# distributed_minimum_error_entropy_algorithms__abb592ea.pdf

Journal of Machine Learning Research 21 (2020) 1-31 Submitted 10/18; Revised 10/19; Published 7/20

Distributed Minimum Error Entropy Algorithms

Xin Guo xin.guo@uq.edu.au Department of Applied Mathematics The Hong Kong Polytechnic University Hong Kong, China, and School of Mathematics and Physics The University of Queensland Brisbane, QLD 4072, Australia

Ting Hu tinghu@whu.edu.cn School of Mathematics and Statistics Wuhan University Wuhan, 430072, China

Qiang Wu qwu@mtsu.edu Department of Mathematical Sciences Middle Tennessee State University Murfreesboro, TN 37132, USA

Editor: Lorenzo Rosasco

Minimum Error Entropy (MEE) principle is an important approach in Information Theoretical Learning (ITL). It is widely applied and studied in various ﬁelds for its robustness to noise. In this paper, we study a reproducing kernel-based distributed MEE algorithm, DMEE, which is designed to work with both fully supervised data and semi-supervised data. The divide-andconquer approach is employed, so there is no inter-node communication overhead. Similar as other distributed algorithms, DMEE signiﬁcantly reduces the computational complexity and memory requirement on single computing nodes. With fully supervised data, our proved learning rates equal the minimax optimal learning rates of the classical pointwise kernel-based regressions. Under the semi-supervised learning scenarios, we show that DMEE exploits unlabeled data eﬀectively, in the sense that ﬁrst, under the settings with weak regularity assumptions, additional unlabeled data signiﬁcantly improves the learning rates of DMEE. Second, with suﬃcient unlabeled data, labeled data can be distributed to many more computing nodes, that each node takes only O(1) labels, without spoiling the learning rates in terms of the number of labels. This conclusion overcomes the saturation phenomenon in unlabeled data size. It parallels a recent results for regularized least squares (Lin and Zhou, 2018), and suggests that an inﬂation of unlabeled data is a solution to the MEE learning problems with decentralized data source for the concerns of privacy protection. Our work refers to pairwise learning and non-convex loss. The theoretical analysis is achieved by distributed U-statistics and error decomposition techniques in integral operators. Keywords: Information theoretic learning, minimum error entropy, distributed method, semisupervised data, reproducing kernel Hilbert space

1. Introduction

Pioneered by the work of Principe and his collaborators in Erdogmus and Principe (2000), MEE principle has been playing an essential role in ITL (Principe, 2010). MEE principle is widely adopted as a powerful alternative to the traditional least squares method which is suboptimal in the non Gaussian situations (Erdogmus and Principe, 2000). The classical least square method minimizes the variance of the prediction error. Its optimality heavily depends on the assumption of Gaussianity

c 2020 Xin Guo, Ting Hu and Qiang Wu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-696.html.

Guo, Hu and Wu

due to the use of a second order statistics. When Gaussianity assumption is violated, high order methods are desired. Entropy is a functional of the probability density function of the error variable and measures the average information contained in the distribution. Minimizing entropy allows one to take into account high-order statistical behavior in the learning process and thus is advantageous in non-Gaussian scenarios. Given an error variable E, Renyi s entropy and Shannon entropy are widely used to quantify the information contained in E. In this paper we focus on the quadratic Renyi s entropy, which is deﬁned by

H(E) = log E(p E) = log Z

E p2 E(e)de, (1)

where p E is the probability density function of E. MEE employs entropy as a new measurement of error to substitute the mean squared error R

E e2p E(e)de in the least squares. Renyi s entropy takes into consideration all higher moments rather than the variance used by the least squares. Hence, MEE is capable of dealing with outliers, heavy-tailed noise or skewed noise distribution. Because of its robustness to non-Gaussian noise, MEE performs well in a large number of applications such as signal processing, regression analysis, feature selection, and data clustering. See Erdogmus and Principe (2003); Chen et al. (2010); Gokcay and Principe (2002); Shen and Li (2015); Silva et al. (2010). Meanwhile, MEE has the nature of pairwise learning (Christmann and Zhou, 2016; Wang et al., 2012; Ying and Zhou, 2016), which focuses on approximating the diﬀerence of labels between each pair of sample points, incurring high computational complexity. The complexity restricts the application of MEE algorithms on the problems with large data size. Although there is a series of work on the theory and applications of MEE (Hu et al., 2015; Fan et al., 2016; Hu et al., 2013), few works have been done to reduce the computational complexity, which is one of the motivations of this paper. In the recent decade, the growth of computing facility power falls way behind the growth of the scale of data, and the research and practice of privacy protection falls way behind the growing concerns of privacy. Distributed algorithms have drawn much attention of machine learning and optimization communities, and are widely implemented in industry. Distributed approaches reduce computational complexity and memory requirement for single computing nodes, and can also be applied to the scenarios where data have to be stored and analyzed locally for privacy concerns. In this paper, we study a distributed MEE algorithm without communication overhead. Speciﬁcally, one ﬁrst divides a large data set into several subsets, then sends each subset as training sample to a computing node for a local output function, and ﬁnally averages these local output functions to synthesize the overall output function. Alternatively, diﬀerent data subsets may directly be used locally to train local output functions, and the prediction is done by distributing the new instance, then collecting and averaging the local predictions. This scheme has been developed for a lot of classical learning algorithms, including kernel ridge regression (Lin et al., 2017; Zhang et al., 2015), stochastic gradient descent algorithm (Lin and Zhou, 2018; Zinkevich et al., 2010), spectral algorithm (M ucke and Blanchard, 2018; Guo et al., 2017a), and bias correction (Guo et al., 2017b). For the applications of MEE algorithm that have privacy concerns, we adopt semi-supervised learning to our distributed scheme. Semi-supervised learning itself is an active research area, with one of the earliest ideas stemming from self-learning in classiﬁcation, known as self-training, self-labeling, or decisiondirected learning (Chapelle et al., 2006), and is later extended in various forms to other applications, including co-training in text classiﬁcation (Blum and Mitchell, 1998), graph-based methods (Wang et al., 2013), and manifold regularization (Belkin and Niyogi, 2004). We focus on improving the distributed MEE algorithm performance by utilizing unlabeled data. Most existing works on MEE methods study only linear models. The distributed MEE algorithms we study employ reproducing kernels and are able to ﬁt nonlinear models. The non-convex and pairwise loss functions caused the main diﬃculties in analysis, which we overcome by employing some decomposition techniques in U-statistics.

Distributed Minimum Error Entropy Algorithms

This paper provides three main contributions. First, existing analysis of MEE algorithm in the literature has largely been improved, and extended losslessly to DMEE. Our obtained learning rates coincide with the minimax optimal rates of regularized least squares algorithms for pointwise learning. Second, we prove that unlabeled data can signiﬁcantly improve learning rates under the setting with weak regularity assumptions. Third, we prove that with suﬃcient unlabeled data, the restriction on the maximum number of computing nodes that labeled data are distributed to is removed. The paper is organized as follows. In Section 2, we review the background of MEE learning, deﬁne the DMEE algorithms, and present our main results on learning rates. In Section 3, we provide detailed discussions and comparisons. Mathematical analysis goes to Sections 4 and 5 for supervised and semi-supervised data respectively.

2. Backgrounds and main results

In this paper we study regression problems. We assume that the explanatory variable X takes values in a compact domain X in an Euclidean space, the response variable Y takes values in the output space Y which is a subset of the real line R, and

Y = g (X) + ϵ,

where g is the target function and ϵ is the noise in the regression model. Let ρ be a Borel probability measure on the product space Z = X Y. Let ρX and ρ(y|x) denote the marginal distribution of ρ on X, and the conditional distribution on Y given x X, respectively. The purpose of regression is to estimate g (X) according to a sample D = {(xi, yi)}|D| i=1 drawn independently from ρ, where |D| is the sample size, the cardinality of D. Given a hypothesis space of functions g : X Y, MEE looks for a good approximation of g by minimizing the entropy of the prediction error E = g(X) Y . Here we consider Renyi s quadratic entropy (1). Denote ei = g(xi) yi, (xi, yi) D for 1 i |D| and p E can be estimated by Parzen windowing (Parzen, 1962). Given a windowing function G : ( , + ) [0, + ) and a scaling parameter h > 0, one gets the density estimator

ˆp E(e) = c |D|

i=1 Gh(e ei) = c |D|h

i=1 G (e ei)2

where c is a normalization constant so that R c Gh(t)dt = 1. A typical example is the windowing function G(a) = exp( a) with a 0, associated to which are the constant c = 1 π and the Gaussian

kernel c Gh(t) = 1 πh exp( t2

h2 ). Then the empirical Renyi s entropy of (1) is

HD(g) = log

j=1 G (ei ej)2

MEE algorithm searches over a suitable hypothesis space H for a minimizer of HD(g). Equivalently, one can just minimizes the factor

j=1 G (ei ej)2

(x,y) D (u,v) D

G [(g(x) y) (g(u) v)]2

Guo, Hu and Wu

MEE algorithm outputs g D := arg min g H RD(g) as the estimator of g .

In this work, we study the kernel based MEE algorithm, which includes the linear models studied in the literature as a special case. The learning process of the kernel MEE method is associated with a pairwise reproducing kernel Hilbert space (RKHS) (Ying and Zhou, 2016) HK on X 2 := X X. Denote by K : X 2 X 2 R a pairwise Mercer kernel, which is continuous, symmetric and positive semi-deﬁnite. The pairwise RKHS HK is deﬁned to be the completion of the linear span of the set of functions {K(x,u)( ) := K((x, u), ( , )) : (x, u) X 2} with respect to the inner product that satisﬁes K(x,u), K(x ,u ) K = K((x, u), (x , u )) for any (x, u), (x , u ) X 2. We replace the hypothesis function diﬀerence g(x) g(u) in (2), by a pairwise function f(x, u) in HK, and generalize the scheme (2) to the pairwise empirical risk

(x,y) D (u,v) D

G [(f(x, u) y + v)]2

Our target function is now fρ(x, u) := g (x) g (u). It is pointed out in Ying and Zhou (2016) that by the restriction

K((x, u), (x , u )) = W(x, x ) + W(u, u ) W(x, u ) W(u, x ) (3)

(where W is a reproducing kernel on X), any pairwise function f HK has the form of function diﬀerence f(x, u) = g(x) g(u) with g HW . However, here we do not impose such restriction and will give analysis for general pairwise Mercer kernels. To avoid overﬁtting, we consider the regularized MEE as follows.

Deﬁnition 2.1 Given a labeled data set D = {(xi, yi)}|D| i=1, the regularized MEE algorithm with an RKHS HK in supervised learning is deﬁned by

f D,λ := arg min f HK ED(f) + λ f 2 K, (4)

where λ > 0 is the regularization parameter.

The eﬃciency of the regularized MEE (4) in applications has been observed in considerable experimental results and theoretical analysis have been given in Hu et al. (2016); Fan et al. (2016). As a byproduct of our main results, we shall prove that the learning rates of (4) equal the minimax optimal learning rates of the classical pointwise regularized least squares. This greatly improves the results in the literature (we defer the detailed comparison to Section 3). For simplicity and without loss of much generality, we formulate the fully supervised data set D for our distributed MEE algorithm as the union of k independent and equal-sized subsets D1, . . . , Dk, all drawn independently from (Z, ρ). So,

For technical simplicity, in this paper we assume

|D1| = . . . = |Dk| = |D|

A local predicted function f Dl,λ is obtained from (4) with Dl. The distributed MEE algorithm outputs its predicted function f D,λ as the average of the local output functions

l=1 f Dl,λ. (5)

Distributed Minimum Error Entropy Algorithms

In this paper we study the convergence of f D,λ to fρ in the square integrable space (L2 ρX2, ρ), where

L2 ρX2 := f : X 2 R : f 2 ρ := Z

X 2 |f(x, u)|2dρX (x)dρX (u) < .

Below we elaborate three important assumptions to carry out the analysis. The ﬁrst assumption (6) is about the regularity of the target function fρ. Deﬁne the integral operator LK : L2 ρX2 L2 ρX2 associated with the pairwise kernel K by

X f(x, u)K(x,u)dρX (x)dρX (u), f L2 ρX2.

Since K is a Mercer kernel on the compact domain X 2, LK is of trace class (hence compact) and positive. So we write Lr K as the r-th power of LK for r > 0. Our error bounds are stated in terms of the regularity of the target function fρ(x, u), given by

fρ = Lr K(hρ), for some r > 0 and hρ L2 ρX2, (6)

The condition (6) characterizes the regularity of fρ and is directly related to the smoothness of fρ when HK is a Sobolev space. If (6) holds with r 1

2, fρ lies in the space HK. The second assumption (7) is about the capacity of HK, measured by the eﬀective dimension (Zhang, 2002; Caponnetto and Yao, 2010; Blanchard and Kr amer, 2016)

N(λ) = Trace((LK + λI) 1LK), for λ > 0,

where I is the identity operator on HK. In this paper, we assume that

N(λ) C0λ s for some C0 > 0 and 0 < s 1. (7)

We postpone some discussions on (7) to Section 3. The third assumption (8) is about the conditional probability distribution ρ(y|x) on the output space Y. We only assume that the output variable Y satisﬁes the moment condition (van der Vaart and Wellner, 1996, page 103): there exist two positive numbers σ, M > 0, both independent of X, such that for any integer q 2,

E (|Y |q|X) 1

2q!σ2M q 2. (8)

The assumption (8) covers many common distributions, for example, Gaussian and the distributions with compact support. Throughout the paper, we assume that the windowing function G is diﬀerentiable, G (0) = 1, and G(a) G(0) for a > 0. We assume that

CG := sup a (0, ) |G (a)| < ,

and there exists some p, cp > 0 such that

|G (a) G (0)| cpap, for any a > 0. (9)

For example, the windowing function G(a) = e a for Gaussian kernel satisﬁes the above assumptions with cp = 1 and p = 1. Since the convergence requires to select λ 0 as |D| , we assume λ 1 in the sequel to simplify the notations. Without loss of generality, we also assume

sup (x,u) X 2

K((x, u), (x, u)) = 1. (10)

Guo, Hu and Wu

2.1. Convergence of DMEE with fully supervised data

The following theorem bounds the error of (5) with overwhelming probability.

Theorem 2.2 Assume (6) for r > 0. For any 0 < δ < 1, we have with probability at least 1 δ that,

f D,λ fρ ρ hρ ρλmin{r,1} + (2 fλ K + 8M + 8σ)AD,λ,k log 8

+ 128( fλ K + M + σ)λ 1

4 max 1 l k

+ 16cp,σ,M 1 + λp+ 1

2 h 2pλ p 1 log 16k

3 log 16|D|

2 max 1 l k

where the constant cp,σ,M is independent of D, δ, or h, and it will be speciﬁed later after the bound (27). Here and in the sequel, |D|/4 denotes the largest integer not exceeding |D|/4, AD,λ,k :=

λ + 1 |D|/4

N(λ) |D|/4 , AD,λ := AD,λ,1, and ADl,λ := ADl,λ,1.

Corollary 2.3 Assume (6) for r > 0 and (7). Let

( |D| 1 1+s , for 0 < r 1

2, |D| 1 s+2 min{1,r} , for r > 1

( = 1, for 0 < r 1

2, λ min{r 1

2} log 4 |D|, for r > 1

Then for any 0 < δ < 1, with probability 1 δ,

f D,λ fρ ρ C1 max n λmin{r,1}, h 2pλ p 1(log |D|)2p+4o log 16

2p+4 , (13)

where C1 is a constant independent of D, δ, k, or h, and it will be speciﬁed in the proof.

As a direct corollary, the following theorem provides the learning rates for DMEE (and hence MEE) with large h.

Theorem 2.4 Under the same conditions of Corollary 2.3, if one further has h λ r 1 p(log |D|)2p+4 1

2p , then with probability at least 1 δ,

f D,λ fρ ρ C1

( |D| r s+1 log 16

δ 2p+4 , for 0 < r 1

2, |D| min{ r s+2r , 1 s+2} log 16

δ 2p+4 , for r > 1

where C1 is deﬁned in Corollary 2.3 above. Furthermore, we employ Lemma B.1 in Appendix B to see that for any real number µ > 0,

E( f D,λ fρ µ ρ) 1/µ [16Γ(µ(2p + 4) + 1)]1/µ C1

( |D| r s+1 , for 0 < r 1

2, |D| min{ r s+2r , 1 s+2}, for r > 1

In particular, since the only assumption (12) on k permits the case k = 1, the above bounds (14) and (15) for f D,λ also hold for f D,λ.

Distributed Minimum Error Entropy Algorithms

Remark 2.5 Theorem 2.4 suggests that as r (0, 1] increases, the learning rate is improved. However, further increasing of r beyond 1 may not help to improve the learning rate. This saturation phenomenon is widely observed in the literature; see e.g. Lo Gerfo et al. (2008); Lin et al. (2017).

Recall the goal of regression analysis is to get a good estimator of g . In this work, we aim to learn the diﬀerence of the regression function fρ(x, u) = g (x) g (u) based on the idea of pairwise learning (Ying and Zhou, 2016, 2017). For this purpose, it is natural to adopt pariwise reproducing kernels, and build the theory in a general way thereupon. To derive an estimator of g , we ﬁrst consider the following orthogonal projection P on L2 ρX2 (note that ρX 2 is a probability measure)

(Pf)(x, u) = (Mf)(x) (Mf)(u), where (Mf)(x) = 1

X [f(x, u) f(u, x)] dρX(u).

In particular, if f(x, u) = g(x) g(u) for some g L2 ρX , then Pf = f, Mf = g R

X g(u)dρX (u), and R

X (Mf)(x)dρX(x) = 0. So f D,λ fρ 2 ρ P f D,λ fρ 2 ρ = 2 (M f D,λ + g Mfρ) g 2 L2ρX .

With data D, one replaces M by MD : C(X 2) C(X),

(MDf)(x) := 1 2|D|

(u,v) D [f(x, u) f(u, x)] ,

and replaces the diﬀerence g Mfρ = R

X g (x)dρX(x) by its unbiased and eﬃcient estimator 1 |D| P

2.2. Convergence of DMEE with semi-supervised data

We also study the inﬂuence of unlabeled data on the convergence of DMEE. Besides the labeled data D = k l=1Dl (with disjoint and equal-sized subsets D1, . . . , Dk), assume that we also have an

unlabeled data set D = { xi}| D| i=1. We assume that the input observations xi are drawn independently from ρX , and D is independent of D. For technical simplicity we assume that D is also divided randomly into k disjoint and equal-sized subsets D = k l=1 Dl. We deﬁne the semi-supervised training

data set by D = k l=1D l , where for each 1 l k, we write Dl = {(xi, yi)}|Dl| i=1 , Dl = { xi}| Dl| i=1 , and deﬁne D l = {(x i , y i )}|D l | i=1 with |D l | = |Dl| + | Dl| by

(x i , y i ) =

( (xi, |D l | |Dl| yi), for 1 i |Dl|, ( xi |Dl|, 0), for |Dl| + 1 i |D l |.

Here the factor |D l |/|Dl| for yi is given to compensate the bias introduced by the fake labels 0 for the unlabeled data. By faking the zero labels, there is no need to reform the algorithm itself. The output function f D ,λ of the regularized MEE algorithm with semi-supervised data is deﬁned by (4) with D substituted by D . The semi-supervised DMEE outputs the predictive function

l=1 f D l ,λ. (16)

In this subsection, we assume that K is antisymmetric. That is, we assume K(x,u) = K(u,x).

Guo, Hu and Wu

Theorem 2.6 The following bound holds with probability at least 1 δ.

f D ,λ fρ ρ hρ ρλmin{r,1} + 256(1 + M + σ)λ 1/2 log 16k

A2 D l ,λ λ + 1

AD l ,λ AD l ,λ fλ K + ADl,D l ,λ

+ 16λ 1/2C2h 2p log 16k

3 max 1 l k

" A2 D l ,λ λ + 1

AD l ,λλ 1/2 + 1

+ (2AD ,λ,k fλ K + 8(M + σ)AD,D ,λ,k) log 16

where AD l ,λ is deﬁned in the same way as AD,λ in Theorem 2.2 by substituting D with D l ,

AD,D ,λ,k = k |D |

λ + 1 |D|/4

N(λ) |D|/4 , AD,D ,λ = AD,D ,λ,1,

D,D ,λ = |D |

and C2 is a constant independent of D, D , k, h, or δ, and it will be speciﬁed in the proof.

To demonstrate the idea of Theorem 2.6, we give the following corollary, which suggests that, with suﬃcient unlabeled data, the number k of the computing nodes is only technically bounded from above by the assumption |Dl| = |D|

k 4. Similar results for distributed regularized least squares are obtained in Lin and Zhou (2018). Note that this increase of computing nodes does not help to reduce single-node time or space complexity, but signiﬁcantly improves the learning rates under the scenario that the regression function fρ has low regularity 0 < r < 1

2, therefore our analysis suggests the semi-supervised scheme for learning less regular target functions. Another scenario of such learning rate improvement is when locally stored data can not be centralized due to privacy concerns, and therefore our analysis suggests an inﬂation of unlabeled data solution to DMEE for privacy-sensitive distributed learning. We will elaborate the details in Section 3.

Corollary 2.7 Assume (8), (9), (6) for r > 0, r + s 1

2, and |D | max{|D|

s+1 2 min{r,1}+s , 2|D|}.

Let λ = |D| 1 2 min{r,1}+s , and

k (log |D|) 4 min np

|D |λ1+s, (|D |λ2 2 min{r,1} s)1/3o . (18)

Then with probability at least 1 δ, one has

f D ,λ fρ ρ C3 max

|D| min{ r 2r+s , 1 2+s}, D,D ,λ (log |D|)2p+4

where C3 is a constant independent of D, D , k, h, or δ, and will be speciﬁed in the proof.

Theorem 2.8 Under the same conditions of Corollary 2.7, if one further has

h h |D| 3 2+s D,D ,λ (log |D|)2p+4i 1

Distributed Minimum Error Entropy Algorithms

then with probability 1 δ,

f D ,λ fρ ρ C3|D| min{ r 2r+s , 1 2+s} log 16

2p+4 , (20)

where C3 is the constant deﬁned in Corollary (2.7). Furthermore, we employ Lemma B.1 in Appendix B to see that for any real number µ > 0, E f D ,λ fρ µ ρ 1/µ [16Γ((2p + 4)µ + 1)]1/µ C3|D| min{ r 2r+s , 1 2+s}. (21)

In particular, since the only assumption (18) on k permits the case k = 1, the above bounds (20) and (21) for f D ,λ also holds for f D ,λ.

3. Discussion and comparison with other works

The condition (7) is widely adopted in the literature to characterize the capacity of HK (Lin et al., 2017; Caponnetto and De Vito, 2007; Zhang, 2002; Blanchard and Kr amer, 2010). We see that since LK is of trace class, the condition (7) always holds with s = 1. When the eigenvalues of LK decay faster, one can have (7) with a smaller s. In particular, if the eigenvalues {γi} i=1 of the operator LK decay as γi C0i 1

b for some 0 < b < 1 and C0 > 0, then

γi γi + λ Z

C0 C0 + λt 1 b dt 2 1 b Cb 0λ b Z

(1 + t) 1 b dt = 2 1 b Cb 0b 1 b λ b.

The condition (7) roughly measures the smoothness of K. For example, if K Cα(X 2 X 2) with some integer α 1, and X 2 is locally the graph of a Lipschitz function, then (7) is satisﬁed with

s = α 2dim(X) + 1

2 1 (Mendelson and Neeman, 2010). There are some other capacity characteristics,

for example covering numbers (Zhou, 2002) and entropy numbers (Steinwart et al., 2009). Compared to the regularity assumptions, capacity assumption is not necessary for deriving learning rates, and there are works on capacity independent analysis in the literature; see e.g. Smale and Zhou (2007).

Figure 1: To organize the discussion of DMEE learning rates, we divide the space of regularity and capacity parameters into several parts. As we discussed at the beginning of Section 3, one always has N(λ) C0λ 1, so we exclude the area s > 1.

We organize the following discussions around Figure 1. In Area 1, our analysis suggests the saturation phenomenon of DMEE with respect to regularity, as shared also by regularized least

Guo, Hu and Wu

squares (Lin et al., 2017). That is, when the regularity index r exceeds 1, its further increasing does not help to improve the algorithm convergence. This saturation is proved overcome by spectral algorithms (Lo Gerfo et al., 2008; M ucke and Blanchard, 2018; Blanchard and M ucke, 2018; Guo et al., 2017a), bias corrected approach (Guo et al., 2017b), or a gradient descent approach for MEE (Hu et al., 2020) without Tikhonov regularization. In Area 2, DMEE achieves the learning rates which equal the minimax optimal rates O(|D| r 2r+s ) for pointwise regression learning (Steinwart et al., 2009; Caponnetto and De Vito, 2007; Bauer et al., 2007; Blanchard and M ucke, 2018). Without spoiling the learning rates, unlabeled data help to essentially remove the restriction of the maximum number of computing nodes DMEE can be distributed to. Note that, while allowing a more distributed computation, unlabeled data may increase the single-node computational complexity and memory requirement. This being said, the removal of the restriction on maximum computing nodes has signiﬁcant impact on the applications where for privacy reasons, data can not be centralized and must be processed locally. For these applications, computational complexity and memory requirement are usually not the concern, and our analysis suggests a solution of inﬂating data subsets with unlabeled data to improve generalization power. Similar observations are reported for pointwise regularized least squares (Lin and Zhou, 2018). For fully supervised data, Theorem 2.4 suggests that without spoiling the learning rate, DMEE can at most reduce the single-node data size, from |D| to |D|1 r (1/2)

s+2r log4 |D| for Area 2, and to

s+2 log4 |D| for Area 1, respectively. However, For Area 3 and Area 4, Theorem 2.4 suggests that DMEE could not indeed reduce single-node computational complexity without sacriﬁcing the learning rates. For the semi-supervised data, unlabeled data serve mainly for the purpose of relaxing the restriction of the maximum number of computing nodes for privacy protection, and improving the learning performance under weak regularity conditions. In fact, for Areas 1 and 2, our analysis suggests that to maintain the best possible learning rates that DMEE can achieve with fully supervised data, if one relaxes the restriction of the maximum number of computing nodes by adding unlabeled data, then the single-node sample size will be increased. In Area 3, the optimal lower bound O(|D| r 2r+s ) for point kernel-based regression for 0 < r < 1

2 is derived in Steinwart et al. (2009) with the boundedness assumption of Lr K : L2 L . Fully supervised DMEE does not achieve this lower bound. Unlabeled data improves DMEE in two ways. First, semi-supervised DMEE achieves the learning rates O(|D| r 2r+s ). Second, again, semisupervised DMEE has essentially no restriction on the maximum number of computing nodes. The coverage of Area 3 is also one of the important improvements we have in this paper, compared with Hu et al. (2020), and even Lin and Zhou (2018). In Area 4, the learning rate of fully supervised DMEE is O(|D| r s+1 ), and our analysis of semisupervised DMEE fails to improve the rate to O(|D| r 2r+s ). The learning rates are only provided for fully supervised DMEE. Nevertheless, Area 4 seems to be the situation that one should avoid. In fact, with a less regular target function, typically one needs a larger hypothesis space, which corresponds to a larger s. It is unknown to us at this moment whether the suboptimal rates in Areas 3 for fully supervised DMEE and in Area 4 for semi-supervised DMEE are the inherent features of these algorithms or the consequences as limited by the analysis tools. It will be an interesting future research topic. It is worth mentioning that all the error bounds and convergence rates obtained in this paper apply to non-distributed MEE, which corresponds to the case k = 1, and they improve some existing results in the literature. Most studies on MEE algorithms in the literature are carried out empirically. Theoretical results are relatively sparse. In Chen et al. (2010), the consistency of MEE is proved in the local region and no explicit learning rate was given. In our earlier works (Hu et al., 2015; Fan et al., 2016), we studied MEE algorithms in the empirical risk minimization (ERM) and regularized ERM frameworks respectively. The main results include that if the target function lies in the hypothesis space H, |y| M and the logarithm of the covering number of the hypothesis space H

Distributed Minimum Error Entropy Algorithms

by C(X) balls of radius ϵ grows no faster than ϵ p for some index p > 0 when ϵ decays to zero,

then with high probability the learning rate is of order O |D| 1 2(1+p) . To elaborate it clearly,

assume H to be an RKHS induced either by a pointwise kernel W Cα(X X) or a pairwise kernel K Cα(X 2 X 2) and X Rd satisﬁed certain mild regularity conditions. By Cucker and Zhou (2007, page 72, Theorem 5.1) and Mendelson and Neeman (2010), the covering number index p = 2d/α while the eﬀective dimension index s = 2d/(d + α). When target function lies in HK, i.e.

2, the rate O |D| 1 2(1+p) in (Hu et al., 2015) is always inferior to O |D| r 2r+s in (14). To our best knowledge, various MEE algorithms in the existing literature are implemented by gradient descent-based methods. Note that in general, Algorithm (4) is not a convex optimization problem, and a comprehensive discussion about the global/local optimal, the dependence of convergence on the initial value of variables, and the convergence speed, go beyond the scope of this paper, and are interesting questions for future research. While our analysis is given for general pairwise reproducing kernels, the design (3) is usually adopted in practice (Ying and Zhou, 2016, 2017). We point out that under the design (3), Algorithm (4) is reduced to a smooth (though not convex) optimization problem with |D| 1 variables. In fact, for any x X, write W (x) = (W(x, xi))|D| i=1 a column vector of dimension |D|. Write C = (ci,j)|D| i,j=1 the coeﬃcient matrix of the function

f C(x, u) =

j=1 ci,j K((xi, xj), (x, u)). (22)

By the representer theorem, the solution to (4) takes the form of (22). Meanwhile, we write 1 = (1, . . . , 1)T R|D| and use the kernel structure (3) to obtain that f C(x, u) = c T (W (x) W (u)), where c = (C CT )1 R|D|. We also have f C 2 K = c T Wc, where W = (W(xi, xj))|D| i,j=1. So, Algorithm (3) is reduced to an optimization problem of a smooth function of c. Since 1T c = 0, the vector c has only |D| 1 free variables. The gradient vector and the Hessian matrix of the target function can be directly computed. The computational complexity can further be reduced for DMEE.

4. Estimates in supervised learning

Now we are in a position to prove the consistency results stated in Section 2. First, we will estimate the bound of f D,λ deﬁned by (4). In the sequel, for notational simplicity, write w = (x, y) and z = (u, v). Deﬁne the empirical operator LK,D : HK HK by

LK,D := 1 |D|2 X

w,z D , K(x,u) KK(x,u),

so for any f HK, LK,Df = 1 |D|2 X

w,z D f(x, u)K(x,u). Then we have the following representation

Lemma 4.1 Deﬁne f D,λ by (4). Then it satisﬁes

f D,λ = (LK,D + λI) 1 ˆfρ,D + (LK,D + λI) 1ED,λ (23)

ˆfρ,D = 1 |D|2 X

w,z D (y v)K(x,u)

Guo, Hu and Wu

ED,λ = 1 |D|2 X

G (f D,λ(x, u) y + v)2

G (0) (f D,λ(x, u) y + v)K(x,u).

Proof. Since f D,λ is the minimizer of algorithm (4), we take the gradient of the regularized functional on HK in (4) to give

w,z D G (f D,λ(x, u) y + v)2

(f D,λ(x, u) y + v)K(x,u) + λf D,λ = 0,

or equivalently (recall the assumption G (0) = 1),

w,z D (f D,λ(x, u) y + v)K(x,u) + λf D,λ ED,λ = 0,

which is (LK,D + λI)f D,λ ˆfρ,D ED,λ = 0. The proof is completed.

4.1. Bounds of f D,λ and ED,λ

Under the moment condition (8), similar to (Wang and Hu, 2019, Proposition 3) we can prove that, with probability at least 1 δ, there holds

max{|y| : there exists an x X, such that (x, y) D} (4M + 5σ) log |D|

By the deﬁnition of f D,λ in (4), we have that ED(f D,λ) + λ f D,λ 2 K ED(0). Recall that CG = supa |G (a)|. With the fact G(a) < G(0) for all a > 0 and Taylor expansion,

λ f D,λ 2 K ED(0) ED(f D,λ) h2

w,z D G (y v)2

w,z D (y v)2 CG

w,z D 2(y2 + v2) 4CG max w D |y|2.

It follows that

f D,λ K 2 p

2 max w D |y|. (25)

By (9), we see that

ED,λ K cph 2p 1 |D|2 X

w,z D ( f D,λ K + |y v|)2p+1

22pcph 2p 1 |D|2 X

f D,λ 2p+1 K + |y v|2p+1

22pcph 2p f D,λ 2p+1 K + 22p+1 max w D |y|2p+1 . (26)

This in combination with the bounds (24) and (25) gives that, with probability at least 1 δ,

ED,λ K cp,σ,M λ (p+ 1

2 ) + 1 h 2p log |D|

where cp,σ,M := 24p+1cp(C p+ 1

2 G + 1)(4M + 5σ)2p+1.

Distributed Minimum Error Entropy Algorithms

4.2. Two error decompositions in MEE algorithms

To derive the explicit learning rate of the distributed algorithm (5) and (16), we introduce the regularization function fλ in HK, deﬁned by

fλ := arg min f HK Els(f) + λ f 2 K,

where Els(f) = R

Z2(f(x, u) y +v)2dρ(x, y)dρ(u, v) is the expected risk associated with the pairwise square loss. Similar to the argument in Smale and Zhou (2007), we can verify that

fλ = (LK + λI) 1LKfρ, (28)

so fλ fρ = λ(LK +λI) 1fρ. We have used the property that the operator norm of LK on L2 ρX2 is no greater than 1, thanks to the assumption (10). Under the regularity assumption (6) with r > 0,

fλ fρ ρ hρ ρλr, when 0 < r 1, hρ ρλ, when r > 1, (29)

2 , when 0 < r < 1/2, hρ ρ, when r 1/2. (30)

Now we state two error decompositions for f D,λ fλ. By (28), LK,Dfλ λfλ = LK,Dfλ + LKfλ LKfρ, so

fλ = (LK,D + λI) 1[(LK LK,D)fλ LKfρ], (31)

and we obtain the ﬁrst decomposition by adding (23) and (31),

f D,λ fλ =(LK,D + λI) 1(LK LK,D)fλ + (LK,D + λI) 1 ˆfρ,D LKfρ

+ (LK,D + λI) 1ED,λ. (32)

Recall that λf D,λ = LK,Df D,λ + ˆfρ,D + ED,λ, so

(LK + λI)f D,λ = (LK LK,D)(f D,λ fλ) + (LK LK,D)fλ + ˆfρ,D + ED,λ,

and we obtain the second decomposition

f D,λ fλ = f D,λ (LK + λI) 1LKfρ = (LK + λI) 1[(LK + λI)f D,λ LKfρ]

= (LK + λI) 1(LK LK,D)(f D,λ fλ) + (LK + λI) 1(LK LK,D)fλ

+ (LK + λI) 1( ˆfρ,D LKfρ) + (LK + λI) 1ED,λ. (33)

In the sequel, we denote by op the operator norm from HK to itself, and

BD,λ = (LK,D + λI) 1(LK + λI) op,

CD,λ = (LK + λI) 1

2 (LK LK,D) op,

l=1 (LK + λI) 1

2 (LK LK,Dl)

l=1 (LK + λI) 1

2 ( ˆfρ,Dl LKfρ)

GD,λ = (LK + λI) 1

2 ( ˆfρ,D LKfρ) K.

We cite the following lemma from Blanchard and Kr amer (2010, Lemma E.4), which was proved for positive deﬁnite matrices in Bhatia (1997, pages 255 256, Theorems IX.2.1-2).

Guo, Hu and Wu

Lemma 4.2 Let A and B be positive deﬁnite operators on a separable Hilbert space H. Let op(H) denote the operator norm. Then

As Bs op(H) AB s op(H), for any 0 s 1.

Noting that for any f HK,

λ f K} (LK + λI) 1 2 f K (34)

by the fact f ρ = L

1 2 Kf K, one gets a bound for the sample error f D,λ fλ ρ by the two decompositions (32) and (33) above.

Proposition 4.3 Deﬁne f D,λ by (5). Then there holds

f D,λ fλ ρ S1 + S2 + DD,λ fλ K + FD,λ, (35)

where S1 = max 1 l k

BDl,λC2 Dl,λ fλ Kλ 1

2 + BDl,λCDl,λGDl,λλ 1

S2 = max 1 l k

BDl,λCDl,λλ 1 EDl,λ K + 1

Proof. Let I1, I2, and I3 denote the three terms on the right-hand side of (32), respectively. Consider the HK norm of

(LK + λI)1/2(f D,λ fλ) = (LK + λI)1/2(I1 + I2 + I3).

By Lemma 4.2,

(LK + λI)1/2I1 K

(LK + λI)1/2(LK,D + λI) 1/2 op (LK,D + λI) 1/2(LK + λI)1/2 op

(LK + λI) 1/2(LK LK,D) op fλ K BD,λCD,λ fλ K.

(LK + λI)1/2I2 K

(LK + λI)1/2(LK,D + λI) 1(LK + λI)1/2 op (LK + λI) 1/2( ˆfρ,D LKfρ) K BD,λGD,λ,

(LK + λI)1/2I3 K

(LK + λI)1/2(LK,D + λI) 1(LK + λI)1/2 op 1

λ 1/2BD,λ ED,λ K.

With the above bounds, we use (34) to obtain

f D,λ fλ ρ BD,λCD,λ fλ K + BD,λGD,λ + λ 1

2 BD,λ ED,λ K

Distributed Minimum Error Entropy Algorithms

f D,λ fλ K BD,λCD,λ fλ Kλ 1

2 + BD,λGD,λλ 1

2 + λ 1BD,λ ED,λ K. (36)

By the fact f D,λ fλ = 1

k Pk l=1(f Dl,λ fλ) and the second decomposition (33), one obtains that

f D,λ fλ ρ 1

l=1 (LK + λI) 1

2 (LK LK,Dl)(f Dl,λ fλ) K

l=1 (LK + λI) 1

2 (LK LK,Dl)fλ

l=1 (LK + λI) 1

2 ˆfρ,Dl LKfρ K + λ 1

l=1 EDl,λ K

DD,λ fλ K + FD,λ + max 1 l k

CDl,λ f Dl,λ fλ K + λ 1

2 EDl,λ K .

Plugging (36) into the above bounds (with substitution of D by Dl) completes the proof.

4.3. Estimates in distributed U-statistics

To present the learning power of the algorithm (5), we will make use of Proposition 4.3, that is related to the quantities BD,λ, CD,λ, DD,λ, FD,λ and GD,λ. By the work in Hu et al. (2020), we can see that the following bounds hold.

Proposition 4.4 Each of the following three bounds holds with probability 1 δ.

BD,λ 2 2AD,λ log 2

2 + 2, CD,λ 2AD,λ log 2

DD,λ 2AD,λ,k log 2

For other two quantities FD,λ and GD,λ, they are both involved with the unbounded condition (8), which brings diﬃculties in pairwise distributed concentration inequalities. We will handle them by some decomposition techniques in U-statistics.

Proposition 4.5 Each of the following two bounds holds with probability 1 δ.

FD,λ 8(M + σ)AD,λ,k log 2

δ , and GD,λ 8(M + σ)AD,λ log 2

Proof. Deﬁne a random variable

ξ(w, z) = (y v)(LK + λI) 1

with w = (x, y) and z = (u, v). The moment condition (8) implies that,

2 E|Y | 2λ 1

2 (E|Y |2) 1 2 2σλ 1

One applies H older s inequality to have that for any q 2, (E ξ K)q E( ξ q K). Note the equation E[ (LK +λI) 1/2K(x,u) 2 K] = N(λ) for λ > 0 (Lin et al., 2017, Lemma 18). We obtain the following

Guo, Hu and Wu

bound for any integer q 2.

E[ ξ Eξ q K] 2q 1E[ ξ q K] + 2q 1 Eξ q K 2q E[ ξ q K]

22q sup x ,u X (LK + λI) 1/2K(x ,u ) q 2 K E h |Y |q (LK + λI) 1/2K(x,u) 2 K i

q 2 E h E(|Y |q|X) (LK + λI) 1/2K(x,u) 2 K i

2q!σ2M q 2N(λ)

= 8q!σ2N(λ)(4M/

Let π be a permutation of the set {1, . . . , |Dl| = |D|/k} of integers. Then {zl π(1), . . . , zl π(|Dl|)} is the

associate permutation of Dl = {zl i = (xl i, yl i)}|Dl| i=1 . Let

U l π = 1 |Dl|/2

i=1 ξ(zl π(2i 1), zl π(2i)).

Since ξ(z, z) = 0, the average (LK + λI) 1/2 ˆfρ,Dl = 1 |Dl|2 P

w,z Dl ξ(w, z) can be written as

(LK + λI) 1/2 ˆfρ,Dl = |Dl| 1

|Dl| 1 |Dl|(|Dl| 1)

w,z Dl ξ(w, z) = |Dl| 1

|Dl| 1 |Dl|!

where the last sum is taken over all the |Dl|! permutations π of {1, . . . , |Dl|}. Since |D1| = . . . = |Dk|,

l=1 (LK + λI) 1/2 ˆfρ,Dl = |D1| 1

|D1| 1 |D1|!

By deﬁnition, it is easy to see that E[ξ] = (LK + λI) 1/2LKfρ. One applies (37) to obtain

FD,λ |D1| 1

l=1 U l π Eξ

K + 1 |D1| Eξ K

l=1 U l π Eξ

K + 2σk |D|

We observe that for each π, 1 k Pk l=1 U l π is the average of k |D1|/2 independent copies of ξ(z, z ) with z and z independently drawn from ρ, and k |D1|/2 |D|/4 (in fact, recall our assumption |D1| 4 and k|D1| = |D|, obviously when |D1| is even, k |D1|/2 = k|D1|/2 = |D|/2 |D|/4 , and when |D1| is odd, one still has k |D1|/2 = k(|D1| 1)/2 k|D1|/4 |D|/4 ). Note that the hyperbolic function cosh(x) = (ex + e x)/2 is convex. To estimate the ﬁrst term on the right-hand

Distributed Minimum Error Entropy Algorithms

side of (38), we apply Lemma A.2 to obtain that for any ϵ > 0,

l=1 U l π Eξ

l=1 U l π Eξ

inf c>0 1 |D1|!

l=1 U l π Eξ

2 16N(λ)σ2 + 4λ 1

One takes the right-hand side of (39) as δ and recalls (38) to have that with probability 1 δ,

FD,λ 2σk |D|

λ + 8M log(2/δ)

32N(λ)σ2 log(2/δ)

Note that GD,λ equals FD,λ when k = 1. The proof is complete.

4.4. Proof of learning rates in supervised learning

Proof of Theorem 2.2 We can decompose f D,λ fρ ρ as the sample error f D,λ fλ ρ and the approximation error fλ fρ ρ. As stated in (29), fλ fρ ρ λr hρ ρ for 0 < r 1. Thus, we just estimate f D,λ fλ ρ by Proposition 4.3. By Propositions 4.4 and 4.5, and the bound (27), we get that for any ﬁxed l, with probability at least 1 4δ, the following three bounds hold simultaneously,

BDl,λC2 Dl,λλ 1

BDl,λCDl,λGDl,λλ 1

2 128(M + σ)

BDl,λCDl,λλ 1 EDl,λ K

16cp,σ,M(1 + λp+ 1

ADl,λλ (p+ 3

2 )h 2p log 2

With the notations in (35), it follows that with probability at least 1 4kδ, the following two bounds hold true simultaneously,

S1 128( fλ K + M + σ) max 1 l k

S2 16cp,σ,M(1 + λp+ 1

2 )h 2pλ p 1 log3 2

2 max 1 l k

Guo, Hu and Wu

By Proposition 4.4 and 4.5 again, we see that with probability at least 1 δ

2, the following bounds hold simultaneously,

DD,λ 2AD,λ,k log 8

δ and FD,λ 8(M + σ)AD,λ,k log 8

Substitute δ by δ 8k in (40) and (41), one has with probability at least 1 δ that

f D,λ fρ ρ f D,λ fλ ρ + fλ fρ ρ hρ ρλr + S1 + S2 + DD,λ fλ K + FD,λ

hρ ρλr + 128( fλ K + M + σ)λ 1

+ 16cp,σ,M(1 + λp+ 1

2 )h 2pλ p 1 log3 16k

2 max 1 l k

+ (2 fλ K + 8(M + σ))AD,λ,k log 8

The proof is complete. Proof of Corollary 2.3. Our assumption |D| 4 implies |D|/4 |D|/7 and log |D| > 1. By (7),

AD,λ,k = k |D|

λ + 1 |D|/4

N(λ) |D|/4 8k |D|

2 , when 0 < r < 1

2 λ 1 2 r+ s 1

= λs 1. (42)

By (30), fλ K hρ ρ(1 + λr 1

2 ) for 0 < r 1. Since 1 |D| = λs+max{2r,1}, the second term on the right-hand side of (11) is bounded by

J2 :=(2 fλ K + 8M + 8σ)AD,λ,k log 8

δ C1 1(1 + λr 1

=C1 1(1 + λr 1

2 +max{r, 1

δ 2C1 1λr log 8

where C1 1 = (4 hρ ρ + 8M + 8σ) 2(8 + 7C0). By deﬁnition and the assumption |Dl| 4 (hence |Dl|/4 |Dl|/7),

ADl,λ = 1 |Dl|

λ + 1 |Dl|/4

N(λ) |Dl|/4 8k |D|

Distributed Minimum Error Entropy Algorithms

where the last step follows from (42). Since

|D| 1λ s 1λs+1 = 1, when 0 < r < 1

2 λ 1 2 rλ 1 sλ2r+s = λr 1

we have A2 Dl,λ/λ 4(8 + 7C0)2kλ s 1/|D| 4(8 + 7C0)2. Now we bound the third term on the right-hand side of (11).

J3 := 128( fλ K + M + σ)λ 1

When 0 < r < 1

2, k = 1 and 1 |D| = λ1+s,

J3 C2 1λr 1 log4 16

|D| = C2 1λr log4 16

where C2 1 = 128( hρ ρ + M + σ)[4(8 + 7C0)2 + 1] 4(8 + 7C0)2. When 1 2 r 1, k λ 1 2 r log 4 |D| and 1 |D| = λ2r+s. Recall that |D| 4 which implies log 16k

δ log 16|D|

δ 2(log |D|) log 16

2 log4 |D| log4 16

λ 1 2 r log 4 |D| λ sλ2r+s = C3 1λr log4 16

where C3 1 := 128( hρ ρ + M + σ) 24[4(8 + 7C0)2 + 1] 4(8 + 7C0)2. The last term on the right-hand side of (11) is bounded as follows.

J4 :=16cp,σ,M(1 + λp+ 1

2 )h 2pλ p 1 log3 16k

2 max 1 l k

C4 1h 2pλ p 1 (log |D|)2p+4 log 16

where C4 1 := 16cp,σ,M 22p+5 [1 + (4(8 + 7C0)2 + 1) 2(8 + 7C0)]. One completes the proof by letting C1 := hρ ρ + 2C1 1 + max{C2 1, C3 1} + C4 1.

5. Estimates in semi-supervised learning

To derive the optimal learning rate in semi-supervised learning, we give the following proposition by taking similar procedures in the proof of Proposition 4.3.

Corollary 5.1 Let f D ,λ be deﬁned in (16). One has

f D ,λ fλ K S 1 + S 2 + DD ,λ fλ K + FD ,λ, (46)

where S 1 = max 1 l k

BD l ,λC2 D l ,λ fλ Kλ 1

2 + BD l ,λCD l ,λGD l ,λλ 1

and S 2 = max 1 l k

BD l ,λCD l ,λλ 1 + λ 1

2 ED l ,λ K.

Guo, Hu and Wu

To quantify the above bounds, we get the following probability inequalities about BD ,λ, CD ,λ, DD ,λ, whose proofs can be found in the earlier work in Hu et al. (2020).

Corollary 5.2 Each of the following three bounds holds with probability 1 δ.

BD ,λ 2 2AD ,λ log 2

2 + 2, CD ,λ 2AD ,λ log 2

and DD ,λ 2AD ,λ,k log 2

where AD ,λ,k = k |D |

λ + 1 |D |/4

N(λ) |D |/4 and AD ,λ = AD ,λ,1, which are consistent with the notations AD,λ,k and AD,λ we deﬁned in Theorem 2.2.

5.1. Distributed U-statistics with unlabeled data and unbounded sampling

Next we turn to the bounds of FD ,λ and GD ,λ, which are more involved in semi-supervised learning. To this end, we need the following lemma.

Lemma 5.3 Let (H, ) be a separable Hilbert space. Let ξ(w, z) be an H-valued random variable deﬁned on (W Z, ρW ρZ). Assume that there exist two constants σ > 0 and M > 0, such that for any integer q 2,

E [ ξ Eξ q] 1

2q!σ2M q 2.

Suppose that from (W, ρW), one draws independently a sample D = {w1, . . . , w|D|}, which is evenly divided to k disjoint subsets D = k l=1Dl with |D1| = = |Dk| = |D|/k. Suppose that similarly, one divides another sample D = k l=1 Dl independently drawn from (Z, ρZ) such that the subsets Dl s are disjoint and | D1| = = | Dk| = | D|/k. Assume that |D| | D|. Then with probability at least 1 δ, there holds

1 | Dl||Dl|

z Dl ξ(w, z) Eξ

2M log(2/δ)

2σ2 log(2/δ)

Proof. Let π be a permutation of the set {1, . . . , |Dl|} of integers, so {wl π(1), . . . , wl π(|Dl|)} is the asso-

ciated permutation of Dl = {wl i}|Dl| i=1 . Let ψ be a permutation of {1, . . . , | Dl|}, so {zl ψ(1), . . . , zl ψ(| Dl|)}

is the associated permutation of Dl = {zl i}| Dl| i=1 . For any 1 l k, we write U l π,ψ = 1 |Dl| P|Dl| i=1 ξ(wl π(i), zl ψ(i)) to obtain

1 |Dl|| Dl|

z Dl ξ(w, z) = 1 |Dl|!| Dl|!

ψ U l π,ψ, (47)

where the last two sums are taken over all the |Dl|! permutations π of {1, . . . , |Dl|} and all the | Dl|! permutations ψ of {1, . . . , | Dl|}, respectively. Note that |D1| = = |Dl| = |D|

k and | D1| = =

| Dl| = | D|

k . One takes the average of (47) over 1 l k, to give

1 |Dl|| Dl|

z Dl ξ(w, z) =1

1 |Dl|!| Dl|!

π,ψ U l π,ψ

= 1 |D1|!| D1|!

l=1 U l π,ψ,

Distributed Minimum Error Entropy Algorithms

l=1 U l π,ψ =1

i=1 ξ(wl π(i), zl ψ(i)) = 1 |D|

i=1 ξ(wl π(i), zl ψ(i)).

By the convexity of the function cosh(t), we obtain that for any c, ϵ > 0,

1 |Dl|| Dl|

z Dl ξ(w, z) Eξ

1 |Dl|!| Dl|!

ψ U l π,ψ Eξ

1 |D1|!| D1|!

l=1 U l π,ψ Eξ

1 |D1|!| D1|!

i=1 ξ(wl π(i), zl ψ(i)) Eξ

/ cosh(cϵ).

One employs Lemma A.2 in Appendix to obtain

1 |Dl|| Dl|

z Dl ξ(w, z) Eξ

2 exp |D|ϵ2

We take δ = 2 exp n |D|ϵ2

2(σ2+Mϵ) o to complete the proof.

Noting that if f(x, u) = f(u, x) for any (x, u) X 2, then K(x,u) is antisymmetric, i.e., K(x,u) = K(u,x) for any (x, u) X 2. The quantities FD ,λ and GD ,λ are involved with unlabeled data, and we will handle them by the feature of antisymmetry of K and get the bounds as follows.

Proposition 5.4 Each of the following two bounds holds with probability at least 1 δ.

FD ,λ 8(M + σ)AD,D ,λ,k log 4

δ , and GD ,λ 8(M + σ)AD,D ,λ log 4

Proof. Recall that K is antisymmetric. So

Z (y v)K(x,u)dρ(x, y)dρ(u, v)

Z y K(x,u)dρ(x, y)dρ(u, v) Z

Z v K(x,u)dρ(x, y)dρ(u, v)

Z 2y K(x,u)dρ(x, y)dρ(u, v).

Recall that we have the relation (x, y) 7 (x, |D |

|D| y) when we embed D to D . Write w = (x, y) and z = (u, v). We have the following decomposition

ˆfρ,D := 1 |D |2 X

w,z D (y v)K(x,u)

= 1 |D |2 X

|D| (y v)K(x,u) + 2 |D |2 X

|D| y K(x,u)

|D | ˆfρ,D + | D|

|D | ˆfρ,D, D,

Guo, Hu and Wu

where ˆfρ,D is deﬁned in Lemma 4.1 and ˆfρ,D, D = 1 |D|| D| P

w D,z D 2y K(x,u). Below, ˆfρ,Dl and ˆfρ,Dl, Dl are similarly deﬁned by substituting D and D with Dl and Dl, respectively. Note that both ˆfρ,D and ˆfρ,D, D are empirical analogs of LKfρ. We have

l=1 (LK + λI) 1

2 ( ˆfρ,Dl LKfρ)

l=1 (LK + λI) 1

2 ( ˆfρ,Dl, Dl LKfρ)

For the ﬁrst term, it has been proved that with probability at least 1 δ

2, there holds

l=1 (LK + λI) 1

2 ( ˆfρ,Dl LKfρ)

K 8(M + σ)AD,λ,k log 4

For the second term, let ξ(w, z) = (LK + λI) 1

2 2y K(x,u) with w = (x, y) D and z = (u, v) D, so

(LK + λI) 1/2 ˆfρ,Dl, Dl = 1 |D|| D|

z D ξ(w, z),

and (LK + λI) 1/2LKfρ = Eξ. With (8), for any integer q 2,

E[ ξ Eξ q K] 2q 1E[ ξ q K] + 2q 1 Eξ q K 2q E[ ξ q K]

22q sup x ,u X (LK + λI) 1/2K(x ,u ) q 2 K E h |Y |q (LK + λI) 1/2K(x,u) 2 K i

q 2 E h E(|Y |q|X) (LK + λI) 1/2K(x,u) 2 K i

2q!σ2M q 2N(λ) = 1

2q![16σ2N(λ)](4M/

Applying Lemma 5.3, one gets that with probability at least 1 δ

l=1 (LK + λI) 1

2 ( ˆfρ,Dl, Dl LKfρ)

Recall that |D| + | D| = |D |. We combine the analysis above by

|D| |D |AD,λ,k + | D|

! |D| |D | + | D|

λ + 1 |D|/4

N(λ) |D|/4 = AD,D ,λ,k.

One completes the proof by observing that GD ,λ is a special case of FD ,λ with k = 1.

Distributed Minimum Error Entropy Algorithms

5.2. Proof of learning rates in semi-supervised learning

Proof of Theorem 2.6. We can decompose f D ,λ fρ ρ as the sample error f D ,λ fλ ρ and the approximation error fλ fρ ρ. As stated in (29), fλ fρ ρ λr hρ ρ for 0 < r 1. Thus, we just estimate the sample error by Corollary 5.1 and bound the right-hand side of (46) term by term. Recall the deﬁnition of D in Section 2.2. One has

λ f D ,λ 2 K ED (0) ED (f D ,λ) h2

w,z D G (y v)2

w,z D (y v)2 = CG |D |2

|D| v 2 + 2 X

w D y2 + | D|

max w D y2 = 4CG |D |

|D| max w D y2.

Thus, f D ,λ K 2 CG|D |

λ|D| 1/2 maxw D |y|. Similar to the estimation (26),

cph 2p 1 |D |2 X

w,z D ( f D ,λ K + |y v|)2p+1

22pcph 2p f D ,λ 2p+1 K + 22p+1 max w D |y|2p+1

22p+1 CG|D |

2 max w D |y|2p+1 + 22p+1 |D |

2p+1 max w D |y|2p+1 !

24p+1cph 2p(C p+ 1

2 G + 1) D,D ,λ(4M + 5σ)2p+1 log2p+1 |D|

where D,D ,λ = |D | λ|D| p+ 1

|D| 2p+1 . Therefore with probability 1 δ,

ED ,λ K C2 D,D ,λh 2p log2p+1 |D|

where C2 = 24p+1cp(C p+ 1

2 G +1)(4M +5σ)2p+1. The following part of the proof is similar to the proof of Theorem 2.2. We include it for the sake of completeness. Recall Corollary 5.2 and Proposition 5.4. For each ﬁxed l = 1, . . . , k, with probability at least 1 4δ, the following three bounds hold true simultaneously.

BD l ,λC2 D l ,λλ 1/2 8

A2 D l ,λ λ + 1

A2 D l ,λ λ + 1

A2 D l ,λλ 1/2 log4 2

Guo, Hu and Wu

BD l ,λCD l ,λλ 1 ED l ,λ K

A2 D l ,λ λ + 1

λ 1C2h 2p Dl,D l ,λ log2p+1 |Dl|

=16C2 Dl,D l ,λh 2pλ 1 A2 D l ,λ λ + 1

log2p+1 |Dl|

BD l ,λCD l ,λGD l ,λλ 1/2

A2 D l ,λ λ + 1

8(M + σ)ADl,D l ,λ

A2 D l ,λ λ + 1

AD l ,λADl,D l ,λλ 1/2 log4 2

where we used log 4

δ , which follows from 4/δ 4/δ2. Therefore, with probability at least 1 4kδ, the following two bounds hold true simultaneously.

S 1 256(1 + M + σ)λ 1/2 log4 2

A2 D l ,λ λ + 1

AD l ,λ AD l ,λ fλ K + ADl,D l ,λ , (48)

S 2 16λ 1/2C2 Dl,D l ,λh 2p log3 2

" A2 D l ,λ λ + 1

AD l ,λλ 1/2 + 1

log2p+1 |Dl|

By Corollary 5.2 and Proposition 5.4, we see that with probability at least 1 δ

2, the following two bounds hold simultaneously.

DD ,λ 2AD ,λ,k log 8

FD ,λ 8(M + σ)AD,D ,λ,k log 16

The proof is completed by scaling δ to δ 8k in (48) and (49), and combining them with (50) and (51).

Proof of Corollary 2.7. We have |D | |D| s+1 2r+s = λ s 1 when 0 < r < 1

2, and |D | |D| = λ 2r s when 1

2 r 1. By (30),

|D | hρ ρλr 1

2 , when 0 < r < 1

|D|, when 1

= hρ ρλr+ s

For any 1 l k, the assumption |D l | |Dl| 4 implies |D l |/4 |D l |/7. The assumption k p

|D |λ1+s implies k

|D |λ (1+s)/2 1. Recall N(λ) C0λ s. We have

AD l ,λ = 1 |D l |

λ + 1 |D l |/4

N(λ) |D l |/4 8k |D |

k |D |λ s 2 1

Distributed Minimum Error Entropy Algorithms

A2 D l ,λ λ + 1 4(8 + p

7C0)2 k |D |λ s 1 + 1 C1 3,

where C1 3 := 4(8 + 7C0)2 + 1. From deﬁnition,

ADl,D l ,λ = 1 |D l |

λ + 1 |Dl|/4

N(λ) |Dl|/4

|D| = 8kλ2r+s 1

So, for any 1 l k,

AD l ,λ fλ K + ADl,D l ,λ 2(8 + p

kλ s fλ K p

|D | + ADl,D l ,λ

kλr + 8kλ2r+s 1

kλr + kλ2r+s 1

where C2 3 := 2(8 + 7C0) hρ ρ + 8 + 7C0. Recall log 16k

δ log 16|D|

δ 2(log |D|) log 16

δ . We summarize the analysis above to give the bound of the second term on the right-hand side of (17).

J 2 :=256(1 + M + σ)λ 1/2 log4 16k

A2 D l ,λ λ + 1

AD l ,λ AD l ,λ fλ K + ADl,D l ,λ

log4 |D| log4 16

kλr + kλ2r+s 1

where C3 3 := 256(1 + M + σ) 24C1 3 2(8 + 7C0)C2 3. We use the assumption (18) to obtain

kλr log4 |D| p

|D |λs+1 λr,

k3/2λ2r+s 1

|D |λs log4 |D|

|D | 1 3 λ 2 2r s

3 3/2 λ2r+s 1 p

So J 2 2C3 3λr log4 16

Guo, Hu and Wu

Next, it is easy to see that for 1 l k, Dl,D l ,λ = D,D ,λ. Note that a a2 + 1 for a 0. We have

J 3 :=16λ 1/2C2h 2p log3 16k

" A2 D l ,λ λ + 1

Dl,D l ,λ log2p+1 16|D|

16C2h 2p 23 log3 |D| log3 16

λ 1/2 (C1 3)2 + 1

D,D ,λ 22p+1 (log |D|)2p+1 log 16

= C4 3h 2pλ 1/2 D,D ,λ (log |D|)2p+4 log 16

where C4 3 := 16C2 22p+4 (C1 3)2 + 1 . Now we bound the last term on the right-hand side of (17). By deﬁnition and the bound (52) above,

AD ,λ,k fλ K = fλ K

λ + 1 |D |/4

N(λ) |D |/4

The assumption (18) of k implies k

λ|D | λs/2 1. So AD ,λ,k fλ K hρ ρ(8 + 7C0)λr. By

the assumption |D | λ 1 s, k p

|D |λ1+s, and r + s 1

AD,D ,λ,k = k |D |

λ + 1 |D|/4

|D| + 7λ2r+s 1

We summarize the above analysis and use (17) to obtain that with probability 1 δ,

f D ,λ fρ ρ hρ ρλr + 2C3 3λr log4 16

+ C4 3h 2pλ 1/2 D,D ,λ (log |D|)2p+4 log 16

+ (2 hρ ρ + 8M + 8σ)(8 + p

7C0)λr log 16

which yields the conclusion with C3 := hρ ρ + 2C3 3 + C4 3 + (2 hρ ρ + 8M + 8σ)(8 + 7C0).

Acknowledgments

The work described in this paper was partially supported by National Natural Science Foundation of China (Projects 11671307, 11571078, 11671171) and the Research Grants Council of the Hong Kong

Distributed Minimum Error Entropy Algorithms

Special Administrative Region, China (Project No. Poly U 25301115). We thank the anonymous reviewers for their constructive comments. All the three authors contributed equally to the paper. The corresponding author is Ting Hu.

Appendix A. Concentration inequalities

Lemma A.1 (Pinelis and Sakhanenko, 1986, Theorem 3) Let {ηi}n i=1 be a be a sequence of independent random variables with values in a separable Hilbert space (H, ) and E(ηi) = 0 for each i = 1 , n. Then for any c > 0, there holds

i=1 E ec ηi c ηi .

Lemma A.2 Let ξ(z) be a random variable deﬁned on (Z, ρ) with values in a separable Hilbert space (H, ). Assume that there are two positive constants σ > 0 and M > 0 such that for any integer q 2,

E[ ξ(z) Eξ q] 1

2q!σ2M q 2, (53)

then for any ϵ > 0 and any positive integer n, there holds

i=1 ξ(zi) E(ξ)

/ cosh(cϵ) 2 exp nϵ2

Proof. Applying Lemma A.1 with ηi = 1

n [ξ(zi) E(ξ)], we get that

i=1 ξ(zi) E(ξ)

c ξ(zi) E(ξ)

n c ξ(zi) E(ξ)

For each 1 i n and any c > 0, by Taylor expansion and the elementary inequality 1 + a ea

for any a R, we have

c ξ(zi) E(ξ)

n c ξ(zi) E(ξ)

cq E[ ξ(z) E(ξ) q]

cq E[ ξ(z) E(ξ) q]

We set 0 < c < n/M and recall (53) to obtain,

c ξ(zi) E(ξ)

n c ξ(zi) E(ξ)

i=1 ξ(zi) E(ξ)

Taking c = nϵ σ2+Mϵ (0, n/M) and noting that cosh(cϵ) ecϵ

2 , the conclusion is obtained.

Guo, Hu and Wu

Appendix B. A technical lemma

The following lemma is straightforward, and is used in the literature to yield learning rates in expectation, from the learning rates with probability. We include it just for completeness.

Lemma B.1 Let R be a non-negative random variable. Let α, γ > 0 and β 1. If for any 0 < δ < 1, with probability at least 1 δ,

then for any real number µ > 0,

[E(Rµ)]1/µ α [βΓ(µγ + 1)]1/µ ,

where Γ(t) = R 0 e uut 1du is the Gamma function.

Proof. For any real number µ > 0 and 0 < δ < 1, we have Prob(Rµ > αµ logµγ β

t = αµ logµγ β

δ > αµ logµγ β to give δ = β exp n (t/αµ)

1 µγ o . So, when t > αµ logµγ β,

Prob {Rµ > t} β exp n (t/αµ)

1 µγ o . (55)

When 0 < t < αµ logµγ β, the bound (55) also holds true because its right-hand side is greater than 1,

β exp n (t/αµ)

1 µγ o > β exp { log β} = 1.

0 Prob(Rµ > t)dt β Z

0 exp n (t/αµ)

uµγ=t/αµ ====== βαµµγ Z

0 e uuµγ 1du = αµβΓ(µγ + 1).

The proof is complete.

Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. J. Complexity, 23(1):52 72, 2007.

Mikhail Belkin and Partha Niyogi. Semi-supervised learning on Riemannian manifolds. Machine Learning, 56(1):209 239, 2004.

Rajendra Bhatia. Matrix analysis, volume 169 of Graduate Texts in Mathematics. Springer-Verlag, New York, 1997.

Gilles Blanchard and Nicole Kr amer. Optimal learning rates for kernel conjugate gradient regression. In Advances in Neural Information Processing Systems 23, pages 226 234. Curran Associates, Inc., 2010.

Gilles Blanchard and Nicole Kr amer. Convergence rates of kernel conjugate gradient for random design regression. Anal. Appl. (Singap.), 14(6):763 794, 2016.

Distributed Minimum Error Entropy Algorithms

Gilles Blanchard and Nicole M ucke. Optimal rates for regularization of statistical inverse learning problems. Found. Comput. Math., 18(4):971 1013, 2018.

Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (Madison, WI, 1998), pages 92 100. ACM, New York, 1998.

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math., 7(3):331 368, 2007.

Andrea Caponnetto and Yuan Yao. Cross-validation based adaptation for regularization operators in learning theory. Anal. Appl. (Singap.), 8(2):161 183, 2010.

Olivier Chapelle, Bernhard Sch olkopf, and Alexander Zien. Semi-Supervised Learning. MIT Press, 2006.

Badong Chen, Yu Zhu, and Jinchun Hu. Mean-square convergence analysis of ADALINE training with minimum error entropy criterion. IEEE Transactions on Neural Networks, 21(7):1168 1179, 2010.

Andreas Christmann and Ding-Xuan Zhou. On the robustness of regularized pairwise learning methods based on kernels. J. Complexity, 37:1 33, 2016.

Felipe Cucker and Ding-Xuan Zhou. Learning theory: an approximation theory viewpoint, volume 24 of Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, Cambridge, 2007. With a foreword by Stephen Smale.

Deniz Erdogmus and Jose C. Principe. Comparison of entropy and mean square error criteria in adaptive system training using higher order statistics. In Proceedings of the Second International Workshop on Independent Component Analysis and Blind Signal, pages 75 90. Berlin: Springer Verlag, 2000.

Deniz Erdogmus and Jose C. Principe. Convergence properties and data eﬃciency of the minimum error entropy criterion in ADALINE training. IEEE Transactions on Signal Processing, 51(7): 1966 1978, 2003.

Jun Fan, Ting Hu, Qiang Wu, and Ding-Xuan Zhou. Consistency analysis of an empirical minimum error entropy algorithm. Appl. Comput. Harmon. Anal., 41(1):164 189, 2016.

Erhan Gokcay and Jose C. Principe. Information theoretic clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):158 171, 2002.

Zheng-Chu Guo, Shao-Bo Lin, and Ding-Xuan Zhou. Learning theory of distributed spectral algorithms. Inverse Problems, 33(7):074009, 29, 2017a.

Zheng-Chu Guo, Lei Shi, and Qiang Wu. Learning theory of distributed regression with bias corrected regularization kernel network. J. Mach. Learn. Res., 18:Paper No. 118, 25, 2017b.

Ting Hu, Jun Fan, Qiang Wu, and Ding-Xuan Zhou. Learning theory approach to minimum error entropy criterion. J. Mach. Learn. Res., 14:377 397, 2013.

Ting Hu, Jun Fan, Qiang Wu, and Ding-Xuan Zhou. Regularization schemes for minimum error entropy principle. Anal. Appl. (Singap.), 13(4):437 455, 2015.

Ting Hu, Qiang Wu, and Ding-Xuan Zhou. Convergence of gradient descent for minimum error entropy principle in linear regression. IEEE Trans. Signal Process., 64(24):6571 6579, 2016.

Guo, Hu and Wu

Ting Hu, Qiang Wu, and Ding-Xuan Zhou. Distributed kernel gradient descent algorithm for minimum error entropy principle. Appl. Comput. Harmon. Anal., 49(1):229 256, 2020.

Shao-Bo Lin and Ding-Xuan Zhou. Distributed kernel-based gradient descent algorithms. Constr. Approx., 47(2):249 276, 2018.

Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou. Distributed learning with regularized least squares. J. Mach. Learn. Res., 18:Paper No. 92, 31, 2017.

L. Lo Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri. Spectral algorithms for supervised learning. Neural Comput., 20(7):1873 1897, 2008.

Shahar Mendelson and Joseph Neeman. Regularization in kernel learning. Ann. Statist., 38(1): 526 565, 2010.

Nicole M ucke and Gilles Blanchard. Parallelizing spectrally regularized kernel algorithms. J. Mach. Learn. Res., 19:Paper No. 30, 29, 2018.

Emanuel Parzen. On estimation of a probability density function and mode. Ann. Math. Statist., 33:1065 1076, 1962.

I. F. Pinelis and A. I. Sakhanenko. Remarks on inequalities for large deviation probabilities. Theory of Probability & Its Applications, 30(1):143 148, 1986.

Jose C. Principe. Information theoretic learning. Information Science and Statistics. Springer, New York, 2010. Renyi s entropy and kernel perspectives.

Pengcheng Shen and Chunguang Li. Minimum total error entropy method for parameter estimation. IEEE Trans. Signal Process., 63(15):4079 4090, 2015.

Lu ıs M. Silva, J. Marques de S a, and Lu ıs A. Alexandre. The MEE principle in data classiﬁcation: a perceptron-based analysis. Neural Comput., 22(10):2698 2728, 2010.

Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their approximations. Constr. Approx., 26(2):153 172, 2007.

I. Steinwart, D. R. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009.

Aad W. van der Vaart and Jon A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics.

Cheng Wang and Ting Hu. Online minimum error entropy algorithm with unbounded sampling. Anal. Appl. (Singap.), 17(2):293 322, 2019.

Jun Wang, Tony Jebara, and Shih-Fu Chang. Semi-supervised learning using greedy max-cut. J. Mach. Learn. Res., 14:771 800, 2013.

Y. Wang, R. Khardon, D. Pechyony, and R. Jones. Generalization bounds for online learning algorithms with pairwise loss functions. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pages 13.1 13.22, 2012.

Yiming Ying and Ding-Xuan Zhou. Online pairwise learning algorithms. Neural Comput., 28(4): 743 777, 2016.

Yiming Ying and Ding-Xuan Zhou. Unregularized online learning algorithms with general loss functions. Appl. Comput. Harmon. Anal., 42(2):224 244, 2017.

Distributed Minimum Error Entropy Algorithms

T. Zhang. Eﬀective dimension and generalization of kernel learning. In Advances in Neural Information Processing Systems, pages 454 461, 2002.

Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J. Mach. Learn. Res., 16:3299 3340, 2015.

Ding-Xuan Zhou. The covering number in learning theory. J. Complexity, 18(3):739 767, 2002.

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems 23, pages 2595 2603. Curran Associates, Inc., 2010.