# generalization_bounds_for_regularized_pairwise_learning__49e79d62.pdf

Generalization Bounds for Regularized Pairwise Learning

Yunwen Lei1, Shao-Bo Lin2, Ke Tang1

1 Shenzhen Key Laboratory of Computational Intelligence, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 2 Department of Mathematics, Wenzhou University, Wenzhou leiyw@sustc.edu.cn, sblin1983@gmail.com, tangk3@sustc.edu.cn

Pairwise learning refers to learning tasks with the associated loss functions depending on pairs of examples. Recently, pairwise learning has received increasing attention since it covers many machine learning schemes, e.g., metric learning, ranking and AUC maximization, in a uniﬁed framework. In this paper, we establish a uniﬁed generalization error bound for regularized pairwise learning without either Bernstein conditions or capacity assumptions. We apply this general result to typical learning tasks including distance metric learning and ranking, for each of which our discussion is able to improve the state-of-the-art results.

1 Introduction

Recently, there is a growing interest in studying a large family of machine learning problems called pairwise learning problems. Unlike traditional learning problems whose loss functions depend only on a single example (e.g., classiﬁcation and regression), pairwise learning refers to learning tasks for which the associated loss function involves a pair of examples. Speciﬁcally, for any two examples (x, y), ( x, y) Rd R, the loss function for pairwise learning often takes the form V (h, (x, y), ( x, y)) for a hypothesis function h : Rd Rd R. Many learning tasks can be cast into the framework of pairwise learning, including ranking [Rejchel, 2012; Kriukova et al., 2016], metric learning [Xing et al., 2003; Cao et al., 2016], AUC maximization [Cortes and Mohri, 2004; Gao and Zhou, 2015; Gao et al., 2013; Zhao et al., 2011], gradient learning [Mukherjee and Zhou, 2006] and learning under minimum error entropy criterion [Hu et al., 2015], etc. For example, supervised metric learning aims to ﬁnd a Mahalanobis metric dw(x, x) = (x x) w(x x) encoded by a semi-positive matrix w Sd d to bring examples with similar labels together while keeping examples with different labels apart [Xing et al., 2003], where Sd d is the class of all positive semi-deﬁnite matrices in Rd d. In this case, a common loss function V (dw, (x, y), ( x, y)) = g(y y(1 dw(x, x))) involves two examples (x, y), ( x, y), where g : R R+ is convex for which a typical choice is the hinge loss g(t) = max(1 t, 0) [Cao et al., 2016].

Motivated by growing interests in pairwise learning, generalization analysis of different pairwise learning machines has been conducted to better understand their practical behavior [Kar et al., 2013; Ying and Zhou, 2016; Christmann and Zhou, 2016; Cl emenc on et al., 2008; Rejchel, 2012]. A difﬁculty in learning theory analysis of pairwise learning consists in the fact that the empirical error can not be written as a summation of independent and identically distributed (i.i.d.) random variables, rendering standard techniques in the i.i.d. case not applicable in this context [Sridharan et al., 2009; Bartlett et al., 2005]. For these learning problems with coupled examples, existing studies use either techniques in U-process to derive uniform convergence bounds [Rejchel, 2012; Cl emenc on et al., 2008; Cao et al., 2016; Zhao et al., 2017; Lei and Ying, 2016] or algorithm stability/robustness to establish algorithm-speciﬁc bounds [Bellet and Habrard, 2015; Jin et al., 2009; Agarwal and Niyogi, 2009]. However, existing studies on pairwise learning problems are not quite satisfactory in the following three aspects. Firstly, these generalization bounds are mostly derived for different speciﬁc instantiations of pairwise learning problems, and a uniﬁed framework to study generalization errors for regularized pairwise learning is still lacking. Secondly, most of these discussions only consider estimation errors. Thirdly, these estimation error bounds either are slow [Jin et al., 2009; Bellet and Habrard, 2015; Agarwal and Niyogi, 2009; Cao et al., 2016] or require capacity assumptions on hypothesis spaces and Bernstein conditions for an application of Talagrand s inequality [Rejchel, 2012; Cl emenc on et al., 2008]. In this paper, we provide a uniﬁed analysis for regularized pairwise learning by showing that the regularized generalization error of the estimator would converge to the optimal value at a rate of the order O(1/n), where n is the number of training examples. Our discussion requires neither Bernstein conditions on the bias and variance nor capacity assumptions on the hypothesis spaces. The property that the empirical error is a U-statistic makes the argument in the i.i.d. case not applicable to our context, and we bypass this obstacle by resorting to established techniques in the U-process. Based on this, we develop generalization error bounds for regularized pairwise learning which also take approximation errors into consideration. We apply our general results to metric learning and ranking, for each of which our general analysis can imply generalization bounds tighter than the state of the art.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

2 Related Work Here, we review related work on generalization analysis of pairwise learning algorithms based on different approaches. Generalization error bounds were established for regularized metric learning [Jin et al., 2009] and ranking [Agarwal and Niyogi, 2009] based on algorithmic stability. The basic idea is to use strong convexity of the objective function in regularized pairwise learning problems to show that the learned model would change slightly if a single training example is replaced by another one. Based on this, Mc Diarmid s inequality is applied to establish generalization bounds. However, this approach can only yield a suboptimal estimation bound O(1/(λ n)), where λ is the regularization parameter. The tool of U-process was also used in generalization analysis of regularized metric learning [Cao et al., 2016] and ranking [Cl emenc on et al., 2008; Rejchel, 2012]. The basic idea is to use symmetry of U-statistics to control supremum of a U-process in generalization analysis by the supremum of a Rademacher process, the latter of which can be bounded by standard techniques in the i.i.d. setting. However, existing studies with this approach either imply a suboptimal estimation bound O(1/(

λn)) or require both Bernstein conditions and capacity assumptions for a fast learning rate. Integral operator was used to establish learning rates for regularized least squares ranking [Zhao et al., 2017]. The basic idea is to show that the involved optimization problem has a closed-form solution in terms of integral operators, which, however, applies only to the least squares loss. Recently, regret bounds for online pairwise learning algorithms were established based on online-to-batch conversion together with covering numbers [Wang et al., 2012] and Rademacher complexities [Kar et al., 2013]. The convergence rates of the last iterate for online pairwise learning algorithms were studied based on convex analysis [Ying and Zhou, 2016; Guo et al., 2016; Lin et al., 2017], which, however, as we will show below, are suboptimal.

3 Problem Setup and Main Results 3.1 Regularized Pairwise Learning Let ρ be a probability measure deﬁned over the sample space Z = X Y Rd R, where X and Y are the input and output space, respectively. Let z = {zi = (xi, yi)}n i=1 be a sequence of examples independently drawn from ρ. Let W be a Hilbert space with the associated inner product , and , ||| ||| be two norms deﬁned in W. Let φ : X X W be a feature map and τ : Y Y R. The deﬁnition of τ(y, y) depends on the speciﬁc application domain (See Examples 1, 2 below). From the feature map φ, one can deﬁne a Mercer kernel K : (X X) (X X) R satisfying K((x1, x1), (x2, x2)) = φ(x1, x1), φ(x2, x2) for the reproducing Kernel Hilbert space W [Cl emenc on et al., 2008; Rejchel, 2012]. For any two examples z = (x, y), z = ( x, y) and a prediction rule h : X X R, we use the loss function of the form V (h, z, z) = ℓ(τ(y, y), h(x, x)) to measure the quality of h on z and z, where ℓ: R R R+ is convex with respect to (w.r.t.) the second argument. We assume V is symmetric in the sense that V (h, z, z) = V (h, z, z). The generalization error of any h : X X R is deﬁned as

E(h) := Ez, z[V (h, z, z)], where Ez, z denotes the conditional expectation w.r.t. z and z. We omit the subscript z, z if the expectation is taken over all random variables. We consider prediction rules of the form hw(x, x) := w, φ(x, x) , w W, and we search the estimator wz,λ by minimizing the empirical error plus a regularization term to avoid overﬁtting

wz,λ = arg min w W

h Fz,λ(w) := λ|||w|||2+

i,j Nn,i =j ℓ(τ(yi, yj), w, φ(xi, xj) ) i , (1)

where λ > 0 is a regularization parameter and we use the notation Nn = {1, . . . , n}. The regularized generalization error of the prediction rule hw is deﬁned as

Fλ(w) = Ez, z[ℓ(τ(y, y), w, φ(x, x) )] + λ|||w|||2. (2)

We denote by wλ the model minimizing Fλ(w) over W and by hρ the model minimizing the generalization error over measurable functions deﬁned on X X, respectively

wλ := arg min w W Fλ(w) and hρ := arg min h E(h). (3)

The framework of pairwise learning covers many machine learning problems as speciﬁc examples. Here we clarify how the distance metric learning and ranking can be recovered with speciﬁc instantiations of ℓ, φ and τ. We denote (t)+ := max(t, 0) and x the transpose of x Rd.

Example 1 (Supervised metric learning). Assume Y = { 1}. Supervised metric learning aims to ﬁnd a Mahalanobis metric dw(xi, xj) = w, (xi xj)(xi xj) , w Sd d such that two examples with the same label are close to each other, while two examples with different labels are apart from each other. A common loss function used in metric learning takes the form Vm(dw, z, z) = g(y y(1 dw(x, x))) [Jin et al., 2009], where g : R R+ is a convex function for which a typical choice is g(t) = (1 t)+. This scheme falls into the pairwise learning framework if we deﬁne τ(y, y) = y y, φ(x, x) = (x x)(x x) , ℓ(a, b) = g(a(1 b)) and hw(x, x) = w, φ(x, x) , w Sd d. That is, we have Vm(dw, z, z) = ℓ(τ(y, y), w, φ(x, x) ).

Example 2 (Ranking). In ranking problems, we use the output yi to indicate the ordering between instances, i.e., the instance xi is considered to be better than xj if yi > yj. The task is to predict the ordering between the objects based on observations by constructing ranking rules h : X X R, and predict y > y if h(x, x) > 0 [Rejchel, 2012; Cl emenc on et al., 2008]. A common pairwise loss function used in ranking problems takes the form Vr(h, z, z) = g(sign(y y)h(x, x)), where g : R R+ is a convex function which can be either the exponential cost function g(t) = e t, the logit function g(t) = log(1 + e t) or the hinge loss g(t) = (1 t)+ [Cl emenc on et al., 2008]. Here sign(t) denotes the sign of t R. The above formulation of ranking problems falls into our framework of pairwise learning by taking τ(y, y) = sign(y y), ℓ(a, b) = g(ab) and ranking rules of the form hw(x, x) = w, φ(x, x) , w W. That is, we have Vr(hw, z, z) = ℓ(τ(y, y), w, φ(x, x) ).

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

3.2 Main Results We now present our main result, showing that the uniform deviation of the population sub-optimality Fλ(w) Fλ(wλ) from the empirical sub-optimality Fz,λ(w) Fz,λ(wλ) decays with the rate O(1/n), provided that Fλ is strongly convex w.r.t. the norm , the dual norm of which is denoted by . Introduce the notation X = supx, x X φ(x, x) . For any w W, we refer to Fλ(w) Fλ(wλ) as the excess regularized generalization error (ERGE) of w. Let e = exp(1). Deﬁnition 1 (Strong Convexity). A function f : W R is said to be β-strongly convex (β > 0) w.r.t. if w, w W and α (0, 1), we have f(αw + (1 α) w) αf(w) + (1 α)f( w) β

2 α(1 α) w w 2. Theorem 1 (Main theorem). Let L > 0 and β > 0. Assume the loss function ℓis L-Lipschitz continuous in the sense

|ℓ(τ(y, y), a) ℓ(τ(y, y), b)| L|a b|, y, y Y, a, b R. (4) Assume that Fλ(w) deﬁned in Eq. (2) is β-strongly convex w.r.t. the norm . Let wλ be deﬁned as Eq.(3) and any ρ0 > 0. Then, for any 0 < δ0 < 1/e, with probability at least 1 2δ0 the following inequality holds for all w W

Fλ(w) Fλ(wλ) Fz,λ(w) Fz,λ(wλ)+ 1 + q

log δ 1 0 + log max(1, 2ρ 1 0 (Fλ(w) Fλ(wλ)))

nβ max(ρ0, 2(Fλ(w) Fλ(wλ))). (5)

In particular, for wz,λ in Eq.(1), we have the following inequality with probability at least 1 2δ0 for all 0 < a < 1

Fλ(wz,λ) Fλ(wλ) 128L2X2 nβ(1 a) log 1

Note the norm ||| ||| in the deﬁnition of regularization algorithm (1) is not necessarily equal to the norm w.r.t. which Fλ is strongly convex. In learning theory, we often refer to the term D(λ) := E(wλ) E(hρ)+λ|||wλ|||2 as the approximation error. It is a standard assumption that the approximation error admits a polynomial decay rate [Smale and Zhou, 2003; Ying and Zhou, 2016; Zhao et al., 2017]. In particular, if hρ W then D(λ) λ|||hρ|||2 [Guo et al., 2016]. Throughout the paper we use the abbreviation E(w) := E(hw). Assumption 1. We assume the approximation error D(λ) enjoys a polynomial decay rate with exponent 0 < α 1 in the sense D(λ) cαλα, λ > 0, where cα > 0 is a constant. Theorem 2. Suppose that the assumptions in Theorem 1 and Assumption 1 hold. Then, for any 0 < δ0 < 1/e, with probability 1 2δ0 there holds

E(wz,λ) E(hρ)+λ|||wz,λ|||2 cαλα + 256L2X2 nβ log 2

Proof. Plugging a = 1 2 in Eq. (6) and recalling Fλ(w) = E(w) + λ|||w|||2, with probability at least 1 2δ0 we can upper bound the term E(wz,λ) E(hρ) + λ|||wz,λ|||2 by

E(wλ) E(hρ) + λ|||wλ|||2 + 256L2X2 nβ log 2

The stated bound can be derived by plugging Assumption 1 into the above inequality.

Remark 1. The regularization parameter λ should be chosen according to β and n to achieve optimal generalization bounds. Convergence rates O(log(n)/(nβ2)) were established for the n-th iteration of online regularized pairwise learning algorithms [Guo et al., 2016], which are suboptimal to (6) if β 1. For example, under conditions of Theorem 2 with β λ, Theorem 2 with λ n 1 α+1 implies the learning rate O(n α α+1 ) with high probability, while the discussions in [Guo et al., 2016] and [Lin et al., 2017] can only imply the rate O(n α 2(α+1) log(n)) and the rate O(n α 1+2α log(n)) in expectation, respectively. Here, A B means there are universal constants C1, C2 > 0 with C1A B C2A. In Section 4, we will apply Theorem 2 to improve the existing bounds for metric learning and ranking.

Remark 2. Another term of interest in the literature of regularized pairwise learning is E(wz,λ) Ez(wz,λ). We now relate this to E(wz,λ) E(hρ)+λ|||wz,λ|||2. On the one hand, by Theorem 8.3 in [Cucker and Zhou, 2007] and |||wλ||| p

D(λ)/λ, we have E(wz,λ) E(hρ) + λ|||wz,λ|||2 = O(λ α 1

2 )+E(wz,λ) Ez(wz,λ)+D(λ) with high probability, where Ez(w) = Fz,λ(w) λ|||w|||2 is empirical error (without regularizer). On the other hand, we can take w = wz,λ in (5) to get E(wz,λ) Ez(wz,λ) E(wλ) Ez(wλ) + E with high probability, where E is last term in (5) with w = wz,λ. Taking ρ0 in (5) as the right-hand side of (6) and using (6) we get E = O(1/(nβ)). This, together with E(wλ) Ez(wλ) = O(λ α 1

2 ), allows us to derive E(wz,λ) Ez(wz,λ) = O λ α 1

2 + 1/(nβ) , which improves the bound O(1/

nλ) based on the U-process approach [Cao et al., 2016] and the bound O(1/(λ n)) based on the stability approach since we often have 1/n β λ 1 (for example, Theorem 2 suggests λ n 1 α+1 ).

4 Applications

We now apply Theorem 2 to distance metric learning and ranking. We always assume g : R R+ is convex.

4.1 Regularized Distance Metric Learning As clariﬁed in Example 1, an established regularization framework to learn the Mahalanobis distance metric can be reformulated as a regularized pairwise learning problem (1) with some speciﬁc instantiations of W, τ, φ and ℓ. Some interesting regularizers in metric learning include the ℓ1-norm regularizer favoring the element-wise sparsity [Cao et al., 2016], the mixed (2, 1)-norm regularizer encouraging the column-wise sparsity [Ying et al., 2009] and the trace-norm regularizer encouraging low-rank [Cao et al., 2016], etc. For any w Rd d, we deﬁne the Schatten-p norm w S(p) as the ℓp-norm of σ(w), p 1, where σ(w) is the vector of all singular values of w and the ℓp-norm of a = (a1, . . . , ad)

Rd is deﬁned as a p = Pd j=1 |aj|p 1/p. For any w = (w1, w2, . . . , wm) Rn m, the mixed (p, q)-norm of w

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

is w (p,q) := w1 p, . . . , wm p q, p, q 1. It is known that the dual norm of p,q is (p ,q ), and the dual norm of S(p) is S(p ), where p is the conjugate exponent of p satisfying p 1 + (p ) 1 = 1. For brevity, we introduce some notations X(p,q) = supx, x X (x x)(x x) (p,q), XS(p) = supx, x X (x x)(x x) S(p), Xp = supx, x X x x p. Denote d = log d log d 1. We now apply Theorem 2 to study regularized metric learning with the regularizer involving various norms, including the mixed (p, q)-norm, the mixed (p, d)-norm, the S( d)- norm, and the S(p)-norm. Since 2 (p,1), 2 S(1) are not strongly convex, we use the strongly convex regularizers 2 (p, d) and 2 S( d) [Kakade et al., 2012] to mimic the effects of the mixed (p, 1)-norm based regularizer and the Schatten-1 norm based regularizer, respectively. We hide the linear dependency on log δ 1 0 in the big O notation.

Corollary 3. Consider the regularized metric learning (1) with W = Sd d, τ(y, y) = y y, φ(x, x) = (x x)(x x) and ℓ(a, b) = g(a(1 b)). Let p, q (1, 2] and 0 < δ0 < 1/e. Assume (4) and Assumption 1 hold. If we choose

2 α+1 n 1 α+1 , then with probability 1 2δ0 we have E(wz,λ) E(hρ) + λ|||wz,λ|||2 =

O((Xp Xq ) 2α α+1 n α α+1 ), if ||| |||= (p,q), (7a)

O((X2 p X2 log d) α α+1 n α α+1 ), if ||| |||= (p, d), (7b)

2α α+1 S(p )n α α+1 ), if ||| |||= S(p), (7c)

O((X2 S( ) log d) α α+1 n α α+1 ), if ||| |||= S( d). (7d)

Proof. It was shown in [Kakade et al., 2012] that r(M) := λ M 2 (p,q) is 2λ (p 1)(q 1)

p+q 2 -strongly convex w.r.t. the norm (p,q). Therefore, Theorem 2 shows that, with probability 1 2δ0, the term E(wz,λ) E(hρ) + λ wz,λ 2 (p,q) can be upper bounded by

128L2X2 (p ,q )(p + q 2)

λn(p 1)(q 1) log 2

Eq. (7a) then follows by taking λ (Xp Xq ) 2 α+1 n 1 α+1 and noticing X(p ,q ) = Xp Xq . Eqs. (7b), (7c) and (7d) can be derived in a similar way. We omit the deduction for brevity.

Remark 3 (Comparison with the state of the art). We now compare our learning rates to the state-of-the-art bounds for regularized metric learning. We complement the existing estimation error bounds with approximation error bounds under Assumption 1. Jin et al. [2009] studied regularized distance metric learning (1) with τ(y, y) = y y, φ(x, x) = (x x)(x x) and ||| ||| being the Frobenius norm, and derived the error bound E(wz,λ) E(hρ) + λ|||wz,λ|||2 = O X4 2 λ + q

log δ 1 0 n + λα . Taking λ X

4 α+1 2 n 1 2(α+1) , this

bound becomes O X

4α α+1 2 n α 2(α+1) +

4(α+1) , which is

worse than the bound O X4 2n 1 α α+1 in (7a) with p = q = 2. For regularized metric learning (1) with a general regularizer, with probability 1 δ0 the term E(wz,λ) E(hρ) + λ|||wz,λ|||2 was upper bounded in [Cao et al., 2016] by

cαλα + O Ez,σ sup |||w||| 1

Pn i=1 σi w, xix n

where {σi}i Nn is a sequence of independent Rademacher variables taking the value +1 or 1 with equal probability. The term Ez,σ sup|||w||| 1 Pn i=1 σi w, xix n

2 +i is closely related to the Rademacher complexity for metric learning in [Cao et al., 2016] and can be estimated by the seminal complexity bound of linear prediction with speciﬁc instantiations of strongly convex functions [Kakade et al., 2012]. In Table 1, we list the best generalization error bounds derived by choosing appropriate regularization parameters λ to minimize (8). It is clear from Table 1 that our generalization analysis yields the bounds O(n α α+1 ), which signiﬁcantly improve the bound O(n α 2α+1 ) in [Cao et al., 2016].

4.2 Regularized Ranking As shown in Example 2, a regularized ranking problem is a speciﬁc regularized pairwise learning problem with some speciﬁc instantiations of τ and ℓ. In this subsection, we assume both || || and ||| ||| are the norm induced by the inner product , in W, i.e., ||w|| = p

w, w for any w W. The following corollary follows directly from Theorem 2 by noting the (2λ)-strong convexity of λ 2 w.r.t. . We omit the proof due to the space constraint.

Corollary 4. Consider the regularized ranking (1) with ℓ(a, b) = g(ab) and either τ(y, y) = sign(y y) or τ(y, y) = y y. Assume (4) and Assumption 1 hold. If we choose

2 α+1 n 1 α+1 , then with probability 1 2δ0 we have E(wz,λ) E(hρ) + λ|||wz,λ|||2 = O (X2 n 1) α α+1 .

Remark 4 (Comparison with the state of the art). Fast learning rates were established for unregularized ranking in a compact hypothesis space satisfying a capacity assumption with the loss function satisfying an assumption on the modulus of convexity [Rejchel, 2012]. These error bounds are stated only for the estimation error (ignoring the approximation error), and can be as slow as O(n 1

2 ) if the capacity assumption is removed. Agarwal and Niyogi [2009] studied the generalization performance of regularized ranking algorithms based on a stability approach, and derived the bound E(wz,λ) E(hρ) + λ|||wz,λ|||2 = O 1 λ n + λα . If we

take λ n 1 2(α+1) , then this bound becomes O n α 2(α+1) , which is improved to O(n α α+1 ) in Corollary 4 without capacity assumptions on the hypothesis spaces. Zhao et al. [2017] derived generalization bounds O(nϵ α 1+α ) for a speciﬁc regularized ranking problem (1) with τ(y, y) = y y and ℓ(a, b) = (a b)2. Here ϵ is a positive constant. Therefore, even for this speciﬁc regularized ranking problem, our general error bound for regularized pairwise learning (1) is able to imply a tighter bound than existing bounds derived

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

(p,q), p, q (1, 2] (p, d), p (1, 2] S(p), p (1, 2] S( d) Ours (X2 p X2 q n 1) α α+1 (X2 p X2 (log d)n 1) α α+1 (X2 S(p )n 1) α α+1 (X2 S( )(log d)n 1) α α+1

[Cao et al., 2016] (X2 p X2 q n 1) α 2α+1 (X2 p X2 (log d)n 1) α 2α+1 (X2 S(p )n 1) α 2α+1 (X2 S( )(log d)n 1) α 2α+1

Table 1: Comparison of generalization bounds for regularized metric learning. The ﬁrst row shows different instantiations of the norm ||| ||| in (1), for which the related generalization bounds established in this paper and [Cao et al., 2016] are presented in the second and third row.

exclusively for a speciﬁc regularized ranking algorithm using the least squares loss. It should be mentioned that the bound O(nϵ α 1+α ) in [Zhao et al., 2017] can be improved if a further capacity assumption on the hypothesis space is imposed.

5 Proof of the Main Theorem In this section, we present the proof of the main theorem (Theorem 1). The key idea in our deduction is to partition the hypothesis space into a sequence of subsets according to the value of ERGE, using the idea of peeling [Bartlett et al., 2005]. The strong convexity of the regularized objective then implies that the inﬁnity-norm of functions in each sub-class can be bounded by the maximal ERGE associated to that subclass, which allows us to conduct the localization analysis to control the uniform deviation between the ERGE and its empirical counterpart in each sub-class by the local ERGE. For any w W, deﬁne the excess loss function by

hλ w(z, z) := ℓ(τ(y, y), w, φ(x, x) ) ℓ(τ(y, y), wλ, φ(x, x) ).

Let ρ0 > 0 and 0 < δ0 < 1/e be any two ﬁxed number. We construct two geometric sequences ρk = 2kρ0, δk = 2 kδ0, k = 1, 2, . . .. For brevity, we set ρ 1 = 0. We group those w whose ERGEs belong to (ρk 1, ρk] into the class Wk

Wk = {w W : ρk 1 < Fλ(w) Fλ(wλ) ρk}, k N {0}.

Deﬁne Hk = {hλ w : w Wk}, k N {0}. We denote by x the largest natural number not larger than x. We begin our deduction with the following lemma controlling the inﬁnity-norm of functions in Hk. Lemma 5. If (4) holds and Fλ is β-strongly convex, then suphλ w Hk hλ w 2LX q

Proof. For any hλ w Hk, hλ w can be upper bounded by

sup z, z |ℓ(τ(y, y), w, φ(x, x) ) ℓ(τ(y, y), wλ, φ(x, x) )|

sup z, z L| w wλ, φ(x, x) | LX w wλ , (9)

where the ﬁrst inequality follows from the Lipschitz property of the loss function ℓand the second inequality is due to the deﬁnition of dual norm together with the deﬁnition of X . According to the deﬁnition of wλ and the strong convexity of Fλ, the following inequality holds for all w W

Fλ(wλ) Fλ w+wλ

which, together with the assumption hλ w Hk, implies

β (Fλ(w) Fλ(wλ)) r4ρk

Putting this inequality into (9) and using the deﬁnition of hλ w show hλ w q

We prove Theorem 1 in four steps. For any f deﬁned over Z Z, we use ˆEzf = 1 n(n 1) P i,j Nn,i =j f(zi, zj) to denote the empirical average of f. In the ﬁrst and second step, we conduct localization analysis in each sub-class Hk and get a bound with probability 1 δk for suphλ w Hk[E[hλ w] ˆEz[hλ w]] in terms of ρk. In the third step, we show both δ 1 k and ρk can be bounded by ERGE, which together with union bounds, implies a probability inequality on ERGE in terms of a square root of ERGE. In the last step, we solve this probabilistic inequality to get the stated bound. The intuitive observation is that both the reciprocal of conﬁdence and the constant in the bounded difference property for the application of Mc Diarmid s inequality in each sub-class can be controlled by ERGE associated to that sub-class, which allows us to get bounds on ERGE in terms of a square root of ERGE.

Proof of Theorem 1. Our proof consists of four steps.

Step 1. We ﬁrst apply the Mc Diarmid s inequality (Theorem D.3 in [Mohri et al., 2012]) to control the deviation of suphλ w Hk[E[hλ w] ˆEz[hλ w]] from its expectation. To this aim, we need to show that functions in Hk satisfy the bounded difference property. Indeed, for any z = z1, . . . , zt 1, zt, zt+1, . . . , zn and z = z1, . . . , zt 1, zt, zt+1, . . . , zn , we have sup hλ w Hk (E[hλ w] ˆEz[hλ w]) sup hλ w Hk (E[hλ w] ˆE z[hλ w])

sup hλ w Hk |ˆEz[hλ w] ˆE z[hλ w]|

2 n(n 1) sup hλ w Hk

j Nn,j =t (|hλ w(zt, zj)| + |hλ w( zt, zj)|)

n sup hλ w Hk hλ w 8LX

where the last inequality follows from Lemma 5. Applying Mc Diarmid s inequality (Theorem D.3 in [Mohri et al., 2012]) with increments bounded by 8LX

β implies that

with probability at least 1 δk the term suphλ w Hk[E[hλ w] ˆEz[hλ w]] can be upper bounded by

E sup hλ w Hk [E[hλ w] ˆEz[hλ w]] + 4LX

2ρk log(1/δk)

Step 2. We now use techniques in U-process to bound the expectation in (10). Speciﬁcally, Lemma A.1 in

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

[Cl emenc on et al., 2008] with qw(zi, zj) = E[hλ w] hλ w(zi, zj) and the index set W allows us to derive

Ez suphλ w Hk[E[hλ w] ˆEz[hλ w]] Ez suphλ w Hk

2 i=1 hλ w(zi, z n

2 +i) . Let z = { z1, . . . , zn} be i.i.d.

samples independent of z and let {σi}i Nn be a sequence of independent Rademacher variables. According to Jensen s inequality and a standard symmetrization technique, the term Ez suphλ w Hk[E[hλ w] ˆEz[hλ w]] can be upper bounded by

Ez, z sup hλ w Hk

i=1 hλ w( zi, z n

i=1 hλ w(zi, z n

= Ez, z,σ sup hλ w Hk

i=1 σi[hλ w( zi, z n

2 +i) hλ w(zi, z n

2 Ez,σ sup w Wk

i=1 σiℓ(τ(yi, y n

2 +i), w, φ(xi, x n

2 Ez,σ sup w Wk

i=1 σi L w, φ(xi, x n

2 +i) , (11)

where the second inequality uses the fact that wλ is a ﬁxed element and the last inequality follows from a contraction property of Rademacher averages (Lemma 4.2 in [Mohri et al., 2012] with the Lipschitz composition operator t 7 ℓ(τ(yi, y n

2 +i), t), i = 1, . . . , n

2 ). The function Fλ(w) := Fλ(w) Fλ(wλ) is β-strongly convex. Note that any w Wk satisﬁes Fλ(w) ρk. Moreover, the deﬁnition of wλ implies that infw W Fλ(w) = 0. Therefore, the conditions of Theorem 7 in [Kakade et al., 2012] are satisﬁed and we can apply it here to show

Ez,σ sup w Wk

i=1 σi w, φ(xi, x n

Plugging it back into Eq. (11) gives

Ez sup hλ w Hk [E[hλ w] ˆEz[hλ w]] 2LX

Step 3. We now present probabilistic bounds for supw W[E[hλ w] ˆEz[hλ w]]. For any hλ w Hk, k 1, we have ρk

2 = ρk 1 Fλ(w) Fλ(wλ). Therefore,

ρk max{ρ0, 2(Fλ(w) Fλ(wλ))}, w Wk, k N {0}.

Moreover, it can be directly checked that

1 δk = ρ02k

δ0ρ0 max{ρ0, 2(Fλ(w) Fλ(wλ))}

δ0ρ0 . (13)

According to the deﬁnition of Fλ and Fz,λ given in (1) and (2) together with the deﬁnition of hλ w, we derive

[Fλ(w) Fλ(wλ)] [Fz,λ(w) Fz,λ(wλ)] = E[hλ w] ˆEz[hλ w].

Combining the above identity, (10), (12) and (13) together, with probability 1 δk the following inequality holds uniformly for all hλ w Hk

Fλ(w) Fλ(wλ) Fz,λ(w) Fz,λ(wλ)+ 1 + q

log δ 1 0 + log max(1, 2ρ 1 0 (Fλ(w) Fλ(wλ)))

nβ max(ρ0, 2(Fλ(w) Fλ(wλ))).

The stated inequality (5) follows directly if we apply union bounds over H0, H1, . . . with conﬁdence δ0, δ1, . . . (notice that P k=0 δk = 2δ0). Step 4. Finally, we present probabilistic bounds for the estimator wz,λ. According to the deﬁnition of wz,λ, we have Fz,λ(wz,λ) Fz,λ(wλ) and therefore (5) implies

Fλ(wz,λ) Fλ(wλ)

2 max(ρ0, 2(Fλ(wz,λ) Fλ(wλ)))

log δ 1 0 +log max(1, 2ρ 1 0 (Fλ(wz,λ) Fλ(wλ))) .

Take the assignment ρ0 = 256L2X2 nβ . If Fλ(wz,λ) Fλ(wλ) ρ0

2 , the term Fλ(wz,λ) Fλ(wλ) can be further upper bounded by

128L2X2 nβ 1 + log δ 1 0 + log 2(Fλ(wz,λ) Fλ(wλ))

For the assignment ρ0 = 256L2X2 nβ and any 0 < a < 1,

the term nβ(Fλ(wz,λ) Fλ(wλ))

128L2X2 can be upper bounded by (note (a + b)2 2(a2 + b2), a, b R)

1 + log δ 1 0 + log nβ(Fλ(wz,λ) Fλ(wλ))

log δ 1 0 + anβ(Fλ(wz,λ) Fλ(wλ))

128L2X2 + log 1

where the last inequality follows from log a ab + log 1

b 1, a, b > 0. Solving the linear inequality (14) directly yields the stated bound (6) with probability at least 1 2δ0. If Fλ(wz,λ) Fλ(wλ) ρ0

2 , it is clear that (6) holds.

6 Conclusions We develop a uniﬁed generalization bound for pairwise learning, and apply it to improve existing results. It would be interesting to study pairwise learning by exploiting the smoothness of loss functions [Zhang et al., 2017] and consider distributed pairwise learning algorithms [Lin and Zhou, 2018].

Acknowledgments This work was supported in part by the National Key Research and Development Program of China (Grant No. 2017YFB1003102), the National Natural Science Foundation of China (Grant Nos. 61672478, 61502342), and the Science and Technology Innovation Committee Foundation of Shenzhen (Grant No. ZDSYS201703031748284).

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

References [Agarwal and Niyogi, 2009] Shivani Agarwal and Partha Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research, 10:441 474, 2009. [Bartlett et al., 2005] P.L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497 1537, 2005. [Bellet and Habrard, 2015] Aur elien Bellet and Amaury Habrard. Robustness and generalization for metric learning. Neurocomputing, 151:259 267, 2015. [Cao et al., 2016] Qiong Cao, Zheng-Chu Guo, and Yiming Ying. Generalization bounds for metric and similarity learning. Machine Learning, 102(1):115 132, 2016. [Christmann and Zhou, 2016] Andreas Christmann and Ding-Xuan Zhou. On the robustness of regularized pairwise learning methods based on kernels. Journal of Complexity, 37:1 33, 2016. [Cl emenc on et al., 2008] St ephan Cl emenc on, Gabor Lugosi, and Nicolas Vayatis. Ranking and empirical minimization of U-statistics. Annals of Statistics, 36(2):844 874, 2008. [Cortes and Mohri, 2004] Corinna Cortes and Mehryar Mohri. AUC optimization vs. error rate minimization. In NIPS, pages 313 320, 2004. [Cucker and Zhou, 2007] Felipe Cucker and Ding-Xuan Zhou. Learning theory: an approximation theory viewpoint. Cambridge Univ. Press, Cambridge, 2007. [Gao and Zhou, 2015] Wei Gao and Zhi-Hua Zhou. On the consistency of AUC pairwise optimization. In IJCAI, pages 939 945, 2015. [Gao et al., 2013] Wei Gao, Rong Jin, Shenghuo Zhu, and Zhi-Hua Zhou. One-pass AUC optimization. In ICML, pages 906 914, 2013. [Guo et al., 2016] Zheng-Chu Guo, Yiming Ying, and Ding Xuan Zhou. Online regularized learning with pairwise loss functions. Advances in Computational Mathematics, pages 1 24, 2016. [Hu et al., 2015] Ting Hu, Jun Fan, Qiang Wu, and Ding Xuan Zhou. Regularization schemes for minimum error entropy principle. Analysis and Applications, 13(04):437 455, 2015. [Jin et al., 2009] Rong Jin, Shijun Wang, and Yang Zhou. Regularized distance metric learning: theory and algorithm. In NIPS, pages 862 870, 2009. [Kakade et al., 2012] Sham M Kakade, Shai Shalev Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. Journal of Machine Learning Research, 13:1865 1890, 2012. [Kar et al., 2013] Purushottam Kar, Bharath Sriperumbudur, Prateek Jain, and Harish Karnick. On the generalization ability of online learning algorithms for pairwise loss functions. In ICML, pages 441 449, 2013.

[Kriukova et al., 2016] Galyna Kriukova, Oleksandra Panasiuk, Sergei V Pereverzyev, and Pavlo Tkachenko. A linear functional strategy for regularized ranking. Neural Networks, 73:26 35, 2016. [Lei and Ying, 2016] Yunwen Lei and Yiming Ying. Generalization analysis of multi-modal metric learning. Analysis and Applications, 14(04):503 521, 2016. [Lin and Zhou, 2018] Shao-Bo Lin and Ding-Xuan Zhou. Distributed kernel-based gradient descent algorithms. Constructive Approximation, 47(2):249 276, 2018. [Lin et al., 2017] Junhong Lin, Yunwen Lei, Bo Zhang, and Ding-Xuan Zhou. Online pairwise learning algorithms with convex loss functions. Information Sciences, 406:57 70, 2017. [Mohri et al., 2012] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT press, 2012. [Mukherjee and Zhou, 2006] Sayan Mukherjee and Ding Xuan Zhou. Learning coordinate covariances via gradients. Journal of Machine Learning Research, 7:519 549, 2006. [Rejchel, 2012] Wojciech Rejchel. On ranking and generalization bounds. Journal of Machine Learning Research, 13(1):1373 1392, 2012. [Smale and Zhou, 2003] Steve Smale and Ding-Xuan Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 1(01):17 41, 2003. [Sridharan et al., 2009] Karthik Sridharan, Shai Shalev Shwartz, and Nathan Srebro. Fast rates for regularized objectives. In NIPS, pages 1545 1552, 2009. [Wang et al., 2012] Yuyang Wang, Roni Khardon, Dmitry Pechyony, and Rosie Jones. Generalization bounds for online learning algorithms with pairwise loss functions. In COLT, pages 1 22, 2012. [Xing et al., 2003] Eric P Xing, Andrew Y Ng, Michael I Jordan, and Stuart Russell. Distance metric learning with application to clustering with side-information. In NIPS, pages 505 512, 2003. [Ying and Zhou, 2016] Yiming Ying and Ding-Xuan Zhou. Online pairwise learning algorithms. Neural Computation, 28(4):743 777, 2016. [Ying et al., 2009] Yiming Ying, Kaizhu Huang, and Colin Campbell. Sparse metric learning via smooth optimization. In NIPS, pages 2214 2222, 2009. [Zhang et al., 2017] Lijun Zhang, Tianbao Yang, and Rong Jin. Empirical risk minimization for stochastic convex optimization: O(1/n)-and O(1/n2)-type of risk bounds. In COLT, pages 1954 1979, 2017. [Zhao et al., 2011] Peilin Zhao, Rong Jin, Tianbao Yang, and Steven C Hoi. Online AUC maximization. In ICML, pages 233 240, 2011. [Zhao et al., 2017] Yulong Zhao, Jun Fan, and Lei Shi. Learning rates for regularized least squares ranking algorithm. Analysis and Applications, 15(06):815 836, 2017.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)