# sharper_generalization_bounds_for_pairwise_learning__ee25d96e.pdf

Sharper Generalization Bounds for Pairwise Learning

Yunwen Lei1,2 Antoine Ledent2 Marius Kloft2

1School of Computer Science, University of Birmingham, Birmingham B15 2TT, United Kingdom 2Department of Computer Science, TU Kaiserslautern, Kaiserslautern 67653, Germany y.lei@bham.ac.uk ledent@cs.uni-kl.de kloft@cs.uni-kl.de

Pairwise learning refers to learning tasks with loss functions depending on a pair of training examples, which includes ranking and metric learning as speciﬁc examples. Recently, there has been an increasing amount of attention on the generalization analysis of pairwise learning to understand its practical behavior. However, the existing stability analysis provides suboptimal high-probability generalization bounds. In this paper, we provide a reﬁned stability analysis by developing generalization bounds which can be n-times faster than the existing results, where n is the sample size. This implies excess risk bounds of the order O(n 1/2) (up to a logarithmic factor) for both regularized risk minimization and stochastic gradient descent. We also introduce a new on-average stability measure to develop optimistic bounds in a low noise setting. We apply our results to ranking and metric learning, and clearly show the advantage of our generalization bounds over the existing analysis.

1 Introduction

In modern machine learning, we frequently encounter problems where the performance of a model depends on pairs of training instances. As examples consider the following. In ranking problems, our aim is to learn a function that can predict the ordering of examples [13, 44]. In metric learning, which plays a key role in clustering problems [9, 28], we wish to learn an adequate distance metric between instances. In AUC maximization, which is deployed to class-imbalanced learning problems, we aim to ﬁnd a classiﬁer that maximizes the probability of scoring a positive example higher than a negative one [14]. Further examples include learning with minimum error entropy loss functions [27], multiple kernel learning [31], preference learning [22], and gradient learning [41]. All these so-called pairwise learning problems involve a loss function based on pairs of training examples. This is in a sharp contrast to classiﬁcation and regression, where the loss function depends only on a single instance. Those problems are referred to as pointwise learning problems.

In machine learning, we frequently build predictive models by optimizing their empirical behavior on training instances, that is, to achieve a small training error. However, a small training error does not imply that the learnt models will generalize well to test examples. Generalization analysis which is a central topic in statistical learning theory (SLT) [40] studies the generalization gap between the training and testing errors. There is a large amount of work on the generalization analysis of learning algorithms, largely based on either algorithmic stability [7, 17], complexity analysis of models [3, 52], PAC-Bayesian analysis [38], or integral operators [49, 53]. Most of this work focuses on pointwise learning, while pairwise learning is far less studied. A difﬁculty occurring in the generalization analysis of pairwise learning is that the objective function is not a sum of identically and independently distributed (i.i.d.) random variables [1, 9, 13, 30] a fundamental assumption in SLT.

The ﬁrst two authors contributed equally

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

In this paper, we employ the methodology of algorithmic stability for generalization analysis of pairwise learning. Appealingly, algorithmic stability considers just the one prediction function output by the learner [7], while methods based on uniform convergence, such as the Rademacher complexity [3], bound the difference of training and testing errors for all prediction functions. The latter approach generally involves a square-root dependency on the input dimension [2, 18, 54]. For comparison, algorithmic stability enables dimension-independent generalization bounds [20].

While there is preliminary work on the algorithmic stability of metric learning and ranking, the resulting generalization bounds are not satisfactory. The best existing bounds decay at the order of O(γ n) [1, 28, 55], where γ is the uniform-stability constant of the learning algorithm. In regularized risk minimization (RRM), this results in an excess risk bound of order O(n 1

4 ) at best, where n is the number of training examples.

As a main contribution of this paper, we show an improved bound for this setting of order O(γ log n), which translates into O log(n)/ n for excess risks of RRM. Remarkably, although the bound improves the previously best known rate achieved through stability analysis by a factor of n/ log(n), it applies more generally: we remove the standard assumption of a bounded loss function used in the prevalent stability analyses [1, 8, 19, 20, 55]. The loss of some of the most commonly used pairwise learning methods including rank SVM [29] and MPRank [15] is unbounded, for which we show, for the ﬁrst time, a stability analysis. Based on our connection between generalization and stability, we also derive, to the best of our knowledge, the ﬁrst probabilistic generalization bound for stochastic gradient descent (SGD) in pairwise learning. Our result quantiﬁes how to trade-off optimization and generalization to achieve a reﬁned excess risk bound in this setting.

The above bound holds generally for any conﬁdence level, which is informative to understand the variability of the algorithm and is necessary if the algorithm is used many times [20]. Furthermore, we show a sharper bound, but which holds in expectation and in a realizable case (where zero training error is achievable). Such bounds are called optimistic bounds in the literature [50]. For this setting, we show an excess risk bound of order O(n 1). For the proof, we introduce a new on-average stability measure for pairwise learning and quantify its implication to generalization.

Finally, we consider applications of our general theory to ranking and metric learning, where we obtain generalization bounds with signiﬁcantly improved dependence on n as compared to the existing stability analysis. Furthermore, our stability analysis also removes the dependency on the complexity of the hypothesis space and the input dimension in the uniform convergence analysis.

Structure. We review related work in Section 2, and give background information in Section 3. We list main results in Section 4 and give applications in Section 5. We conclude the paper in Section 6.

2 Related Work

In this section, we summarize the related work on the generalization analysis of pairwise learning, which we categorize according to the employed proof techniques.

In complexity (uniform convergence) analysis, we view generalization gaps between training and testing errors as U-statistics of order two. We can then bound the supremum of U-statistics over the hypothesis space the U-process [9, 13, 34, 44, 54, 58, 61]. To this end, decoupling techniques have been introduced to represent the objective function as a summation of i.i.d. random variables plus a degenerate U-statistic [13]. This approach can yield meaningful generalization bounds of the order O(1/ n) for several pairwise learning problems, including ranking [13, 44] and metric learning [9, 54]. The authors of [13, 44] show fast-rate generalization bounds under stronger capacity assumptions on the hypothesis space and Bernstein-type of assumptions on the relationship between variances and expectations. Fast generalization bounds were established for metric learning [58], which, however requires a boundedness assumption on the output model, a bounded gradient assumption and the learning models to be linear. The complexity approach ignores the interaction between the learning algorithm and the training dataset in the search of the output model. It therefore implies generalization bounds depending on the complexity of the hypothesis space [3] and the input dimension. As indicated in [2, 13, 54], a square-root dependency on the dimension is generally inevitable for the uniform convergence in metric learning and ranking. This means that such bounds can quickly become uninformative in high dimensions [1]. For hypothesis spaces with an unbounded complexity, uniform convergence bounds cannot be applied at all [1]. The stability analysis is preferable in both cases.

An advantage of uniform convergence approach is that it is able to imply meaningful generalization bounds in a non-convex learning setting [16, 21, 39]. As a comparison, stability analysis requires very small step sizes to enjoy good stability for non-convex problems [25], which inevitably leads to very slow convergence rates of optimization errors.

The second popular approach studies pairwise learning using algorithmic stability, which is a fundamental concept in SLT dating back to 1970s [46]. The modern framework of stability analysis was established in the seminal paper [7], where an important concept called the uniform stability was introduced. This stability measure was then extended to study randomized algorithms [17], to investigate the concentration of output models [35], and to exploit the summation structure of the empirical risk [47]. These stability measures have found applications in privacy learning [4], stochastic optimization [5, 10, 25, 32, 33], and structured prediction [36]. The fundamental role of algorithmic stability in SLT was illustrated by establishing its close connection to learnability [42, 47]. Very recently, elegant high-probability bounds were established for uniformly stable algorithms [8, 19, 20, 37]. The above mentioned stability analysis was conducted in the setting of pointwise learning. There is also some interesting work on the stability analysis of pairwise learning. For example, the connection between generalization and algorithmic stability was established for ranking [1, 23]. Furthermore, it was shown there that kernel-based ranking algorithms in a regularization setting enjoy uniform stability. Algorithmic stability was further used to yield dimension-independent bounds for regularized metric learning [28, 55]. The stability and its trade-off with optimization errors were studied for a variant of SGD in pairwise learning [48], inspired by the recent work in the pointwise learning setting [11, 25].

We now brieﬂy mention related work on the generalization analysis of pairwise learning using other proof techniques than complexity analysis or algorithmic stability. Algorithmic robustness was estimated for pairwise learning [12], which in turn implies generalization bounds [6]. Convex analysis was applied to study the regret bounds and generalization bounds of online pairwise learning [30, 56]. The tool of integral operators was used to exploit the structure of the speciﬁc least squares loss functions, where the learnt model can be written in a closed-form [59].

3 Background

3.1 Pairwise learning

Assume we are given a training dataset S = {z1, . . . , zn} drawn independently from a probability measure ρ deﬁned over a sample space Z = X Y, where X Rd is an input space of dimension d and Y R is an output space. Based on S, we wish to build a model h : X 7 Y or h : X X 7 R that can be used to do prediction when we are given some testing examples. We consider a parametric model where the model hw can be parameterized by an index w W, where W Rd is a parameter space of dimension d (d is not necessarily equal to d, and can be inﬁnite). Unlike pointwise learning, a distinctive property of pairwise learning is that the performance of a model should be measured over pairs of training examples. That is, the behavior of hw over z, z Z is measured by ℓ(w; z, z), where ℓ: W Z Z 7 R+ is a loss function. Then, the empirical behavior of hw can be quantiﬁed by the empirical risk RS(w) deﬁned by

RS(w) = 1 n(n 1)

i,j [n]:i =j ℓ(w; zi, zj), (3.1)

where we use the notation [n] = {1, . . . , n}. We train a predictive model by applying an algorithm A to S, for which some popular choices include empirical risk minimization, regularized/structural risk minimization, (stochastic) gradient descent, etc. An algorithm A can be understood as a mapping from Zn to W, with A(S) being the output of A when applied to S. Typically, the output model A(S) would enjoy a small empirical risk since we are often ﬁtting training examples. However, this does not necessarily mean that it also enjoys a small population risk R(w) = Ez, z ℓ(w; z, z) , which quantiﬁes the prediction behavior of w over testing examples. The generalization gap of a model w is deﬁned as the difference between the population risk and empirical risk, i.e., R(w) RS(w).

We are particularly interested in RRM, where a regularizer r : W 7 R+ is added into the data-ﬁtting term RS to increase the regularity of an algorithm. The resulting algorithm then outputs the model by

w S = arg min w W

h FS(w) := 1 n(n 1)

i,j [n]:i =j f(w; zi, zj) i , (3.2)

where f : W Z Z 7 R+ is deﬁned as f(w; z, z) = ℓ(w; z, z) + r(w). Although the above objective function involves O(n2) terms in the summand, one can use SGD to achieve sample-size independent convergence rates [45]. Let w1 W. At the t-th iteration, SGD ﬁrst randomly selects (it, jt) from the uniform distribution over the set {(i, j) : i, j [n], i = j}, and updates the model by

wt+1 = wt ηt ℓ (wt; zit, zjt) + r (wt) , (3.3)

where {ηt}t is a step size sequence, and ℓ (wt; zit, zjt) denotes a subgradient of ℓ( ; zit, zjt) at wt.

3.2 Algorithmic stability

Algorithmic stability plays an important role in studying the behavior of a learning algorithm. Intuitively, we say an algorithm A : Zn 7 W is stable if the output model A(S) is insensitive to perturbations of S. There are various notions of stability, including uniform stability, hypothesis stability, error stability and on-average stability [7, 17, 47]. A particularly interesting stability measure is uniform stability, which was introduced in [7] and extended in [1] to pairwise learning. Deﬁnition 1 (Uniform Stability). We say a deterministic algorithm A : Zn 7 W is γ-uniformly stable if for any training datasets S, S Zn that differ by at most a single example we have

ℓ(A(S); z, z) ℓ(A(S ); z, z) γ.

We will use the above notion of uniform stability to develop high-probability generalization bounds. To construct optimistic bounds, we introduce a novel on-average stability for pairwise learning, which is motivated by the recent work on on-average stability for pointwise learning [24, 32, 33, 47]. The difference is that we consider perturbations of a training dataset by two examples. Deﬁnition 2 (On-average stability). Let S = {z1, . . . , zn}, S = {z 1, . . . , z n} be independently drawn from ρ. For any i < j, we denote

Si,j = z1, . . . , zi 1, z i, zi+1, . . . , zj 1, z j, zj+1, . . . , zn . (3.4)

We say a deterministic algorithm A is γ-on-average stable if

i,j [n]:i =j ES,S h ℓ A(Si,j); zi, zj ℓ A(S); zi, zj i γ.

It is clear that on-average stability is weaker than uniform-stability since it involves the expectation over training examples and the average of indices. As a comparison, uniform stability involves a supremum over both training examples and testing examples z, z.

4 Main Results

In this section, we present our main results on generalization bounds based on stability. We always let be a norm induced by an inner product , in a Hilbert space, i.e., 2 = , . Then its dual norm is itself. We say a function g : W 7 R is σ-strongly convex w.r.t. a norm if

g(w) g(w ) + w w , g (w ) σ w w 2/2, w, w W.

4.1 Generalization by algorithmic stability

Our ﬁrst result (Theorem 1) to be proved in Appendix A is a high-probability generalization bound for uniformly stable algorithms in pairwise learning, motivated by the recent analysis in pointwise learning [8, 19, 20, 37]. One of the key tools we use in the analysis is a concentration inequality from [8], which considers a summation of n functions of n independent random variables. However,

this concentration inequality does not ﬁt the structure of pairwise learning. A difﬁculty is the interdependency among the n(n 1) terms in the objective function. Our novelty to tackle this difﬁculty is to introduce a new decomposition to exploit the structure of the U-statistic in the pairwise objective function (3.1). Below, e denotes the base of the natural logarithm. For any α 0, α denotes the least integer no smaller than α. For any random variable Z, we denote by EZ[ ] the conditional expectation with respect to (w.r.t.) Z. Theorem 1. Let A : Zn 7 W be γ-uniformly stable and M > 0. Suppose ES[ℓ(A(S); z, z)] M for all z, z Z. Then for all δ (0, 1/e) the following inequality holds with probability 1 δ

|RS(A(S)) R(A(S))| 4γ+e 12

log(e/δ)+48

6γ log2(n 1) log(e/δ) . (4.1) Remark 1. We now compare Theorem 1 with the existing stability analysis. Roughly speaking, Theorem 1 shows that the generalization gap for γ-uniformly stable algorithms decays as O(γ log n + n 1

2 ) with high probability (we ignore log(1/δ) for brevity). Under the same conditions, it was shown for pairwise learning that [1, 15, 28, 55]

|RS(A(S)) R(A(S))| = O nγ + n 1

It is clear that our result signiﬁcantly improves (4.2) by replacing their dominant term nγ with γ log n. Speciﬁcally, if γ = O(n α) with α ( 1

2, 1] (actually γ = O(1/(nσ)) if FS is σstrongly convex [7]), then (4.1) becomes |RS(A(S)) R(A(S))| = O(n 1

2 ), while (4.2) becomes |RS(A(S)) R(A(S))| = O(n 1 2 α). The existing complexity analysis for pointwise learning suggests σ = O(n 1

2 ) to get an optimal bound [51, eq (14)]. In this case, γ = O(n 1

2 ) and our stability analysis implies the nice bound O(n 1

2 log n), while (4.2) implies the vacuous bound O(1).

4.2 Generalization bounds for regularized risk minimization

We now apply Theorem 1 to establish generalization bounds for pairwise learning with strongly convex objective functions. A preliminary step in the application of Theorem 1 is to control ES[ℓ(A(S); z, z)]. To this aim, we establish the following lemma to be proved in Appendix B. Let

w = arg min w W R(w) + r(w) , w R = arg min w W R(w).

Lemma 2. Suppose FS is σ-strongly convex w.r.t. a norm . Deﬁne the algorithm A as A(S) = arg minw W FS(w). If A is γ-uniformly stable, then E[ A(S) w 2] 8γ/σ.

In the existing analysis, one often uses the σ-strong convexity of FS to show A(S) = O(1/ σ) [9], which implies a suboptimal bound since the convexity parameter σ is often very small in practice, i.e., σ = O(n α) for α (0, 1) (σ is roughly the regularization parameter which should decay in this way [19, 51]). As a comparison, Lemma 2 implies that ES[ A(S) w ] = O( p

γ/σ), which is signiﬁcantly smaller than O(1/ σ) since the uniform stability parameter is often very small [7].

We need the following assumption to derive Theorem 3, whose proof is given in Appendix B. Assumption 1. Let b, σ0 > 0. We assume 0 ℓ(0; z, z) b for all z, z Z. We also assume Var[ℓ(w ; Z, Z)] < σ2 0, where Var[ℓ(w ; Z, Z)] denotes the variance of ℓ(w ; Z, Z).

We use the notation B e B if there exist constants c1, c2 > 0 such that c1 e B < B c2 e B. Theorem 3. Let Assumption 1 hold and L R+. Deﬁne A as A(S) = arg minw W FS(w). Suppose FS is σ-strongly convex w.r.t. for all S. Assume ℓ(w; z, z) ℓ(w ; z, z) L w w , z, z Z, w, w W. (4.3)

Then for δ (0, 1/e), with probability 1 δ the generalization gap RS(A(S)) R(A(S)) satisﬁes

|RS(A(S)) R(A(S))| = O (nσ) 1 log n log(1/δ) + p

n 1 log(1/δ) . (4.4)

Furthermore, if r(w) = O(σ w 2), σ n 1/2, supz,z ℓ(w R; z, z ) = O( n) and Assumption 1 holds with w replaced by w R, then with probability at least 1 δ we have the following bound on excess risk R(A(S)) R(w R)

R(A(S)) R(w R) = O(n 1

2 log n log(1/δ)). (4.5)

Remark 2. We present some comparisons with the existing work. Under similar assumptions and additional boundedness assumptions, the existing stability analysis implies the generalization bound |RS(A(S)) R(A(S))| = O(σ 1n 1

2 ) for pairwise learning with σ-strongly convex objective functions [1, 15, 28, 55], which can be n-times slower than the bound (4.4). To see this, assume σ n α with α [0, 1

2]. If α [0, 1

2), then (4.4) implies the bound O(n 1

2 ), while the bounds in [1, 28, 55] become |RS(A(S)) R(A(S))| = O(nα 1

2 ). For the special case α = 1/2 suggested in the existing analysis of pointwise learning [51], Eq. (4.4) implies the bound O(n 1

2 log n), while the existing bound becomes O(1) [1, 28, 55]. As we will clarify in Remark B.1, the existing stability analysis yields at best the excess risk bound R(A(S)) R(w R) = O(n 1

4 ) no matter how σ changes. As a comparison, our stability analysis yields the bound R(A(S)) R(w R) = O(n 1

Remark 3 (Boundedness assumption). To get the bound O(n 1

2 log n), the existing stability analysis requires a boundedness assumption on loss functions as 0 ℓ(A(S); z, z) B for a constant B > 0 and all S Zn, z, z Z (B is treated as a constant absorbed in a big O notation) [1, 8, 19, 20, 55] or a boundness assumption on W [28]. However, one can only show A(S) = O(1/ σ) [1] and therefore the constant B needs to grow as O(1/ σ) for popular loss functions (e.g., hinge loss and logistic loss), from which the stability analysis in [8] can only imply suboptimal bounds O((nσ) 1/2) even in the case of pointwise learning if one does not impose a boundedness assumption (note σ is often very small). We develop the generalization bound O(n 1

2 log n) by relaxing the boundedness assumption to a variance assumption on ℓ(w ; Z, Z). Note that the expectation of ℓ(w ; Z, Z) is R(w ), which is small according to the deﬁnition of w . Therefore, it is reasonable to assume that the variance of ℓ(w ; Z, Z) is bounded. To achieve this relaxation, we use a novel application of Theorem 1 to ℓ(w; z, z) = ℓ(w; z, z) ℓ(w ; z, z) instead of ℓ(w; z, z) (cf. Line 107 in the Appendix). Moreover, we introduce a novel lemma (Lemma 2) to show ES ℓ(A(S); z, z) = O(1/( nσ)) (cf. Line 105 in the Appendix).

4.3 Generalization bounds for stochastic gradient descent

As a further application of Theorem 1, we establish generalization bounds for SGD (3.3) in pairwise learning, which can be considered as a deterministic algorithm if we ﬁx {(it, jt)t} in (3.3). SGD is a highly popular algorithm with wide applications in the big-data era. Note we do not require a strong convexity. We say g : W 7 R is α-smooth w.r.t. a norm if g is differentiable and

g (w) g (w ) α w w , w, w W.

Popular smooth loss functions include the logistic loss, the Huber loss, the squared hinge loss and the least squares loss [52]. Note that the logistic loss and the Huber loss are also Lipschitz continuous. The proof is given in Appendix C. Theorem 4. Let (4.3) hold. Assume for all z, z , w 7 ℓ(w; z, z ) is convex and α-smooth w.r.t. the Euclidean norm, and for any {(it, jt)t} we have ES[ℓ(w T ; z, z )] M, where w T is produced by SGD with ηt = c/

T, c 2/α and r(w) = 0. For any δ (0, 1) with probability 1 δ we have RS(w T ) R(w T ) = O log n log(1/δ)

2 log n log 3 2 (1/δ) . (4.6)

Remark 4. We now show the implication of Theorem 4 on understanding the generalization behavior and implicit regularization of SGD. We can show (the details are given in Remark C.1)

R(w T ) R(w R) = R(w T ) RS(w T ) + RS(w T ) RS(w R) + O(n 1

The ﬁrst term is the estimation error and the second term is the optimization error. Therefore, Theorem 4 actually gives an estimation error bound. If we further assume wt B for some B > 0 and all t, the optimization error was shown to satisfy2 RS(w T ) RS(w R) = O(T 1

2 log T) [26] with high probability. Plugging these estimation and optimization error bounds back into (4.7), we derive with high probability R(w T ) R(w R) = O log n

2 log n + O(T 1

2 log T). It is clear that estimation errors increase as we run more iterations, while optimization errors decrease. One can take an optimal T n to trade-off the optimization and estimation errors, and get

2Although they considered step size ηt = c/

t [26], their result also holds for ηt = c/

R(w T ) R(w R) = O(n 1

2 log n). To the best of our knowledge, this gives the ﬁrst high-probability generalization bound for SGD in pairwise learning. Although we do not use an explicit regularizer in Theorem 4, our analysis shows that an implicit regularization can be achieved by tuning the number of iterations [25, 57]. We can compare our results with those based on the existing connection (4.2) between generalization and stability. Indeed, if we combine the best known optimization error bounds [26], the uniform stability of SGD established in Lemma C.3 and (4.2) together, we can only derive vacuous excess risk bound R(w T ) R(w R) = O(1) (details are given in Remark C.2), which are signiﬁcantly improved to O(n 1

2 log n) based on Theorem 1. In Appendix F we show O(n 1

2 ) is minimax optimal for pairwise learning in a general convex setting.

The authors of [48] studied a variant of SGD where the models are updated by

wt+1 = wt ηt t 1

k=1 ℓ (wt; zit, zik), t > 1.

Their stability bounds were stated in expectation [48] while we give high-probability analysis.

It should be mentioned that our generalization analysis can be applied to other iterative algorithms for pairwise learning, including gradient descent, Nesterov s accelerated gradient descent, the heavy ball method and stochastic gradient Langevin dynamics [11]. To this aim, it sufﬁces to estimate the uniform stability of these algorithms in pairwise learning.

4.4 Optimistic generalization bounds

Our key idea to derive optimistic bounds is to use the on-average stability in Deﬁnition 2, whose connection to generalization is established in the following theorem to be proved in Section D. Theorem 5. If A is γ-on-average stable, then E R(A(S)) RS(A(S)) γ.

We now present an optimistic generalization bound for pairwise learning by exploiting the smoothness of loss functions. By optimistic we mean that the decay rate of generalization bounds depends on the behavior of the best model. That is, we can get faster bounds if R(w R) = o(1). Optimistic bounds were studied for pointwise learning in the literature [43, 50, 60], which are becoming interesting in the big-data era where models are often powerful enough to achieve a very small training error. For any w W, let F(w) = R(w) + r(w). Theorem 6 is proved in Appendix D. Theorem 6. Assume for all z, z , the map w 7 ℓ(w; z, z ) is α-smooth w.r.t. . Let A(S) = arg minw W FS(w) and σn 8α. If for all S Zn, FS is σ-strongly convex w.r.t. , then

E F(A(S)) F(w ) E R(A(S)) RS(A(S)) 1024α2

E RS(A(S)) . (4.8)

Furthermore, if r(w) = O(σ w 2) we can take some appropriate σ to get

E[R(A(S))] R(w R) = O q

R(w R) w R n 1

2 + w R 2n 1 . (4.9)

Note if R(w R) = O( w R 2/n), the above excess risk bound becomes E[R(A(S))] R(w R) = O( w R 2/n). That is, we get a fast excess risk bound if there exists a model with a small population risk.

5 Applications

5.1 Ranking

For ranking we assume real-valued labels indicating a ranking preference on instances, i.e., yi < yj means xi has a lower rank than xj. We aim to build a function hw : X 7 R that ranks instances with larger labels higher than those with smaller labels [1, 13, 44]. The performance of a model hw at a pair z, z can be measured by the 0-1 loss ℓ0-1(w; z, z ) = I[sgn(y y )(hw(x) hw(x )) < 0], where I[ ] is the indicator function taking the value 1 if the argument holds and 0 otherwise, and sgn(a) denotes the sign of the number a. Since the 0-1 loss leads to an NP-hard problem, we consider loss functions of the form ℓψ(w; z, z ) = ψ(sgn(y y )(hw(x) hw(x ))). Here ψ : R 7 R+ is

convex and decreasing, for which popular choices include the hinge loss ψ(t) = max{1 t, 0} and the logistic loss ψ(t) = log(1 + exp( t)). Below we provide bounds for RRM, SGD and optimistic bounds.

Regularized risk minimization. The following proposition follows directly from Theorem 3 by noticing that r(w) = λ 2 is 2λ-strongly convex w.r.t. . We omit the proof for simplicity.

Proposition 7. Let Assumption 1 hold for both w and w replaced by w R. Consider ranking problems with f(w; z, z ) = ℓψ(w; z, z ) + λ w 2. Assume ℓψ is convex w.r.t. w and satisfy (4.3). Then for A(S) = arg minw W FS(w), λ n 1

2 and any δ (0, 1/e), with probability at least 1 δ there holds R(A(S)) R(w R) = O n 1

2 log n log(1/δ) .

Remark 5. High-probability bounds for ranking were developed in the literature under some capacity assumptions on the hypothesis space {hw : w W} measured by either covering numbers [44, 58] or VC dimension [13]. The arguments there are based on the uniform convergence of empirical risks to population risk and ignore the speciﬁc property of the learning algorithm, which inevitably depends on the complexity of the hypothesis space. Furthermore, a dependency on the dimension is necessary if no structural assumptions are imposed [18]. For example, generalization bounds O( p

d/n) were derived for bipartite ranking (AUC maximization) via the uniform convergence approach [2]. As a comparison, we derive dimension-independent bounds of order O(n 1

2 log n). Furthermore, the existing stability analysis implies |RS(A(S)) R(A(S))| = O λ 1p

1/n [1, 15] and the excess risk bounds R(A(S)) R(w R) = O(n 1

4 ), which are worse than the results in Proposition 7.

Stochastic gradient descent. The following proposition is a direct application of Theorem 4.

Proposition 8. Consider ranking problems with f(w; z, z ) = ℓψ(w; z, z ), i.e. r(w) = 0. Let (4.3) hold and assume for all z, z , the map w 7 ℓψ(w; z, z ) is convex and α-smooth. Let w T be produced by SGD with ηt = c/

T, c 2/α and assume ES[ℓ(w T ; z, z )] M. Then for any δ (0, 1) and T n, with probability 1 δ we have RS(w T ) R(w T ) = O n 1

2 log n log 3 2 (1/δ) .

Optimistic bounds. Proposition 9 on optimistic bounds is a direct application of Theorem 6.

Proposition 9. Consider ranking problems with f(w; z, z ) = ℓψ(w; z, z )+λ w 2. If ℓψ is convex, α-smooth and A(S) = arg minw W FS(w), we can choose some λ 8α/n such that (4.9) holds.

Our results directly apply to bipartite ranking (AUC maximization) [14] with Y = {+1, 1}. To see this, bipartite ranking is a speciﬁc instance of (3.1) with loss functions of the form ℓψ(w; z, z ) = ψ((hw(x) hw(x )))I[y = 1, y = 1]. We omit this discussion for brevity.

5.2 Metric learning

We consider metric learning for learning a metric to measure the distance between instance pairs. We consider supervised metric learning with Y = { 1, +1}, where we want an instance pair to be similar if they have the same class label, and apart from each other if they have different class labels [9, 28, 58]. We consider the Mahalanobis metric hw(x, x ) = w, (x x )(x x ) , where x denotes the transpose of x and w Rd d. The performance of hw on z, z can be measured by the 0-1 loss ℓ0-1(w; z, z ) = I[τ(y, y )(1 hw(x, x )) 0], where τ(y, y ) = 1 if y = y and 1 otherwise. We often use a convex surrogate ψ : R 7 R+, which leads to ℓψ(w; z, z ) = ψ(τ(y, y )(1 hw(x, x ))). We assume supx X x 2 B for some B > 0, where 2 is the Euclidean norm.

Regularized risk minimization. Corollary 10 on RRM is proved in Appendix E.

Corollary 10. Let Assumption 1 hold for both w and w replaced by w R. Consider metric learning with Y = { 1, +1}. Consider f(w; z, z ) = ψ(τ(y, y )(1 hw(x, x ))) + λ w 2, where is the Frobenius norm and ψ(t) = max{0, 1 t}. Then for A(S) = arg minw W FS(w), λ n 1

2 and any δ (0, 1/e), with probability 1 δ it holds R(A(S)) R(w R) = O n 1

2 log n log(1/δ) .

Remark 6. We make some comparisons. It was previously shown that R(A(S)) RS(A(S)) = O n 1

2 λ 1 [9, 28, 55], which leads to the excess risk bound O(n 1

4 ). This is signiﬁcantly improved to O(n 1

2 log n) in Corollary 10. A uniform convergence rate O(

dn 1) was shown for metric learning [54], which is not appealing for high-dimensional problems. It was further indicated that a

strong dependence on d is generally necessary for the uniform convergence if one does not impose a structural assumption [54]. As a comparison, our bound in Corollary 10 is dimension-independent.

Stochastic gradient descent. Corollary 11 on bounds for SGD is proved in Appendix E.

Corollary 11. Consider metric learning with Y = { 1, +1} and f(w; z, z ) = ψ(τ(y, y )(1 hw(x, x ))), where ψ(t) = log(1 + exp( t)). Let w T be produced by SGD with ηt = c/

T, c 1/8B4 and assume ES[ψ(τ(y, y )(1 hw T (x, x )))] M. Then for any δ (0, 1) and T n, with probability 1 δ we have RS(w T ) R(w T ) = O n 1

2 log n log 3 2 (1/δ) .

Optimistic bounds. We also get optimistic bounds for metric learning. We omit the proof for brevity.

Corollary 12. Let Assumptions of Corollary 10 hold except that we consider the logistic loss ψ(t) = log(1 + exp( t)). If A(S) = arg minw W FS(w) and w R = O(1), then we can choose some appropriate λ such that Eq. (4.9) holds.

6 Conclusion

We analyze the generalization ability of pairwise learning using the methodology of algorithmic stability. We signiﬁcantly improve the existing high-probability bounds O( nγ) to O(γ log n) for γ-uniformly stable algorithms. This allows us to improve the previously best excess risk bounds O(n 1/4) for RRM and O(1) for SGD to O(n 1/2 log n). As compared to the uniform convergence analysis, our stability analysis implies the ﬁrst high-probability risk bound for SGD in pairwise learning, and yields bounds independent of the complexity of models and the input dimension. Furthermore, we introduce an on-average stability to develop optimistic bounds as fast as O(1/n) for learning in a low noise setting. Speciﬁc applications are further given to show the advantage of our generalization bounds over the existing analysis.

Below we mention some interesting directions for future research. First, it would be interesting to extend the analysis here to other learning settings, such as distributed learning and online learning for pairwise learning. Second, one could tackle the challenging problem of stability and generalization bounds for non-convex pairwise learning problems, which are popular in modern machine learning.

Broader Impact

This work does not present any foreseeable societal consequence.

Acknowledgments and Disclosure of Funding

YL acknowledges support by the National Natural Science Foundation of China (Grant Nos. 61806091, 11771012), and by the Alexander von Humboldt Foundation for a Humboldt Research Fellowship. AL and MK acknowledge support by the German Research Foundation (DFG) award KL 2698/2-1 and by the Federal Ministry of Science and Education (BMBF) awards 01IS18051A and 031B0770E.

[1] S. Agarwal and P. Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research, 10(Feb):441 474, 2009. [2] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the area under the roc curve. Journal of Machine Learning Research, 6(Apr):393 425, 2005. [3] P. Journal of Machine Learning Research, 3:463 482, 2002. [4] R. Bassily, V. Feldman, K. Talwar, and A. G. Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11279 11288, 2019. [5] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems, 2020. [6] A. Bellet and A. Habrard. Robustness and generalization for metric learning. Neurocomputing, 151: 259 267, 2015.

[7] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2 (Mar):499 526, 2002. [8] O. Bousquet, Y. Klochkov, and N. Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pages 610 626, 2020. [9] Q. Cao, Z.-C. Guo, and Y. Ying. Generalization bounds for metric and similarity learning. Machine Learning, 102(1):115 132, 2016. [10] Z. Charles and D. Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, pages 744 753, 2018. [11] Y. Chen, C. Jin, and B. Yu. Stability and convergence trade-off of iterative optimization algorithms. ar Xiv preprint ar Xiv:1804.01619, 2018. [12] A. Christmann and D.-X. Zhou. On the robustness of regularized pairwise learning methods based on kernels. Journal of Complexity, 37:1 33, 2016. [13] S. Clémençon, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of U-statistics. The Annals of Statistics, pages 844 874, 2008. [14] C. Cortes and M. Mohri. Auc optimization vs. error rate minimization. In Advances in Neural Information Processing Systems, pages 313 320, 2004. [15] C. Cortes, M. Mohri, and A. Rastogi. Magnitude-preserving ranking algorithms. In Proceedings of the 24th international conference on Machine learning, pages 169 176, 2007. [16] D. Davis and D. Drusvyatskiy. Uniform graphical convergence of subgradients in nonconvex optimization and learning. ar Xiv preprint ar Xiv:1810.07590, 2018. [17] A. Elisseeff, T. Evgeniou, and M. Pontil. Stability of randomized learning algorithms. Journal of Machine Learning Research, 6(Jan):55 79, 2005. [18] V. Feldman. Generalization of erm in stochastic convex optimization: The dimension strikes back. In Advances in Neural Information Processing Systems, pages 3576 3584, 2016. [19] V. Feldman and J. Vondrak. Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems, pages 9747 9757, 2018. [20] V. Feldman and J. Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pages 1270 1279, 2019. [21] D. J. Foster, A. Sekhari, and K. Sridharan. Uniform convergence of gradients for non-convex learning and optimization. In Advances in Neural Information Processing Systems, pages 8759 8770, 2018. [22] J. Fürnkranz and E. Hüllermeier. Preference learning and ranking by pairwise comparison. In Preference learning, pages 65 82. Springer, 2010. [23] W. Gao and Z.-H. Zhou. Uniform convergence, stability and learnability for ranking problems. In International Joint Conference on Artiﬁcial Intelligence, pages 1337 1343. AAAI Press, 2013. [24] A. Gonen and S. Shalev-Shwartz. Average stability is invariant to data preconditioning: Implications to exp-concave empirical risk minimization. The Journal of Machine Learning Research, 18(1):8245 8257, 2017. [25] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225 1234, 2016. [26] N. J. Harvey, C. Liaw, Y. Plan, and S. Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579 1613, 2019. [27] T. Hu, J. Fan, Q. Wu, and D.-X. Zhou. Regularization schemes for minimum error entropy principle. Analysis and Applications, 13(04):437 455, 2015. [28] R. Jin, S. Wang, and Y. Zhou. Regularized distance metric learning: Theory and algorithm. In Advances in Neural Information Processing Systems, pages 862 870, 2009. [29] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133 142, 2002. [30] P. Kar, B. Sriperumbudur, P. Jain, and H. Karnick. On the generalization ability of online learning algorithms for pairwise loss functions. In International Conference on Machine Learning, pages 441 449, 2013. [31] A. Kumar, A. Niculescu-Mizil, K. Kavukcoglu, and H. Daumé. A binary classiﬁcation framework for two-stage multiple kernel learning. In International Conference on Machine Learning, pages 1331 1338, 2012. [32] I. Kuzborskij and C. Lampert. Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning, pages 2820 2829, 2018. [33] Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, 2020. [34] Y. Lei, S.-B. Lin, and K. Tang. Generalization bounds for regularized pairwise learning. In International Joint Conference on Artiﬁcial Intelligence, pages 2376 2382, 2018. [35] T. Liu, G. Lugosi, G. Neu, and D. Tao. Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, pages 2159 2167, 2017.

[36] B. London, B. Huang, and L. Getoor. Stability and generalization in structured prediction. The Journal of Machine Learning Research, 17(1):7808 7859, 2016. [37] A. Maurer. A second-order look at stability and generalization. In Conference on Learning Theory, pages 1461 1475, 2017. [38] D. A. Mc Allester. Some pac-bayesian theorems. Machine Learning, 37(3):355 363, 1999. [39] S. Mei, Y. Bai, and A. Montanari. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747 2774, 2018. [40] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT press, 2012. [41] S. Mukherjee and D.-X. Zhou. Learning coordinate covariances via gradients. Journal of Machine Learning Research, 7:519 549, 2006. [42] A. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications, 3 (04):397 417, 2005. [43] H. W. Reeve and A. Kaban. Optimistic bounds for multi-output prediction. ar Xiv preprint ar Xiv:2002.09769, 2020. [44] W. Rejchel. On ranking and generalization bounds. Journal of Machine Learning Research, 13(May): 1373 1392, 2012. [45] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, pages 400 407, 1951. [46] W. H. Rogers and T. J. Wagner. A ﬁnite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, pages 506 514, 1978. [47] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635 2670, 2010. [48] W. Shen, Z. Yang, Y. Ying, and X. Yuan. Stability and optimization error of stochastic gradient descent for pairwise learning. Analysis and Applications, pages 1 41, 2019. [49] S. Smale and D.-X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive approximation, 26(2):153 172, 2007. [50] N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, pages 2199 2207, 2010. [51] K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives. In Advances in Neural Information Processing Systems, pages 1545 1552, 2009. [52] I. Steinwart and A. Christmann. Support Vector Machines. Springer Science & Business Media, 2008. [53] Z. Szabó, B. K. Sriperumbudur, B. Póczos, and A. Gretton. Learning theory for distribution regression. The Journal of Machine Learning Research, 17(1):5272 5311, 2016. [54] N. Verma and K. Branson. Sample complexity of learning mahalanobis distance metrics. In Advances in Neural Information Processing Systems, pages 2584 2592, 2015. [55] B. Wang, H. Zhang, P. Liu, Z. Shen, and J. Pineau. Multitask metric learning: Theory and algorithm. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3362 3371, 2019. [56] Y. Wang, R. Khardon, D. Pechyony, and R. Jones. Generalization bounds for online learning algorithms with pairwise loss functions. In Conference on Learning Theory, volume 23, pages 13 1, 2012. [57] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289 315, 2007. [58] H.-J. Ye, D.-C. Zhan, and Y. Jiang. Fast generalization rates for distance metric learning. Machine Learning, 108(2):267 295, 2019. [59] Y. Ying and D.-X. Zhou. Online pairwise learning algorithms. Neural computation, 28(4):743 777, 2016. [60] L. Zhang, T. Yang, and R. Jin. Empirical risk minimization for stochastic convex optimization: O(1/n)-and O(1/n2)-type of risk bounds. In Conference on Learning Theory, pages 1954 1979, 2017. [61] Y. Zhou, H. Chen, R. Lan, and Z. Pan. Generalization performance of regularized ranking with multiscale kernels. IEEE Transactions on Neural Networks and Learning Systems, 27(5):993 1002, 2015.