# convergence_of_unregularized_online_learning_algorithms__6500297a.pdf

Journal of Machine Learning Research 18 (2018) 1-33 Submitted 8/17; Published 4/18

Convergence of Unregularized Online Learning Algorithms

Yunwen Lei yunwelei@cityu.edu.hk Shenzhen Key Laboratory of Computational Intelligence Department of Computer Science and Engineering Southern University of Science and Technology Shenzhen, 518055, China and Department of Mathematics City University of Hong Kong Kowloon, Hong Kong, China

Lei Shi leishi@fudan.edu.cn School of Mathematical Sciences Shanghai Key Laboratory for Contemporary Applied Mathematics Fudan University Shanghai, 200433, China

Zheng-Chu Guo guozhengchu@zju.edu.cn School of Mathematical Sciences Zhejiang University Hangzhou, 310027, China

Editor: Andreas Christmann

In this paper we study the convergence of online gradient descent algorithms in reproducing kernel Hilbert spaces (RKHSs) without regularization. We establish a suﬃcient condition and a necessary condition for the convergence of excess generalization errors in expectation. A suﬃcient condition for the almost sure convergence is also given. With high probability, we provide explicit convergence rates of the excess generalization errors for both averaged iterates and the last iterate, which in turn also imply convergence rates with probability one. To our best knowledge, this is the ﬁrst high-probability convergence rate for the last iterate of online gradient descent algorithms in the general convex setting. Without any boundedness assumptions on iterates, our results are derived by a novel use of two measures of the algorithm s one-step progress, respectively by generalization errors and by distances in RKHSs, where the variances of the involved martingales are cancelled out by the descent property of the algorithm. Keywords: Learning theory, Online learning, Convergence analysis, Reproducing kernel Hilbert space

1. Introduction

Online gradient descent is a scalable method able to tackle large-scale data arriving in a sequential manner (Zhang, 2004; Kivinen et al., 2004; Duchi and Singer, 2009; Dieuleveut and Bach, 2016), which is becoming ubiquitous within the big data era. As a ﬁrst-order method, it iteratively builds an unbiased estimate of the true gradient upon the arrival of

c 2018 Yunwen Lei, Lei Shi and Zheng-Chu Guo.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/17-457.html.

Lei, Shi and Guo

a new example and uses this information to guide the learning process (Zinkevich, 2003; Zhang, 2004). As veriﬁed by theoretical and empirical analysis, online gradient descent enjoys comparable performance as compared to its batch counterpart such as gradient descent (Zhang, 2004; Yao, 2010; Shalev-Shwartz et al., 2011), while attaining a great computational speed-up since its gradient calculation involves only a single example. As a comparison, the gradient calculation in gradient descent requires to traverse all training examples. Recently, online gradient descent has received renewed attention due to the wide applications of its stochastic analogue, i.e., stochastic gradient descent, in training deep neural networks (Bottou, 1991; Ngiam et al., 2011; Sutskever et al., 2013). In this paper, we are interested in the setting that training examples {zt = (xt, yt)}t N are sequentially and identically drawn from a probability measure ρ deﬁned in the sample space Z = X Y, where X Rd is the input space and Y R is the output space. We focus on the nonparametric setting, where the learning process is implemented in a reproducing kernel Hilbert space (RKHS) HK associated with a Mercer kernel K : X X R which is assumed to be continuous, symmetric and positive semi-deﬁnite. The space HK is deﬁned as the completion of the linear span of the set of functions {Kx( ) := K(x, ) : x X} with the inner producing satisfying the reproducing property f(x) = f, Kx for any x X and f HK. In this setting, the use of Mercer kernels provides a unifying way to measure similarities between pairs of objects (Cortes and Vapnik, 1995; M uller et al., 2001; Steinwart, 2001; Sch olkopf and Smola, 2001), which turns out to be a key to the great success of kernel methods in many practical learning problems. We wish to build a prediction rule f HK after seeing a sequence of training examples, the performance of which at an example (x, y) can be quantitatively measured by a loss function φ : Y R R+ as φ(y, f(x)). With a sequence {ηt}t N of positive step sizes and f1 = 0, online gradient descent is a realization of learning schemes by keeping a sequence of iterates as follows

ft+1 = ft ηtφ (yt, ft(xt))Kxt, t N, (1.1)

where φ denotes the derivative of φ with respect to the second argument. Although our focus is on the nonparametric setting, it should be mentioned that the above algorithm also recovers the parametric case in which the kernel is taken to be the linear kernel with Kx(x ) = x, x , x, x X, to which our results also apply. Despite its widespread applications, the theoretical understanding of the online gradient descent algorithms is still not satisfactory in the following three aspects. Firstly, boundedness assumptions on the iterates are often imposed in the literature, which may be violated in practical implementations if the underlying domain is not bounded. Although a projection of iterates onto a bounded domain guarantees the boundedness assumption, the projection operator may be time-consuming and this introduces an additional challenging problem of tuning the size of the domain. Secondly, most of the theoretical results are stated in expectation, while we are sometimes more interested in either almost sure convergence or convergence rates with high probability. Indeed, an algorithm may suﬀer from a high variability and should be used with caution if neither almost sure convergence nor high-probability bounds hold (Shamir and Zhang, 2013). In particular, an almost sure convergence is still lacking for online gradient descent algorithms applied to general convex problems (Ying and Zhou, 2017). Lastly, most existing convergence rates are stated for some average of iterates. Though taking average of iterates can improve the robustness

Convergence of Unregularized Online Learning Algorithms

of the solution (Nemirovski et al., 2009), it can either destroy the sparsity of the solution which is crucial for a proper interpretation of models in many applications, or slow down the training speed in practical implementations (Rakhlin et al., 2012). In this paper, we aim to take a further step to tackle the above mentioned problems. We establish a general suﬃcient condition and a necessary condition on the step sizes for the convergence of online gradient descent algorithms in expectation. With Doob s martingale convergence theorem and the Borel-Cantelli lemma, a suﬃcient condition for the almost sure convergence and explicit convergence rates with probability one are also established. Furthermore, we present high-probability bounds for both averaged iterates and the last iterate of online gradient descent algorithms. To our best knowledge, this is the ﬁrst high-probability convergence rate for the last iterate of online gradient descent algorithms in the general convex setting. Our analysis does not impose any boundedness assumptions on the iterates. Indeed, we show that, although implemented in an unbounded domain, the iterates produced by (1.1) fall into a bounded domain with high probability (up to logarithmic factors). Our analysis is performed by viewing the one-step progress of online gradient descent algorithms from diﬀerent yet uniﬁed perspectives: one in terms of generalization errors and one in terms of RKHS distances. For both viewpoints, we relate the one-step progress to a martingale diﬀerence sequence and a negative term due to the descent nature of the algorithm. Our novelty is to show that the dominant variance term appearing in the application of a Bernstein-type inequality to these martingales can be cancelled out by the negative terms in the one-step progress inequalities. Both viewpoints of the one-step progress are indispensable in our analysis. The remaining parts of this paper are organized as follows. We present main results in Section 2. Discussions and comparisons with related work are given in Section 3. The proofs of main results are given in Section 4.

2. Main Results

Our convergence rates are stated for generalization errors, which, for a prediction rule f : X R, are deﬁned as the expected error E(f) = R

Z φ(y, f(x))dρ incurred from using f to perform prediction. Our analysis requires to impose mild assumptions on the loss functions.

Assumption 1 We assume the loss function φ : Y R R+ is convex and diﬀerentiable with respect to the second argument. Let α (0, 1] and L > 0 be two constants. We assume that the gradients of φ are (α, L)-H older continuous in the sense

|φ (y, s) φ (y, s)| L|s s|α, s, s R, y Y. (2.1)

We say φ is smooth if it satisﬁes (2.1) with α = 1. Loss functions satisfying Assumption 1 are wildly used in machine learning. Smooth loss functions include the least squares loss φ(y, a) = 1

2(y a)2 and the Huber loss φ(y, a) = 1

2(y a)2 if |y a| 1 and |y a| 1

2 otherwise for regression, as well as the logistic loss φ(y, a) = log(1 + exp( ya)) and the quadratically smoothed hinge loss φ(y, a) = max{0, 1 ya}2 for classiﬁcation (Zhang, 2004). If p (1, 2], both the p-norm hinge loss φ(y, a) = max{0, 1 ya}p for classiﬁcation and the p-th power absolute distance φ(y, a) = |y a|p for regression satisfy (2.1) with α = p 1 (Chen et al., 2004; Steinwart and Christmann, 2008).

Lei, Shi and Guo

Throughout this paper, we assume that a minimizer f H = arg minf HK E(f) exists in HK. We also assume

max sup y Y φ(y, 0), sup z Z φ(y, f H(x)) < and κ := sup x X

K(x, x) < ,

which is satisﬁed if the sample space Z is bounded. Denote as the norm in HK. We always use the notation At = E[E(ft)] E(f H) and ˆAt = E(ft) E(f H), t N for brevity, which are referred to as the expected excess generalization errors and excess generalization errors, respectively.

In the following, we present the main results of this paper. We consider three types of convergence: convergence in expectation, almost sure convergence and convergence rates with high probability.

2.1 Convergence in Expectation

The ﬁrst part of our main results to be proved in Section 4.1 establishes a general suﬃcient condition (Theorem 1) and a necessary condition (Theorem 2) on the step size sequence {ηt}t N for the convergence of At to zero.

Theorem 1 Let {ft}t N be the sequence produced by (1.1) and suppose Assumption 1 holds with α (0, 1]. If X

t=1 ηt = and lim t ηα t

k=1 η2 k = 0, (2.2)

then limt E[E(ft)] E(f H) = 0.

Theorem 2 Let {ft}t N be the sequence produced by (1.1). Suppose that for any y Y, the function φ(y, ) : R R+ is convex and its derivative φ (y, ) is (1, L)-H older continuous. Assume that the step size sequence satisﬁes ηt 1/(6Lκ2), t N and E(f1) = E(f H). If limt E[E(ft)] = E(f H), then P t=1 ηt = .

Remark 3 We now illustrate the above theorems by considering the polynomially decaying step sizes ηt = η1t θ, t N, θ 0. The condition P t=1 ηt = requires θ 1, while the condition limt ηα t Pt k=1 η2 k = 0 requires θ > 1 2+α. Therefore, Theorem 1 shows that the iteration scheme (1.1) with ηt = η1t θ and θ 1 2+α, 1 guarantees the convergence of {At}t N. Theorem 2 shows that the condition θ 1 is also necessary for the convergence.

2.2 Almost Sure Convergence

The second part of our main results focuses on a suﬃcient condition (Theorem 4) for the almost sure convergence of { ˆAt}t N to zero and convergence rates with probability 1 (Theorem 6). The proofs of results in this section can be found in Section 4.2.

Theorem 4 Let {ft}t N be the sequence given by (1.1). If Assumption 1 holds with α (0, 1] and the step size sequence satisﬁes

t=1 ηt = and

t=1 η1+α t < , (2.3)

Convergence of Unregularized Online Learning Algorithms

then limt E(ft) = E(f H) almost surely.

Remark 5 According to Theorem 4, we know that { ˆAt}t N would converge almost surely to 0 if we consider either the step sizes ηt = η1t θ with θ ( 1 1+α, 1] or the step sizes

ηt = η1(t logβ t) 1 1+α with β > 1. Speciﬁcally, if the loss function is smooth, then we can

choose either ηt = η1t θ with θ (1

2, 1] or ηt = η1 t logβ t 1

2 with β > 1 to guarantee the convergence of the algorithm (1.1) almost surely in the sense of generalization errors.

Theorem 6 Suppose that Assumption 1 holds with α (0, 1]. Let {ft}t N be the sequence given by (1.1) with ηt = η1t θ, θ ( 1 α+1, 1) and η1 1 Aκ2 (A is deﬁned in (4.27)). Then for any ϵ > 0, lim t tmin{(1 θ),(α+1)θ 1} ϵ ˆAt = 0 almost surely. (2.4)

Speciﬁcally, if we choose θ = 2 2+α, then limt t α 2+α ϵ ˆAt = 0 almost surely.

2.3 Convergence Rates with High Probability

The last part of our main results is on high-probability bounds for the excess generalization errors, the proof of which is given in Section 4.3. With high probability, Theorem 7 establishes the boundedness (up to logarithmic factors) of the weighted summation PT t=1 ηt ˆAt, from which the decay rate of the excess generalization error E( fη T ) E(f H) associated to a

weighted average of the iterates fη T := PT t=1 ηtft PT t=1 ηt follows directly.

Theorem 7 Let {ft}t N be the sequence given by (1.1). Suppose that Assumption 1 holds with α (0, 1]. Assume the step size sequence satisﬁes ηt 1 Aκ2 , ηt+1 ηt for all t N and P t=1 η2 t < . Then, there exists a constant e C independent of T (explicitly given in the proof) such that for any δ (0, 1) the following inequality holds with probability at least 1 δ

t=1 ηt E(ft) E(f H) e C log 3 2 2T

δ and E( fη T ) E(f H) e C log 3 2 2T

δ PT t=1 ηt . (2.5)

Remark 8 For the step size sequence ηt = η1t θ, θ > 1

2, Theorem 7 implies that E( fη T )

E(f H) = O T θ 1 log 3 2 T

δ with probability at least 1 δ. If we consider ηt = η1 t logβ(et) 1

with β > 1, then with probability 1 δ we have E( fη T ) E(f H) = O T 1

A key feature of Theorem 7 distinguishing it from the existing results is that it avoids boundedness assumptions on the iterates, which are always imposed in the literature (Nemirovski et al., 2009; Duchi et al., 2010). Indeed, an essential ingredient in proving Theorem 7 is to show that {ft}t N produced by (1.1) would fall into a bounded ball of HK (up to logarithmic factors) with high probability, as shown in the following proposition.

Proposition 9 Suppose assumptions in Theorem 7 hold. Then, there exists a constant C 1 independent of T (explicitly given in the proof) such that for any δ (0, 1) the following inequality holds with probability at least 1 δ

max 1 t T ft f H 2 C log T

Lei, Shi and Guo

A key ingredient to prove Proposition 9 is to establish the following one-step progress inequality in terms of the RKHS distances (see (4.37))

ft+1 f H 2 ft f H 2 + Cη2 t + 2ηt E(f H) E(ft) + ξt,

where C is a constant and {ξt}t N is a Martingale diﬀerence sequence. Our novelty in applying a Bernstein-type inequality to control the martingale PT t=1 ξt is to show that the associated variances can be cancelled out by the negative term 2 PT t=1 ηt E(f H) E(ft) (see (4.38) and the last inequality of Proposition 23). Although Theorem 7 only considers the behavior of the weighted average fη T of iterates, it is possible to establish similar convergence rates for the uniform average of iterates f T := 1

T PT t=1 ft (Proposition 24). Theorem 10 establishes a general high-probability bound for the excess generalization error of the last iterate in terms of the step size sequence.

Theorem 10 Suppose that the assumptions in Theorem 7 hold. Then, there exists a constant e C independent of T (explicitly given in the proof) such that for any δ (0, 1) the following inequality holds with probability at least 1 δ

E(f T+1) E(f H) e C max n T X

2 ηt 1, η T

2 η1+α t o log2 3T

2 denotes the largest integer not greater than T

To establish high-probability error bounds for the last iterate of online gradient descent algorithm is an interesting problem which is not well studied, to our best knowledge, in the general convex setting. The key ingredient in our analysis is the following one-step progress inequality in terms of generalization errors (see (4.47))

ˆAt+1 ˆAt ηt E(ft) 2 + ξt + Cη1+α t ,

where C is a constant and { ξt} is a martingale diﬀerence sequence. A key observation of our analysis is that the variance of the martingale PT t=1 ξt can be cancelled out by the negative term PT t=1 ηt E(ft) 2 in the above one-step progress inequality (see (4.48) and (4.52)), paving the way for the application of a Bernstein-type inequality for martingales. We can derive explicit convergence rates in Corollary 11 by considering polynomially decaying step sizes in Theorem 10.

Corollary 11 Let {ft}t N be the sequence given by (1.1) with ηt = η1t θ, θ (1

2, 1) and η1 1 Aκ2 If Assumption 1 holds and δ (0, 1), then the following inequality holds with probability 1 δ

E(f T+1) E(f H) = O T max θ 1,1 (1+α)θ log2 T

If we choose θ = 2 2+α, then with probability at least 1 δ we derive E(f T+1) E(f H) =

O T α 2+α log2 T

Convergence of Unregularized Online Learning Algorithms

Remark 12 It should be mentioned that, unlike Theorem 7, the convergence rates in Corollary 11 depend on the smoothness parameter α and are not able to attain the minimax optimal convergence rate O(T 1

2 ) (Agarwal et al., 2009). Indeed, for smooth loss functions, Corollary 11 establishes the convergence rate O T 1

δ with high probability, which matches the bounds in-expectation AT = O(T 1

3 ) up to logarithmic factors established in Moulines and Bach (2011); Ying and Zhou (2017). It remains a challenging problem to further improve the high-probability bounds for ˆAT .

3. Discussions

In this section, we discuss related work on convergence of online/stochastic gradient descent algorithms from three viewpoints: convergence in expectation, almost sure convergence and convergence rates with high probability.

3.1 Related Work on Convergence in Expectation

Most studies of online gradient descent algorithms focus on convergence in expectation (Zhang, 2004; Ying and Zhou, 2006; Duchi and Singer, 2009; Shamir and Zhang, 2013; Lin et al., 2016; Hardt et al., 2016; Ying and Zhou, 2017). Convergence rates O(T 1

2 ) were established for some averaged iterates produced by (1.1) in a parametric setting with the linear kernel Kx = x (Zhang, 2004). These results were extended to online gradient descent algorithms in RKHSs with the speciﬁc least squares loss function (Ying and Pontil, 2008; Dieuleveut and Bach, 2016; Guo and Shi, 2017), and online mirror descent algorithms performing updates in Banach spaces (Duchi et al., 2010). Under boundedness assumptions on the iterates and (sub)gradients, the convergence rate O(T 1

2 log T) was established for the expected excess generalization error of the last iterate (Shamir and Zhang, 2013). Recently, a general condition on the step sizes as (2.3) was established for the convergence of the algorithm (1.1), in the sense limt At = 0, with loss functions satisfying Assumption 1 (Ying and Zhou, 2017). This suﬃcient condition is stricter than our condition (2.2). To see this clearly, we consider the polynomially decaying step sizes ηt = η1t θ, for which the condition (2.3) requires θ ( 1 1+α, 1] while our condition (2.2) requires θ ( 1 2+α, 1]. Furthermore, our discussion also implies a necessary condition for the convergence in expectation. Implemented in either a parametric or a nonparametric setting, regularized online learning algorithms have also received considerable attention (Kivinen et al., 2004; Smale and Yao, 2006; Ying and Zhou, 2006; Smale and Zhou, 2009), which diﬀer from (1.1) by introducing a regularization term to avoid overﬁtting. This algorithm updates iterates as follows

ft+1 = (1 ληt)ft ηtφ (yt, ft(xt))Kxt, (3.1)

where λ > 0 is a regularization parameter and the term λft + φ (yt, ft(xt))Kxt is used as an unbiased estimator of the gradient for the regularized generalization error Eλ(f) := E(f) + λ

2 f 2 at f = ft. Convergence rates in expectation can be stated for either the excess regularized generalization error Eλ(f T ) Eλ(fλ) (Shamir and Zhang, 2013) or the RKHS distance f T fλ (Smale and Yao, 2006; Ying and Zhou, 2006; Yao, 2010), where fλ = arg minf HK Eλ(f) is the minimizer of the regularized generalization error. When the

Lei, Shi and Guo

loss function is smooth, a suﬃcient and necessary condition as

lim t ηt = 0 and

t=1 ηt = (3.2)

was recently established for the convergence of {E[ ft fλ 2]}t N to zero in the parametric case (Lei and Zhou, 2017). A disadvantage of the regularization scheme (3.1) is that it requires to tune two sequences of hyper-parameters: a regularization parameter and the step sizes. As a comparison, an implicit regularization can be attained in the unregularized scheme (1.1) by tuning only the step sizes.

3.2 Related Work on Almost Sure Convergence

Existing almost sure convergence of online learning algorithm is mainly stated for the RKHS distances, which requires to impose some type of strong convexity assumption on the objective function E(f). In the parametric setting with the learning scheme (1.1), a suﬃcient condition as X

t=1 ηt = and

was established for the almost sure convergence of ft f H 2 if the objective function attains a unique minimizer and satisﬁes (Bottou, 1998)

inf f f H 2>ϵ f f H, E(f) > 0, ϵ > 0,

EZ φ (Y, f(X))KX 2 e A + e B f f H 2, f HK,

where e A and e B are two constants. This result was extended to the online mirror descent setting under some convexity assumption on the objective function measured by Bregman distances induced by the associated mirror map (Lei and Zhou, 2017). For polynomially decaying step sizes ηt = η1t θ with θ (0, 1), almost sure convergence of ft fλ was shown for regularized online learning algorithms (3.1) speciﬁed to the least squares loss function (Yao, 2010). The analysis in Yao (2010) roots its foundation on the martingale decompositions of the reminders ft fλ, which only holds in the least squares regularization setting. Almost sure convergence was recently studied for the randomized Kaczmarz algorithm (Lin and Zhou, 2015), which is an instantiation of (1.1) with φ(y, a) = 1

and Kx = x. The analysis there heavily depends on a restricted strong convexity of the objective function in a linear subspace where the learning takes place, which can not apply to general loss functions. As compared to the above mentioned results, our almost sure convergence is stated for the excess generalization errors with general loss functions and requires no assumptions on the strong convexity of the objective function E(f).

3.3 Related Work on Convergence Rates with High Probability

In this section, we survey related work on convergence rates with high probability. We divide our discussions into two parts according to the convexity of the objective function. As far as we know, all existing high-probability convergence rates of online gradient descent algorithms with general convex functions focus on some average of iterates (here we

Convergence of Unregularized Online Learning Algorithms

are not interested in probabilistic bounds with a polynomial dependence on 1/δ). The following online projected gradient descent algorithm with Kx = x was studied in Nemirovski et al. (2009); Duchi et al. (2010)

ft+1 = Proj e H h ft ηtφ (yt, ft(xt))Kxt i , (3.3)

where e H is a compact subset of HK and Proj e H(f) = arg min f e H f f is the projection

of f onto e H. Under the boundedness assumption

E h exp φ (y, f(x))Kx 2/G2 i exp(1) f e H,

it was shown that the weighted average fη T = PT t=1 ηtft PT t=1 ηt of iterates produced by (3.3) with a

constant step size satisﬁes the following inequality with probability 1 δ

E( fη T ) E(f H) = O GDT 1

2 log δ 1 ,

where D = supf, f e H f f is the diameter of the subspace e H. Under a stronger assumption

φ (y, f(x))Kx G for all (x, y) Z, f e H, the uniform average f T = 1 T PT t=1 ft of iterates produced by (3.3) with step sizes ηt = η1t 1

2 was shown to enjoy the bound E( f T ) E(f H) = O(DGT 1

2 log 1 2 1

δ) with probability at least 1 δ. In comparison with these results, the convergence rates in Theorem 7 are derived without the projection step and any boundedness assumption on the gradients. Indeed, most of the eﬀorts in proving Theorem 7 is to show ft f H 2 = O(log T

δ ) with probability at least 1 δ. It is implied that the possibly computationally expensive projection step can be removed without harming the behavior of the online gradient descent algorithms. Furthermore, Theorem 10 gives, to our best knowledge, the ﬁrst high-probability bounds for the last iterate of online gradient descent algorithms in the general convex setting. A framework to transfer regret bounds of online learning algorithms to high-probability bounds for the uniform average of iterates was established by Cesa-Bianchi et al. (2004). Now we review some high-probability studies for online gradient descent algorithms in the strongly convex setting, for which some results for the last iterate can be found in the literature. For the online regularized algorithm (3.1) with the least squares loss function and ηt = η1t θ, θ [0, 1), the following inequality was derived with probability at least 1 δ (Yao, 2010)

f T fλ 2 = O λ 2+ 1 1 θ T θ log 1

The analysis in Yao (2010) is based on an integral operator approach, which can not be extended to general loss functions. Under almost sure boundedness assumption (φ (yt, ft(xt))+ λ)Kxt G for all t N, the following improved bound for the last iterate of (3.1) with general loss functions and step sizes ηt = η1(tλ) 1 was established with probability at least 1 δ (Rakhlin et al., 2012)

f T fλ 2 = O G2λ 2T 1 log log T

Lei, Shi and Guo

Although this bound enjoys a tight dependence on T, its dependence on the regularization parameter λ is suboptimal. To make a clear comparison between this result and ours, we consider here the speciﬁc least squares loss function and assume that the regression function fρ(x) := E[Y |X = x] belongs to HK. In this case, Lemma 13 translates (3.4) to the following high-probability bounds on excess generalization errors

E(f T ) + λ

2 f T 2 = E(fλ) + λ

2 fλ 2 + O G2λ 2T 1 log log T

The assumption fρ HK implies D(λ) := E(fλ) E(fρ) + λ

2 fλ 2 = O(λ) (Cucker and Zhou, 2007) and therefore (3.5) reads as

E(f T ) E(fρ) = E(f T ) E(fλ) λ

2 fλ 2 + E(fλ) E(fρ) + λ

= O G2λ 2T 1 log log T

If we choose λ = c G2T 1 log log T

3 for a constant c > 0, then the above inequality trans-

lates to E(f T ) E(fρ) = O G2T 1 log log T

3 , which matches our convergence rates up to logarithmic factors. Note that the regularization parameter λ needs to be tuned according to T to balance the bias and variance in (3.5), which may not be accessible in practical implementations. To deal with this issue, a class of fully online regularized algorithms was proposed and investigated by allowing the regularization parameters to vary along the learning process (Ye and Zhou, 2007; Tarres and Yao, 2014). As a comparison, without a regularization parameter to tune, the unregularized online learning algorithm (1.1) achieves a bias-variance balance by tuning only the step sizes. Furthermore, the convergence rates (3.4) require to impose the non-intuitive boundedness assumptions on the gradients encountered during the iterations, which may be violated in practical implementations. This boundedness assumption is removed in our analysis.

In this section, we present the proofs for the results given in Section 2. Our discussions require to use a property established in the following lemma on functions with (α, L)-H older continuous gradients. This lemma is motivated by similar results in the literature (see, e.g., Ying and Zhou, 2017) and we present the proof in Section A for completeness.

Lemma 13 Let H be a Hilbert space associated with the inner product , . Let G : H R be a convex and diﬀerentiable functional satisfying

G(f) G( f) L f f α, f, f H,

where L > 0, α (0, 1], is the gradient operator and is the norm induced by the inner product. Then, the following inequality holds for any f, f H

α G(f) G( f) 1+α

(1 + α)L 1 α G(f) G( f) + f f, G( f) L f f 1+α

1 + α . (4.1)

Convergence of Unregularized Online Learning Algorithms

With Lemma 13, we can derive the following lemma on gradients of loss functions at iterates of the algorithm (1.1). Its power consists in bounding the gradients for the possibly unbounded iterates {ft}t N by the gradients for f H and the excess generalization errors, the ﬁrst of which can be considered as a constant while the second of which are exactly the terms we are interested in. For a random variable z, we use Ez[ ] to denote the conditional expectation with respect to z.

Lemma 14 Suppose Assumption 1 holds and β (0, 1]. Then,

Ezt |φ (yt, ft(xt))|1+β 2βL 1 α (1 + β) E(ft) E(f H) +

1 + α + 2βEzt |φ (yt, f H(xt))|1+β , t N. (4.2)

Proof With the elementary inequality |u + v|1+β 2β[|u|1+β + |v|1+β] and the Young s inequality uv p 1|u|p + q 1|v|q, u, v R, p 1 + q 1 = 1, p 0, (4.3)

the term |φ (yt, ft(xt))|1+β can be controlled by

|φ (yt, ft(xt))|1+β h |φ (yt, ft(xt)) φ (yt, f H(xt))| + |φ (yt, f H(xt))| i1+β

2β|φ (yt, ft(xt)) φ (yt, f H(xt))|1+β + 2β|φ (yt, f H(xt))|1+β

1 + α |φ (yt, ft(xt)) φ (yt, f H(xt))| 1+α

α + 2β(1 αβ)

1 + α + 2β|φ (yt, f H(xt))|1+β.

It follows from the ﬁrst inequality of (4.1) that

α 1 + α|φ (yt, ft(xt)) φ (yt, f H(xt))| 1+α

L 1 α h φ(yt, ft(xt)) φ(yt, f H(xt)) φ (yt, f H(xt))(ft(xt) f H(xt)) i .

Plugging the above inequality into (4.4) and taking expectations with respect to zt (note ft is independent of zt), we get

Ezt |φ (yt, ft(xt))|1+β 2βL 1 α (1 + β) h E(ft) E(f H) ft f H, Ezt φ (yt, f H(xt))Kxt i

1 + α + 2βEzt |φ (yt, f H(xt))|1+β

= 2βL 1 α (1 + β) E(ft) E(f H) + 2β(1 αβ)

1 + α + 2βEzt |φ (yt, f H(xt))|1+β .

Here the last identity holds since

E(f H) = Ez φ (y, f H(x))Kx = 0.

The proof is complete.

Lei, Shi and Guo

4.1 Proofs for Convergence in Expectation

Before proving Theorem 1 and Theorem 2 on convergence in expectation, we ﬁrst present some preparatory results. Our ﬁrst preliminary result is a weak result on convergence in expectation under a weak condition on the step size sequence (4.5). Eq. (4.6) implies the existence of a sub-index sequence {it}t N satisfying limt Ait = 0, while (4.7) shows the convergence of a weighted average of the expected excess generalization errors. This result is derived based on a one-step progress inequality in terms of distances in RKHSs (see (4.10)).

Proposition 15 Let {ft}t N be the sequence given by (1.1) and suppose Assumption 1 holds. If

lim t ηt = 0 and

t=1 ηt = , (4.5)

then lim inf t E[E(ft) E(f H)] = 0 (4.6)

t=1 ηt 1 T X

t=1 ηt E[E(ft) E(f H)] = 0. (4.7)

Lemma 16 Let {ηt}t N be a sequence of positive numbers. If limt ηt = 0 and P t=1 ηt = , then limt Pt k=1 ηk 1 Pt k=1 η2 k = 0.

Proof of Proposition 15 According to the iteration strategy (1.1), we derive

ft+1 f H 2 = ft ηtφ (yt, ft(xt))Kxt f H 2

ft f H 2 + η2 t |φ (yt, ft(xt))|2κ2 2ηt ft f H, φ (yt, ft(xt))Kxt (4.8)

ft f H 2 + η2 t |φ (yt, ft(xt))|2κ2 + 2ηt φ(yt, f H(xt)) φ(yt, ft(xt)) . (4.9)

Note that ft is independent of zt. Taking expectations with respect to zt on both sides and using (4.2) with β = 1, we derive

Ezt ft+1 f H 2 ft f H 2 + η2 t κ2Ezt |φ (yt, ft(xt))|2 + 2ηt[E(f H) E(ft)]

ft f H 2 + 4η2 t κ2L 1 α [E(ft) E(f H)] + 2(1 α)η2 t κ2

1 + α + 2η2 t κ2Ezt |φ (yt, f H(xt))|2 + 2ηt[E(f H) E(ft)]

= ft f H 2 + 2ηt 1 2ηtκ2L 1 α [E(f H) E(ft)] + 2η2 t κ2 Ezt |φ (yt, f H(xt))|2 + 1 α

Since limt ηt = 0, we can ﬁnd an integer t1 N such that ηt 1 4κ2L 1 α , t t1. This

together with E(f H) E(ft) implies

ηt[E(ft) E(f H)] ft f H 2 Ezt ft+1 f H 2 + γη2 t , t t1, (4.10)

Convergence of Unregularized Online Learning Algorithms

where we introduce γ = 2κ2 Ezt |φ (yt, f H(xt))|2 + 1 α

1+α . Taking expectations followed with a summation from t = t1 to t = T gives

t=t1 ηt At E[ ft1 f H 2] + γ

t=t1 η2 t .

It then follows that

t=1 ηt 1 T X

t=1 ηt At = lim T T X

t=1 ηt 1 t1 1 X

t=1 ηt At + lim T T X

t=1 ηt 1 T X

t=1 ηt 1h E[ ft1 f H 2] + γ

t=t1 η2 t i = 0,

where we have used limt Pt k=1 ηk 1 Pt k=1 η2 k = 0 established in Lemma 16. This establishes (4.7). We now prove (4.6) by contradiction strategy. Suppose to the contrary that lim inf t At =

a > 0. Then, there exists t N such that At 2 1 a, t t, from which we derive from (4.7) that

PT t=1 ηt At PT t=1 ηt a

PT t= t+1 ηt PT t=1 ηt = a

P t t=1 ηt PT t=1 ηt = a

This leads to a contradiction. Therefore, lim inft At = 0 and the proof is complete.

As our second preliminary result, Lemma 17 establishes an upper bound on E[ ft f H 2 2] in terms of the step size sequence, as well as a lower bound on E[ E(ft) 2] in terms of the step size sequence and the expected excess generalization errors.

Lemma 17 Let {ft}t N be the sequence given by (1.1). If Assumption 1 holds and limt ηt = 0, then there exist constants b C, γ > 0 independent of t such that the following inequalities hold for any t N

E[ ft f H 2] b C + γ

k=1 η2 k (4.11)

E[ E(ft) 2]

E[E(ft) E(f H)] 2

b C + γ Pt k=1 η2 k . (4.12)

Proof Since E(ft) E(f H) for all t N, (4.10) implies

E[ ft+1 f H 2] E[ ft f H 2] + η2 t γ, t t1,

Lei, Shi and Guo

where γ and t1 are deﬁned in the proof of Proposition 15. Taking a summation of the above inequality from t = t1 to t = T shows

E[ f T+1 f H 2] E[ ft1 f H 2] + γ

t=t1 η2 t b C + γ

where we introduce b C = E[ ft1 f H 2]. This establishes (4.11). We now turn to (4.12). According to the convexity of E and Schwartz inequality, we get

E[E(ft)] E(f H) E E(ft), ft f H E[ E(ft) ft f H ]

E E(ft) 2 1

2 E ft f H 2 1

The above inequality together with (4.11) gives

E[ E(ft) 2]

E E(ft) E(f H) 2

E[ ft f H 2]

E E(ft) E(f H) 2

b C + γ Pt k=1 η2 k .

This establishes (4.12) and completes the proof.

Remark 18 Eq. (4.11) was derived in Ying and Zhou (2017) under an additional assumption P k=1 η1+α k < , which is removed in Lemma 17. For step sizes of the form ηt = η1t θ

2 < θ 1, it was shown E[ ft f H 2] = O t1 θ E(f H) inff E(f) (Lin and Zhou, 2018). As compared with the results in Ying and Zhou (2017); Lin and Zhou (2018), our discussion implies boundedness of E[ ft 2] under a milder condition P k=1 η2 k < .

We are now in a position to prove Theorem 1 for the convergence in expectation. Let ϵ > 0 be an arbitrary small number. Our idea is to use Proposition 15, based on onestep progress in terms of the distances in RKHSs, to show that {At}t N can be smaller than ϵ inﬁnitely often. Once A t ϵ for a suﬃciently large t, we can use the assumption limt ηα t Pt k=1 η2 k = 0 and the one-step progress inequality (4.15) in terms of generalization errors to show At ϵ for any t t. Proof of Theorem 1 Since φ (y, ) is (α, L)-H older continuous, we can apply the second inequality of (4.1) to show that

φ(y, ft+1(x)) φ(y, ft(x)) + (ft+1(x) ft(x))φ (y, ft(x)) + L 1 + α|ft+1(x) ft(x)|1+α.

According to the reproducing property f(x) = f, Kx , f H and the iteration scheme (1.1), we know

φ(y, ft+1(x)) φ(y, ft(x)) + ft+1 ft, φ (y, ft(x))Kx + L 1 + α| ft+1 ft, Kx |1+α

φ(y, ft(x)) ηt φ (yt, ft(xt))Kxt, φ (y, ft(x))Kx + Lκ1+α

1 + α ft+1 ft 1+α

φ(y, ft(x)) ηt φ (yt, ft(xt))Kxt, φ (y, ft(x))Kx + Lκ2(1+α)η1+α t 1 + α |φ (yt, ft(xt))|1+α.

Convergence of Unregularized Online Learning Algorithms

Putting (4.2) with β = α back into (4.13) followed with a conditional expectation with respect to zt and z yields

Ezt[E(ft+1)] = Ezt,z[φ(y, ft+1(x))] Ez[φ(y, ft(x))] ηt Ezt[φ (yt, ft(xt))Kxt], Ez[φ (y, ft(x))Kx]

+ Lκ2(1+α)η1+α t 1 + α

h 2αL 1 α (1 + α)(E(ft) E(f H)) + 2α(1 α) + 2αEz[|φ (y, f H(x))|1+α] i

E(ft) ηt E(ft) 2+ Lκ2(1+α)2αη1+α t h L 1 α (1+α)(E(ft) E(f H))+(1 α)+Ez[|φ (y, f H(x))|1+α] i

Subtracting E(f H) from both sides of the above inequality gives

Ezt[E(ft+1)] E(f H) h 1 + L1+ 1

α κ2(1+α)2αη1+α t i (E(ft) E(f H)) ηt E(ft) 2

+ Lκ2(1+α)2αη1+α t 1 + α

h (1 α) + Ez[|φ (y, f H(x))|1+α] i . (4.14)

Taking expectations over both sides, the above inequality can be written as

At+1 (1 + aη1+α t )At + bη1+α t ηt E[ E(ft) 2], (4.15)

where we introduce the notations

α κ2(1+α)2α and b = Lκ2(1+α)2α

h (1 α) + Ez[|φ (y, f H(x))|1+α] i . (4.16)

Plugging (4.12) into the above inequality gives

At+1 (1 + aη1+α t )At + bη1+α t ηt A2 t b C + γ Pt k=1 η2 k , (4.17)

where b C and γ are deﬁned in the proof of Lemma 17. The assumption limt ηα t Pt k=1 η2 k = 0 implies limt ηt = 0 and therefore the assumptions of Proposition 15 hold. Let ϵ (0, 1) be an arbitrary number. According to lim inft At = 0 established in Proposition 15, we can ﬁnd a t N ( t can be suﬃciently large) such that A t ϵ and

ηα t b C + γ

k=1 η2 k ϵ2

4(a + b), η1+α t ϵ 2(a + b) t t. (4.18)

We now prove by induction that At ϵ for all t t. It suﬃces to show that At+1 ϵ under the assumption At ϵ and t t. Since At 1, we derive from (4.17) that

At+1 At + (a + b)η1+α t ηt A2 t b C + γ Pt k=1 η2 k .

We now consider two cases. If A2 t (a+b)ηα t b C +γ Pt k=1 η2 k , then we know At+1 At ϵ. Otherwise, we derive from (4.18) that

At+1 At + (a + b)η1+α t

v u u t(a + b)ηα t b C + γ

k=1 η2 k + (a + b)η1+α t ϵ.

Lei, Shi and Guo

Putting the above two cases together we derive At+1 ϵ. That is, At ϵ for all t t. Since ϵ (0, 1) is arbitrarily chosen, we get limt At = 0.

The necessary condition in Theorem 2 is established by applying the co-coercivity given in Lemma 13 to bound E(ft+1) in terms of E(ft) from below. Proof of Theorem 2 Since φ (y, ) is (1, L)-H older continuous for any y Y, we have

E(f) E( f) = E[φ (y, f(x))Kx φ (y, f(x))Kx] E |φ (y, f(x)) φ (y, f(x))| Kx

LE | f f, Kx | Kx Lκ2 f f . (4.19)

That is, E is (1, Lκ2)-H older continuous. Lemma 13 with α = 1 and E(f H) = 0 then yield the following inequality

E(ft) E(f H) + ft f H, E(f H) + E(ft) E(f H) 2

2Lκ2 = E(f H) + E(ft) 2

2Lκ2 . (4.20)

It follows from the convexity of E and (1.1) that

E(ft+1) E(ft) + E(ft), ft+1 ft = E(ft) ηt E(ft), φ (yt, ft(xt))Kxt .

Taking expectations over both sides and using (4.20), we derive the following inequality for all t N

E[E(ft+1)] E[E(ft)] ηt E[ E(ft) 2] E[E(ft)] 2Lκ2ηt E[E(ft) E(f H)].

Hence, At+1 1 2Lκ2ηt At, t N.

The assumption ηt 1/(6Lκ2) and the elementary inequality 1 η exp( 2η), η (0, 1/3) (Lin and Zhou, 2015) then show

At+1 exp 4Lκ2ηt At

k=1 exp 4Lκ2ηk A1 = exp 4Lκ2 t X

which, together with the condition limt At = 0 and A1 = 0, then establishes the necessary condition P t=1 ηt = .

4.2 Proofs for Almost Sure Convergence

We use the following Doob s martingale convergence theorem (see, e.g., Doob, 1994, page 195) to prove Theorem 4 on almost sure convergence. Speciﬁcally, we will use the one-step progress inequality in terms of generalization errors to construct a supermartingale, whose almost sure convergence would imply the almost sure convergence of { ˆAt}t N.

Lemma 19 Let { Xt}t N be a sequence of non-negative random variables with E[ X1] < and let {Ft}t N be a nested sequence of sets of random variables with Ft Ft+1 for all t N. If E[ Xt+1|Ft] Xt for every t N, then Xt converges to a nonnegative random variable X almost surely. Furthermore, X < almost surely.

Convergence of Unregularized Online Learning Algorithms

Proof of Theorem 4 Eq. (4.14) gives

Ezt[ ˆAt+1] (1 + aη1+α t ) ˆAt + bη1+α t , t N, (4.21)

with a and b are deﬁned in the proof of Theorem 1. Denote c = Q k=1(1 + aη1+α k ), which, according to the elementary inequality 1 + τ exp(τ), τ 0 and (2.3), satisﬁes

k=1 exp(aη1+α k ) = exp a

k=1 η1+α k < .

Multiplying both sides of (4.21) by Q k=t+1(1 + aη1+α k ), we derive

k=t+1 (1 + aη1+α k )Ezt ˆAt+1

k=t (1 + aη1+α k ) ˆAt + bη1+α t

k=t+1 (1 + aη1+α k )

k=t (1 + aη1+α k ) ˆAt + bcη1+α t . (4.22)

Introduce the stochastic process

k=t (1 + aη1+α k ) ˆAt + bc

k=t η1+α k , t N (4.23)

According to (2.3), we know E[ ˆX1] < . Eq. (4.22) implies Ezt[ ˆXt+1] ˆXt for all t N, that is, { ˆXt}t N is a supermartingale taking non-negative values. Lemma 19 then implies that limt ˆXt = ˆX for a non-negative random variable ˆX almost surely. Let Ω= {ω = {zt}t N} be the set for which { ˆXt(ω)}t converges to ˆX(ω) as t and ˆX(ω) < . Then, Pr{Ω} = 1, where Pr{Ω} denotes the probability with which the event Ωhappens. Let ω Ωand ϵ > 0. Since P t=1 η1+α t < , we can ﬁnd t N such that

t= t η1+α t < ϵ 3bc,

k= t (1 + aη1+α k ) < 1 + ϵ

3 ˆX(ω) + ϵ and | ˆXt(ω) ˆX(ω)| < ϵ

It then follows from (4.23) that

ˆAt(ω) ˆXt(ω) 1 + ϵ

3 ˆX(ω) + ϵ

3 ˆAt(ω) + ϵ ˆXt(ϵ)

3 ˆX(ω) + ϵ + ϵ

ˆAt(ω) + ϵ ˆX(ω) + ϵ

3 ˆX(ϵ) + ϵ + ϵ

3 ˆAt(ω) + 2ϵ

from which we derive

ˆX(ω) ϵ ˆXt(ω) 2ϵ

3 ˆAt(ω) ˆX(ω) + ϵ, t t.

That is, limt ˆAt(ω) = ˆX(ω) for any ω Ω, i.e., limt ˆAt = ˆX almost surely. Since P t=1 η1+α t < , we know P t=1 η2 t < and limt ηt = 0. This further implies

k=1 η2 k = 0

Lei, Shi and Guo

and therefore the assumptions in Theorem 1 hold. Theorem 1 shows that limt E[ ˆAt] = 0. By Fatou s lemma, we get

0 E[ ˆX] = E lim t ˆAt lim inf t E[ ˆAt] = 0,

which implies that E[ ˆX] = 0 and therefore ˆX = 0 almost surely since ˆX is non-negative. Combining the above deductions together, we know that limt ˆAt = 0 almost surely.

Our proof of Theorem 6 is based on the following lemma which can be found in Lin and Zhou (2015) as an easy consequence of the Borel-Cantelli Lemma.

Lemma 20 Let {ξt}t N be a sequence of non-negative random variables and {ϵt}t N be a sequence of positive numbers satisfying limt ϵt = 0. If P t=1 Pr{ξt > ϵt} < , then ξt converges to 0 almost surely.

Proof of Theorem 6 Introduce δt = t 2 for all t N. According to Corollary 11, there exists a constant e C1 such that

Pr n tmin{1 θ,(α+1)θ 1} ϵ ˆAt e C1t ϵ log2 t

Since P t=1 δt < and limt t ϵ log2 t

δt = 0, we can apply Lemma 20 here to show (2.4). The proof is complete.

4.3 Proofs for Convergence Rates with High Probability

Our discussion on high-probability convergence rates roots its foundation on the following concentration inequalities of martingales. Part (a) is the Azuma-Hoeﬀding inequality for martingales with bounded diﬀerences (Hoeﬀding, 1963), while Part (b) is a Bernsteintype inequality which exploits information on variances to derive improved concentration inequalities for martingales (Zhang, 2005). A remarkable property of this Bernstein-type inequality is that it involves a conditional variance which itself is a random variable.

Lemma 21 Let z1, . . . , zn be a sequence of random variables such that zk may depend on the previous random variables z1, . . . , zk 1 for all k = 1, . . . , n. Consider a sequence of functionals ξk(z1, . . . , zk), k = 1, . . . , n.

(a) Assume that |ξk Ezk[ξk]| bk for each k. Let δ (0, 1). With probability at least 1 δ we have n X

k=1 Ezk[ξk] 2

k=1 b2 k log 1

(b) Assume that ξk Ezk[ξk] b for each k. Let ρ > 0 and δ (0, 1). With probability at least 1 δ we have

k=1 Ezk[ξk] (eρ ρ 1)σ2 n ρb + b log 1

δ ρ , (4.25)

where σ2 n = Pn k=1 Ezk(ξk Ezkξk)2 is the conditional variance.

Convergence of Unregularized Online Learning Algorithms

Since φ (y, ) is (α, L)-H older continuous, convex and non-negative, Proposition 1 in Ying and Zhou (2017) shows that φ(y, ) satisﬁes the following self-bounding property

|φ (y, s)| 1+α

α (1 + α)1+ 1

α α L 1 α φ(y, s), y Y, s R.

The Young s inequality (4.3) then implies

|φ (y, s)|2 α 2α

1+α (1 + α)2L 2 1+α φ(y, s) 2α 1+α

1+α L 2 1+α (1 + α) 2αφ(y, s) + 1 α = Aφ(y, s) + B, (4.26)

where A = 2α 1 α 1+α L 2 1+α (1 + α) and B = α 2α

1+α L 2 1+α (1 α2). (4.27)

Below we will use Part (b) of Lemma 21 to show almost boundedness of {ft}t N with high probability (Proposition 9). To this aim, we ﬁrst establish a crude bound on the iterates {ft}t N in terms of the step size sequence.

Lemma 22 Let {ft}t N be the sequence given by (1.1). Assume ηt 1 Aκ2 for all t N. Then, the following inequalities hold for all t N

ft+1 f H 2 C1

k=0 ηk and ft+1 2 C1

k=1 ηk, (4.28)

where we introduce for brevity η0 = 1 and

C1 = f H 2 2 + A 1B + 2 max sup y Y φ(y, 0), sup z Z φ(y, f H(x)) . (4.29)

Furthermore, if ηt 1 Aκ2 and ηt+1 ηt for all t N, we have

k=1 η2 kφ(yk, fk(xk)) η1 f H 2 + C2

k=1 η2 k, (4.30)

where we introduce C2 = 2 sup z Z φ(y, f H(x)) + η1κ2B. (4.31)

Proof Plugging (4.26) into (4.9) gives

ft+1 f H 2 ft f H 2 + η2 t κ2[Aφ(yt, ft(xt)) + B] + 2ηt[φ(yt, f H(xt)) φ(yt, ft(xt))]

= ft f H 2 + 2ηtφ(yt, f H(xt)) + η2 t κ2B + ηt(Aηtκ2 2)φ(yt, ft(xt)) (4.32)

ft f H 2 + 2ηtφ(yt, f H(xt)) + η2 t κ2B ft f H 2 + ηt 2φ(yt, f H(xt)) + A 1B ,

where the last two inequalities follow from the assumption ηt 1 Aκ2 . According to the deﬁnitions of C1 in (4.29) and η0, it then follows that

ft+1 f H 2 = f H 2 +

fk+1 f H 2 fk f H 2 C1

Lei, Shi and Guo

This establishes the ﬁrst inequality in (4.28). We now prove the second inequality in (4.28). Notice that (4.9) also holds if we replace f H with 0. This, together with (4.26) and ηt 1 Aκ2 , gives

ft+1 2 ft 2 + η2 t κ2[Aφ(yt, ft(xt)) + B] + 2ηt[φ(yt, 0) φ(yt, ft(xt))]

= ft 2 + 2ηtφ(yt, 0) + η2 t κ2B + ηt(Aηtκ2 2)φ(yt, ft(xt))

ft 2 + 2ηtφ(yt, 0) + ηt A 1B.

It is now clear

fk+1 2 fk 2 C1

We now show (4.30). Applying ηt 1 Aκ2 in (4.32) gives

ηtφ(yt, ft(xt)) ft f H 2 ft+1 f H 2 + 2ηtφ(yt, f H(xt)) + η2 t κ2B. (4.33)

Multiplying both sides of the above inequality by ηt and using ηt+1 ηt, we derive

η2 t φ(yt, ft(xt)) ηt ft f H 2 ft+1 f H 2 + 2η2 t φ(yt, f H(xt)) + η3 t κ2B

ηt ft f H 2 ηt+1 ft+1 f H 2 + 2η2 t φ(yt, f H(xt)) + η3 t κ2B.

Taking a summation of the above inequality gives (4.30). The proof is complete.

Based on the above lemma, Proposition 23 gives a high-probability bound on ft+1 f H 2 in terms of Pt k=1 η2 k fk f H 2. Proposition 23 is proved based on a one-step progress inequality (4.37) in terms of the RKHS distances, where the involved martingale is controlled by a Bernstein-type inequality with the dominant variance term cancelled out by the negative term 2 Pt k=1 ηk Ak existing in the one-step progress inequality.

Proposition 23 Suppose assumptions in Theorem 7 hold. Let δ (0, 1) and Cη, C3, C4 be constants deﬁned by

Cη = sup k N ηk

j=0 ηj < , (4.34)

C3 = sup zk Z

φ (yk, f H(xk))Kxk Ez[φ (y, f H(x))Kx] , C4 = 2(1 α)κ2

1 + α + 2κ2Ez |φ (y, f H(x))|2 .

Then, there exists a constant ρ1 (explicitly given in the proof and independent of t as well as the step size sequence) such that the following inequality holds with probability at least 1 δ

ft+1 f H 2 (η1κ2A + 1) f H 2 + (AC2 + B)κ2 t X

k=1 η2 k + C4 Pt k=1 η2 k f H fk 2

2C1Cηκ2L 1 α

1 2 1 Cη + 4L(C

1 2 1 κ)α+1Cη log 1

δ ρ1 . (4.36)

Convergence of Unregularized Online Learning Algorithms

Proof The assumption P t=1 η2 t < implies that Cη in (4.34) is well deﬁned since ηk Pk j=1 ηj Pk j=1 η2 j < . According to (4.8) and (4.26), we derive

fk+1 f H 2 fk f H 2+η2 kκ2 Aφ(yk, fk(xk))+B +2ηk f H fk, Ezk[φ (yk, fk(xk))Kxk]

+ 2ηk f H fk, φ (yk, fk(xk))Kxk Ezk[φ (yk, fk(xk))Kxk] . (4.37)

Using the convexity of φ followed with a summation from k = 1 to t gives

ft+1 f H 2 f H 2 + κ2 t X

k=1 η2 k Aφ(yk, fk(xk)) + B + 2

k=1 ηk E(f H) E(fk)

k=1 ηk f H fk, φ (yk, fk(xk))Kxk Ezk φ (yk, fk(xk))Kxk

(η1Aκ2 + 1) f H 2 + (AC2 + B)κ2 t X

k=1 η2 k + 2

k=1 ηk E(f H) E(fk)

k=1 ηk f H fk, φ (yk, fk(xk))Kxk Ezk φ (yk, fk(xk))Kxk , (4.38)

where the last inequality is due to (4.30). We now estimate the last term of the above inequality with Lemma 21. To this aim, we need to control both the magnitudes and variances for the martingale diﬀerence sequences. Introduce a sequence of functionals ξk, k N as follows

ξk = ηk f H fk, φ (yk, fk(xk))Kxk Ezk φ (yk, fk(xk))Kxk .

It is clear φ (yk, fk(xk))Kxk Ezk[φ (yk, fk(xk))Kxk] φ (yk, fk(xk))Kxk φ (yk, f H(xk))Kxk

+ φ (yk, f H(xk))Kxk Ez[φ (y, f H(x))Kx] + Ez[ (φ (y, f H(x)) φ (y, fk(x)))Kx ]

φ (yk, f H(xk))Kxk Ez[φ (y, f H(x))Kx] + 2Lκ sup x X |fk(x) f H(x)|α,

where we have used the Jensen s inequality in the ﬁrst step. But

|fk(x) f H(x)| = | fk f H, Kx | fk f H κ.

Combining the above two inequalities and using the deﬁnition of C3 in (4.35) give φ (yk, fk(xk))Kxk Ezk[φ (yk, fk(xk))Kxk] C3 + 2L fk f H ακα+1. (4.39)

It then follows from (4.28) and Ezk[ξk] = 0 that (note η0 = 1)

ξk Ezk[ξk] = ξk ηk f H fk φ (yk, fk(xk))Kxk Ezk φ (yk, fk(xk))Kxk

ηk C3 f H fk + 2Lηkκα+1 f H fk 1+α (4.40)

1 2 1 k 1 X

1 2 1 κ)α+1ηk k 1 X

1 2 1 Cη + 2L(C

1 2 1 κ)α+1Cη.

Lei, Shi and Guo

Here we have used the deﬁnition of Cη given in (4.34). Furthermore, according to Lemma 14 with β = 1 and the deﬁnition of C4 in (4.35), the conditional variances can be controlled by (note E[(ξ E[ξ])2] < E[ξ2] for a real-valued random variable ξ)

k=1 Ezk(ξk Ezk[ξk])2

k=1 η2 k Ezk f H fk, φ (yk, fk(xk))Kxk 2

k=1 η2 k f H fk 2κ2Ezk[|φ (yk, fk(xk))|2]

k=1 η2 k f H fk 2 4κ2L 1 α [E(fk) E(f H)] + C4 .

According to (4.28) and the deﬁnition of Cη in (4.34), we can further get

k=1 Ezk(ξk Ezk[ξk])2 4L 1 α C1κ2 t X

h η2 k k 1 X

j=0 ηj E(fk) E(f H) i + C4

k=1 η2 k f H fk 2

4L 1 α C1Cηκ2 t X

k=1 ηk E(fk) E(f H) + C4

k=1 η2 k f H fk 2.

Let ρ1 be the largest positive constant such that (such ρ1 exists since limρ 0 eρ ρ 1

(eρ1 ρ1 1)L 1 α C

2 1 κα+1 ρ1

Since C1 and C3 do not depend on the step size sequence, ρ1 is also a constant independent of the step size sequence. Plugging the above estimates on the magnitudes and variances of ξk into Part (b) of Lemma 21, we derive the following inequality with probability at least 1 δ

k=1 ξk (eρ1 ρ1 1)

1 2 1 Cη + 2L(C

1 2 1 κ)α+1Cη h 4L 1 α C1Cηκ2 t X

k=1 ηk E(fk) E(f H)

k=1 η2 k f H fk 2i +

1 2 1 Cη + 2L(C

1 2 1 κ)α+1Cη log 1

k=1 ηk E(fk) E(f H) + C4 Pt k=1 η2 k f H fk 2

4C1Cηκ2L 1 α +

1 2 1 Cη + 2L(C

1 2 1 κ)α+1Cη log 1

Plugging this inequality into (4.38) gives the stated inequality with probability at least 1 δ.

According to Proposition 9 and the assumption P k=1 η2 k < , one can show essentially that max 1 t T ft f H 2 1

2 max 1 t T ft f H 2+c log T for a constant c > 0, from which one can

establish the boundedness of the iterates with high probability (up to logarithmic factors).

Convergence of Unregularized Online Learning Algorithms

Proof of Proposition 9 We deﬁne the subset Ω ZT by

Ω= n (z1, . . . , z T ) : ft+1 f H 2 C5+C4 Pt k=1 η2 k f H fk 2

2C1Cηκ2L 1 α +C6 log T

δ for all t = 1, . . . , T o ,

where we introduce

C5 = (η1κ2A+1) f H 2 +(AC2 +B)κ2 X

k=1 η2 k, C6 = 2C3C

1 2 1 Cη + 4L(C

1 2 1 κ)α+1Cη ρ1 . (4.41)

Applying Proposition 23 together with union bounds on probabilities of events, we have Pr{Ω} 1 δ. Since P t=1 η2 t < , there exists a t2 N such that

k=t2 η2 k C1Cηκ2L 1 α .

Under the event Ω, we know

ft+1 f H 2 C5 + C4 Pt2 k=1 η2 k f H fk 2

2C1Cηκ2L 1 α + C4 Pt k=t2+1 η2 k f H fk 2

2C1Cηκ2L 1 α + C6 log T

C5 + C7 + 1

2 max t2<k t fk f H 2 + C6 log T

C5 + C7 + 1

2 max 1 k T fk f H 2 + C6 log T

δ , t = 1, . . . , T.

where we have used the inequality

C4 Pt2 k=1 η2 k f H fk 2

2C1Cηκ2L 1 α C4C1 Pt2 k=1 η2 k Pk 1 j=0 ηj

2C1Cηκ2L 1 α := C7.

Under the event Ω, it is now clear that

max 1 t T ft f H 2 C5 + C7 + 1

2 max 1 k T fk f H 2 + C6 log T

Solving the above linear inequality yields the stated inequality with C = max{2(C5 + C6 + C7), 1} with probability at least 1 δ.

We are now in a position to prove Theorem 7 on general high-probability convergence rates for a weighted average of iterates. The underlying idea is to construct a modiﬁed martingale diﬀerence sequence by imposing a constraint on the iterates, which is then estimated by applying the Azuma-Hoeﬀding inequality on martingales. Furthermore, according to Proposition 9, this modiﬁed martingale diﬀerence sequence would be identical to the original martingale diﬀerence sequence with high probability. Let IA denote the indicator function of an event A.

Lei, Shi and Guo

Proof of Theorem 7 We now introduce the following sequence of functionals ξ k, k = 1, . . . , T by

ξ k = ηk f H fk, φ (yk, fk(xk))Kxk Ezk φ (yk, fk(xk))Kxk I{ fk f H 2 C log 2T

where C is deﬁned in Proposition 9. Analogous to (4.40), we have

|ξ k| h ηk C3 f H fk + 2Lηkκα+1 f H fk 1+αi I{ fk f H 2 C log 2T

C3 + 2Lκα+1 ηk max f H fk 2, 1 I{ fk f H 2 C log 2T

C3 + 2Lκα+1 ηk C log 2T

δ := bk. (4.42)

It is clear that Ezk[ξ k] = 0 and ξ k only depends on z1, . . . , zk. According to Part (a) of Lemma 21, there exists a subset Ω = {(z1, . . . , z T ) : z1, . . . , z T Z} ZT with probability measure Pr{Ω } 1 δ

2 such that for any (z1, . . . , z T ) Ω the following inequality holds

k=1 b2 k log 2

2 C3 + 2Lκα+1 C log 2T

According to Proposition 9, there exists a subset Ω= {(z1, . . . , z T ) : z1, . . . , z T Z} ZT

with probability measure Pr{Ω} 1 δ

2 such that for any (z1, . . . , z T ) Ωthe following inequality holds

max 1 k T fk f H 2 C log 2T

Let {ξk}k be the martingale diﬀerence sequence deﬁned in the proof of Proposition 23. For any (z1, . . . , z T ) Ω Ω , we then have

k=1 ξ k C3 + 2Lκα+1 C log 2T

Under this intersection of these two events, it follows from (4.38) and the deﬁnition of C5 given in (4.41) that

k=1 ηk E(fk) E(f H) (η1Aκ2 + 1) f H 2 + (AC2 + B)κ2 T X

k=1 η2 k + 2

C5 + 2 C3 + 2Lκα+1 C log 2T

But Pr{Ω Ω } 1 δ. Therefore, the ﬁrst inequality of (2.5) holds with probability at least 1 δ and

2 + C3 + 2Lκα+1 C 2

The second inequality of (2.5) follows from the convexity of E( ). The proof is complete.

Convergence of Unregularized Online Learning Algorithms

Other than the high-probability bounds for the weighted average of iterates fη T , we can also derive similar results for the uniform average of iterates f T . If we choose the step sizes ηt = η1(t logβ t) 1

2 with β > 1, then Proposition 24 implies E( f T ) E(f H) = O(T 1

2 log 3 2 T

δ ) with probability at least 1 δ. We present the proof in the appendix due to its similarity to the proof of Theorem 7.

Proposition 24 Suppose assumptions in Theorem 7 hold. Then, for any δ (0, 1), with probability at least 1 δ

t=1 [E(ft) E(f H)]=O T 1 2+

t=1 ηt log 3 2 2T

δ and E( f T ) E(f H)=O T 1

t=1 ηt log 3 2 2T

Theorem 10 is a speciﬁc case of Proposition 25 with e T = T

2 . The step-stone in proving this proposition is the inequality (4.48) following from the one-step progress (4.47) in terms of generalization errors. The ﬁrst term on the right hand side of (4.48) can be tackled by Theorem 7 on a weighted summation of ˆAt deduced from the one-step analysis in terms of RKHS distances. The variance of the martingales PT t= t ξt can be controlled by PT t= t ηt E(ft) 2, which is then cancelled out by the third term PT t= t ηt E(ft) 2. A notable fact is that the martingale diﬀerence ξt Ezt[ ξt] is bounded by O(η e T ) for all t e T with high probability, which would be small if e T is large. We can balance the three terms on the right hand side of (4.43) by choosing an appropriate e T.

Proposition 25 Suppose that the assumptions in Theorem 7 hold. Let e T N satisfy 1 e T T. Then, there exists a constant e C independent of T and e T (explicitly given in the proof) such that for any δ (0, 1) the following inequality holds with probability at least 1 δ

E(f T+1) E(f H) e C max n T X

t= e T ηt 1, η e T ,

t= e T η1+α t o log2 3T

Proof Recall that ˆAt = E(ft) E(f H). According to the proof of Theorem 7, there exists a subset Ω= {(z1, . . . , z T ) : z1, . . . , z T Z} ZT with Pr{Ω} 1 2δ

3 such that for any (z1, . . . , z T ) Ω, we have

t=1 ηt ˆAt e C log 3 2 3T

δ and max 1 t T ft f H 2 C log 3T

where e C and C are constants independent of T and δ. Under the event of Ω, we have PT t= e T ηt ˆAt e C log 3 2 3T

δ . Therefore, there exists a t N satisfying e T t T and

t= e T ηt 1 e C log 3 2 3T

Lei, Shi and Guo

Taking expectations only with respect to z over both sides of (4.13) gives

ˆAt+1 ˆAt ηt φ (yt, ft(xt))Kxt, E(ft) + Lκ2(1+α)η1+α t 1 + α |φ (yt, ft(xt))|1+α. (4.46)

According to (4.4), the term |φ (yt, ft(xt))|1+α can be controlled by

|φ (yt, ft(xt))|1+α 2α|φ (yt, ft(xt)) φ (yt, f H(xt))|1+α + 2α|φ (yt, f H(xt))|1+α

2αL1+α| ft f H, Kxt |α(1+α) + 2α|φ (yt, f H(xt))|1+α

2αL1+ακα(1+α) ft f H α(1+α) + 2α|φ (yt, f H(xt))|1+α.

Plugging the above bound into (4.46) gives

ˆAt+1 ˆAt ηt E(ft) 2 + ηt E(ft) φ (yt, ft(xt))Kxt, E(ft)

+ a ft f H α(1+α) + b η1+α t , (4.47)

where we introduce

a = 2αL2+ακ(2+α)(1+α)(1 + α) 1 and b = 2αLκ2(1+α)(1 + α) 1 sup z Z |φ (y, f H(x))|1+α.

Taking a summation from t = t to T yields

ˆAT+1 ˆA t +

a ft f H α(1+α) + b η1+α t

t= t ηt E(ft) 2 +

where we introduce the following two sequences of functionals

ξt = ηt E(ft) φ (yt, ft(xt))Kxt, E(ft) , ξ t = ηt E(ft) φ (yt, ft(xt))Kxt, E(ft) I{ ft f H 2 C log 3T

Under the event Ω, it is clear ξt = ξ t. In the following, we will use Part (b) of Lemma 21 to estimate PT t= t ξ t. It is clear that Ezt[ ξ t] = 0 for all t N. Let t be any integer in [ e T, T]. It follows from Lemma 14 with β = 1 and the deﬁnition of C4 given in (4.35) that (note E[(ξ E[ξ])2 < E[ξ2] for a real-valued random variable ξ)

t= t Ezt ξ t Ezt[ ξ t] 2

t= t η2 t Ezt φ (yt, ft(xt))Kxt, E(ft) 2 I{ ft f H 2 C log 3T

t= t η2 t κ2 E(ft) 2Ezt |φ (yt, ft(xt))|2 I{ ft f H 2 C log 3T

t= t η2 t E(ft) 2 4κ2L 1 α E(ft) E(f H) + C4 I{ ft f H 2 C log 3T

δ }. (4.49)

Convergence of Unregularized Online Learning Algorithms

Analyzing analogously to (4.19), one can show that E is (α, Lκ1+α)-H older continuous. Then, Lemma 13 together with E(f H) = 0 shows that

ˆAt = E(ft) E(f H) Lκ1+α ft f H 1+α

1 + α . (4.50)

Plugging the above inequality into (4.49) shows

t= t Ezt ξ t Ezt[ ξ t] 2 C8 log 3T

t= t η2 t E(ft) 2 η e T C8 log 3T

t= t ηt E(ft) 2, (4.51)

where we have used t e T and introduced

C8 = 4κ3+αL1+ 1

α C 1 + α + C4.

According to (4.39), there holds ξ t Ezt[ ξ t] ηt φ (yt, ft(xt))Kxt E(ft), E(ft) I{ ft f H 2 C log 3T

ηt E(ft) φ (yt, ft(xt))Kxt Ezt[φ (yt, ft(xt))Kxt] I{ ft f H 2 C log 3T

C3 + 2L ft f H ακα+1 ηt E(ft) I{ ft f H 2 C log 3T

Due to the (α, Lκ1+α)-H older continuity of E

E(ft) = E(ft) E(f H) Lκ1+α ft f H α,

we further get ξ t Ezt[ ξ t] ηt C3 + 2Lκα+1 max ft f H α, 1 Lκ1+α ft f H αI{ ft f H 2 C log 3T

η e T C3 + 2Lκα+1 Lκ1+α C log 3T

δ := η e T C9 log 3T

We can ﬁnd a ρ2 > 0 independent of T such that (eρ2 ρ2 1)C8 ρ2C9. Applying Part (b) of Lemma 21 with the above bounds on variances and magnitudes of ξ k followed with union bounds on probabilities, we can ﬁnd a subset Ω = {(z1, . . . , z T ) : z1, . . . , z T Z} ZT

with Pr{Ω } 1 δ

3 such that for any (z1, . . . , z T ) Ω there holds (note Ezt[ ξ t] = 0)

ξ t η e T (eρ2 ρ2 1)C8 log 3T

δ PT t= t ηt E(ft) 2

η e T ρ2C9 log 3T

δ + η e T C9 log2 3T

t= t ηt E(ft) 2 + η e T C9 log2 3T

δ ρ2 , t [ e T, T]. (4.52)

Under the event Ω Ω , we can plug the above inequality with t = t, ξ t = ξt and ft f H 2 C log 3T

δ , t = 1, . . . , T into (4.48) to derive

ˆAT+1 ˆA t + a C log 3T

t= t η1+α t + η e T C9 log2 3T

t= e T ηt 1 e C log 3 2 3T

δ + a C + b log 3T

t= t η1+α t + η e T C9 log2 3T

Lei, Shi and Guo

where the last inequality is due to (4.45). This establishes the stated inequality with probability 1 δ and e C = e C + a C + b + C9ρ 1 2 .

It is clear that e C is independent of T and e T. The proof is complete.

Proof of Corollary 11 The polynomially decaying step sizes ηt = η1t θ(θ > 1

2) satisﬁes the monotonicity and P t=1 η2 t < Furthermore, we have

2 ηt 1 2 TηT = O(T θ 1) and

2 η1+α t (T + 1)η1+α T

2 2 = O(T 1 (1+α)θ).

The proof is complete if we plug the above estimates into Theorem 10.

Acknowledgments

We thank the anonymous referees for their constructive suggestions. The work described in this paper is supported partially by the National Natural Science Foundation of China (Grants No. 11401524, 11531013, 11571078, 11631015, 11771012). Yunwen Lei is also supported by the Science and Technology Innovation Committee Foundation of Shenzhen (Grant No. ZDSYS201703031748284). Lei Shi is also supported by the Joint Research Fund by National Natural Science Foundation of China and Research Grants Council of Hong Kong (Project No. 11461161006 and RGC Project No. N City U120/14), Program of Shanghai Subject Chief Scientist (Project No.18XD1400700), and Zhuo Xue program of Fudan University. The corresponding author is Zheng-Chu Guo.

Appendix A. Some Additional Proofs

Proof of Lemma 16 Let ϵ > 0 be an arbitrary number. Since limt ηt = 0 we can ﬁnd a t3 N such that ηt ϵ

2 for all t t3. Since P t=1 ηt = , we can also ﬁnd a t4 > t3 such that Pt3 k=1 η2 k ϵ

2 Pt4 k=1 ηk. Then, for any t t4, it holds

k=1 ηk 1 t X

k=1 η2 k = t X

k=1 ηk 1 t3 X

k=1 η2 k + t X

k=1 ηk 1 t X

k=t3+1 η2 k

k=1 ηk 1 t X

k=t3+1 ηk ϵ.

Since ϵ > 0 is arbitrarily chosen, the proof is complete.

Convergence of Unregularized Online Learning Algorithms

Proof of Lemma 13 Fix f, f H. Deﬁne a function g : R R by g(t) = G( f +t(f f)). It is clear that g (t) = f f, G( f + t(f f)) and

|g (t) g ( t)| = D f f, G( f + t(f f)) G( f + t(f f)) E

f f G f + t(f f) G f + t(f f)

L f f 1+α|t t|α.

It then follows that

g(1) g(0) g (0) = Z 1

0 [g (t) g (0)]dt Z 1

0 |g (t) g (0)|dt

L f f 1+α Z 1

0 tαdt = L f f 1+α

which amounts to the second inequality in (4.1)

G(f) G( f) + f f, G( f) + L f f 1+α

1 + α . (A.1)

We now turn to the ﬁrst inequality in (4.1). Fix f and f H. Deﬁne a functional L : H R by L( f) = G( f) f, G(f) . It is clear that L is a convex function and L(f) = G(f) G(f) = 0. According to the ﬁrst-order optimality condition, we know L attains its minimum at f and

L(f) = min f H L( f) = min f H

G( f) f, G(f)

h G( f) + f f, G( f) + L f f 1+α

1 + α f, G(f) i

= L( f) + min f H

h f f, G(f) G( f) + L f f 1+α

= L( f) + min f H

h f, G(f) G( f) + L f 1+α

where the inequality follows from (A.1). Taking f = L 1

α G( f) G(f) 1 α

α G( f) G(f) in the above inequality, we derive

L(f) L( f) L 1

α G( f) G(f) 1+α

α G( f) G(f) 1+α

= L( f) αL 1

α 1 + α G(f) G( f) 1+α

This establishes the ﬁrst inequality in (4.1). The proof is complete.

Proof of Proposition 24 Consider the following sequence of functionals ξk, k = 1, . . . , T by ξk = f H fk, φ (yk, fk(xk))Kxk Ezk φ (yk, fk(xk))Kxk ,

Lei, Shi and Guo

where C is deﬁned in Proposition 9. Eq. (4.42) implies that

| ξk|I{ fk f H 2 C log 2T

δ } C3 + 2Lκα+1 C log 2T

By Part (a) of Lemma 21, there exists a subset Ω = {(z1, . . . , z T ) : z1, . . . , z T Z} ZT

with probability measure Pr{Ω } 1 δ

2 such that for any (z1, . . . , z T ) Ω the following inequality holds

k=1 ξk I{ fk f H 2 C log 2T

δ } C3 + 2Lκα+1 C log 2T

According to Proposition 9, there exists a subset Ω= {(z1, . . . , z T ) : z1, . . . , z T Z} ZT

with probability measure Pr{Ω} 1 δ

2 such that for any (z1, . . . , z T ) Ωthere holds the inequality max1 k T fk f H 2 C log 2T

δ . Under the event Ω Ω , we then have

k=1 ξk C3 + 2Lκα+1 C log 2T

Furthermore, it follows from (4.37) that

2[E(fk) E(f H)] η 1 k fk f H 2 fk+1 f H 2 + ηkκ2 Aφ(yk, fk(xk)) + B + 2 ξk.

Taking a summation of the above inequality from k = 1 to T yields the following inequality under the event Ω Ω

k=1 [E(fk) E(f H)]

η 1 k+1 η 1 k fk+1 f H 2 + η 1 1 f1 f H 2

k=1 ηk Aφ(yk, fk(xk)) + B + 2

k=1 ξk. (A.3)

It follows from (4.33) that

k=1 ηkφ(yk, fk(xk)) f H 2 + 2

k=1 ηkφ(yk, f H(xk)) + κ2B

Plugging the above bound into (A.3) and using the monotonicity of ηk together with (A.2), we derive the following inequality with probability at least 1 δ

k=1 [E(fk) E(f H)] (Aκ2+η 1 1 ) f H 2+κ2 T X

2Aηk sup z φ(y, f H(x))+Bηk+ABκ2η2 k

+ C3 + 2Lκα+1 C(8T) 1 2 log 3 2 2T

The proof is complete.

Convergence of Unregularized Online Learning Algorithms

Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1 9, 2009.

L eon Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8), 1991.

L eon Bottou. Online learning and stochastic approximations. On-line Learning in Neural Networks, 17(9):142, 1998.

Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004.

Di-Rong Chen, Qiang Wu, Yiming Ying, and Ding-Xuan Zhou. Support vector machine soft margin classiﬁers: error analysis. Journal of Machine Learning Research, 5:1143 1175, 2004.

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3): 273 297, 1995.

Felipe Cucker and Ding-Xuan Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, 2007.

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363 1399, 2016.

Joseph L Doob. Measure Theory, Graduate Texts in Mathematics. Springer, 1994.

John Duchi and Yoram Singer. Eﬃcient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(Dec):2899 2934, 2009.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. In Conference on Learning Theory, pages 14 26, 2010.

Zheng-Chu Guo and Lei Shi. Fast and strong convergence of online learning algorithms. submitted, 2017.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225 1234, 2016.

Wassily Hoeﬀding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963.

Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8):2165 2176, 2004.

Lei, Shi and Guo

Yunwen Lei and Ding-Xuan Zhou. Convergence of online mirror descent algorithms. submitted, 2017.

Junhong Lin and Ding-Xuan Zhou. Learning theory of randomized Kaczmarz algorithm. Journal of Machine Learning Research, 16:3341 3365, 2015.

Junhong Lin and Ding-Xuan Zhou. Online learning algorithms can converge comparably fast as batch learning. IEEE Transactions on Neural Networks and Learning Systems, in press, 2018. doi: 10.1109/TNNLS.2017.2677970.

Junhong Lin, Raﬀaello Camoriano, and Lorenzo Rosasco. Generalization properties and implicit regularization for multiple passes sgm. In International Conference on Machine Learning, pages 2340 2348, 2016.

Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451 459, 2011.

Klaus-Robert M uller, Sebastian Mika, Gunnar Ratsch, Koji Tsuda, and Bernhard Sch olkopf. An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181 201, 2001.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Quoc V Le, and Andrew Y Ng. On optimization methods for deep learning. In International Conference on Machine Learning, pages 265 272, 2011.

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning, pages 449 456, 2012.

Bernhard Sch olkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2001.

Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1):3 30, 2011.

Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71 79, 2013.

Steve Smale and Yuan Yao. Online learning algorithms. Foundations of Computational Mathematics, 6(2):145 170, 2006.

Steve Smale and Ding-Xuan Zhou. Online learning with markov sampling. Analysis and Applications, 7(01):87 113, 2009.

Convergence of Unregularized Online Learning Algorithms

Ingo Steinwart. On the inﬂuence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2(Nov):67 93, 2001.

Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Science & Business Media, 2008.

Ilya Sutskever, James Martens, George Dahl, and Geoﬀrey Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139 1147, 2013.

Pierre Tarres and Yuan Yao. Online learning as stochastic approximation of regularization paths: optimality and almost-sure convergence. IEEE Transactions on Information Theory, 60(9):5716 5735, 2014.

Yuan Yao. On complexity issues of online learning algorithms. IEEE Transactions on Information Theory, 56(12):6470 6481, 2010.

Gui-Bo Ye and Ding-Xuan Zhou. Fully online classiﬁcation by regularization. Applied and Computational Harmonic Analysis, 23(2):198 214, 2007.

Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561 596, 2008.

Yiming Ying and Ding-Xuan Zhou. Online regularized classiﬁcation algorithms. IEEE Transactions on Information Theory, 52(11):4775 4788, 2006.

Yiming Ying and Ding-Xuan Zhou. Unregularized online learning algorithms with general loss functions. Applied and Computational Harmonic Analysis, 2(42):224 244, 2017.

Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In International Conference on Machine Learning, pages 919 926, 2004.

Tong Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Conference on Learning Theory, pages 173 187, 2005.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In International Conference on Machine Learning, pages 928 936, 2003.