# asymptotic_analysis_of_conditioned_stochastic_gradient_descent__57f02969.pdf

Published in Transactions on Machine Learning Research (08/2023)

Asymptotic Analysis of Conditioned Stochastic Gradient Descent

Rémi Leluc remi.leluc@gmail.com CMAP, École Polytechnique Institut Polytechnique de Paris, Palaiseau (France)

François Portier francois.portier@gmail.com CREST, ENSAI École Nationale de la Statistique et de l Analyse de l Information, Rennes (France)

Reviewed on Open Review: https: // openreview. net/ forum? id= U4Xgz Rjf F1

In this paper, we investigate a general class of stochastic gradient descent (SGD) algorithms, called conditioned SGD, based on a preconditioning of the gradient direction. Using a discrete-time approach with martingale tools, we establish under mild assumptions the weak convergence of the rescaled sequence of iterates for a broad class of conditioning matrices including stochastic ﬁrst-order and second-order methods. Almost sure convergence results, which may be of independent interest, are also presented. Interestingly, the asymptotic normality result consists in a stochastic equicontinuity property so when the conditioning matrix is an estimate of the inverse Hessian, the algorithm is asymptotically optimal.

1 Introduction

Consider some unconstrained optimization problem of the following form:

min θ Rd{F(θ) = Eξ[f(θ, ξ)]},

where f is a loss function and ξ is a random variable. This key methodological problem, known under the name of stochastic programming (Shapiro et al., 2014), includes many ﬂagship machine learning applications such as empirical risk minimization (Bottou et al., 2018), adaptive importance sampling (Delyon & Portier, 2018) and reinforcement learning (Sutton & Barto, 2018). When F is diﬀerentiable, a common appproach is to rely on ﬁrst-order methods. However, in many scenarios and particularly in large-scale learning, the gradient of F may be hard to evaluate or even intractable. Instead, a random unbiased estimate of the gradient is available at a cheap computing cost and the state-of-the-art algorithm, stochastic gradient descent (SGD), just moves along this estimate at each iteration. It is an iterative algorithm, simple and computationally fast, but its convergence towards the optimum is generally slow. Conditioned SGD, which consists in multiplying the gradient estimate by some conditioning matrix at each iteration, can lead to better performance as shown in several recent studies ranging from natural gradient (Amari, 1998; Kakade, 2002) and stochastic second-order methods with quasi-Newton (Byrd et al., 2016) and (L)-BFGS methods (Liu & Nocedal, 1989) to diagonal scaling methods such as Ada Grad (Duchi et al., 2011), RMSProp (Tieleman et al., 2012), Adam (Kingma & Ba, 2014), AMSGrad (Reddi et al., 2018) and adaptive coordinate sampling (Wangni et al., 2018; Leluc & Portier, 2022). These conditioning techniques are based on diﬀerent strategies: diagonal scaling rely on feature normalization, stochastic second-order methods are concerned with minimal variance and adaptive coordinate sampling techniques aim at taking advantage of particular data structure. Furthermore, these methods proved to be the current state-of-the-art for training machine learning models (Zhang, 2004; Le Cun et al., 2012) and are implemented in widely used programming tools (Pedregosa et al., 2011; Abadi et al., 2016).

Published in Transactions on Machine Learning Research (08/2023)

Conditioned SGD generalizes standard SGD by adding a conditioning step to reﬁne the descent direction. Starting from θ0 Rd, the algorithm of interest is deﬁned by the following iteration

θk+1 = θk γk+1Ckg(θk, ξk+1), k 0,

where g(θk, ξk+1) is some unbiased gradient valued in Rd, Ck Rd d is called conditioning matrix and (γk)k 1 is a decreasing learning rate sequence. An important question, which is still open to the best of our knowledge, is to characterize the asymptotic variance of such algorithms for non-convex objective F and general estimation procedure for the conditioning matrix Ck.

Related work. Seminal works around standard SGD (Ck = Id) were initiated by Robbins & Monro (1951) and Kiefer et al. (1952). Since then, a large literature known as stochastic approximation, has developed. The almost sure convergence is studied in Robbins & Siegmund (1971) and Bertsekas & Tsitsiklis (2000); rates of convergence are investigated in Kushner & Huang (1979) and Pelletier (1998a); non-asymptotic bounds are given in Moulines & Bach (2011). The asymptotic normality can be obtained using two diﬀerent approaches: a diﬀusion-based method is employed in Pelletier (1998b) and Benaïm (1999) whereas martingale tools are used in Sacks (1958) and Kushner & Clark (1978). We refer to Nevelson & Khas minski ı (1976); Delyon (1996); Benveniste et al. (2012); Duﬂo (2013) for general textbooks on stochastic approximation.

The aforementioned results do not apply directly to conditioned SGD because of the presence of the matrix sequence (Ck)k 0 involving an additional source of randomness in the algorithm. Seminal papers dealing with the weak convergence of conditioned SGD are Venter (1967) and Fabian (1968). Within a restrictive framework (univariate case d = 1 and strong assumptions on the function F), their results are encouraging because the limiting variance of the procedure is shown to be smaller than the limiting variance of standard SGD. Venter s and Fabian s results have then been extended to more general situations (Fabian, 1973; Nevelson & Khas minski ı, 1976; Wei, 1987). In Wei (1987), the framework is still restrictive not only because the random errors are assumed to be independent and identically distributed but also because the objective F must satisfy their assumption (4.10) which hardly extends to objectives other than quadratic.

More recently, Bercu et al. (2020) have obtained the asymptotic normality as well as the eﬃciency of certain conditioned SGD estimates in the particular case of logistic regression. The previous approach has been generalized not long ago in Boyer & Godichon-Baggioni (2022) where the use of the Woodbury matrix identity is promoted to compute the Hessian inverse in the online setting. Several theoretical results, including the weak convergence of conditioned SGD, are obtained for convex objective functions. An alternative to conditioning, called averaging, developed by Polyak (1990) and Polyak & Juditsky (1992), allows to recover the same asymptotic variance as conditioned SGD. When dealing with convex objectives, the theory behind this averaging technique is a well-studied topic (Moulines & Bach, 2011; Gadat & Panloup, 2017; Dieuleveut et al., 2020; Zhu et al., 2021). However, it is inevitably associated with a large bias caused by poor initialization and requires some parameter tuning through the burn-in phase.

Contributions. The main result of this paper deals with the weak convergence of the rescaled sequence of iterates. Interestingly, our asymptotic normality result consists of the following continuity property: whenever the matrix sequence (Ck)k 0 converges to a matrix C and the iterates (θk)k 0 converges to a minimizer θ , the algorithm behaves in the same way as an oracle version in which C would be used instead of Ck. We stress that contrary to Boyer & Godichon-Baggioni (2022), no convexity assumption is needed on the objective function and no rate of convergence is required on the sequence (Ck)k 0. This is important because, in most studies, deriving a convergence rate on (Ck)k 0 requires a speciﬁc convergence rate on the iterates (θk)k 0 which, in general, is unknown at this stage of the analysis. From a more practical point of view, our main result claims that the impact of the approximation error resulting from the conditioning matrices estimation assumes a secondary role. This ﬁnding promotes the use of simple and cheap sequential algorithm to estimate the conditioning matrix which encompasses a broad spectrum of conditioned SGD methods, highlighting the applicability and generalizability of the obtained result. Another result of independent interest dealing with the almost sure convergence of the gradients is also provided.

In addition, for illustration purposes, we apply our results to the popular variational inference problem where one seeks to approximate a target density out of a parametric family by solving an optimization problem. In this framework, by optimizing the forward Kullback-Liebler divergence (Jerfel et al., 2021) and building

Published in Transactions on Machine Learning Research (08/2023)

stochastic gradients relying on importance sampling schemes (Delyon & Portier, 2018), we show that the approach has some eﬃciency properties. For the sake of completeness, we present in appendix practical ways to compute the conditioning matrix Ck and show that the resulting procedure satisﬁes the high-level conditions of our main Theorem. This yields a feasible algorithm achieving minimum variance.

To obtain these results, instead of approximating the rescaled sequence of iterates by a continuous diﬀusion (as for instance in Pelletier (1998b)), we rely on a discrete-time approach where the recursion scheme is directly analyzed (as for instance in Delyon (1996)). More precisely, the sequence of iterates is studied with the help of an auxiliary linear algorithm whose limiting distribution can be deduced from the central limit theorem for martingale increments (Hall & Heyde, 1980). The limiting variance is derived from a discrete time matrix-valued dynamical system algorithm. It corresponds to the solution of a Lyapunov equation involving the matrix C. It allows a special choice for C which guarantees an optimal variance. Finally, a particular recursion is identiﬁed to examine the remaining part. By studying it on a particular event, this part is shown to be negligible.

Outline. Section 2 introduces the framework of standard SGD with asymptotic results. Section 3 is dedicated to conditioned SGD: it ﬁrst presents popular optimization methods that fall in the considered framework and then presents our main results, namely the weak convergence and asymptotic optimality. Section 4 gathers practical implications of the main results for machine learning models in the framework of variational inference and Section 5 concludes the paper with a discussion of avenues for further research. Technical proofs, additional propositions and numerical experiments are available in the appendix.

2 Mathematical background

In this section, the mathematical background of stochastic gradient descent (SGD) methods is presented and illustrated with the help of some examples. Then, to motivate the use of conditioning matrices, we present a known result from Pelletier (1998b) about the weak convergence of SGD.

2.1 Problem setup

Consider the problem of ﬁnding a minimizer θ Rd of a function F : Rd R, that is,

θ arg min θ Rd F(θ).

In many scenarios and particularly in large scale learning, the gradient of F cannot be fully computed and only a stochastic unbiased version of it is available. The SGD algorithm moves the iterate along this direction. To increase the eﬃciency, the random generators used to derive the unbiased gradients might evolve during the algorithm, e.g., using the past iterations. To analyse such algorithms, we consider the following probabilistic setting.

Deﬁnition 1. A stochastic algorithm is a sequence (θk)k 0 of random variables deﬁned on a probability space (Ω, F, P) and valued in Rd. Deﬁne (Fk)k 0 as the natural σ-ﬁeld associated to the stochastic algorithm (θk)k 0, i.e., Fk = σ(θ0, θ1, . . . , θk), k 0. A policy is a sequence of random probability measures (Pk)k 0, each deﬁned on a measurable space (S, S) that are adapted to Fk.

Given a policy (Pk)k 0 and a learning rates sequence (γk)k 1 of positive numbers, the SGD algorithm (Robbins & Monro, 1951) is deﬁned by the update rule

θk+1 = θk γk+1g(θk, ξk+1) with ξk+1 Pk, (1)

where g : Rd S Rd is called the gradient generator. The choice of the policy (Pk)k 0 in SGD is important as it can impact the convergence speed, generalization performance, and eﬃciency of the optimization algorithm. While most classical approaches rely on uniform sampling and mini-batch sampling, it may be more eﬃcient to use advanced selection sampling strategy such as stratiﬁed sampling or importance sampling (see Example 1 for details). The policy (Pk)k 0 is used at each iteration to produce random gradients through the function g. Those gradients are assumed to be unbiased.

Published in Transactions on Machine Learning Research (08/2023)

Assumption 1 (Unbiased gradient). The gradient generator g : Rd S Rd is such that for all θ Rd, g(θ, ) is measurable, and we have: k 0, E [g(θk, ξk+1)|Fk] = F(θk).

We emphasize three important examples covered by the developed approach. In each case, explicit ways to generate the stochastic gradient are provided.

Example 1. (Empirical Risk Minimization) Given some observed data z1, . . . , zn Rp and a diﬀerentiable loss function ℓ: Rd Rp R, the objective function F approximates the true expected risk Ez[ℓ(θ, z)] using its empirical counterpart F(θ) = n 1 Pn i=1 ℓ(θ, zi). Classically, the gradient estimates at θk are given by the policy

g(θk, ξk+1) = θℓ(θk, ξk+1) with ξk+1

Another one, more subtle, referred to as mini-batching (Gower et al., 2019), consists in generating uniformly a set of nk samples (z1, . . . , znk) and computing the gradient as the average n 1 k Pnk j=1 θℓ(θk, zj). Note that interestingly, we allow changes of the minibatch size throughout the algorithm. Our framework also includes adaptive non-uniform sampling (Papa et al., 2015) and survey sampling (Clémençon et al., 2019), which use Pk = Pn i=1 w(k) i δzi with Fk-adapted weights satisfying Pn i=1 w(k) i = 1 for each k 0.

Example 2. (Adaptive importance sampling for variational inference) Given a target density function f, which for instance might result from the posterior distribution of some observed data, and a parametric family of samplers {qθ : θ Θ}, the aim is to ﬁnd a good approximation of f out of the family of samplers. A standard choice (Jerfel et al., 2021) for the objective function is the so called forward Kullback-Leibler divergence given by F(θ) = R log(qθ(y)/f(y))f(y)dy. Then in the spirit of adaptive importance sampling schemes (Delyon & Portier, 2018), gradient estimates are given by

g(θk, ξk+1) = θ log(qθk(ξk+1)) f(ξk+1)

qθk(ξk+1), ξk+1 qθk.

Other losses such as α-divergence (Daudel et al., 2021) or generalized method of moment (Delyon & Portier, 2018) may also be considered depending on the problem of interest. Some applications of conditioned SGD algorithm to this particular framework are considered with more details in Section 4.

Example 3. (Policy-gradient methods) In reinforcement learning (Sutton & Barto, 2018), the goal of the agent is to ﬁnd the best action-selection policy to maximize the expected reward. Policy-gradient methods (Baxter & Bartlett, 2001; Williams, 1992) use a parameterized policy {πθ : θ Θ} to optimize an expected reward function F given by F(θ) = Eξ πθ[R(ξ)] where ξ is a trajectory including nature states and selected actions. Using the policy gradient theorem, one has F(θ) = Eξ πθ [R(ξ) θ log πθ(ξ)], leading to the REINFORCE algorithm (Williams, 1992) given by

g(θk, ξk+1) = R(ξk+1) θ log πθk(ξk+1), ξk+1 πθk.

2.2 Weak convergence of SGD

This section is related to the weak convergence property of the normalized sequence of iterates (θk θ )/ γk. The working assumptions include the almost sure convergence of the sequence of iterates (θk)k 0 towards a stationary point θ . Note that, given Assumptions 1 and 2, there exist many criteria on the objective function that give such almost sure convergence. For these results, we refer to Bertsekas & Tsitsiklis (2000); Benveniste et al. (2012); Duﬂo (2013). In addition to this high-level assumption of almost sure convergence, we require the following classical assumptions. Let S++ d (R) denote the space of real symmetric positive deﬁnite matrices and deﬁne for all k 0,

wk+1 = F(θk) g(θk, ξk+1), Γk = E wk+1w k+1|Fk .

The learning rates sequence (γk)k 1 should decay to eventually anneal the noise but not too fast so that the iterates (θk)k 0 can reach interesting places in a ﬁnite time.

Published in Transactions on Machine Learning Research (08/2023)

Assumption 2 (Learning rates). The sequence of step-size is γk = αk β with β (1/2, 1].

This classical form of the step-size ensures theoretical convergence guarantee through the Robbins-Monro condition: P k γk = , P k γ2 k < . However, note that in practice, the choice of learning rate is often determined through experimentation and ﬁne-tuning to achieve the best performance on the given task.

Assumption 3 (Hessian). The Hessian matrix at stationary point is positive deﬁnite, i.e., H = 2F(θ ) S++ d (R) and the mapping θ 7 2F(θ) is continuous at θ .

The positive deﬁniteness of the Hessian matrix provides stability and robustness guarantees in the optimization process. It ensures that small perturbations or noise in the objective function or the training data do not signiﬁcantly aﬀect the convergence behavior. The positive curvature helps in conﬁning the optimization trajectory near the minimum and prevents it from getting trapped in ﬂat regions or saddle points.

The noise sequence (wk)k 1 deﬁnes a sequence of conditional covariance matrices (Γk)k 1 that is assumed to converge so that one can identify the limiting covariance Γ = E[g(θ , ξ)g(θ , ξ) ].

Assumption 4 (Covariance matrix). There exists Γ S++ d (R) such that Γk k + Γ a.s.

Finally, in order to derive a central limit theorem for the iterates of the algorithm, there is an extra need for stability which is synonymous with a uniform bound on the noise around the minimizer.

Assumption 5 (Lyapunov bound). There exist δ, ε > 0 such that:

sup k 0 E[ wk+1 2+δ 2 |Fk]1{ θk θ ε} < a.s.

Note that all these assumptions are stated in the spirit of Pelletier (1998b) making them mild and general. In particular, Assumptions 4 and 5 are similar to (A1.2) in Pelletier (1998b). More precisely, Assumption 4 is needed to identify the limiting distribution while Assumption 5 is a stability condition often referred to as the Lyapunov condition. This last condition is technical but not that strong as it is similar to the Lindeberg s condition which is necessary (Hall & Heyde, 1980) for tightness. The following result can be either derived from (Pelletier, 1998b, Theorem 1) or as a direct corollary of our main result, Theorem 2, given in Section 3.2.

Theorem 1 (Weak convergence of SGD). Let (θk)k 0 be obtained by the SGD rule (1). Suppose that Assumptions 1, 2, 3, 4, 5 are fulﬁlled and that θk θ almost surely. If moreover, (H ζI) is positive deﬁnite with ζ = 1{β=1}/2α, it holds that

1 γk (θk θ ) N(0, Σ), as k

where Σ satisﬁes the Lyapunov equation: (H ζId)Σ + Σ(H ζId) = Γ.

Several remarks are to be explored. Since Γ and (H ζI) are positive deﬁnite matrices, there exists a unique solution Σ to the Lyapunov equation (H ζId)Σ + Σ(H ζId) = Γ given by Σ = R + 0 exp[ t(H ζId)]Γ exp[ t(H ζId) ]dt. Second, the previous result can be expressed as kβ/2(θk θ ) N(0, αΣ). Hence, the fastest rate of convergence is obtained when β = 1 for which we recover the classical 1/

k-rate of a Monte Carlo estimate. In this case, the coeﬃcient α should be chosen large enough to ensure the convergence through the condition H Id/(2α) 0, but also such that the covariance matrix αΣ is small. The choice of α is discussed in the next section and should be replaced with a matrix gain.

3 The asymptotics of conditioned stochastic gradient descent

This Section ﬁrst presents practical optimization schemes that fall in the framework of conditioned SGD. Then it contains our main results, namely the weak convergence and asymptotic optimality. Another result of independent interest dealing with the almost sure convergence of the gradients and the iterates is also provided.

Published in Transactions on Machine Learning Research (08/2023)

3.1 Framework and Examples

We introduce the general framework of conditioned SGD as an extension of the standard SGD presented in Section 2. It is deﬁned by the following update rule, for k 0,

θk+1 = θk γk+1Ckg(θk, ξk+1), (2)

where the conditioning matrix Ck S++ d (R) is a Fk-measurable real symmetric positive deﬁnite matrix so that the search direction always points to a descent direction. In convex optimization, inverse of the Hessian is a popular choice but (1) it may be hard to compute, (2) it is not always positive deﬁnite and (3) it may increase the noise of SGD especially when the Hessian is ill-conditioned.

Quasi-Newton. These methods build approximations of the Hessian Ck 2F(θk) 1 with gradient-only information, and are applicable for convex and nonconvex problems. For scalability issue, variants with limited memory are the most used in practice (Liu & Nocedal, 1989). Following Newton s method idea with the secant equation, the update rule is based on pairs (sk, yk) tracking the diﬀerences of iterates and stochastic gradients, i.e., sk = θk+1 θk and yk = g(θk+1, ξk+1) g(θk, ξk+1). Let ρk = 1/(s k yk) then the Hessian updates are

Ck+1 = (I ρkyks k ) Ck(I ρkyks k ) + ρksks k .

In the deterministic setting, the BFGS update formula above is well-deﬁned as long as s k yk > 0. Such condition preserves positive deﬁnite approximations and may be obtained in the stochastic setting by replacing the Hessian matrix with a Gauss-Newton approximation and using regularization.

Adaptive methods and Diagonal scalings. These methods adapt locally to the structure of the optimization problem by setting Ck as a function of past stochastic gradients. General adaptive methods diﬀer in the construction of the conditioning matrix and whether or not they add a momentum term. Using diﬀerent representations such as dense or sparse conditioners also modify the properties of the underlying algorithm. For instance, the optimizers Adam and RMSProp maintain an exponential moving average of past stochastic gradients with a factor τ (0, 1) but fail to guarantee Ck+1 Ck. Such behaviour can lead to large ﬂuctuations and prevent convergence of the iterates. Instead, Ada Grad and AMSGrad ensure the monotonicity Ck+1 Ck.

Optimizer Gradient matrix Gk+1 m Ada Full Gk + gkg k 0 Ada Norm Gk + gk 2 2 0 Ada Diag Gk + diag(gkg k ) 0 RMSProp τGk + (1 τ)diag(gkg k ) 0 Adam [τGk + (1 τ)diag(gkg k )]/(1 τ k) m AMSGrad [τGk + (1 τ)diag(gkg k )]/(1 τ k) m

Table 1: Adaptive Gradient Methods.

Denote by gk = g(θk, ξk+1) and m [0, 1) a momentum parameter. General adaptive gradient methods are deﬁned by: θk+1 = θk γk+1Ckˆgk, ˆgk = mˆgk 1 + (1 m)gk. Diﬀerent optimizers are summarized in Table 1 above. They all rely on a gradient matrix Gk which accumulates the information of stochastic gradients. The conditioning matrix is equal to Ck = G 1/2 k except for AMSGrad which uses Ck = max{Ck 1; G 1/2 k }. Starting from G0 = δI with δ > 0, Gk+1 is updated either in a dense or sparse (diagonal) manner or using an exponential moving average. Note that conditioned SGD methods also include schemes with general estimation of the matrix Ck such as Hessian sketching (Gower et al., 2016) or Jacobian sketching (Gower et al., 2021).

A common assumption made in the literature of adaptive methods is that conditioning matrices are wellbehaved in the sense that their eigenvalues are bounded in a ﬁxed interval. This property is easy to check for diagonal matrices and can always be implemented in practice using projection.

Published in Transactions on Machine Learning Research (08/2023)

3.2 Main result

Similarly to standard SGD, it is interesting to search for an appropriate rescaled process to obtain some convergence rate and asymptotic normality results. In fact the only additional assumption needed, compared to SGD, is the almost sure convergence of the sequence (Ck)k 0. This makes Theorem 1 a particular case of the following Theorem which is the main result of the paper (the proof is given in Appendix A.1).

Theorem 2 (Weak convergence of Conditioned SGD). Let (θk)k 0 be obtained by conditioned SGD (2). Suppose that Assumptions 1, 2, 3, 4, 5 are fulﬁlled and that θk θ almost surely. If moreover, Ck C S++ d (R) almost surely and all the eigenvalues of (CH ζI) are positive with ζ = 1{β=1}/2α, it holds that

1 γk (θk θ ) N(0, ΣC), as k ,

where ΣC satisﬁes: (CH ζId) ΣC + ΣC (CH ζId) = CΓC .

Sketch of the proof. The idea of the proof is to rely on the following bias-variance decomposition. Remark that the diﬀerence k = θk θ is subjected to the iteration:

k+1 = k γk+1Ck F(θk) + γk+1Ckwk+1, k 0.

In a similar spirit as in Delyon (1996), we use the Taylor approximation F(θk) = F(θ ) + H(θk θ ) + o(θk θ ) H(θk θ ) to deﬁne the following auxiliary linear stochastic algorithm which carries the same variance as the main algorithm,

e k+1 = e k γk+1K e k + γk+1Ckwk+1, k 1,

where K = CH. As a ﬁrst step we establish the weak convergence of e k+1 using discrete martingale tools. Note that the analysis is made possible because the matrix K is ﬁxed along this algorithm. As a second step, we prove that the diﬀerence ( k e k), which represents some bias term, is negligible.

Comparison with previous works. Theorem 2 stated above is comparable to Theorem 1 given in Pelletier (1998b). However, our result on the weak convergence cannot be recovered from the one of Pelletier (1998b) due to their Assumption (A1.2) about convergence rates. Indeed, this assumption would require that the sequence (Ck)k 0 converges towards C faster than γk. This condition is either hardly meet in practice or diﬃcult to check. Unlike this prior work, our result only requires the almost sure convergence of the sequence (Ck)k 0. In a more restrictive setting of convex objective and online learning framework, i.e. in which data becomes available in a sequential order, another way to obtain the weak convergence of the rescaled sequence of iterates (θk θ )/ γk is to rely on the results of Boyer & Godichon-Baggioni (2022). However, once again, their work rely on a particular convergence rate for the matrix sequence (Ck)k 0. This implies the derivation of an additional result on the almost sure convergence rate of the iterates. To overcome all these issues, we show in Appendix B that our conditions on the matrices Ck are easily satisﬁed in common situations.

3.3 Asymptotic optimality of Conditioned SGD

The best conditioning matrix C that could be chosen regarding the asymptotic variance is speciﬁed in the next proposition whose proof is given in the supplementary material (Appendix C.3).

Proposition 1 (Optimal choice). The choice C = H 1 is optimal in the sense that ΣC ΣC for all C CH. Moreover, we have ΣC = H 1ΓH 1.

Another remarkable result, which directly follows from the Theorem 2 is now stated as a corollary.

Corollary 1 (Asymptotic optimality). Under the assumptions of Theorem 2, if γk = 1/k and C = H 1, then

k(θk θ ) N(0, H 1ΓH 1), as k .

Published in Transactions on Machine Learning Research (08/2023)

Moreover, let (Z1, . . . , Zd) N(0, Id) and (λk)k=1,...,d be the eigenvalues of the matrix H 1/2ΓH 1/2, we have the convergence in distribution:

k(F(θk) F(θ ))

k=1 λk Z2 k, as k .

This result shows the success of the proposed approach as the asymptotic variance is the optimal one. It provides the user a practical choice for the sequence of rate, γk = 1/k and also removes the assumption that 2αH Id which is usually needed in SGD (see Theorem 1). Concerning the almost sure convergence of the conditioning matrices, we provide in Appendix B an explicit way to ensure that Ck H 1.

The above statement also provides insights about the convergence speed. It claims that the convergence rate of F(θk) towards the optimum F(θ ), in 1/k, is faster than the convergence rate of the iterates, in 1/

k. Another important feature, which is a consequence of Proposition 1, is that the eigenvalues (λk)k=1,...,d that appear in the limiting distribution are the smallest ones among all the other possible version of conditioned SGD (deﬁned by the matrix C).

3.4 Convergence of the iterates (θk) of Conditioned SGD

To apply both Theorem 2 and Corollary 1, it remains to check the almost sure convergence of the iterates. In a non-convex setting, the iterates of stochastic ﬁrst-order methods can only reach local optima, i.e. the iterates are expected to converge to the following set S = {θ Rd : F(θ) = 0}. Going in this direction, we ﬁrst prove the almost sure convergence of the gradients towards zero for general conditioned SGD methods under mild assumptions. This theoretical result may be of independent interest. Under a condition on S, one may uniquely identify a limit point θ and consider the event {θk θ } which is needed for the weak convergence results. The next analysis is based on classical assumptions which are used in the literature to obtain the convergence of standard SGD.

Assumption 6 (L-smooth). L > 0 : θ, η Rd, F(θ) F(η) 2 L θ η 2.

Assumption 7 (Lower bound). F R : θ Rd, F F(θ).

To handle the noise of the stochastic estimates, we consider a weak growth condition, related to the notion of expected smoothness as introduced in Gower et al. (2019) (see also Gazagnadou et al. (2019); Gower et al. (2021)). In particular, we extend the condition of Gower et al. (2019) to our general context in which the sampling distributions are allowed to change along the algorithm.

Assumption 8 (Growth condition). With probability 1, there exist 0 L, σ2 < such that for all θ Rd, k N, E g(θ, ξk+1) 2 2|Fk 2L(F(θ) F ) + σ2.

This almost-sure bound on the stochastic noise E g(θ, ξk) 2 2|Fk 1 is key in the analysis of the conditioned SGD algorithm. This weak growth condition on the stochastic noise is general and can be achieved in practice with a general Lemma available in the supplement (Appendix C.4). Note that Assumption 8, often referred to as a growth condition, is mild since it allows the noise to be large when the iterate is far away from the optimal point. In that aspect, it contrasts with uniform bounds of the form E g(θk, ξk+1) 2 2|Fk σ2 for some deterministic σ2 > 0 (see Nemirovski et al. (2009); Nemirovski & Yudin (1983); Shalev-Shwartz et al. (2011)). Observe that such uniform bound is recovered by taking L = 0 in Assumption 8 but cannot hold when the objective function F is strongly convex (Nguyen et al., 2018). Besides, fast convergence rates have been derived in Schmidt & Roux (2013) under the strong-growth condition: E[ g(θ, ξk+1) 2 2|Fk] M F(θ) 2 2 for some M > 0. Similarly to our growth condition, Bertsekas & Tsitsiklis (2000) and Bottou et al. (2018) performed an analysis under the condition E[ g(θ, ξk+1) 2 2|Fk] M F(θ) 2 2 + σ2 for M, σ2 > 0. Under Assumptions 6 and 7, we have F(θ) 2 2 2L (F(θ) F(θ )) (Gower et al., 2019, Proposition A.1) so our growth condition is less restrictive. If F satisﬁes the Polyak-Lojasiewicz condition (Karimi et al., 2016), then our growth condition becomes a bit stronger. Another weak growth condition has been used for a non-asymptotic study in Moulines & Bach (2011). The success of conditioned SGD relies on the following extended Robbins-Monro condition which ensures a control on the eigenvalues of the conditioning matrices.

Published in Transactions on Machine Learning Research (08/2023)

Assumption 9 (Eigenvalues). Let (µk)k 1 and (νk)k 1 be positive sequences such that: k 1, µk Id Ck 1 νk Id; P k γkνk = + ; P k(γkνk)2 < + ; lim supk νk/µk < a.s.

The last condition deals with the ratio (νk/µk) which may be seen as a conditioned number and ensures that the matrices Ck are well-conditioned. The following Theorem reveals that all these assumptions are suﬃcient to ensure the almost sure convergence of the gradients towards zero.

Theorem 3 (Almost sure convergence). Suppose that Assumptions 1, 6, 7, 8, 9 are fulﬁlled. Then (θk)k 0 obtained by conditioned SGD (2) satisﬁes F(θk) 0 as k a.s.

Other convergence results concerning the sequence of iterates towards global minimizers may be obtained by considering stronger assumptions such as convexity or that F is coercive and the level sets of stationary point S {θ, F(θ) = y} are locally ﬁnite for every y Rd (see Gadat et al. (2018)). In our analysis, the proof of Theorem 3 reveals that θk+1 θk 0 in L2 and almost surely. Thus, as soon as the stationary points are isolated, the sequence of iterates will converge towards a unique stationary point θ Rd. This result is stated in the next Corollary.

Corollary 2 (Almost sure convergence). Under the assumptions of Theorem 3, assume that F is coercive and let (θk)k 0 be the sequence of iterates obtained by the conditioned SGD (2), then d(θk, S) 0 as k . In particular, if S is a ﬁnite set, (θk) converges to some θ S.

4 Asymptotic optimality in Adaptive importance sampling

The aim of this section is to demonstrate that statistical eﬃciency can be ensured in variational inference problems through the combination of adaptive importance sampling and conditioned SGD. While wellknown results regarding the asymptotic optimality of maximum likelihood estimates (MLE) obtained from conditioned SGD (see for instance Amari (1998) or Bercu et al. (2020)) are initially revisited, attention is subsequently shifted towards the variational inference topic relying on adaptive sampling schemes methods. A novel result is then presented, asserting that even within this challenging framework, conditioned SGD allows for the recovery of a certain statistical eﬃciency.

4.1 Maximum likelihood estimation

Assume that (Xk)k 1 is an independent sequence of random variables with distribution q . Consider a parametric family {qθ : θ Θ} from which we aim to obtain an estimate of q . We further assume that the model is well-speciﬁed, i.e. q = qθ for some θ Θ. The MLE is given by

ˆθn arg max θ Θ

i=1 log(qθ(Xi)).

Under suitable condition (van der Vaart, 1998), it is well known that ˆθn is eﬃcient, meaning it is asymptotically unbiased and has the smallest achievable variance. The Cramer-Rao bound is given by the inverse of the Fisher information matrix, denoted by I 1 and deﬁned as

I = Z θ log(qθ ) θ log(qθ ) qθ dλ.

Unfortunately, the estimate ˆθn is often unknown in closed-form, requiring the use of a sequential procedure for approximation. This raises the further question of whether the estimate obtained through the sequential procedure achieves the eﬃciency bound. When using standard SGD without conditioning, the update rule is θk+1 = θk γk+1 log(qθk(Xk+1)). However, the optimal variance bound is not achieved in this case. To recover eﬃciency, one can rely on conditioned SGD, incorporating a conditioning matrix that estimates the inverse of the Hessian. In light of the deﬁnition of the Fisher information matrix, the conditioning matrix can be estimated iteratively using at each step a new sample Xk+1 as follows

Ik+1 = (1 γk+1)Ik + γk+1 log(qθk(Xk)) θ log(qθk(Xk))

Published in Transactions on Machine Learning Research (08/2023)

and then relying on the CSGD algorithm with update rule θk+1 = θk γk+1I 1 k log(qθk(Xk+1)). As a consequence of Theorem 2, under stipulated assumptions, one can recover the optimal bound I 1 as the asymptotic variance of

4.2 Adaptive importance sampling

Consider the variational inference problem where the aim is to approximate a target distribution q = qθ based on a family of density {qθ : θ Θ}. Unlike the previous statistical framework, one does not have access to random variables distributed according to q . Instead, one can usually evaluate the target function q . More background about this type of problem might be found in Zhang et al. (2018). In the following, we show that conditioned SGD methods allow to achieve the same variance as the optimal variance described in the previous statistical setting. To the best of our knowledge, this result is novel and has potential implications in variational inference problems using forward KL (or α-divergence) as described in Jerfel et al. (2021) and Section 5.2 in Zhang et al. (2018). Consider the objective function deﬁned as the Kullback-Liebler divergence between a sampler qθ and the target distribution q , i.e.,

F(θ) = Z log(qθ/q )q dλ.

Under regularity conditions, the gradient and Hessian are respectively written as θF(θ) = Eq [ θ log(qθ)] and 2 θF(θ) = Eq [ 2 θ log(qθ)]. Stochastic gradients can be deﬁned using adaptive importance samplingbased estimate as in Delyon & Portier (2018). Given the current iterate θk, one needs to generate Xk+1 from qθk and compute the (unbiased) stochastic gradient gk+1 = vk+1 k+1 where vk+1 = q (Xk+1)/qθk(Xk+1) and k+1 = θ log(qθk(Xk+1)). Based on our almost sure convergence result, one can obtain that θk θ

and then deduce Γ = lim k E[gk+1g k+1] = lim k

Z θ log(qθk) θ log(qθk) q 2

qθk dλ = I,

where the value I for the limit comes from replacing qθk by its limit qθ = q . The choice of the conditioning matrix Ck may be done using an auxiliary algorithm of the following form

Ck+1 = (1 γk+1)Ck + γk+1vk+1 k+1 k+1.

It can be shown that the sequence of conditioning matrices (Ck) converges to I. Thus, Theorem 2 implies that conditioned SGD is eﬃcient in this framework as it matches the lower bound of the previous less restrictive statistical framework in which qk = q . Similar computations, left for future work, may be performed to investigate if the same optimal variance can be achieved with more general similarity measures such as α-divergences (Daudel et al., 2021).

5 Conclusion and Discussion

We derived an asymptotic theory for Conditioned SGD methods in a general non-convex setting. Compared to standard SGD methods, the only additional assumption required to obtain the weak convergence is the almost sure convergence of the conditioning matrices. The use of appropriate conditioning matrices with the help of Hessian estimates is the key to achieve asymptotic optimality. While our study focuses on the weak convergence of the rescaled sequence of iterates - an appropriate tool to deal with eﬃciency issues since algorithms can be easily compared through their asymptotic variances - it would be interesting to complement our asymptotic results with concentration inequalities. This research direction, left for future work, may be done at the cost of extra assumptions, e.g., strong convexity of the objective function combined with bounded gradients. Furthermore, by using some recent results on the behavior of adaptive gradient methods in non-convex settings (Daneshmand et al., 2018; Staib et al., 2019; Antonakopoulos et al., 2022), another research direction would be to extend the current weak convergence analysis to edge cases where the objective function possesses saddle points.

From a practical standpoint, the approach proposed in Appendix B may not be computationally optimal as it requires eigenvalue decomposition. However, conditioned SGD methods and especially stochastic secondorder methods do not actually require the full computation of a matrix decomposition but rely on matrixvector products which may be performed in O(d2) operations. Futhermore, using low-rank approximation

Published in Transactions on Machine Learning Research (08/2023)

with BFGS algorithm (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) and its variant L-BFGS (Liu & Nocedal, 1989), those algorithms approximately invert Hessian matrices in O(d) operations. More recently, this technique was extended to the online learning framework (Schraudolph et al., 2007) and a purely stochastic setting (Moritz et al., 2016). Similarly, the diﬀerent adaptive optimizers presented in Section 3.1 are concerned with both fast computations and high precision. Designing an eﬃcient conditioned SGD algorithm involves a careful trade-oﬀbetween the low-memory storage of the scaling matrix representation Ck and the quality of its approximation of either the inverse Hessian 2F(θ ) 1 or the information brought in by the underlying geometry of the problem.

Acknowledgments

The authors are grateful to the Associate Editor and three anonymous Reviewers for their many valuable comments and interesting suggestions.

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoﬀrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pp. 265 283, 2016.

Shun-Ichi Amari. Natural gradient works eﬃciently in learning. Neural computation, 10(2):251 276, 1998.

Kimon Antonakopoulos, Panayotis Mertikopoulos, Georgios Piliouras, and Xiao Wang. Adagrad avoids saddle points. In International Conference on Machine Learning, pp. 731 771. PMLR, 2022.

Jonathan Baxter and Peter L Bartlett. Inﬁnite-horizon policy-gradient estimation. Journal of Artiﬁcial Intelligence Research, 15:319 350, 2001.

Michel Benaïm. Dynamics of stochastic approximation algorithms. In Seminaire de probabilites XXXIII, pp. 1 68. Springer, 1999.

Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012.

Bernard Bercu, Antoine Godichon, and Bruno Portier. An eﬃcient stochastic newton algorithm for parameter estimation in logistic regressions. SIAM Journal on Control and Optimization, 58(1):348 367, 2020.

Dimitri P Bertsekas and John N Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627 642, 2000.

M Émile Borel. Les probabilités dénombrables et leurs applications arithmétiques. Rendiconti del Circolo Matematico di Palermo (1884-1940), 27(1):247 271, 1909.

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311, 2018.

Claire Boyer and Antoine Godichon-Baggioni. On the asymptotic rate of convergence of stochastic newton algorithms and their weighted averaged versions. Computational Optimization and Applications, pp. 1 52, 2022.

Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76 90, 1970.

Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008 1031, 2016.

Published in Transactions on Machine Learning Research (08/2023)

Stephan Clémençon, Patrice Bertail, Emilie Chautru, and Guillaume Papa. Optimal survey schemes for stochastic gradient descent with applications to m-estimation. ESAIM: Probability and Statistics, 23: 310 337, 2019.

Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pp. 1155 1164. PMLR, 2018.

Kamélia Daudel, Randal Douc, and François Portier. Inﬁnite-dimensional gradient-based descent for alphadivergence minimisation. The Annals of Statistics, 49(4):2250 2270, 2021. doi: 10.1214/20-AOS2035. URL https://doi.org/10.1214/20-AOS2035.

Bernard Delyon. General results on the convergence of stochastic algorithms. IEEE Transactions on Automatic Control, 41(9):1245 1255, 1996.

Bernard Delyon and François Portier. Asymptotic optimality of adaptive importance sampling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3138 3148. Curran Associates Inc., 2018.

Bernard Delyon and François Portier. Safe adaptive importance sampling: A mixture approach. The Annals of Statistics, 49(2):885 917, 2021.

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Annals of Statistics, 48(3):1348 1382, 2020.

Dheeru Dua and Casey Graﬀ. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ ml.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121 2159, 2011.

Marie Duﬂo. Random iterative models, volume 34. Springer Science & Business Media, 1st edition, 2013.

Vaclav Fabian. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, 39(4):1327 1332, 1968.

Vaclav Fabian. Asymptotically eﬃcient stochastic approximation; the rm case. The Annals of Statistics, 1 (3):486 495, 1973.

Roger Fletcher. A new approach to variable metric algorithms. The computer journal, 13(3):317 322, 1970.

Sébastien Gadat and Fabien Panloup. Optimal non-asymptotic bound of the ruppert-polyak averaging without strong convexity. ar Xiv preprint ar Xiv:1709.03342, 2017.

Sébastien Gadat, Fabien Panloup, Soﬁane Saadane, et al. Stochastic heavy ball. Electronic Journal of Statistics, 12(1):461 529, 2018.

Nidham Gazagnadou, Robert Gower, and Joseph Salmon. Optimal mini-batch and step sizes for saga. In International conference on machine learning, pp. 2142 2150. PMLR, 2019.

Donald Goldfarb. A family of variable-metric methods derived by variational means. Mathematics of computation, 24(109):23 26, 1970.

Robert Gower, Donald Goldfarb, and Peter Richtárik. Stochastic block bfgs: Squeezing more curvature out of data. In International Conference on Machine Learning, pp. 1869 1878. PMLR, 2016.

Robert M Gower, Peter Richtárik, and Francis Bach. Stochastic quasi-gradient methods: Variance reduction via jacobian sketching. Mathematical Programming, 188(1):135 192, 2021.

Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. Sgd: General analysis and improved rates. In International Conference on Machine Learning, pp. 5200 5209. PMLR, 2019.

Published in Transactions on Machine Learning Research (08/2023)

P. Hall and C.C. Heyde. Martingale Limit Theory and Its Application. Probability and mathematical statistics. Academic Press, 1980. ISBN 9781483240244. URL https://books.google.fr/books?id= wd Lajg EACAAJ.

David Harrison Jr and Daniel L Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81 102, 1978.

Ghassen Jerfel, Serena Wang, Clara Wong-Fannjiang, Katherine A Heller, Yian Ma, and Michael I Jordan. Variational reﬁnement for importance sampling using the forward kullback-leibler divergence. In Uncertainty in Artiﬁcial Intelligence, pp. 1819 1829. PMLR, 2021.

Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pp. 1531 1538, 2002.

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795 811. Springer, 2016.

Hassan K Khalil. Nonlinear systems; 3rd ed. Prentice-Hall, Upper Saddle River, NJ, 2002. URL https: //cds.cern.ch/record/1173048.

Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462 466, 1952.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Harold J Kushner and Hai Huang. Rates of convergence for stochastic approximation type algorithms. SIAM Journal on Control and Optimization, 17(5):607 617, 1979.

Harold Joseph Kushner and Dean S Clark. Stochastic approximation methods for constrained and unconstrained systems. 1978.

Yann A Le Cun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Eﬃcient backprop. In Neural networks: Tricks of the trade, pp. 9 48. Springer, 2012.

Rémi Leluc and François Portier. Sgd with coordinate sampling: Theory and practice. Journal of Machine Learning Research, 23(342):1 47, 2022. URL http://jmlr.org/papers/v23/21-1240.html.

Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1):503 528, 1989.

Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artiﬁcial Intelligence and Statistics, pp. 249 258. PMLR, 2016.

Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451 459, 2011.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Arkadi Semenovich Nemirovski and David Borisovich Yudin. Problem complexity and method eﬃciency in optimization. 1983.

Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

Mikhail Borisovich Nevelson and Rafail Zalmanovich Khas minski ı. Stochastic approximation and recursive estimation, volume 47. American Mathematical Soc., 1976.

Published in Transactions on Machine Learning Research (08/2023)

Lam Nguyen, Phuong Ha Nguyen, Marten Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. Sgd and hogwild! convergence without the bounded gradients assumption. In International Conference on Machine Learning, pp. 3750 3758. PMLR, 2018.

Guillaume Papa, Pascal Bianchi, and Stéphan Clémençon. Adaptive sampling for incremental optimization using stochastic gradient descent. In International Conference on Algorithmic Learning Theory, pp. 317 331. Springer, 2015.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

Mariane Pelletier. On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic processes and their applications, 78(2):217 244, 1998a.

Mariane Pelletier. Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Annals of Applied Probability, pp. 10 44, 1998b.

Boris T Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika, (7):98 107, 1990.

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992.

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pp. 233 257. Elsevier, 1971.

Jerome Sacks. Asymptotic distribution of stochastic approximation procedures. The Annals of Mathematical Statistics, 29(2):373 405, 1958.

Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv preprint ar Xiv:1308.6370, 2013.

Nicol N Schraudolph, Jin Yu, and Simon Günter. A stochastic quasi-newton method for online convex optimization. In Artiﬁcial intelligence and statistics, pp. 436 443. PMLR, 2007.

Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated subgradient solver for svm. Mathematical programming, 127(1):3 30, 2011.

David F Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics of computation, 24(111):647 656, 1970.

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2014.

Matthew Staib, Sashank Reddi, Satyen Kale, Sanjiv Kumar, and Suvrit Sra. Escaping saddle points with adaptive gradient methods. In International Conference on Machine Learning, pp. 5956 5965. PMLR, 2019.

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

Tijmen Tieleman, Geoﬀrey Hinton, et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012.

Published in Transactions on Machine Learning Research (08/2023)

A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998.

J. H. Venter. An extension of the robbins-monro procedure. The Annals of Mathematical Statistics, 38(1): 181 190, 1967.

Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsiﬁcation for communication-eﬃcient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1299 1309, 2018.

CZ Wei. Multivariate adaptive stochastic approximation. The Annals of Statistics, 15(3):1115 1130, 1987.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992.

Mishael Zedek. Continuity and location of zeros of linear combinations of polynomials. Proceedings of the American Mathematical Society, 16(1):78 84, 1965.

Cheng Zhang, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008 2026, 2018.

Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-ﬁrst international conference on Machine learning, pp. 116, 2004.

Wanrong Zhu, Xi Chen, and Wei Biao Wu. Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association, pp. 1 12, 2021.

Published in Transactions on Machine Learning Research (08/2023)

Appendix: Asymptotic Analysis of Conditioned Stochastic Gradient Descent

Appendix A contains the mathematical proofs of the main results while Appendix B is dedicated to a practical procedure and numerical experiments for illustration purposes. Appendix C gathers some technical auxiliary results and additional propositions.

A Proofs of main results 16

A.1 Proof of the weak convergence (Theorem 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.2 Proof of the almost sure convergence (Theorem 3) . . . . . . . . . . . . . . . . . . . . . . . . 26

A.3 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

B Practical procedure 28

B.1 Construction of the conditioning matrix Ck . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

B.2 Numerical illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

C Auxiliary results 32

C.1 Robbins-Siegmund Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

C.2 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

C.3 Additional propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

C.4 Auxiliary results on expected smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

A Proofs of main results

A.1 Proof of the weak convergence (Theorem 2)

For any matrix A Rd d, we denote by A = max u 2=1 Au 2 the operator norm associated to the Euclidian norm and by ρ(A) the spectral radius of A, i.e., ρ(A) = max{|λ1|, . . . , |λn|} where λ1, . . . , λn are the eigenvalues of A. We also introduce λmin(A) = min{|λ1|, . . . , |λn|}. Note that when A is symmetric A = ρ(A) and recall that the spectral radius is a (submultiplicative) norm on the real linear space of symmetric matrices.

Structure of the proof.

In virtue of Assumption 5, there exist δ, ε > 0 such that almost surely

sup k 0 E[ wk+1 2+δ 2 |Fk]1{ θk θ 2 ε} < . (3)

An important event in the following is

Ak = { θk θ 2 ε, Ck < 2 C , Γk 2 Γ }.

By assumption, this event has probability going to 1.

Introduce the diﬀerence

and remark that k is subjected to the iteration:

k+1 = k γk+1Ck F(θk) + γk+1Ckwk+1, k 0,

Published in Transactions on Machine Learning Research (08/2023)

with wk+1 = F(θk) g(θk, ξk+1). We have by assumption that Ck C almost surely and we can deﬁne K = limk Ck H = CH. The proof relies on the introduction of an auxiliary stochastic algorithm which follows the iteration:

e k+1 = e k γk+1K e k + γk+1Ckwk+11Ak, k 0

The previous algorithm is a linear approximation of the algorithm that deﬁnes k in the sense that F(θk) = F(θ ) + H(θk θ ) + o(θk θ ) H(θk θ ) has been linearly expanded around θ . Writing

k = e k + ( k e k),

and invoking the Slutsky lemma, the proof will be complete as soon as we obtain that

γ 1/2 k e k N(0, Σ), (4)

( k e k) = o P(γ1/2 k ). (5)

H the positive square root of the real symmetric positive deﬁnite matrix H and consider the transformation Θk =

H e k which satisﬁes

Θk+1 = Θk γk+1 e KΘk + γk+1

HCkwk+11Ak, k 1,

where e K =

H S++ d (R) is a real symmetric positive deﬁnite matrix. The sequence (Θk)k 0 is easier to study than e k because contrary to e K, the matrix K = CH is not symmetric in general unless C and H commute. In view of Assumption 3, the eigenvalues of e K are real and positive. Denote by λm (resp. λM) the smallest (resp. the largest) eigenvalue of e K, i.e.,

λm = λmin( e K), λM = λmax( e K).

Because CH is similar to e K, they share the same eigenvalues. Since by assumption, the eigenvalues of (CH ζId) are positive, we have 2αλm > 1{β=1}. For all k 1, introduce the real symmetric matrix Ak = I γk e K. Observe that all these matrices commute, i.e., for any i, j 0, we have Ai Aj = Aj Ai. For any k, n 0, denote the matrices product Πn,k = An . . . Ak+1 if k < n Πn,k = Id if k n, Πn = Πn,0

Since the matrices Ak commute, we have Π n,k = Πn,k is also real symmetric.

Step 1. Proof of Equation (4). The random process (Θk)k 0 follows the recursion equation

Θk = AkΘk 1 + γk

HCk 1wk1Ak 1.

We have by induction

Θn = ΠnΘ0 +

HCk 1wk1Ak 1,

and the rescaled process is equal to

Θn γn = Πn γn Θ0

| {z } initial error Yn

HCk 1wk1Ak 1 | {z } sampling error Mn

Published in Transactions on Machine Learning Research (08/2023)

Bound on the initial error. Deﬁne τn = Pn k=1 γk the partial sum of the learning rates. Since Πn is symmetric, we have ρ(ΠnΘ0) ρ(Πn) Θ0 2. In view of Lemma 5, since γk 0, there exists j 1 such that

ρ(Πn) ρ(Πj) exp( λm(τn τj)).

Therefore, the initial error is bounded by

ρ(Yn) ρ(Πj) exp(λmτj) Θ0 2 exp(dn) with dn = λmτn log( γn).

Using Lemma 6, we can treat the two cases β < 1 and β = 1. On the one hand, if β < 1 then we always have dn . On the other hand, if β = 1, we have dn 1

2 γλm log(n) and the condition 2αλm 1 > 0 ensures dn . In both cases we get exp(dn) 0 and the initial error vanishes to 0.

Weak convergence of the sampling error. Consider the random process

Mn = γ 1/2 n

HCk 1wk1Ak 1.

Note that θk, Ak and Ck are Fk-measurable. As a consequence, Mn is a sum of martingale increments and we may rely on the following central limit theorem for martingale arrays.

Theorem 4. (Hall & Heyde, 1980, Corollary 3.1) Let (Wn,i)1 i n, n 1 be a triangular array of random vectors such that

E[Wn,i | Fi 1] = 0, for all 1 i n, (6)

i=1 E[Wn,i W n,i | Fi 1] V 0, in probability, (7)

i=1 E[ Wn,i 21{ Wn,i >ε} | Fi 1] 0, in probability, (8)

then, Pn i=1 Wn,i N(0, V ), as n .

We start by verifying (7). Let Dk =

HCk 1Γk 1CT k 1

H1Ak 1 Sd(R). The quadratic variation of Mn is given by

k=1 γ2 kΠn,k DkΠ n,k.

First we can check that Σn is bounded. Using the triangle inequality and since the operator norm is submultiplicative, we have

k=1 γ2 k Πn,k DkΠT n,k γ 1 n

k=1 γ2 k Dk Πn,k 2 = γ 1 n

k=1 γ2 k Dk ρ(Πn,k)2,

where we use in the last equality that Πn,k is real symmetric so Πn,k = ρ(Πn,k). On the event Ak 1, the matrices Ck 1 and Γk 1 are bounded as Ck 1 2 C and Γk 1 2 Γ leading to the following bound for the matrix Dk,

HCk 1Γk 1CT k 1

H Γk 1 Ck 1 21Ak 1 8 H Γ C 2 = UD.

Published in Transactions on Machine Learning Research (08/2023)

It follows that

k=1 γ2 kρ(Πn,k)2.

In view of Lemma 5, we shall split the summation from k = 1, . . . , j and k = j + 1, . . . , n as

k=1 γ2 kρ Πn,k 2 = γ 1 n

k=1 γ2 kρ Πn,k 2 + γ 1 n

k=j+1 γ2 kρ Πn,k 2

k=1 γ2 kρ Πn,k 2

i=k+1 (1 λmγi)2

For the ﬁrst term an, we have for all k = 1, . . . , j

ρ(Πn,k) ρ(Πn,j)

i=j+1 (1 λmγi) exp( λm(τn τj)),

which implies since (γk) is decreasing with γ1 = α that

k=1 γ2 kρ Πn,k 2 ατj exp( 2λm(τn τj)).

Therefore, similarly to the initial error term, we get

an ατj exp(2λmτj)) exp(dn) with dn = 2λmτn log(γn),

and the condition 2αλm 1 > 0 ensures dn so that an goes to 0 and is almost surely bounded by Ua.

For the second term bn, we can apply Lemma 3 and need to distinguish between the two cases:

(β = 1) If γn = α/n, since 2αλm > 1, we can apply Lemma 3 (p = 1, m = 2, λ = λmα, xj = 0, εk = α2) and obtain

2αλm 1 = Ub.

(β < 1) If γn = γ/nβ, we deduce the same as before because λm > 0.

Finally in both cases, we get

Σn UD (Ua + Ub) . (9)

We now derive the limit of Σn. We shall use a recursion equation to recover a stochastic approximation scheme. Note that

k=1 γ2 kΠn,k DkΠT n,k (10)

= γ2 n Dn + An

k=1 γ2 kΠn 1,k DkΠT n 1,k

and recognize

γnΣn = γ2 n Dn + γn 1AnΣn 1A n .

Published in Transactions on Machine Learning Research (08/2023)

Replacing the symmetric matrix An = I γn e K, we get (because Σn is bounded almost surely)

γnΣn = γ2 n Dn + γn 1(I γn e K)Σn 1(I γn e K)

= γ2 n Dn + γn 1 h Σn 1 γnΣn 1 e K γn e KΣn 1 + O(γ2 n) i .

Divide by γn to obtain

Σn = γn Dn + γn 1

h Σn 1 γn( e KΣn 1 + Σn 1 e K) + O(γ2 n) i ,

and we recognize a stochastic approximation scheme

Σn = Σn 1 γn h e KΣn 1 + Σn 1 e K Dn i + γn 1 γn

γn Σn 1 + O(γn 1γn + |γn 1 γn|)

Recall that when β < 1 we have

1 γn 1 γn 1 0, i.e., γn 1 γn

γn = o(γn).

(β = 1) If γn = α/n we get

Σn = Σn 1 α

e KΣn 1 + Σn 1 e K 1

Σn = Σn 1 α

Σn 1 + Σn 1

(β < 1) If γn = α/nβ we get

Σn = Σn 1 γn h e KΣn 1 + Σn 1 e K Dn i + o(γn).

Recall that ζ = 1{β=1}/(2α) and deﬁne e Kζ = e K ζI, so that in both cases, the recursion equation becomes

Σn = Σn 1 γn h e KζΣn 1 + Σn 1 e K ζ Dn i + o(γn).

We can vectorize this equation. The vectorization of an m n matrix A = (ai,j), denoted vec(A), is the mn 1 column vector obtained by stacking the columns of the matrix A on top of one another:

vec(A) = [a1,1, . . . , am,1, a1,2, . . . , am,2, . . . , a1,n, . . . , am,n]T .

Applying this operator to our stochastic approximation scheme gives

vec(Σn) = vec(Σn 1) γn h vec e KζΣn 1 + Σn 1 e K ζ vec(Dn) i + o(γn).

Denote by the Kronecker product, we have the following property

vec KζΣn 1 + Σn 1K ζ = Id Kζ + K ζ Id vec(Σn 1).

Deﬁne D as the almost sure limit of Dn, i.e.

D = lim n Dn =

Introduce vn = vec(Σn) and Q = Id e Kζ + e Kζ Id . We have almost surely

vn = vn 1 γn (Qvn 1 vec(D)) + γnvec(Dn D) + o(γn)

= vn 1 γn (Qvn 1 vec(D)) + εnγn

Published in Transactions on Machine Learning Research (08/2023)

where εn 0 almost surely. This is a stochastic approximation scheme with the aﬃne function h(v) = Qv

vec(D) for v Rd2. Let v be the solution of h(v) = 0 which is well deﬁned since Q = Id e Kζ + e K ζ Id

is invertible. Indeed, the eigenvalues of Q are µi + µj, 1 i, j d, where the µi, i = 1, . . . , d are the eigenvalues of e Kζ. Equivalently, the eigenvalues of Q are of the form (λi ζ) + (λj ζ) where the λi, i = 1, . . . , d are the eigenvalues of e K. Because λm > ζ, we have that Q 0. As a consequence

(vn v ) = (vn 1 v ) γn (h(vn 1) h(v )) + εnγn = (vn 1 v ) γn Q (vn 1 v ) + εnγn = Bn (vn 1 v ) + εnγn,

with Bn = (Id2 γn Q). By induction, we obtain

(vn v ) = (Bn . . . B1) (v0 v ) +

k=1 γk (Bn . . . Bk+1) εk,

Deﬁne λQ = λmin(Q) > 0 and remark that

Bn . . . Bk+1

j=k+1 (1 γjλQ).

It follows that

vn v 2 Bn . . . B1 v0 v 2 +

k=1 γk Bn . . . Bk+1 εk 2

j=1 (1 γjλQ) v0 v 2 +

j=k+1 (1 γjλQ) εk 2

Applying Lemma 3 we obtain that the right-hand side term goes to 0. The left-hand side term goes to 0 under the eﬀect of the product by deﬁnition of (γk)k 1. We therefore conclude that vn v almost surely. From easy manipulation involving vec( ) and , this is equivalent to Σn Σ, where Σ is the solution of the Lyapunov equation

( e K ζI)Σ + Σ( e K ζI) = D.

Now we turn our attention to (8). We need to show that almost surely,

k=1 γ2 k E[ Πn,k

HCk 1wk 2 21{γk Πn,k

HCk 1wk 2>εγ1/2 n } | Fk 1]1Ak 1 0.

E[γ 1 n γ2 k Πn,k

HCk 1wk 2 21{γk Πn,k

HCk 1wk 2>εγ1/2 n } | Fk 1]

ε δE[(γ 1/2 n γk Πn,k

HCk 1wk 2)2+δ | Fk 1]

ε δ(γ 1/2 n γk Πn,k

HCk 1 2+δE[ wk 2+δ 2 | Fk 1].

Let U(ω) = supk 1 E[ wk 2+δ 2 | Fk 1]1Ak 1 which is almost surely ﬁnite by Assumption 5. We get

E[γ 1 n γ2 k Πn,k

HCk 1wk 2 21{γk Πn,k

HCk 1wk 2>εγ1/2 n } | Fk 1]1Ak 1

H C 2+δ U(ω)(γ 1/2 n γkρ(Πn,k))2+δ

Published in Transactions on Machine Learning Research (08/2023)

Hence by showing that

k=1 (γ 1/2 n γkρ(Πn,k))2+δ 0,

we will obtain (8). The previous convergence can be deduced from Lemma 3 with p = 1 + δ/2, m = 2 + δ, ϵk = γδ/2 k , checking that (2 + δ)αλm > 1 + δ/2.

Step 2. Proof of Equation (5).

A preliminary step to the derivation of Equation (5) is to obtain that e k 0 almost surely. For any θ and η in Rd, we have

θ 2 = η 2 + 2η (θ η) + θ η 2

implying that for all k 0

E[ Θk+1 2|Fk] = eΘk 2 2γk+1Θ k e KΘk + γ2 k+1 E[ e KΘk Ckwk+11Ak 2|Fk].

Since (wk) is a martingale increment and because on Ak, ρ(Ck) 2ρ(C), we get

E[ e KΘk Ckwk+11Ak 2|Fk] = E[ e KΘk 2|Fk] + E[ Ckwk+11Ak 2|Fk]

λ2 M Θk 2 + ρ(Ck)2 E[ wk+11Ak 2|Fk]

λ2 M Θk 2 + 4ρ(C)2 E[ wk+11Ak 2|Fk],

Injecting this bound in the previous equality yields

E[ Θk+1 2|Fk] Θk 2(1 + γ2 k+1λ2 M) 2γk+1Θ k e KΘk + 4ρ(C)2γ2 k+1 E[ wk+1 2|Fk]1Ak.

Since, using (3),

k 0 γ2 k+1 E[ wk+1 2|Fk]1Ak sup k 0 E[ wk+1 2|Fk]1Ak

we are in position to apply the Robbins-Siegmund Theorem 6 and we obtain the almost sure convergence of P

k γk+1Θ k e KΘk and Θk 2 2 V . Because e K is positive deﬁnite, it gives that, with probability 1, P

k 0 γk+1 Θk 2 < + , from which, we deduce lim infk Θk 2 = 0. Therefore one can extract a subsequence Θk such that Θk 2 0. Using the above second condition yields V = 0 and we conclude that e k = H 1/2Θk 0.

Deﬁne the diﬀerence

Ek = k e k.

Since θ 7 2F(θ) is continous at θ , we can apply a coordinate-wise mean value theorem. Indeed, for any θ Rd, we have F(θ) = ( 1F(θ), . . . , d F(θ)) where for all j = 1, . . . , d, the partial derivatives functions j F : Rd R are Lipschitz continuous. Denote by ( j F) : Rd Rd the gradient of the partial derivative j F, i.e., ( j F)(θ) = ( 2 1,j F(θ), . . . , 2 d,j F(θ)). For any θ, η B(θ , ε), there exists ξj Rd such that

j F(θ) j F(η) = ( j F)(ξj)(θ η).

We construct a Hessian matrix by rows H(ξ) = H(ξ1, . . . , ξd) where the j-th row is equal to ( j F)(ξj) = ( 2 1,j F(ξj), . . . , 2 d,j F(ξj))

2 1,1F(ξ1) . . . 2 1,d F(ξ1) ... ... ... 2 d,1F(ξd) . . . 2 d,d F(ξd)

Published in Transactions on Machine Learning Research (08/2023)

and we can write

F(θ) F(η) = H(ξ)(θ η).

There exists ξk = (ξ(1) k , . . . , ξ(d) k ) with ξ(j) k [θ + Ek, θk] and ξ k = (ξ (1) k , . . . , ξ (d) k ) with ξ (j) k [θ + Ek, θ ] such that

F(θ + Ek) F(θk) = H(ξk)e k (12)

F(θ + Ek) = H(ξ k)Ek. (13)

Let η > 0 such that 2αλm(1 3η) > 1. This choice will come clear at the end of the reasoning. On the one hand, we have Ck C. On the other hand, using Lemma 4, the spectrum of Ck H is real and positive. Hence, we have the convergence of the eigenvalues of Ck H towards the eigenvalues of K = CH. This follows from the deﬁnition of eigenvalues as roots of the characteristic polynomial and the fact that the roots of any polynomial P C[X] are continuous functions of the coeﬃcients (Zedek, 1965). Consequently, there exists n1(ω) such that for all k n1(ω),

(1 η)λm λmin(Ck H) λmax(Ck H) (1 + η)λM. (14)

We can deﬁne n2(ω) such that for all k n2(ω)

Ak is realized. (15)

H 1 Id 0 as k , there is n3(ω) and n4(ω) such that for all k n3(ω)

H 1 Id η 1 + η λm λM , (16)

and for all k n4(ω),

H 1 1. (17)

Since γk 0, there is n5 such that for all k n5

γk+1 2ηλm (1 + η)2λ2 M . (18)

To use the previous local properties, deﬁne n0(ω) = n1(ω) n2(ω) n3(ω) n4(ω) n5 and introduce the set Ej along with its complement Ec j , deﬁned by

Ej = {ω : j n0(ω)}.

Let δ > 0 and take j 1 large enough such that P(Ec j ) δ. Invoking the Markov inequality, we have for all a > 0

P(γ 1/2 k Ek > a) = P(γ 1/2 k Ek > a, Ej) + P(γ 1/2 k Ek > a, Ec j )

P(γ 1/2 k Ek > a, Ej) + δ

γ 1/2 k a 1E[ Ek 1Ej] + δ

Because δ is arbitrary, we only need to show that for any value of j 1,

ek := E[ Ek 1Ej] = o(γ1/2 k ).

To prove this fact, we shall recognize a stochastic algorithm for the sequence ek.

Let k j and assume further that Ej is realized. We have, because of (15),

Ek+1 = k e k γk+1Ck F(θk) + γk+1K e k.

Published in Transactions on Machine Learning Research (08/2023)

Introducing e Ek =

HEk, we ﬁnd

e Ek+1 = e Ek γk+1

HCk F(θk) + γk+1

and using (12), it comes that

e Ek+1 = e Ek γk+1

HCk F(θ + Ek) γk+1

HCk H(ξk)e k + γk+1

= e Ek γk+1

HCk F(θ + Ek) + γk+1

H(K Ck H(ξk))e k.

Using Minkowski inequality, we have

e Ek+1 e Ek γk+1

HCk F(θ + Ek) + γk+1

H(K Ck H(ξk))e k .

We shall now focus on the ﬁrst term. Still on the set Ej, we have

HCk F(θ + Ek) 2

= e Ek 2 2γk+1 e Ek,

HCk F(θ + Ek) + γ2 k+1

HCk F(θ + Ek) 2 (19)

We have on the one hand using (13)

HCk F(θ + Ek) = e Ek,

HCk H(ξ k)Ek

HCk HEk + e Ek,

HCk(H(ξ k) H)Ek

Due to (14), the ﬁrst term satisﬁes

HCk HEk = e Ek,

λmin(Ck H) e Ek 2

(1 η)λm e Ek 2

The second term satisﬁes

HCk(H(ξ k) H)Ek = e Ek,

H 1 Id) e Ek

H 1 Id) e Ek

Using Cauchy-Schwarz inequality, the submultiplicativity of the norm, (14) and (16), we have e Ek,

H 1 Id) e Ek

H 1 Id e Ek 2

ηλm e Ek 2.

Finally, it follows that

HCk F(θ + Ek) (1 2η)λm e Ek 2 (20)

On the other hand using (13), (14) and (17),

HCk F(θ + Ek) 2 =

HCk H(ξ k)Ek 2

H 1) e Ek 2

λmax(Ck H)2

H 1 2 e Ek 2

(1 + η)2λ2 M e Ek 2 (21)

Published in Transactions on Machine Learning Research (08/2023)

Putting together (19), (20), (21) and using (18) gives that, on Ej,

HCk F(θ + Ek) 2

e Ek 2(1 2γk+1(1 2η)λm + γ2 k+1(1 + η)2λ2 M)

e Ek 2(1 2γk+1(1 3η)λm).

By the Minkowski inequality and the fact that (1 x)1/2 1 x/2, on Ej, it holds

e Ek+1 e Ek (1 2γk+1(1 3η)λm)1/2 + γk+1

H(K Ck H(ξk))e k

e Ek (1 γk+1(1 3η)λm) + γk+1

H (K Ck H(ξk))e k

Hence, we have shown that for any k j,

e Ek 1Ej e Ek 1Ej(1 γk+1(1 3η)λm) + γk+1

H (K Ck H(ξk))1Ej e k .

It follows that, for any k j,

ek+1 ek(1 γk+1(1 3η)λm) + γk+1

H E[ Uk e k ],

with Uk = (K Ck H(ξk))1Ej. Because with probability 1, Uk is bounded, we can apply the Lebesgue dominated convergence theorem to obtain that εk = E[ Uk 2] 0. From the Cauchy-Schwarz inequality, we get

E[ Uk e k ] εk q

E[ e k 2 2].

On the other hand, we have already shown in (9) that ρ(Σk) = Σk UD (Ua + Ub). Since e k =

H 1 γk(Yk + Mk), we have

E[ e k 2 2] 2(γk/λm)( Yk 2 2 + E[ Mk 2 2]),

where the last term is the leading term and satisﬁes

E[ Mk 2 2]] = E[Tr(Σk)] d E[ρ(Σk)].

Therefore, we have

E[ e k 2 2] γk A

for some A > 0. Consequently, for all k j,

ek+1 ek(1 γk+1(1 3η)λm) + γ3/2 k+1A

The condition 2αλm(1 3η) > 1 ensures that we can apply Lemma 3 with (mλ > p), m = 1, p = 1/2, λ = α(1 3η)λm. we ﬁnally get

lim sup k (ek/γ1/2 k ) = 0.

As a consequence, ek = o( γk), which concludes the proof.

Since γ 1/2 k

H e k N(0, Σ), we have γ 1/2 k e k N(0, eΣ) where eΣ =

H 1. Recall that Σ satisﬁes the Lyapunov equation

H ζId)Σ + Σ(

Multiplying on the left and right sides by

H 1, we get

where we recognize the following Lyapunov equation

(CH ζId)eΣ + eΣ(CH ζId) = CΓC.

Published in Transactions on Machine Learning Research (08/2023)

A.2 Proof of the almost sure convergence (Theorem 3)

The idea behind the proof of the almost sure convergence is to apply the Robbins-Siegmund Theorem (Theorem 6) (which can be found in Appendix C) in combination with the following key deterministic result.

Lemma 1 (Deterministic result). Let F : Rd R be a L-smooth function and (θt) a random sequence obtained by the SGD update rule θt+1 = θt γt+1Ctgt where (γ)t 1 a positive sequence of learning rates and Ct 1 νt Id are such that P

t γtνt = . Let ω Ωsuch that the following limits exist:

t 0 γt+1νt+1 F(θt(ω)) 2 2 < (ii) X

t 1 γt Ct 1(gt 1(ω) F(θt 1(ω))) <

then F(θt(ω)) 0 as t .

Proof. The proof (and in particular the reasoning by contradiction) is inspired from the proof of Proposition 1 in Bertsekas & Tsitsiklis (2000). For ease of notation we omit the ω in the proof. Note that condition (i) along with P

t γtνt = implie that lim inft F(θt) = 0. Now, by contradiction, let ε > 0 and assume that

lim sup t F(θt) > ε

We have that there is inﬁnitely many t such that F(θt) < ε/2 and also inﬁnitely many t such that F(θt) > ε. It follows that there is inﬁnitely many crossings between the sets {t N : F(θt) < ε/2} and {t N : F(θt) > ε}. A crossing is a collection of indexes Ik = {Lk, Lk +1, . . . , Uk 1} with Lk Uk (Ik = when Lk = Uk) such that for all t Ik,

F(θLk 1) < ε/2 F(θt) ε < F(θUk) .

Deﬁne the following partial Cauchy sequence Rk = PUk t=Lk γt(gt 1 F(θt 1)) and note that condition (ii) implies that Rk 0 as k . For all k 1,

ε/2 F(θUk) 2 F(θLk 1) 2 F(θUk) F(θLk 1) 2 L θUk θLk 1 2,

where we use that F is L-Lipschitz. Then using the update rule θt θt 1 = γt Ct 1gt 1, we have by sum

t=Lk θt θt 1 2 = L

t=Lk γt Ct 1gt 1 2

t=Lk γt Ct 1 F(θt 1) 2 + L

t=Lk γt Ct 1(gt 1 F(θt 1)) 2

t=Lk γtνt F(θt 1) 2 + L Rk 2

Since in the previous equation F(θt 1) 2 > ε/2, we get

t=Lk γtνt F(θt 1) 2 2 + (ε/2)L Rk 2

But since P t 0 γt+1νt+1 F(θt) 2 is ﬁnite and limk Rk = 0, the previous upper bound goes to 0 and implies a contradiction.

Published in Transactions on Machine Learning Research (08/2023)

It remains to show that points (i) and (ii) in Lemma 1 are valid with probability one. Since θ 7 F(θ) is L-smooth, we have the quadratic bound (see Nesterov (2013))

θ, η Rd F(η) F(θ) + F(θ), η θ + L

Using the update rule θk+1 = θk γk+1Ckg(θk, ξk+1), we get

F(θk+1) F(θk) + F(θk), θk+1 θk + L

2 θk+1 θk 2 2

= F(θk) γk+1 F(θk), Ckg(θk, ξk+1) + L

2 γ2 k+1 Ckg(θk, ξk+1) 2 2.

The last term can be upper bounded using the matrix norm and Assumption 9 as

Ckg(θk, ξk+1) 2 2 Ck 2 g(θk, ξk+1) 2 2 ν2 k+1 g(θk, ξk+1) 2 2,

and we have the inequality

F(θk+1) F(θk) γk+1 F(θk), Ckg(θk, ξk+1) + L

2 (γk+1νk+1)2 g(θk, ξk+1) 2 2.

Introduce uk = γkνk and vk = γkµk, we have P

k 1 vk = + and P

k 1 u2 k < + a.s. in virtue of Assumption 9. The random variables F(θk), Ck are Fk-measurable and the gradient estimate is unbiased with respect to Fk. Taking the conditional expectation denoted by Ek leads to

Ek F(θk+1) F(θk) γk+1 F(θk), Ek [Ckg(θk, ξk+1)] + L

2 u2 k+1 Ek g(θk, ξk+1) 2 2

= γk+1 F(θk) Ck F(θk) + L

2 u2 k+1 Ek g(θk, ξk+1) 2 2 .

On the one hand for the ﬁrst term, using Assumption 9 ,

F(θk) Ck F(θk) λmin(Ck) F(θk) 2 2 µk+1 F(θk) 2 2.

On the other hand, using Assumption 8, there exist 0 L, σ2 < such that almost surely

k N, Ek g(θk, ξk+1) 2 2 2L (F(θk) F ) + σ2.

Inject these bounds in the previous inequality and substract F(θ ) on both sides to have

Ek [F(θk+1) F ] (1 + LLu2 k+1)(F(θk) F ) vk+1 F(θk) 2 2 + (L/2)u2 k+1σ2.

Introduce Vk = F(θk) F , Wk = vk+1 F(θk) 2 2, ak = LLu2 k+1 and bk = (L/2)u2 k+1σ2. These four random sequences are non-negative Fk-measurable sequences with P

k ak < and P

k bk < almost surely. Moreover we have

k N, E [Vk+1|Fk] (1 + ak)Vk Wk + bk.

We can apply Robbins-Siegmund Theorem 6 to have

k 0 Wk < a.s. (b) Vk a.s. V , E [V ] < . (c) sup k 0 E [Vk] < .

Therefore we have the almost sure convergence of the series P vk+1 F(θk) 2 2 which, given that lim supk νk/µk exists, implies that P uk+1 F(θk) 2 2 is ﬁnite. Hence we obtain (i) in Lemma 1. We now show that (ii) in Lemma 1 is also valid. The term of interest is a sum of martingale increments. The quadratic variation is given by X

t 1 γ2 t Et[ Ct 1(gt 1(ω) f(θt 1(ω))) 2 2] X

t 1 γ2 t ν2 t Et[ (gt 1(ω) F(θt 1(ω))) 2 2]

t 1 γ2 t ν2 t Et[ gt 1(ω) 2 2]

t 1 γ2 t ν2 t (2L(F(θt 1) F ) + σ2).

Published in Transactions on Machine Learning Research (08/2023)

Now we can use that Vk = F(θk) F a.s. V (which was deduced from Robbins-Siegmund Theorem) to obtain that the previous series converges. Invoking Theorem 2.17 in Hall & Heyde (1980), we obtain (ii) in Lemma 1. Furthermore we can prove that θk+1 θk 0 almost surely and in L2. Indeed, we have

E θk+1 θk 2 2 = E γk+1Ckg(θk, ξk+1 2 2 u2 k+1 2L (F(θk) F ) + σ2 .

In virtue of the almost sure convergence of Vk = F(θk) F , the last term in parenthesis is upper bounded by a constant so that in view of the convergence of P u2 k+1, we have the convergence of the series P E θk+1 θk 2 2 . We then deduce that E θk+1 θk 2 2 0 and P θk+1 θk 2 2 < + almost surely. In particular, θk+1 θk 0 in L2 and almost surely. The last point follows from the fact that, for every δ > 0,

lim n P sup k n θk+1 θk δ δ 2 lim n

k n E θk+1 θk 2 2 = 0.

A.3 Proof of Corollary 2

First observe that since F is coercive, the convergence of (F(θk)) obtained by Robbins-Siegmund theorem implies that the sequence of iterates (θk)k 0 remains in a compact subset K Rd. Let ε > 0. Since θ 7 d(θ, S) is continuous, the set D(ε) = {θ Rd : d(θ, S) ε} is closed and the set K(ε) = K D(ε) is compact. On this set, the map θ 7 F(θ) 2 is stricly positive and there exists ηε > 0 such that: θ K(ε) F(θ) 2 > ηε. Thus, P(θ K(ε)) P( F(θ) 2 > ηε) and this last quantity goes to zero which proves the convergence in probability d(θk, S) 0. Actually the almost sure convergence F(θk) 0 implies the convergence of the distances. Deﬁne Ak(ε) = {ω : θk(ω) K(ε)} and Bk(ε) = {ω : F(θk(ω)) 2 > ηε}. We have Ak(ε) Bk(ε) then n 1 k n Ak(ε) n 1 k n Bk(ε). Conclude by using the almost sure convergence P( n 1 k n Bk(ε)) = 0 for each ε > 0. If S is ﬁnite, it is in particular a compact set so the distance is attained for every k 0, d(θk, S) = mins S d(θk, s) 0. Since θk+1 θk 0, the sequence of iterates can only converge to a single point of S.

B Practical procedure

For the sake of completeness, the aim of this Section is to derive a feasible procedure that achieves the optimal asymptotic variance described in Corollary 1. First, we present a practical way to compute the conditioning matrix Ck and then we show that the resulting algorithm satisﬁes the high-level conditions of Theorem 2. This method is considered in a numerical illustration along with a novel variant of Ada Grad.

B.1 Construction of the conditioning matrix Ck

Similarly to the unavailability of gradients, one may not have access to values of the Hessian matrix but only stochastic versions of it (see details in numerical experiments below). As a consequence, we consider the following framework which involves random Hessian matrices. As for gradients, a policy (P k)k 0 is used at each iteration to produce random Hessians through H(θk, ξ k+1) with ξ k+1 P k.

Assumption 10 (Unbiased and bounded Hessians). The Hessian generator H : Rd S Rd d is uniformly bounded around the minimizer and is such that for all θ Rd, H(θ, ) is measurable and we have: k 0, E H(θk, ξ k+1)|Fk = 2F(θk).

An estimate of the Hessian matrix H = 2F(θ ) is now introduced as the weighted average

j=0 ωj,k H(θj, ξ j+1) with

j=0 ωj,k = 1. (22)

The previous estimate has two advantages. First, thanks to averaging, the noise associated to each evaluation H(θj, ξ j+1) will eventually vanished due to the sum of martingale increments. Second, the weights ωj,k may help to give more importance to most recent iterates. In the idea that θk lies near θ eventually, it might be helpful to reduce the bias when estimating H = 2F(θ ).

Published in Transactions on Machine Learning Research (08/2023)

Proposition 2. Let (Φk)k 0 be obtained by (22). Suppose that Assumptions 3 and 10 are fulﬁlled and that θk θ a.s. If sup0 j k ωj,k = O(1/k), then we have Φk H = 2F(θ ) a.s.

A natural choice is to take equal weights ωj,k = (k + 1) 1. However, since the last iterates are more likely to bring more relevant information through their Hessian estimates, we advocate the use of adaptive weights of the form ωj,k exp( η θj θk 1) with a parameter η 0 that recovers equal weights with η = 0. These two weights sequences satisfy the assumption of Proposition 2. They are considered in the numerical illustration of the next Section. While inverting Φk would produce a simple estimate of H 1, such an approach might result in a certain instability in practice caused by large jumps towards wrong directions (large eigenvalues) or a too restrictive visit along other components (vanishing eigenvalues). To overcome this issue, we rely on the following ﬁlter which clamps the eigenvalues of a symmetric matrix. For any symmetric matrix S and two positive numbers 0 < a < b, denote by S[a, b] the associated matrix where all the eigenvalues are clamped to [a, b], i.e., any eigenvalue λ of S is modiﬁed as λ max{a, min{λ, b}}.

Let (λ(m) k )k 1 and (λ(M) k )k 1 be two sequence of positive numbers such that λ(m) k λ(M) k for all k 1. Deﬁne the matrices

k N, Ck = Φk[(λ(M) k+1) 1, (λ(m) k+1) 1] 1 . (23)

Such a deﬁnition guarantees two properties. First, Ck S++ d (R) with λ(m) k+1Id Ck λ(M) k+1Id. Second,

in virtue of Proposition 2, Φk H a.s. so that, as soon as (λ(m) k )k 1 and (λ(M) k )k 1 go to 0 and + respectively, the matrix Ck converges almost surely to H 1 (as recommended by Corollary 1). Therefore, we obtain a feasible procedure leading to asymptotic optimality.

Theorem 5 (Asymptotic optimality of the iterates). Let (θk)k 0 be obtained by conditioned SGD (2) with γk = 1/k, Φk deﬁned by (22), λ(m) k 0, λ(M) k + and Ck given by (23). Suppose that Assumptions 1 to 9 are fulﬁlled and sup0 j k ωj,k = O(1/k). We have

k(θk θ ) N(0, H 1ΓH 1), as k .

This algorithm is theoretically asymptotically optimal. However in practice, adaptive gradient methods described in Table 1 have become the workhorse for training deep learning models as they take advantage of low rank-approximations and diagonal scalings. Interestingly, the conditioned matrices involved in these methods are linked to gradient estimates and thus to covariance matrices Γk (see Assumption 4) rather than the Hessian H. Indeed, since θ S, we have for the limiting covariance Γ = Eξ[g(θ , ξ)g(θ , ξ) ]. Consider a variant of Ada Grad which accumulates the average gradients Gk = δI + (1/k) Pk i=1 gig i and Ck = G 1/2 k . Averaging allows to anneal the stochastic noise of the gradient estimate. By the law of large numbers, the limiting matrix in our Theorem 2 will be C = (Γ + δI) 1/2.

B.2 Numerical illustration

Consider the empirical risk minimization framework applied to Generalized Linear Models. Given a data matrix X = (xi,j) Rn d with labels y Rn and a regularization parameter λ > 0, we are interested in solving minθ Rd F(θ) with

i=1 fi(θ), fi(θ) = L(x i θ, yi) + λΩ(θ),

L : R R R is smooth loss function and Ω: Rd R+ is a smooth convex regularizer chosen as Tikhonov regularization Ω(θ) = 1

2 θ 2 2. The gradient and Hessian of each component fi are given for all i = 1, . . . , n by

fi(θ) = L (x i θ, yi)xi + λθ

2fi(θ) = L (x i θ, yi)xix i + λId,

Published in Transactions on Machine Learning Research (08/2023)

where L ( , ) and L ( , ) are the ﬁrst and second derivative of L( , ) with respect to the ﬁrst argument. Consider two well-known losses, namely least-squares and logistic. These losses are respectively associated to the Ridge regression problem with y Rn and the binary classication task with y { 1, +1}n. The regularization parameter is set to the classical value λ = 1/n. Denote by σ(z) = 1/(1+exp( z)) the sigmoid function, we have the following closed-form equations

(Ridge Regression)

L(x i θ, yi) = 1

2(yi x i θ)2

L (x i θ, yi) = x i θ yi L (x i θ, yi) = 1

(Logistic Regression)

L(x i θ, yi) = log(1 + exp( yix i θ)) L (x i θ, yi) = σ(x i θ) yi L (x i θ, yi) = σ(x i θ)(1 σ(x i θ))

As stated in Example 1 of Section 2, stochastic versions of both the gradient and the Hessian of the objective F can be easily computed using only a batch B {1, . . . , n} of data and BF(θ) = P

i B fi(θ)/|B| (resp. 2 BF(θ) = P

i B 2fi(θ)/|B|) for the gradient (resp. Hessian) estimate. Note that these random generators meet Assumptions 1 and 10 as they produce unbiased estimates of the gradient and the Hessian matrix respectively.

For the sake of completeness and illustrative purposes, we compare the performance of classical stochastic gradient descent (sgd) and the conditionned variant (csgd) presented in Appendix B where the matrix Φk is an averaging of past Hessian estimates as given in Equation (22). We shall compare equal weights ωj,k = (k + 1) 1 and adaptive weights ωj,k exp( η θj θk 1) with η > 0 to give more importance to Hessian estimates associated to iterates which are closed to the current point. Furthermore, for computational reason, we consider a novel adaptive stochastic ﬁrst-order method which is a variant of Adagrad.

Starting from the null vector θ0 = (0, . . . , 0) Rd, we use optimal learning rate of the form γk = α/(k + k0) (Bottou et al., 2018) and set λ(m) k 0, λ(M) k = Λ

k in the experiments where γ, k0 and Λ are tuned using a grid search. The means of the optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )], obtained over 100 independent runs, are presented in Figures below.

Methods in competition. The diﬀerent methods in the experiments are:

sgd: standard stochastic gradient descent.

sgd_avg: Polyak-averaging stochastic gradient descent , with a burn-in period (n0 = 15 for d = 20 and n0 = 30 for d = 100) to avoid the poor performance of bad initialization.

csgd(η = 0) and csgd(η > 0): conditioned stochastic gradient descent methods with equal and adaptive weights where the matrix Φk is an averaging of past Hessian estimates as given in Equation (22).

adafull_avg: The variant of Adagrad presented in Appendix B where the gradient matrix Gk is updated as an average Gk = δI + (1/k) Pk i=1 gig i and Ck = G 1/2 k instead of the cumulative sum provided in the literature of Adagrad. Note that averaging here allows to anneal the stochastic noise whereas classical versions of Adagrad often rely on true gradients and may use cumulative sums. The parameter δ is also tuned using a grid search.

We focus on Ridge regression on simulated data with n = 10, 000 samples in dimensions d {20; 100}. Stochastic gradient methods are known to greatly beneﬁt from mini-batch instead of picking a single random sample when computing the gradient estimate. We use a batch-size equal to |B| = 16. In Figure 1, we can see that conditioned SGD outperforms standard SGD. Furthermore, adaptive weights (η > 0) improve the convergence speed of conditioned SGD methods. Interestingly, the novel approach adafull_avg oﬀers great performance at a cheap computing cost. Indeed, the update of Ck+1 relies on the inverse of an average. This operation can be carried out in an eﬃcient way thanks to Woodbury matrix identity.

Real-world data. We now turn our attention to real-world data and consider again the Ridge regression problem on the following datasets: Boston Housing dataset (Harrison Jr & Rubinfeld, 1978) (n = 506; d = 14) and Diabetes dataset (Dua & Graﬀ, 2017) (n = 442; d = 10).

Published in Transactions on Machine Learning Research (08/2023)

0 20 40 60 80 100 Iterations

Optimaliy Ratio

sgd sgd_avg csgd( =0)

adafull_avg

(a) Ridge d = 20

0 50 100 150 200 250 300

Optimaliy Ratio

sgd sgd_avg csgd( =0)

adafull_avg

(b) Ridge d = 100

Figure 1: Optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )] for Ridge regression in dimension d {20; 100}.

Boston Housing dataset (Harrison Jr & Rubinfeld, 1978): This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It contains n = 506 samples in dimension d = 14.

Diabetes dataset (Dua & Graﬀ, 2017): Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

The means of the optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )], obtained over 100 independent runs, are presented in Figure 2. Once again, the conditioned SGD methods oﬀer better performance than plain SGD. For these datasets, it is the conditioning matrix with adaptive weights as given in Equation (22) which presents the best results.

0 50 100 150 200 250 300

Optimaliy Ratio

sgd sgd_avg csgd( =0)

adafull_avg

(a) Boston Dataset

0 100 200 300 400 500 Iterations

Optimaliy Ratio

sgd sgd_avg csgd( =0)

adafull_avg

(b) Diabetes Dataset

Figure 2: Optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )] for Ridge regression on datasets Boston and Diabetes.

Published in Transactions on Machine Learning Research (08/2023)

C Auxiliary results

C.1 Robbins-Siegmund Theorem

Theorem 6. (Robbins & Siegmund (1971)) Consider a ﬁltration (Fn)n 0 and four sequences of random variables(Vn)n 0 , (Wn)n 0 , (an)n 0 and (bn)n 0 that are adapted and non-negative. Assume that almost surely P

k ak < and P

k bk < . Assume moreover that E [V0] < and for all n N, E[Vn+1|Fn] (1 + an)Vn Wn + bn. Then it holds

k Wk < a.s. (b) Vn a.s. V , E [V ] < . (c) sup n 0 E [Vn] < .

C.2 Auxiliary lemmas

Lemma 2. Let (un)n 1, (vn)n 1 and (γn)n 1 be non-negative sequences such that γn 0 and P

n γn = + . Assume that there exists a real number m 1 and j 1 such that for all n j, un (1 γn)mun 1+γnvn. Then it holds that lim sup n + un lim sup n + vn.

Proof. Denote x+ = max(x, 0). One has (x + y)+ x+ + y+. Set ε > 0 and v = lim supn vn + ε. Then there exists an integer N 1 such that (1 γn)m (1 γn) and vn < v, i.e., (vn v)+ = 0 for n N. We have for large enough n N j,

un v (1 γn)(un 1 v) + γn(vn v),

and taking the positive part gives

(un v)+ (1 γn)(un 1 v)+ + γn(vn v)+ = (1 γn)(un 1 v)+.

n γn = + , this inequality implies that (un v)+ tends to zero, but this is true for all ε > 0 so v is arbitrarily close to lim supn vn and the result follows.

Lemma 3. Let (γn)n 1 be a non-negative sequence converging to zero, and λ, m and p three real numbers with λ > 0, m 1, p 0. Consider two non-negative sequences (xn), (εn) and an integer j 1 such that

n j, xn = (1 λγn)mxn 1 + γp+1 n εn,

i=j (1 λγi)mxj 1 +

i=k+1 (1 λγi)m !

The following holds if γn = n β, β (1/2, 1), then for any p

lim sup n + xn γp n 1 mλ lim sup n + εn.

if γn = 1/n, then for any p < mλ

lim sup n + xn γp n 1 mλ p lim sup n + εn.

In particular, when εn 0 with j = 1 and x0 = 0,

i=k+1 (1 λγi)mεk = 0,

(mλ > 1) lim n + 1 γn

i=k+1 (1 λγi)mεk = 0.

Published in Transactions on Machine Learning Research (08/2023)

Before proving this result, note that if we consider γn = γ/nβ then we can write

xn = (1 λγn)mxn 1 + γp+1 n εn = (1 (λγ)n β)mxn 1 + (n β)p+1(γp+1εn)

and apply the result with λ = γλ and εn = γp+1εn.

Proof. We apply Lemma 2 to the sequence un = xn

γp n . We have for all n j,

(1 λγn)mxn 1 + γp+1 n εn

p (1 λγn)mun 1 + γnεn

= exp p log γn 1

+ m log(1 λγn) un 1 + γnεn.

1 exp p log γn 1

+ m log(1 λγn) ,

so we get the recursion equation

n j, un = (1 λnγn)un 1 + λnγn εn λn .

if γn = n β, β (1/2, 1) then 1/γn 1/γn 1 0 and the ratio γn 1/γn tends to 1 with

γn 1 (1 + o(1)) = γn 1

(1 + o(1)) = o(γn).

Besides, m log(1 λγn) = mλγn + o(γn) when n + and we get

γn [1 exp ( mλγn + o(γn))] ,

which implies that λn converges to mλ. We conclude with Lemma 2. if γn = 1/n then the ratio γn 1/γn tends to 1 with

= log 1 + 1 n 1

= γn + o(γn).

We still have m log(1 λγn) = mλγn + o(γn) when n + and therefore

γn [1 exp ((p mλ)γn + o(γn))] ,

which implies λn converges to (mλ p) and we conclude in the same way.

Lemma 4. Let A, B S++ d (R) then the eigenvalues of AB are real and positive with Sp(AB) [λmin(A)λmin(B); λmax(A)λmax(B)].

Proof. Denote by

B the unique positive square root of B. The matrix AB is similar to the real symmetric positive deﬁnite matrix

B. Therefore its eigenvalues are real and positive. Since A 7 λmax(A) is a sub-multiplicative matrix norm on S++ d (R), λmax(AB) λmax(A)λmax(B) which gives λmax((AB) 1) λmax(A 1)λmax(B 1), i.e., λmin (AB) 1 λmin(A) 1λmin(B) 1, and ﬁnally λmin(A)λmin(B) λmin (AB) .

Published in Transactions on Machine Learning Research (08/2023)

Lemma 5. Let S S++ d (R) be a real symmetric positive deﬁnite matrix. Let (γk)k 1 be a positive decreasing sequence converging to 0 such that P k γk = + . Denote by λm the smallest eigenvalue of S. It holds that there exists j 1 such that for any k > j, all the eigenvalues of the real symmetric matrix Ak = I γk S are positive and we have

ρ(Πn) = ρ(An . . . A1) n + 0,

k > j, ρ(Πn,k) = ρ(An . . . Ak+1)

i=k+1 (1 γiλm).

Proof. For any k N, the eigenvalues of the real symmetric matrix Ak = I γk S are given by Sp(Ak) = {(1 γkλ), λ Sp(S)}. Since γk 0, there exists j 1 such that γkλm < 1 for all k > j. Therefore for any k > j, we have Sp(Ak) R + and the largest eigenvalue is ρ(Ak) = 1 γkλm. Since ρ is a sub-multiplicative norm for real symmetric matrices, we get ρ(Πn) Qn k=1 ρ(Ak) = Qj k=1 ρ(Ak) Qn k=j+1 ρ(Ak). The second product can be upper bounded with the convexity of exponential,

k=j+1 ρ(Ak) =

k=j+1 (1 γkλm)

k=j+1 exp ( γkλm) = exp ( λm(τn τj)) n + 0.

Similarly we have for all k > j, ρ(Πn,k) Qn i=k+1 ρ(Ai) Qn i=k+1(1 γiλm).

Lemma 6. Let γn = αn β with β (1/2, 1] then it holds

1 β = α 1 β n1 β, (β = 1)

k=1 γk α log(n).

Proof. By series-integral comprison, R n+1 1 t βdt Pn k=1 k β 1 + R n 1 t βdt.

Theorem 7. (Delyon & Portier, 2021, Theorem 17)(Freedman inequality) Let (Xj)1 j n be random variables such that E[Xj|Fj 1] = 0 for all 1 j n then, for all t 0 and v, m > 0,

j=1 Xj t, max j=1,...,n |Xj| m,

j=1 E X2 j | Fj 1 v

2 exp t2/2 v + tm/3

Lemma 7. Let A Rn n be a symmetric positive semi-deﬁnite matrix. Then for any B Rm n, the matrix BAB Rm m is symmetric positive semi-deﬁnite.

Proof. First note that (BAB ) = (B ) A B = BAB because A is symmetric. Then for any vector x R, we have x (BAB )x = (B x) A(B x) 0 since A is positive semi-deﬁnite.

Proposition 3. (Khalil, 2002, Theorem 4.6) Let H be a positive deﬁnite matrix and Γ a symmetric positive deﬁnite matrix of same dimension. Then there exists a symmetric positive deﬁnite matrix Σ, unique solution of the Lyapunov equation HΣ + ΣH = Γ, which is given by Σ = R + 0 e t HΓe t H dt.

The results remains true if the matrix Γ is only symmetric positive semi-deﬁnite: in that case the matrix Σ is also symmetric positive semi-deﬁnite and is the solution of the Lyapunov equation.

C.3 Additional propositions

This section gathers the proofs of Proposition 1 about the optimal choice for the conditioning matrix and of Proposition 2 about the almost sure convergence of the conditioning matrices.

Proposition 1. The choice C = H 1 is optimal in the sense that ΣC ΣC, C CH. Moreover, ΣC = H 1ΓH 1.

Published in Transactions on Machine Learning Research (08/2023)

Proof. Deﬁne C = ΣC H 1ΓH 1 and check that C satisﬁes

(CH Id/2) C + C (CH Id/2) = (C H 1)Γ(C H 1).

Because Γ is symmetric positive semi-deﬁnite, we have using Lemma 7 that the term on the right side is symmetric positive semi-deﬁnite. Therefore, in view of Proposition 3, we get that C is symmetric positive semi-deﬁnite C 0 which implies ΣC H 1ΓH 1 for all C CH. The equality is reached for C = H 1

with C = 0, ΣC = H 1ΓH 1.

Proposition 2. Let (Φk)k 0 be obtained by (22). Suppose that Assumptions 3 and 10 are fulﬁlled and that θk θ almost surely . If sup0 j k ωj,k = O(1/k), then we have Φk H = 2F(θ ) almost surely.

Proof. We use the decomposition

j=0 ωj,k 2F(θj) H +

j=0 ωj,k H(θj, ξ j+1) 2F(θj) .

The continuity of 2F at θ and the fact that θj θ a.s. implie that 2F(θj) H 0 a.s. Since sup0 j k ωj,k = O(1/k), there exists a > 0 such that

j=0 ωj,k 2F(θj) H a k + 1

which goes to 0 in virtue of Cesaro s Lemma, therefore limk Pk j=0 ωj,k 2F(θj) H = 0. The second term is a sum of martingale increments and shall be treated with Freedman inequality and Borel-Cantelli Lemma. Introduce the martingale increments

0 j k, Xj+1,k = ωj,k H(θj, ξ j+1) 2F(θj) .

For a ﬁxed k, we have Xj+1,k = x(i,l) j+1

1 i,l d where we remove the index k for the sake of clarity. Because

the Hessian generator is unbiased, we have for all coordinates

E h x(i,l) j+1|Fj i = 0 for all 0 j k.

By deﬁnition of the Hessian generator and using that ( 2F(θj)) is bounded, we get that H(θj, ξ j+1) 2F(θj) = O(1) for all j 0. For any b > 0, consider the following event

Ωb = sup k 0 max j=0,...,k(k + 1) x(i,l) j+1 b ,

and note that since ωj,k = O(1/k) we have P(Ωb) 1 as b . On this event, the martingale increments and the variance term are bounded as

max j=0,...,k

x(i,l) j+1 b(k + 1) 1,

j=0 E x(i,l) j+1 2 | Fj

b2(k + 1) 1.

Using Freedman inequality (Theorem 7), we have for all coordinates i, l = 1, . . . , d,

j=0 x(i,l) j+1

2 exp ε2(k + 1)

The last term is the general term of a convergent series. Apply Borel-Cantelli Lemma (Borel, 1909) to ﬁnally get almost surely on Ωb that limk Pk j=0 x(i,l) j+1 = 0. Since b > 0 is arbitrary and P(Ωb) 1 when

Published in Transactions on Machine Learning Research (08/2023)

b , we have almost surely limk Pk j=0 x(i,l) j+1 = 0. This is true for all the coordinates of the martingale increments and therefore

j=0 ωj,k H(θj, ξ j+1) 2F(θj) = 0 a.s.

C.4 Auxiliary results on expected smoothness

The following Lemma gives suﬃcient conditions to meet the weak growth condition on the stochastic noise as stated in Assumption 8.

Lemma 8. Suppose that for all k 1, θ Rd, F(θ) = E [f(θ, ξk)|Fk 1] with ξk Pk 1. Assume that for all ξk Pk 1, the function θ 7 f(θ, ξk) is L-smooth almost surely and there exists m R such that for all θ Rd, f(θ, ξk) m. Then a gradient estimate is given by g(θ, ξ) = f(θ, ξ) and the growth condition of Assumption 8 is satisﬁed with σ2 = 2L(F m) and

θ Rd, k N, E g(θ, ξk) 2 2|Fk 1 2L (F(θ) F ) + σ2.

Proof. For all ξk Pk 1, Lipschitz continuity of the gradient θ 7 f(θ, ξk) implies (see Nesterov (2013))

f(y, ξk) f(θ, ξk) + f(θ, ξk), y θ + (L/2) y θ 2 2.

Plug y = θ (1/L) f(θ, ξk) and use the lower bound f(y, ξk) m to obtain

1 2L f(θ, ξk) 2 2 f(θ, ξk) f(y, ξk) f(θ, ξk) m,

which gives,

g(θ, ξk) 2 2 2L (f(θ, ξk) f(θ , ξk)) + 2L (f(θ , ξk) m)

and conclude by taking the conditional expectation with respect to Fk 1.

The next Lemma links our weak growth condition with the notion of expected smoothness as introduced in Gower et al. (2019). In particular, this notion can be extended to our general context where the sampling distribution can evolve through the stochastic algorithm.

Lemma 9. (Expected smoothness) Assume that with probability one,

sup k 1 sup x =x E g(θ, ξk) g(θ , ξk) 2 2|Fk 1

F(θ) F < and sup k 1 E g(θ , ξk) 2 2|Fk 1 < .

Then there exist 0 L, σ2 < such that

θ Rd, k N, E g(θ, ξk) 2 2|Fk 1 2L (F(θ) F ) + 2σ2.

Proof. For all θ Rd and all k N, we have

g(θ, ξk) 2 2 = g(θ, ξk) g(θ , ξk) + g(θ , ξk) 2 2 2 g(θ, ξk) g(θ , ξk) 2 2 + 2 g(θ , ξk) 2 2.

Using the expected smoothness, with probability one, there exists 0 L < such that

E g(θ, ξk) g(θ , ξk) 2 2|Fk 1 L (F(θ) F ) .

Since the noise at optimal point is almost surely ﬁnite there exists 0 σ2 < such that

E g(θ , ξk) 2 2|Fk 1 σ2,

which allows to conclude by taking the conditional expectation.