# asymptotic_analysis_of_conditioned_stochastic_gradient_descent__57f02969.pdf Published in Transactions on Machine Learning Research (08/2023) Asymptotic Analysis of Conditioned Stochastic Gradient Descent Rémi Leluc remi.leluc@gmail.com CMAP, École Polytechnique Institut Polytechnique de Paris, Palaiseau (France) François Portier francois.portier@gmail.com CREST, ENSAI École Nationale de la Statistique et de l Analyse de l Information, Rennes (France) Reviewed on Open Review: https: // openreview. net/ forum? id= U4Xgz Rjf F1 In this paper, we investigate a general class of stochastic gradient descent (SGD) algorithms, called conditioned SGD, based on a preconditioning of the gradient direction. Using a discrete-time approach with martingale tools, we establish under mild assumptions the weak convergence of the rescaled sequence of iterates for a broad class of conditioning matrices including stochastic first-order and second-order methods. Almost sure convergence results, which may be of independent interest, are also presented. Interestingly, the asymptotic normality result consists in a stochastic equicontinuity property so when the conditioning matrix is an estimate of the inverse Hessian, the algorithm is asymptotically optimal. 1 Introduction Consider some unconstrained optimization problem of the following form: min θ Rd{F(θ) = Eξ[f(θ, ξ)]}, where f is a loss function and ξ is a random variable. This key methodological problem, known under the name of stochastic programming (Shapiro et al., 2014), includes many flagship machine learning applications such as empirical risk minimization (Bottou et al., 2018), adaptive importance sampling (Delyon & Portier, 2018) and reinforcement learning (Sutton & Barto, 2018). When F is differentiable, a common appproach is to rely on first-order methods. However, in many scenarios and particularly in large-scale learning, the gradient of F may be hard to evaluate or even intractable. Instead, a random unbiased estimate of the gradient is available at a cheap computing cost and the state-of-the-art algorithm, stochastic gradient descent (SGD), just moves along this estimate at each iteration. It is an iterative algorithm, simple and computationally fast, but its convergence towards the optimum is generally slow. Conditioned SGD, which consists in multiplying the gradient estimate by some conditioning matrix at each iteration, can lead to better performance as shown in several recent studies ranging from natural gradient (Amari, 1998; Kakade, 2002) and stochastic second-order methods with quasi-Newton (Byrd et al., 2016) and (L)-BFGS methods (Liu & Nocedal, 1989) to diagonal scaling methods such as Ada Grad (Duchi et al., 2011), RMSProp (Tieleman et al., 2012), Adam (Kingma & Ba, 2014), AMSGrad (Reddi et al., 2018) and adaptive coordinate sampling (Wangni et al., 2018; Leluc & Portier, 2022). These conditioning techniques are based on different strategies: diagonal scaling rely on feature normalization, stochastic second-order methods are concerned with minimal variance and adaptive coordinate sampling techniques aim at taking advantage of particular data structure. Furthermore, these methods proved to be the current state-of-the-art for training machine learning models (Zhang, 2004; Le Cun et al., 2012) and are implemented in widely used programming tools (Pedregosa et al., 2011; Abadi et al., 2016). Published in Transactions on Machine Learning Research (08/2023) Conditioned SGD generalizes standard SGD by adding a conditioning step to refine the descent direction. Starting from θ0 Rd, the algorithm of interest is defined by the following iteration θk+1 = θk γk+1Ckg(θk, ξk+1), k 0, where g(θk, ξk+1) is some unbiased gradient valued in Rd, Ck Rd d is called conditioning matrix and (γk)k 1 is a decreasing learning rate sequence. An important question, which is still open to the best of our knowledge, is to characterize the asymptotic variance of such algorithms for non-convex objective F and general estimation procedure for the conditioning matrix Ck. Related work. Seminal works around standard SGD (Ck = Id) were initiated by Robbins & Monro (1951) and Kiefer et al. (1952). Since then, a large literature known as stochastic approximation, has developed. The almost sure convergence is studied in Robbins & Siegmund (1971) and Bertsekas & Tsitsiklis (2000); rates of convergence are investigated in Kushner & Huang (1979) and Pelletier (1998a); non-asymptotic bounds are given in Moulines & Bach (2011). The asymptotic normality can be obtained using two different approaches: a diffusion-based method is employed in Pelletier (1998b) and Benaïm (1999) whereas martingale tools are used in Sacks (1958) and Kushner & Clark (1978). We refer to Nevelson & Khas minski ı (1976); Delyon (1996); Benveniste et al. (2012); Duflo (2013) for general textbooks on stochastic approximation. The aforementioned results do not apply directly to conditioned SGD because of the presence of the matrix sequence (Ck)k 0 involving an additional source of randomness in the algorithm. Seminal papers dealing with the weak convergence of conditioned SGD are Venter (1967) and Fabian (1968). Within a restrictive framework (univariate case d = 1 and strong assumptions on the function F), their results are encouraging because the limiting variance of the procedure is shown to be smaller than the limiting variance of standard SGD. Venter s and Fabian s results have then been extended to more general situations (Fabian, 1973; Nevelson & Khas minski ı, 1976; Wei, 1987). In Wei (1987), the framework is still restrictive not only because the random errors are assumed to be independent and identically distributed but also because the objective F must satisfy their assumption (4.10) which hardly extends to objectives other than quadratic. More recently, Bercu et al. (2020) have obtained the asymptotic normality as well as the efficiency of certain conditioned SGD estimates in the particular case of logistic regression. The previous approach has been generalized not long ago in Boyer & Godichon-Baggioni (2022) where the use of the Woodbury matrix identity is promoted to compute the Hessian inverse in the online setting. Several theoretical results, including the weak convergence of conditioned SGD, are obtained for convex objective functions. An alternative to conditioning, called averaging, developed by Polyak (1990) and Polyak & Juditsky (1992), allows to recover the same asymptotic variance as conditioned SGD. When dealing with convex objectives, the theory behind this averaging technique is a well-studied topic (Moulines & Bach, 2011; Gadat & Panloup, 2017; Dieuleveut et al., 2020; Zhu et al., 2021). However, it is inevitably associated with a large bias caused by poor initialization and requires some parameter tuning through the burn-in phase. Contributions. The main result of this paper deals with the weak convergence of the rescaled sequence of iterates. Interestingly, our asymptotic normality result consists of the following continuity property: whenever the matrix sequence (Ck)k 0 converges to a matrix C and the iterates (θk)k 0 converges to a minimizer θ , the algorithm behaves in the same way as an oracle version in which C would be used instead of Ck. We stress that contrary to Boyer & Godichon-Baggioni (2022), no convexity assumption is needed on the objective function and no rate of convergence is required on the sequence (Ck)k 0. This is important because, in most studies, deriving a convergence rate on (Ck)k 0 requires a specific convergence rate on the iterates (θk)k 0 which, in general, is unknown at this stage of the analysis. From a more practical point of view, our main result claims that the impact of the approximation error resulting from the conditioning matrices estimation assumes a secondary role. This finding promotes the use of simple and cheap sequential algorithm to estimate the conditioning matrix which encompasses a broad spectrum of conditioned SGD methods, highlighting the applicability and generalizability of the obtained result. Another result of independent interest dealing with the almost sure convergence of the gradients is also provided. In addition, for illustration purposes, we apply our results to the popular variational inference problem where one seeks to approximate a target density out of a parametric family by solving an optimization problem. In this framework, by optimizing the forward Kullback-Liebler divergence (Jerfel et al., 2021) and building Published in Transactions on Machine Learning Research (08/2023) stochastic gradients relying on importance sampling schemes (Delyon & Portier, 2018), we show that the approach has some efficiency properties. For the sake of completeness, we present in appendix practical ways to compute the conditioning matrix Ck and show that the resulting procedure satisfies the high-level conditions of our main Theorem. This yields a feasible algorithm achieving minimum variance. To obtain these results, instead of approximating the rescaled sequence of iterates by a continuous diffusion (as for instance in Pelletier (1998b)), we rely on a discrete-time approach where the recursion scheme is directly analyzed (as for instance in Delyon (1996)). More precisely, the sequence of iterates is studied with the help of an auxiliary linear algorithm whose limiting distribution can be deduced from the central limit theorem for martingale increments (Hall & Heyde, 1980). The limiting variance is derived from a discrete time matrix-valued dynamical system algorithm. It corresponds to the solution of a Lyapunov equation involving the matrix C. It allows a special choice for C which guarantees an optimal variance. Finally, a particular recursion is identified to examine the remaining part. By studying it on a particular event, this part is shown to be negligible. Outline. Section 2 introduces the framework of standard SGD with asymptotic results. Section 3 is dedicated to conditioned SGD: it first presents popular optimization methods that fall in the considered framework and then presents our main results, namely the weak convergence and asymptotic optimality. Section 4 gathers practical implications of the main results for machine learning models in the framework of variational inference and Section 5 concludes the paper with a discussion of avenues for further research. Technical proofs, additional propositions and numerical experiments are available in the appendix. 2 Mathematical background In this section, the mathematical background of stochastic gradient descent (SGD) methods is presented and illustrated with the help of some examples. Then, to motivate the use of conditioning matrices, we present a known result from Pelletier (1998b) about the weak convergence of SGD. 2.1 Problem setup Consider the problem of finding a minimizer θ Rd of a function F : Rd R, that is, θ arg min θ Rd F(θ). In many scenarios and particularly in large scale learning, the gradient of F cannot be fully computed and only a stochastic unbiased version of it is available. The SGD algorithm moves the iterate along this direction. To increase the efficiency, the random generators used to derive the unbiased gradients might evolve during the algorithm, e.g., using the past iterations. To analyse such algorithms, we consider the following probabilistic setting. Definition 1. A stochastic algorithm is a sequence (θk)k 0 of random variables defined on a probability space (Ω, F, P) and valued in Rd. Define (Fk)k 0 as the natural σ-field associated to the stochastic algorithm (θk)k 0, i.e., Fk = σ(θ0, θ1, . . . , θk), k 0. A policy is a sequence of random probability measures (Pk)k 0, each defined on a measurable space (S, S) that are adapted to Fk. Given a policy (Pk)k 0 and a learning rates sequence (γk)k 1 of positive numbers, the SGD algorithm (Robbins & Monro, 1951) is defined by the update rule θk+1 = θk γk+1g(θk, ξk+1) with ξk+1 Pk, (1) where g : Rd S Rd is called the gradient generator. The choice of the policy (Pk)k 0 in SGD is important as it can impact the convergence speed, generalization performance, and efficiency of the optimization algorithm. While most classical approaches rely on uniform sampling and mini-batch sampling, it may be more efficient to use advanced selection sampling strategy such as stratified sampling or importance sampling (see Example 1 for details). The policy (Pk)k 0 is used at each iteration to produce random gradients through the function g. Those gradients are assumed to be unbiased. Published in Transactions on Machine Learning Research (08/2023) Assumption 1 (Unbiased gradient). The gradient generator g : Rd S Rd is such that for all θ Rd, g(θ, ) is measurable, and we have: k 0, E [g(θk, ξk+1)|Fk] = F(θk). We emphasize three important examples covered by the developed approach. In each case, explicit ways to generate the stochastic gradient are provided. Example 1. (Empirical Risk Minimization) Given some observed data z1, . . . , zn Rp and a differentiable loss function ℓ: Rd Rp R, the objective function F approximates the true expected risk Ez[ℓ(θ, z)] using its empirical counterpart F(θ) = n 1 Pn i=1 ℓ(θ, zi). Classically, the gradient estimates at θk are given by the policy g(θk, ξk+1) = θℓ(θk, ξk+1) with ξk+1 Another one, more subtle, referred to as mini-batching (Gower et al., 2019), consists in generating uniformly a set of nk samples (z1, . . . , znk) and computing the gradient as the average n 1 k Pnk j=1 θℓ(θk, zj). Note that interestingly, we allow changes of the minibatch size throughout the algorithm. Our framework also includes adaptive non-uniform sampling (Papa et al., 2015) and survey sampling (Clémençon et al., 2019), which use Pk = Pn i=1 w(k) i δzi with Fk-adapted weights satisfying Pn i=1 w(k) i = 1 for each k 0. Example 2. (Adaptive importance sampling for variational inference) Given a target density function f, which for instance might result from the posterior distribution of some observed data, and a parametric family of samplers {qθ : θ Θ}, the aim is to find a good approximation of f out of the family of samplers. A standard choice (Jerfel et al., 2021) for the objective function is the so called forward Kullback-Leibler divergence given by F(θ) = R log(qθ(y)/f(y))f(y)dy. Then in the spirit of adaptive importance sampling schemes (Delyon & Portier, 2018), gradient estimates are given by g(θk, ξk+1) = θ log(qθk(ξk+1)) f(ξk+1) qθk(ξk+1), ξk+1 qθk. Other losses such as α-divergence (Daudel et al., 2021) or generalized method of moment (Delyon & Portier, 2018) may also be considered depending on the problem of interest. Some applications of conditioned SGD algorithm to this particular framework are considered with more details in Section 4. Example 3. (Policy-gradient methods) In reinforcement learning (Sutton & Barto, 2018), the goal of the agent is to find the best action-selection policy to maximize the expected reward. Policy-gradient methods (Baxter & Bartlett, 2001; Williams, 1992) use a parameterized policy {πθ : θ Θ} to optimize an expected reward function F given by F(θ) = Eξ πθ[R(ξ)] where ξ is a trajectory including nature states and selected actions. Using the policy gradient theorem, one has F(θ) = Eξ πθ [R(ξ) θ log πθ(ξ)], leading to the REINFORCE algorithm (Williams, 1992) given by g(θk, ξk+1) = R(ξk+1) θ log πθk(ξk+1), ξk+1 πθk. 2.2 Weak convergence of SGD This section is related to the weak convergence property of the normalized sequence of iterates (θk θ )/ γk. The working assumptions include the almost sure convergence of the sequence of iterates (θk)k 0 towards a stationary point θ . Note that, given Assumptions 1 and 2, there exist many criteria on the objective function that give such almost sure convergence. For these results, we refer to Bertsekas & Tsitsiklis (2000); Benveniste et al. (2012); Duflo (2013). In addition to this high-level assumption of almost sure convergence, we require the following classical assumptions. Let S++ d (R) denote the space of real symmetric positive definite matrices and define for all k 0, wk+1 = F(θk) g(θk, ξk+1), Γk = E wk+1w k+1|Fk . The learning rates sequence (γk)k 1 should decay to eventually anneal the noise but not too fast so that the iterates (θk)k 0 can reach interesting places in a finite time. Published in Transactions on Machine Learning Research (08/2023) Assumption 2 (Learning rates). The sequence of step-size is γk = αk β with β (1/2, 1]. This classical form of the step-size ensures theoretical convergence guarantee through the Robbins-Monro condition: P k γk = , P k γ2 k < . However, note that in practice, the choice of learning rate is often determined through experimentation and fine-tuning to achieve the best performance on the given task. Assumption 3 (Hessian). The Hessian matrix at stationary point is positive definite, i.e., H = 2F(θ ) S++ d (R) and the mapping θ 7 2F(θ) is continuous at θ . The positive definiteness of the Hessian matrix provides stability and robustness guarantees in the optimization process. It ensures that small perturbations or noise in the objective function or the training data do not significantly affect the convergence behavior. The positive curvature helps in confining the optimization trajectory near the minimum and prevents it from getting trapped in flat regions or saddle points. The noise sequence (wk)k 1 defines a sequence of conditional covariance matrices (Γk)k 1 that is assumed to converge so that one can identify the limiting covariance Γ = E[g(θ , ξ)g(θ , ξ) ]. Assumption 4 (Covariance matrix). There exists Γ S++ d (R) such that Γk k + Γ a.s. Finally, in order to derive a central limit theorem for the iterates of the algorithm, there is an extra need for stability which is synonymous with a uniform bound on the noise around the minimizer. Assumption 5 (Lyapunov bound). There exist δ, ε > 0 such that: sup k 0 E[ wk+1 2+δ 2 |Fk]1{ θk θ ε} < a.s. Note that all these assumptions are stated in the spirit of Pelletier (1998b) making them mild and general. In particular, Assumptions 4 and 5 are similar to (A1.2) in Pelletier (1998b). More precisely, Assumption 4 is needed to identify the limiting distribution while Assumption 5 is a stability condition often referred to as the Lyapunov condition. This last condition is technical but not that strong as it is similar to the Lindeberg s condition which is necessary (Hall & Heyde, 1980) for tightness. The following result can be either derived from (Pelletier, 1998b, Theorem 1) or as a direct corollary of our main result, Theorem 2, given in Section 3.2. Theorem 1 (Weak convergence of SGD). Let (θk)k 0 be obtained by the SGD rule (1). Suppose that Assumptions 1, 2, 3, 4, 5 are fulfilled and that θk θ almost surely. If moreover, (H ζI) is positive definite with ζ = 1{β=1}/2α, it holds that 1 γk (θk θ ) N(0, Σ), as k where Σ satisfies the Lyapunov equation: (H ζId)Σ + Σ(H ζId) = Γ. Several remarks are to be explored. Since Γ and (H ζI) are positive definite matrices, there exists a unique solution Σ to the Lyapunov equation (H ζId)Σ + Σ(H ζId) = Γ given by Σ = R + 0 exp[ t(H ζId)]Γ exp[ t(H ζId) ]dt. Second, the previous result can be expressed as kβ/2(θk θ ) N(0, αΣ). Hence, the fastest rate of convergence is obtained when β = 1 for which we recover the classical 1/ k-rate of a Monte Carlo estimate. In this case, the coefficient α should be chosen large enough to ensure the convergence through the condition H Id/(2α) 0, but also such that the covariance matrix αΣ is small. The choice of α is discussed in the next section and should be replaced with a matrix gain. 3 The asymptotics of conditioned stochastic gradient descent This Section first presents practical optimization schemes that fall in the framework of conditioned SGD. Then it contains our main results, namely the weak convergence and asymptotic optimality. Another result of independent interest dealing with the almost sure convergence of the gradients and the iterates is also provided. Published in Transactions on Machine Learning Research (08/2023) 3.1 Framework and Examples We introduce the general framework of conditioned SGD as an extension of the standard SGD presented in Section 2. It is defined by the following update rule, for k 0, θk+1 = θk γk+1Ckg(θk, ξk+1), (2) where the conditioning matrix Ck S++ d (R) is a Fk-measurable real symmetric positive definite matrix so that the search direction always points to a descent direction. In convex optimization, inverse of the Hessian is a popular choice but (1) it may be hard to compute, (2) it is not always positive definite and (3) it may increase the noise of SGD especially when the Hessian is ill-conditioned. Quasi-Newton. These methods build approximations of the Hessian Ck 2F(θk) 1 with gradient-only information, and are applicable for convex and nonconvex problems. For scalability issue, variants with limited memory are the most used in practice (Liu & Nocedal, 1989). Following Newton s method idea with the secant equation, the update rule is based on pairs (sk, yk) tracking the differences of iterates and stochastic gradients, i.e., sk = θk+1 θk and yk = g(θk+1, ξk+1) g(θk, ξk+1). Let ρk = 1/(s k yk) then the Hessian updates are Ck+1 = (I ρkyks k ) Ck(I ρkyks k ) + ρksks k . In the deterministic setting, the BFGS update formula above is well-defined as long as s k yk > 0. Such condition preserves positive definite approximations and may be obtained in the stochastic setting by replacing the Hessian matrix with a Gauss-Newton approximation and using regularization. Adaptive methods and Diagonal scalings. These methods adapt locally to the structure of the optimization problem by setting Ck as a function of past stochastic gradients. General adaptive methods differ in the construction of the conditioning matrix and whether or not they add a momentum term. Using different representations such as dense or sparse conditioners also modify the properties of the underlying algorithm. For instance, the optimizers Adam and RMSProp maintain an exponential moving average of past stochastic gradients with a factor τ (0, 1) but fail to guarantee Ck+1 Ck. Such behaviour can lead to large fluctuations and prevent convergence of the iterates. Instead, Ada Grad and AMSGrad ensure the monotonicity Ck+1 Ck. Optimizer Gradient matrix Gk+1 m Ada Full Gk + gkg k 0 Ada Norm Gk + gk 2 2 0 Ada Diag Gk + diag(gkg k ) 0 RMSProp τGk + (1 τ)diag(gkg k ) 0 Adam [τGk + (1 τ)diag(gkg k )]/(1 τ k) m AMSGrad [τGk + (1 τ)diag(gkg k )]/(1 τ k) m Table 1: Adaptive Gradient Methods. Denote by gk = g(θk, ξk+1) and m [0, 1) a momentum parameter. General adaptive gradient methods are defined by: θk+1 = θk γk+1Ckˆgk, ˆgk = mˆgk 1 + (1 m)gk. Different optimizers are summarized in Table 1 above. They all rely on a gradient matrix Gk which accumulates the information of stochastic gradients. The conditioning matrix is equal to Ck = G 1/2 k except for AMSGrad which uses Ck = max{Ck 1; G 1/2 k }. Starting from G0 = δI with δ > 0, Gk+1 is updated either in a dense or sparse (diagonal) manner or using an exponential moving average. Note that conditioned SGD methods also include schemes with general estimation of the matrix Ck such as Hessian sketching (Gower et al., 2016) or Jacobian sketching (Gower et al., 2021). A common assumption made in the literature of adaptive methods is that conditioning matrices are wellbehaved in the sense that their eigenvalues are bounded in a fixed interval. This property is easy to check for diagonal matrices and can always be implemented in practice using projection. Published in Transactions on Machine Learning Research (08/2023) 3.2 Main result Similarly to standard SGD, it is interesting to search for an appropriate rescaled process to obtain some convergence rate and asymptotic normality results. In fact the only additional assumption needed, compared to SGD, is the almost sure convergence of the sequence (Ck)k 0. This makes Theorem 1 a particular case of the following Theorem which is the main result of the paper (the proof is given in Appendix A.1). Theorem 2 (Weak convergence of Conditioned SGD). Let (θk)k 0 be obtained by conditioned SGD (2). Suppose that Assumptions 1, 2, 3, 4, 5 are fulfilled and that θk θ almost surely. If moreover, Ck C S++ d (R) almost surely and all the eigenvalues of (CH ζI) are positive with ζ = 1{β=1}/2α, it holds that 1 γk (θk θ ) N(0, ΣC), as k , where ΣC satisfies: (CH ζId) ΣC + ΣC (CH ζId) = CΓC . Sketch of the proof. The idea of the proof is to rely on the following bias-variance decomposition. Remark that the difference k = θk θ is subjected to the iteration: k+1 = k γk+1Ck F(θk) + γk+1Ckwk+1, k 0. In a similar spirit as in Delyon (1996), we use the Taylor approximation F(θk) = F(θ ) + H(θk θ ) + o(θk θ ) H(θk θ ) to define the following auxiliary linear stochastic algorithm which carries the same variance as the main algorithm, e k+1 = e k γk+1K e k + γk+1Ckwk+1, k 1, where K = CH. As a first step we establish the weak convergence of e k+1 using discrete martingale tools. Note that the analysis is made possible because the matrix K is fixed along this algorithm. As a second step, we prove that the difference ( k e k), which represents some bias term, is negligible. Comparison with previous works. Theorem 2 stated above is comparable to Theorem 1 given in Pelletier (1998b). However, our result on the weak convergence cannot be recovered from the one of Pelletier (1998b) due to their Assumption (A1.2) about convergence rates. Indeed, this assumption would require that the sequence (Ck)k 0 converges towards C faster than γk. This condition is either hardly meet in practice or difficult to check. Unlike this prior work, our result only requires the almost sure convergence of the sequence (Ck)k 0. In a more restrictive setting of convex objective and online learning framework, i.e. in which data becomes available in a sequential order, another way to obtain the weak convergence of the rescaled sequence of iterates (θk θ )/ γk is to rely on the results of Boyer & Godichon-Baggioni (2022). However, once again, their work rely on a particular convergence rate for the matrix sequence (Ck)k 0. This implies the derivation of an additional result on the almost sure convergence rate of the iterates. To overcome all these issues, we show in Appendix B that our conditions on the matrices Ck are easily satisfied in common situations. 3.3 Asymptotic optimality of Conditioned SGD The best conditioning matrix C that could be chosen regarding the asymptotic variance is specified in the next proposition whose proof is given in the supplementary material (Appendix C.3). Proposition 1 (Optimal choice). The choice C = H 1 is optimal in the sense that ΣC ΣC for all C CH. Moreover, we have ΣC = H 1ΓH 1. Another remarkable result, which directly follows from the Theorem 2 is now stated as a corollary. Corollary 1 (Asymptotic optimality). Under the assumptions of Theorem 2, if γk = 1/k and C = H 1, then k(θk θ ) N(0, H 1ΓH 1), as k . Published in Transactions on Machine Learning Research (08/2023) Moreover, let (Z1, . . . , Zd) N(0, Id) and (λk)k=1,...,d be the eigenvalues of the matrix H 1/2ΓH 1/2, we have the convergence in distribution: k(F(θk) F(θ )) k=1 λk Z2 k, as k . This result shows the success of the proposed approach as the asymptotic variance is the optimal one. It provides the user a practical choice for the sequence of rate, γk = 1/k and also removes the assumption that 2αH Id which is usually needed in SGD (see Theorem 1). Concerning the almost sure convergence of the conditioning matrices, we provide in Appendix B an explicit way to ensure that Ck H 1. The above statement also provides insights about the convergence speed. It claims that the convergence rate of F(θk) towards the optimum F(θ ), in 1/k, is faster than the convergence rate of the iterates, in 1/ k. Another important feature, which is a consequence of Proposition 1, is that the eigenvalues (λk)k=1,...,d that appear in the limiting distribution are the smallest ones among all the other possible version of conditioned SGD (defined by the matrix C). 3.4 Convergence of the iterates (θk) of Conditioned SGD To apply both Theorem 2 and Corollary 1, it remains to check the almost sure convergence of the iterates. In a non-convex setting, the iterates of stochastic first-order methods can only reach local optima, i.e. the iterates are expected to converge to the following set S = {θ Rd : F(θ) = 0}. Going in this direction, we first prove the almost sure convergence of the gradients towards zero for general conditioned SGD methods under mild assumptions. This theoretical result may be of independent interest. Under a condition on S, one may uniquely identify a limit point θ and consider the event {θk θ } which is needed for the weak convergence results. The next analysis is based on classical assumptions which are used in the literature to obtain the convergence of standard SGD. Assumption 6 (L-smooth). L > 0 : θ, η Rd, F(θ) F(η) 2 L θ η 2. Assumption 7 (Lower bound). F R : θ Rd, F F(θ). To handle the noise of the stochastic estimates, we consider a weak growth condition, related to the notion of expected smoothness as introduced in Gower et al. (2019) (see also Gazagnadou et al. (2019); Gower et al. (2021)). In particular, we extend the condition of Gower et al. (2019) to our general context in which the sampling distributions are allowed to change along the algorithm. Assumption 8 (Growth condition). With probability 1, there exist 0 L, σ2 < such that for all θ Rd, k N, E g(θ, ξk+1) 2 2|Fk 2L(F(θ) F ) + σ2. This almost-sure bound on the stochastic noise E g(θ, ξk) 2 2|Fk 1 is key in the analysis of the conditioned SGD algorithm. This weak growth condition on the stochastic noise is general and can be achieved in practice with a general Lemma available in the supplement (Appendix C.4). Note that Assumption 8, often referred to as a growth condition, is mild since it allows the noise to be large when the iterate is far away from the optimal point. In that aspect, it contrasts with uniform bounds of the form E g(θk, ξk+1) 2 2|Fk σ2 for some deterministic σ2 > 0 (see Nemirovski et al. (2009); Nemirovski & Yudin (1983); Shalev-Shwartz et al. (2011)). Observe that such uniform bound is recovered by taking L = 0 in Assumption 8 but cannot hold when the objective function F is strongly convex (Nguyen et al., 2018). Besides, fast convergence rates have been derived in Schmidt & Roux (2013) under the strong-growth condition: E[ g(θ, ξk+1) 2 2|Fk] M F(θ) 2 2 for some M > 0. Similarly to our growth condition, Bertsekas & Tsitsiklis (2000) and Bottou et al. (2018) performed an analysis under the condition E[ g(θ, ξk+1) 2 2|Fk] M F(θ) 2 2 + σ2 for M, σ2 > 0. Under Assumptions 6 and 7, we have F(θ) 2 2 2L (F(θ) F(θ )) (Gower et al., 2019, Proposition A.1) so our growth condition is less restrictive. If F satisfies the Polyak-Lojasiewicz condition (Karimi et al., 2016), then our growth condition becomes a bit stronger. Another weak growth condition has been used for a non-asymptotic study in Moulines & Bach (2011). The success of conditioned SGD relies on the following extended Robbins-Monro condition which ensures a control on the eigenvalues of the conditioning matrices. Published in Transactions on Machine Learning Research (08/2023) Assumption 9 (Eigenvalues). Let (µk)k 1 and (νk)k 1 be positive sequences such that: k 1, µk Id Ck 1 νk Id; P k γkνk = + ; P k(γkνk)2 < + ; lim supk νk/µk < a.s. The last condition deals with the ratio (νk/µk) which may be seen as a conditioned number and ensures that the matrices Ck are well-conditioned. The following Theorem reveals that all these assumptions are sufficient to ensure the almost sure convergence of the gradients towards zero. Theorem 3 (Almost sure convergence). Suppose that Assumptions 1, 6, 7, 8, 9 are fulfilled. Then (θk)k 0 obtained by conditioned SGD (2) satisfies F(θk) 0 as k a.s. Other convergence results concerning the sequence of iterates towards global minimizers may be obtained by considering stronger assumptions such as convexity or that F is coercive and the level sets of stationary point S {θ, F(θ) = y} are locally finite for every y Rd (see Gadat et al. (2018)). In our analysis, the proof of Theorem 3 reveals that θk+1 θk 0 in L2 and almost surely. Thus, as soon as the stationary points are isolated, the sequence of iterates will converge towards a unique stationary point θ Rd. This result is stated in the next Corollary. Corollary 2 (Almost sure convergence). Under the assumptions of Theorem 3, assume that F is coercive and let (θk)k 0 be the sequence of iterates obtained by the conditioned SGD (2), then d(θk, S) 0 as k . In particular, if S is a finite set, (θk) converges to some θ S. 4 Asymptotic optimality in Adaptive importance sampling The aim of this section is to demonstrate that statistical efficiency can be ensured in variational inference problems through the combination of adaptive importance sampling and conditioned SGD. While wellknown results regarding the asymptotic optimality of maximum likelihood estimates (MLE) obtained from conditioned SGD (see for instance Amari (1998) or Bercu et al. (2020)) are initially revisited, attention is subsequently shifted towards the variational inference topic relying on adaptive sampling schemes methods. A novel result is then presented, asserting that even within this challenging framework, conditioned SGD allows for the recovery of a certain statistical efficiency. 4.1 Maximum likelihood estimation Assume that (Xk)k 1 is an independent sequence of random variables with distribution q . Consider a parametric family {qθ : θ Θ} from which we aim to obtain an estimate of q . We further assume that the model is well-specified, i.e. q = qθ for some θ Θ. The MLE is given by ˆθn arg max θ Θ i=1 log(qθ(Xi)). Under suitable condition (van der Vaart, 1998), it is well known that ˆθn is efficient, meaning it is asymptotically unbiased and has the smallest achievable variance. The Cramer-Rao bound is given by the inverse of the Fisher information matrix, denoted by I 1 and defined as I = Z θ log(qθ ) θ log(qθ ) qθ dλ. Unfortunately, the estimate ˆθn is often unknown in closed-form, requiring the use of a sequential procedure for approximation. This raises the further question of whether the estimate obtained through the sequential procedure achieves the efficiency bound. When using standard SGD without conditioning, the update rule is θk+1 = θk γk+1 log(qθk(Xk+1)). However, the optimal variance bound is not achieved in this case. To recover efficiency, one can rely on conditioned SGD, incorporating a conditioning matrix that estimates the inverse of the Hessian. In light of the definition of the Fisher information matrix, the conditioning matrix can be estimated iteratively using at each step a new sample Xk+1 as follows Ik+1 = (1 γk+1)Ik + γk+1 log(qθk(Xk)) θ log(qθk(Xk)) Published in Transactions on Machine Learning Research (08/2023) and then relying on the CSGD algorithm with update rule θk+1 = θk γk+1I 1 k log(qθk(Xk+1)). As a consequence of Theorem 2, under stipulated assumptions, one can recover the optimal bound I 1 as the asymptotic variance of 4.2 Adaptive importance sampling Consider the variational inference problem where the aim is to approximate a target distribution q = qθ based on a family of density {qθ : θ Θ}. Unlike the previous statistical framework, one does not have access to random variables distributed according to q . Instead, one can usually evaluate the target function q . More background about this type of problem might be found in Zhang et al. (2018). In the following, we show that conditioned SGD methods allow to achieve the same variance as the optimal variance described in the previous statistical setting. To the best of our knowledge, this result is novel and has potential implications in variational inference problems using forward KL (or α-divergence) as described in Jerfel et al. (2021) and Section 5.2 in Zhang et al. (2018). Consider the objective function defined as the Kullback-Liebler divergence between a sampler qθ and the target distribution q , i.e., F(θ) = Z log(qθ/q )q dλ. Under regularity conditions, the gradient and Hessian are respectively written as θF(θ) = Eq [ θ log(qθ)] and 2 θF(θ) = Eq [ 2 θ log(qθ)]. Stochastic gradients can be defined using adaptive importance samplingbased estimate as in Delyon & Portier (2018). Given the current iterate θk, one needs to generate Xk+1 from qθk and compute the (unbiased) stochastic gradient gk+1 = vk+1 k+1 where vk+1 = q (Xk+1)/qθk(Xk+1) and k+1 = θ log(qθk(Xk+1)). Based on our almost sure convergence result, one can obtain that θk θ and then deduce Γ = lim k E[gk+1g k+1] = lim k Z θ log(qθk) θ log(qθk) q 2 qθk dλ = I, where the value I for the limit comes from replacing qθk by its limit qθ = q . The choice of the conditioning matrix Ck may be done using an auxiliary algorithm of the following form Ck+1 = (1 γk+1)Ck + γk+1vk+1 k+1 k+1. It can be shown that the sequence of conditioning matrices (Ck) converges to I. Thus, Theorem 2 implies that conditioned SGD is efficient in this framework as it matches the lower bound of the previous less restrictive statistical framework in which qk = q . Similar computations, left for future work, may be performed to investigate if the same optimal variance can be achieved with more general similarity measures such as α-divergences (Daudel et al., 2021). 5 Conclusion and Discussion We derived an asymptotic theory for Conditioned SGD methods in a general non-convex setting. Compared to standard SGD methods, the only additional assumption required to obtain the weak convergence is the almost sure convergence of the conditioning matrices. The use of appropriate conditioning matrices with the help of Hessian estimates is the key to achieve asymptotic optimality. While our study focuses on the weak convergence of the rescaled sequence of iterates - an appropriate tool to deal with efficiency issues since algorithms can be easily compared through their asymptotic variances - it would be interesting to complement our asymptotic results with concentration inequalities. This research direction, left for future work, may be done at the cost of extra assumptions, e.g., strong convexity of the objective function combined with bounded gradients. Furthermore, by using some recent results on the behavior of adaptive gradient methods in non-convex settings (Daneshmand et al., 2018; Staib et al., 2019; Antonakopoulos et al., 2022), another research direction would be to extend the current weak convergence analysis to edge cases where the objective function possesses saddle points. From a practical standpoint, the approach proposed in Appendix B may not be computationally optimal as it requires eigenvalue decomposition. However, conditioned SGD methods and especially stochastic secondorder methods do not actually require the full computation of a matrix decomposition but rely on matrixvector products which may be performed in O(d2) operations. Futhermore, using low-rank approximation Published in Transactions on Machine Learning Research (08/2023) with BFGS algorithm (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970) and its variant L-BFGS (Liu & Nocedal, 1989), those algorithms approximately invert Hessian matrices in O(d) operations. More recently, this technique was extended to the online learning framework (Schraudolph et al., 2007) and a purely stochastic setting (Moritz et al., 2016). Similarly, the different adaptive optimizers presented in Section 3.1 are concerned with both fast computations and high precision. Designing an efficient conditioned SGD algorithm involves a careful trade-offbetween the low-memory storage of the scaling matrix representation Ck and the quality of its approximation of either the inverse Hessian 2F(θ ) 1 or the information brought in by the underlying geometry of the problem. Acknowledgments The authors are grateful to the Associate Editor and three anonymous Reviewers for their many valuable comments and interesting suggestions. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pp. 265 283, 2016. Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251 276, 1998. Kimon Antonakopoulos, Panayotis Mertikopoulos, Georgios Piliouras, and Xiao Wang. Adagrad avoids saddle points. In International Conference on Machine Learning, pp. 731 771. PMLR, 2022. Jonathan Baxter and Peter L Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001. Michel Benaïm. Dynamics of stochastic approximation algorithms. In Seminaire de probabilites XXXIII, pp. 1 68. Springer, 1999. Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012. Bernard Bercu, Antoine Godichon, and Bruno Portier. An efficient stochastic newton algorithm for parameter estimation in logistic regressions. SIAM Journal on Control and Optimization, 58(1):348 367, 2020. Dimitri P Bertsekas and John N Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization, 10(3):627 642, 2000. M Émile Borel. Les probabilités dénombrables et leurs applications arithmétiques. Rendiconti del Circolo Matematico di Palermo (1884-1940), 27(1):247 271, 1909. Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223 311, 2018. Claire Boyer and Antoine Godichon-Baggioni. On the asymptotic rate of convergence of stochastic newton algorithms and their weighted averaged versions. Computational Optimization and Applications, pp. 1 52, 2022. Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76 90, 1970. Richard H Byrd, Samantha L Hansen, Jorge Nocedal, and Yoram Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008 1031, 2016. Published in Transactions on Machine Learning Research (08/2023) Stephan Clémençon, Patrice Bertail, Emilie Chautru, and Guillaume Papa. Optimal survey schemes for stochastic gradient descent with applications to m-estimation. ESAIM: Probability and Statistics, 23: 310 337, 2019. Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In International Conference on Machine Learning, pp. 1155 1164. PMLR, 2018. Kamélia Daudel, Randal Douc, and François Portier. Infinite-dimensional gradient-based descent for alphadivergence minimisation. The Annals of Statistics, 49(4):2250 2270, 2021. doi: 10.1214/20-AOS2035. URL https://doi.org/10.1214/20-AOS2035. Bernard Delyon. General results on the convergence of stochastic algorithms. IEEE Transactions on Automatic Control, 41(9):1245 1255, 1996. Bernard Delyon and François Portier. Asymptotic optimality of adaptive importance sampling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3138 3148. Curran Associates Inc., 2018. Bernard Delyon and François Portier. Safe adaptive importance sampling: A mixture approach. The Annals of Statistics, 49(2):885 917, 2021. Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Annals of Statistics, 48(3):1348 1382, 2020. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ ml. John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121 2159, 2011. Marie Duflo. Random iterative models, volume 34. Springer Science & Business Media, 1st edition, 2013. Vaclav Fabian. On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, 39(4):1327 1332, 1968. Vaclav Fabian. Asymptotically efficient stochastic approximation; the rm case. The Annals of Statistics, 1 (3):486 495, 1973. Roger Fletcher. A new approach to variable metric algorithms. The computer journal, 13(3):317 322, 1970. Sébastien Gadat and Fabien Panloup. Optimal non-asymptotic bound of the ruppert-polyak averaging without strong convexity. ar Xiv preprint ar Xiv:1709.03342, 2017. Sébastien Gadat, Fabien Panloup, Sofiane Saadane, et al. Stochastic heavy ball. Electronic Journal of Statistics, 12(1):461 529, 2018. Nidham Gazagnadou, Robert Gower, and Joseph Salmon. Optimal mini-batch and step sizes for saga. In International conference on machine learning, pp. 2142 2150. PMLR, 2019. Donald Goldfarb. A family of variable-metric methods derived by variational means. Mathematics of computation, 24(109):23 26, 1970. Robert Gower, Donald Goldfarb, and Peter Richtárik. Stochastic block bfgs: Squeezing more curvature out of data. In International Conference on Machine Learning, pp. 1869 1878. PMLR, 2016. Robert M Gower, Peter Richtárik, and Francis Bach. Stochastic quasi-gradient methods: Variance reduction via jacobian sketching. Mathematical Programming, 188(1):135 192, 2021. Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. Sgd: General analysis and improved rates. In International Conference on Machine Learning, pp. 5200 5209. PMLR, 2019. Published in Transactions on Machine Learning Research (08/2023) P. Hall and C.C. Heyde. Martingale Limit Theory and Its Application. Probability and mathematical statistics. Academic Press, 1980. ISBN 9781483240244. URL https://books.google.fr/books?id= wd Lajg EACAAJ. David Harrison Jr and Daniel L Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81 102, 1978. Ghassen Jerfel, Serena Wang, Clara Wong-Fannjiang, Katherine A Heller, Yian Ma, and Michael I Jordan. Variational refinement for importance sampling using the forward kullback-leibler divergence. In Uncertainty in Artificial Intelligence, pp. 1819 1829. PMLR, 2021. Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pp. 1531 1538, 2002. Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795 811. Springer, 2016. Hassan K Khalil. Nonlinear systems; 3rd ed. Prentice-Hall, Upper Saddle River, NJ, 2002. URL https: //cds.cern.ch/record/1173048. Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462 466, 1952. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Harold J Kushner and Hai Huang. Rates of convergence for stochastic approximation type algorithms. SIAM Journal on Control and Optimization, 17(5):607 617, 1979. Harold Joseph Kushner and Dean S Clark. Stochastic approximation methods for constrained and unconstrained systems. 1978. Yann A Le Cun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pp. 9 48. Springer, 2012. Rémi Leluc and François Portier. Sgd with coordinate sampling: Theory and practice. Journal of Machine Learning Research, 23(342):1 47, 2022. URL http://jmlr.org/papers/v23/21-1240.html. Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1):503 528, 1989. Philipp Moritz, Robert Nishihara, and Michael Jordan. A linearly-convergent stochastic l-bfgs algorithm. In Artificial Intelligence and Statistics, pp. 249 258. PMLR, 2016. Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451 459, 2011. Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009. Arkadi Semenovich Nemirovski and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983. Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. Mikhail Borisovich Nevelson and Rafail Zalmanovich Khas minski ı. Stochastic approximation and recursive estimation, volume 47. American Mathematical Soc., 1976. Published in Transactions on Machine Learning Research (08/2023) Lam Nguyen, Phuong Ha Nguyen, Marten Dijk, Peter Richtárik, Katya Scheinberg, and Martin Takác. Sgd and hogwild! convergence without the bounded gradients assumption. In International Conference on Machine Learning, pp. 3750 3758. PMLR, 2018. Guillaume Papa, Pascal Bianchi, and Stéphan Clémençon. Adaptive sampling for incremental optimization using stochastic gradient descent. In International Conference on Algorithmic Learning Theory, pp. 317 331. Springer, 2015. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Mariane Pelletier. On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic processes and their applications, 78(2):217 244, 1998a. Mariane Pelletier. Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Annals of Applied Probability, pp. 10 44, 1998b. Boris T Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika, (7):98 107, 1990. Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Optimizing methods in statistics, pp. 233 257. Elsevier, 1971. Jerome Sacks. Asymptotic distribution of stochastic approximation procedures. The Annals of Mathematical Statistics, 29(2):373 405, 1958. Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv preprint ar Xiv:1308.6370, 2013. Nicol N Schraudolph, Jin Yu, and Simon Günter. A stochastic quasi-newton method for online convex optimization. In Artificial intelligence and statistics, pp. 436 443. PMLR, 2007. Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated subgradient solver for svm. Mathematical programming, 127(1):3 30, 2011. David F Shanno. Conditioning of quasi-newton methods for function minimization. Mathematics of computation, 24(111):647 656, 1970. Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2014. Matthew Staib, Sashank Reddi, Satyen Kale, Sanjiv Kumar, and Suvrit Sra. Escaping saddle points with adaptive gradient methods. In International Conference on Machine Learning, pp. 5956 5965. PMLR, 2019. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Tijmen Tieleman, Geoffrey Hinton, et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012. Published in Transactions on Machine Learning Research (08/2023) A. W. van der Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998. J. H. Venter. An extension of the robbins-monro procedure. The Annals of Mathematical Statistics, 38(1): 181 190, 1967. Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1299 1309, 2018. CZ Wei. Multivariate adaptive stochastic approximation. The Annals of Statistics, 15(3):1115 1130, 1987. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992. Mishael Zedek. Continuity and location of zeros of linear combinations of polynomials. Proceedings of the American Mathematical Society, 16(1):78 84, 1965. Cheng Zhang, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 41(8):2008 2026, 2018. Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, pp. 116, 2004. Wanrong Zhu, Xi Chen, and Wei Biao Wu. Online covariance matrix estimation in stochastic gradient descent. Journal of the American Statistical Association, pp. 1 12, 2021. Published in Transactions on Machine Learning Research (08/2023) Appendix: Asymptotic Analysis of Conditioned Stochastic Gradient Descent Appendix A contains the mathematical proofs of the main results while Appendix B is dedicated to a practical procedure and numerical experiments for illustration purposes. Appendix C gathers some technical auxiliary results and additional propositions. A Proofs of main results 16 A.1 Proof of the weak convergence (Theorem 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Proof of the almost sure convergence (Theorem 3) . . . . . . . . . . . . . . . . . . . . . . . . 26 A.3 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B Practical procedure 28 B.1 Construction of the conditioning matrix Ck . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.2 Numerical illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 C Auxiliary results 32 C.1 Robbins-Siegmund Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 C.2 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 C.3 Additional propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 C.4 Auxiliary results on expected smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 A Proofs of main results A.1 Proof of the weak convergence (Theorem 2) For any matrix A Rd d, we denote by A = max u 2=1 Au 2 the operator norm associated to the Euclidian norm and by ρ(A) the spectral radius of A, i.e., ρ(A) = max{|λ1|, . . . , |λn|} where λ1, . . . , λn are the eigenvalues of A. We also introduce λmin(A) = min{|λ1|, . . . , |λn|}. Note that when A is symmetric A = ρ(A) and recall that the spectral radius is a (submultiplicative) norm on the real linear space of symmetric matrices. Structure of the proof. In virtue of Assumption 5, there exist δ, ε > 0 such that almost surely sup k 0 E[ wk+1 2+δ 2 |Fk]1{ θk θ 2 ε} < . (3) An important event in the following is Ak = { θk θ 2 ε, Ck < 2 C , Γk 2 Γ }. By assumption, this event has probability going to 1. Introduce the difference and remark that k is subjected to the iteration: k+1 = k γk+1Ck F(θk) + γk+1Ckwk+1, k 0, Published in Transactions on Machine Learning Research (08/2023) with wk+1 = F(θk) g(θk, ξk+1). We have by assumption that Ck C almost surely and we can define K = limk Ck H = CH. The proof relies on the introduction of an auxiliary stochastic algorithm which follows the iteration: e k+1 = e k γk+1K e k + γk+1Ckwk+11Ak, k 0 The previous algorithm is a linear approximation of the algorithm that defines k in the sense that F(θk) = F(θ ) + H(θk θ ) + o(θk θ ) H(θk θ ) has been linearly expanded around θ . Writing k = e k + ( k e k), and invoking the Slutsky lemma, the proof will be complete as soon as we obtain that γ 1/2 k e k N(0, Σ), (4) ( k e k) = o P(γ1/2 k ). (5) H the positive square root of the real symmetric positive definite matrix H and consider the transformation Θk = H e k which satisfies Θk+1 = Θk γk+1 e KΘk + γk+1 HCkwk+11Ak, k 1, where e K = H S++ d (R) is a real symmetric positive definite matrix. The sequence (Θk)k 0 is easier to study than e k because contrary to e K, the matrix K = CH is not symmetric in general unless C and H commute. In view of Assumption 3, the eigenvalues of e K are real and positive. Denote by λm (resp. λM) the smallest (resp. the largest) eigenvalue of e K, i.e., λm = λmin( e K), λM = λmax( e K). Because CH is similar to e K, they share the same eigenvalues. Since by assumption, the eigenvalues of (CH ζId) are positive, we have 2αλm > 1{β=1}. For all k 1, introduce the real symmetric matrix Ak = I γk e K. Observe that all these matrices commute, i.e., for any i, j 0, we have Ai Aj = Aj Ai. For any k, n 0, denote the matrices product Πn,k = An . . . Ak+1 if k < n Πn,k = Id if k n, Πn = Πn,0 Since the matrices Ak commute, we have Π n,k = Πn,k is also real symmetric. Step 1. Proof of Equation (4). The random process (Θk)k 0 follows the recursion equation Θk = AkΘk 1 + γk HCk 1wk1Ak 1. We have by induction Θn = ΠnΘ0 + HCk 1wk1Ak 1, and the rescaled process is equal to Θn γn = Πn γn Θ0 | {z } initial error Yn HCk 1wk1Ak 1 | {z } sampling error Mn Published in Transactions on Machine Learning Research (08/2023) Bound on the initial error. Define τn = Pn k=1 γk the partial sum of the learning rates. Since Πn is symmetric, we have ρ(ΠnΘ0) ρ(Πn) Θ0 2. In view of Lemma 5, since γk 0, there exists j 1 such that ρ(Πn) ρ(Πj) exp( λm(τn τj)). Therefore, the initial error is bounded by ρ(Yn) ρ(Πj) exp(λmτj) Θ0 2 exp(dn) with dn = λmτn log( γn). Using Lemma 6, we can treat the two cases β < 1 and β = 1. On the one hand, if β < 1 then we always have dn . On the other hand, if β = 1, we have dn 1 2 γλm log(n) and the condition 2αλm 1 > 0 ensures dn . In both cases we get exp(dn) 0 and the initial error vanishes to 0. Weak convergence of the sampling error. Consider the random process Mn = γ 1/2 n HCk 1wk1Ak 1. Note that θk, Ak and Ck are Fk-measurable. As a consequence, Mn is a sum of martingale increments and we may rely on the following central limit theorem for martingale arrays. Theorem 4. (Hall & Heyde, 1980, Corollary 3.1) Let (Wn,i)1 i n, n 1 be a triangular array of random vectors such that E[Wn,i | Fi 1] = 0, for all 1 i n, (6) i=1 E[Wn,i W n,i | Fi 1] V 0, in probability, (7) i=1 E[ Wn,i 21{ Wn,i >ε} | Fi 1] 0, in probability, (8) then, Pn i=1 Wn,i N(0, V ), as n . We start by verifying (7). Let Dk = HCk 1Γk 1CT k 1 H1Ak 1 Sd(R). The quadratic variation of Mn is given by k=1 γ2 kΠn,k DkΠ n,k. First we can check that Σn is bounded. Using the triangle inequality and since the operator norm is submultiplicative, we have k=1 γ2 k Πn,k DkΠT n,k γ 1 n k=1 γ2 k Dk Πn,k 2 = γ 1 n k=1 γ2 k Dk ρ(Πn,k)2, where we use in the last equality that Πn,k is real symmetric so Πn,k = ρ(Πn,k). On the event Ak 1, the matrices Ck 1 and Γk 1 are bounded as Ck 1 2 C and Γk 1 2 Γ leading to the following bound for the matrix Dk, HCk 1Γk 1CT k 1 H Γk 1 Ck 1 21Ak 1 8 H Γ C 2 = UD. Published in Transactions on Machine Learning Research (08/2023) It follows that k=1 γ2 kρ(Πn,k)2. In view of Lemma 5, we shall split the summation from k = 1, . . . , j and k = j + 1, . . . , n as k=1 γ2 kρ Πn,k 2 = γ 1 n k=1 γ2 kρ Πn,k 2 + γ 1 n k=j+1 γ2 kρ Πn,k 2 k=1 γ2 kρ Πn,k 2 i=k+1 (1 λmγi)2 For the first term an, we have for all k = 1, . . . , j ρ(Πn,k) ρ(Πn,j) i=j+1 (1 λmγi) exp( λm(τn τj)), which implies since (γk) is decreasing with γ1 = α that k=1 γ2 kρ Πn,k 2 ατj exp( 2λm(τn τj)). Therefore, similarly to the initial error term, we get an ατj exp(2λmτj)) exp(dn) with dn = 2λmτn log(γn), and the condition 2αλm 1 > 0 ensures dn so that an goes to 0 and is almost surely bounded by Ua. For the second term bn, we can apply Lemma 3 and need to distinguish between the two cases: (β = 1) If γn = α/n, since 2αλm > 1, we can apply Lemma 3 (p = 1, m = 2, λ = λmα, xj = 0, εk = α2) and obtain 2αλm 1 = Ub. (β < 1) If γn = γ/nβ, we deduce the same as before because λm > 0. Finally in both cases, we get Σn UD (Ua + Ub) . (9) We now derive the limit of Σn. We shall use a recursion equation to recover a stochastic approximation scheme. Note that k=1 γ2 kΠn,k DkΠT n,k (10) = γ2 n Dn + An k=1 γ2 kΠn 1,k DkΠT n 1,k and recognize γnΣn = γ2 n Dn + γn 1AnΣn 1A n . Published in Transactions on Machine Learning Research (08/2023) Replacing the symmetric matrix An = I γn e K, we get (because Σn is bounded almost surely) γnΣn = γ2 n Dn + γn 1(I γn e K)Σn 1(I γn e K) = γ2 n Dn + γn 1 h Σn 1 γnΣn 1 e K γn e KΣn 1 + O(γ2 n) i . Divide by γn to obtain Σn = γn Dn + γn 1 h Σn 1 γn( e KΣn 1 + Σn 1 e K) + O(γ2 n) i , and we recognize a stochastic approximation scheme Σn = Σn 1 γn h e KΣn 1 + Σn 1 e K Dn i + γn 1 γn γn Σn 1 + O(γn 1γn + |γn 1 γn|) Recall that when β < 1 we have 1 γn 1 γn 1 0, i.e., γn 1 γn γn = o(γn). (β = 1) If γn = α/n we get Σn = Σn 1 α e KΣn 1 + Σn 1 e K 1 Σn = Σn 1 α Σn 1 + Σn 1 (β < 1) If γn = α/nβ we get Σn = Σn 1 γn h e KΣn 1 + Σn 1 e K Dn i + o(γn). Recall that ζ = 1{β=1}/(2α) and define e Kζ = e K ζI, so that in both cases, the recursion equation becomes Σn = Σn 1 γn h e KζΣn 1 + Σn 1 e K ζ Dn i + o(γn). We can vectorize this equation. The vectorization of an m n matrix A = (ai,j), denoted vec(A), is the mn 1 column vector obtained by stacking the columns of the matrix A on top of one another: vec(A) = [a1,1, . . . , am,1, a1,2, . . . , am,2, . . . , a1,n, . . . , am,n]T . Applying this operator to our stochastic approximation scheme gives vec(Σn) = vec(Σn 1) γn h vec e KζΣn 1 + Σn 1 e K ζ vec(Dn) i + o(γn). Denote by the Kronecker product, we have the following property vec KζΣn 1 + Σn 1K ζ = Id Kζ + K ζ Id vec(Σn 1). Define D as the almost sure limit of Dn, i.e. D = lim n Dn = Introduce vn = vec(Σn) and Q = Id e Kζ + e Kζ Id . We have almost surely vn = vn 1 γn (Qvn 1 vec(D)) + γnvec(Dn D) + o(γn) = vn 1 γn (Qvn 1 vec(D)) + εnγn Published in Transactions on Machine Learning Research (08/2023) where εn 0 almost surely. This is a stochastic approximation scheme with the affine function h(v) = Qv vec(D) for v Rd2. Let v be the solution of h(v) = 0 which is well defined since Q = Id e Kζ + e K ζ Id is invertible. Indeed, the eigenvalues of Q are µi + µj, 1 i, j d, where the µi, i = 1, . . . , d are the eigenvalues of e Kζ. Equivalently, the eigenvalues of Q are of the form (λi ζ) + (λj ζ) where the λi, i = 1, . . . , d are the eigenvalues of e K. Because λm > ζ, we have that Q 0. As a consequence (vn v ) = (vn 1 v ) γn (h(vn 1) h(v )) + εnγn = (vn 1 v ) γn Q (vn 1 v ) + εnγn = Bn (vn 1 v ) + εnγn, with Bn = (Id2 γn Q). By induction, we obtain (vn v ) = (Bn . . . B1) (v0 v ) + k=1 γk (Bn . . . Bk+1) εk, Define λQ = λmin(Q) > 0 and remark that Bn . . . Bk+1 j=k+1 (1 γjλQ). It follows that vn v 2 Bn . . . B1 v0 v 2 + k=1 γk Bn . . . Bk+1 εk 2 j=1 (1 γjλQ) v0 v 2 + j=k+1 (1 γjλQ) εk 2 Applying Lemma 3 we obtain that the right-hand side term goes to 0. The left-hand side term goes to 0 under the effect of the product by definition of (γk)k 1. We therefore conclude that vn v almost surely. From easy manipulation involving vec( ) and , this is equivalent to Σn Σ, where Σ is the solution of the Lyapunov equation ( e K ζI)Σ + Σ( e K ζI) = D. Now we turn our attention to (8). We need to show that almost surely, k=1 γ2 k E[ Πn,k HCk 1wk 2 21{γk Πn,k HCk 1wk 2>εγ1/2 n } | Fk 1]1Ak 1 0. E[γ 1 n γ2 k Πn,k HCk 1wk 2 21{γk Πn,k HCk 1wk 2>εγ1/2 n } | Fk 1] ε δE[(γ 1/2 n γk Πn,k HCk 1wk 2)2+δ | Fk 1] ε δ(γ 1/2 n γk Πn,k HCk 1 2+δE[ wk 2+δ 2 | Fk 1]. Let U(ω) = supk 1 E[ wk 2+δ 2 | Fk 1]1Ak 1 which is almost surely finite by Assumption 5. We get E[γ 1 n γ2 k Πn,k HCk 1wk 2 21{γk Πn,k HCk 1wk 2>εγ1/2 n } | Fk 1]1Ak 1 H C 2+δ U(ω)(γ 1/2 n γkρ(Πn,k))2+δ Published in Transactions on Machine Learning Research (08/2023) Hence by showing that k=1 (γ 1/2 n γkρ(Πn,k))2+δ 0, we will obtain (8). The previous convergence can be deduced from Lemma 3 with p = 1 + δ/2, m = 2 + δ, ϵk = γδ/2 k , checking that (2 + δ)αλm > 1 + δ/2. Step 2. Proof of Equation (5). A preliminary step to the derivation of Equation (5) is to obtain that e k 0 almost surely. For any θ and η in Rd, we have θ 2 = η 2 + 2η (θ η) + θ η 2 implying that for all k 0 E[ Θk+1 2|Fk] = eΘk 2 2γk+1Θ k e KΘk + γ2 k+1 E[ e KΘk Ckwk+11Ak 2|Fk]. Since (wk) is a martingale increment and because on Ak, ρ(Ck) 2ρ(C), we get E[ e KΘk Ckwk+11Ak 2|Fk] = E[ e KΘk 2|Fk] + E[ Ckwk+11Ak 2|Fk] λ2 M Θk 2 + ρ(Ck)2 E[ wk+11Ak 2|Fk] λ2 M Θk 2 + 4ρ(C)2 E[ wk+11Ak 2|Fk], Injecting this bound in the previous equality yields E[ Θk+1 2|Fk] Θk 2(1 + γ2 k+1λ2 M) 2γk+1Θ k e KΘk + 4ρ(C)2γ2 k+1 E[ wk+1 2|Fk]1Ak. Since, using (3), k 0 γ2 k+1 E[ wk+1 2|Fk]1Ak sup k 0 E[ wk+1 2|Fk]1Ak we are in position to apply the Robbins-Siegmund Theorem 6 and we obtain the almost sure convergence of P k γk+1Θ k e KΘk and Θk 2 2 V . Because e K is positive definite, it gives that, with probability 1, P k 0 γk+1 Θk 2 < + , from which, we deduce lim infk Θk 2 = 0. Therefore one can extract a subsequence Θk such that Θk 2 0. Using the above second condition yields V = 0 and we conclude that e k = H 1/2Θk 0. Define the difference Ek = k e k. Since θ 7 2F(θ) is continous at θ , we can apply a coordinate-wise mean value theorem. Indeed, for any θ Rd, we have F(θ) = ( 1F(θ), . . . , d F(θ)) where for all j = 1, . . . , d, the partial derivatives functions j F : Rd R are Lipschitz continuous. Denote by ( j F) : Rd Rd the gradient of the partial derivative j F, i.e., ( j F)(θ) = ( 2 1,j F(θ), . . . , 2 d,j F(θ)). For any θ, η B(θ , ε), there exists ξj Rd such that j F(θ) j F(η) = ( j F)(ξj)(θ η). We construct a Hessian matrix by rows H(ξ) = H(ξ1, . . . , ξd) where the j-th row is equal to ( j F)(ξj) = ( 2 1,j F(ξj), . . . , 2 d,j F(ξj)) 2 1,1F(ξ1) . . . 2 1,d F(ξ1) ... ... ... 2 d,1F(ξd) . . . 2 d,d F(ξd) Published in Transactions on Machine Learning Research (08/2023) and we can write F(θ) F(η) = H(ξ)(θ η). There exists ξk = (ξ(1) k , . . . , ξ(d) k ) with ξ(j) k [θ + Ek, θk] and ξ k = (ξ (1) k , . . . , ξ (d) k ) with ξ (j) k [θ + Ek, θ ] such that F(θ + Ek) F(θk) = H(ξk)e k (12) F(θ + Ek) = H(ξ k)Ek. (13) Let η > 0 such that 2αλm(1 3η) > 1. This choice will come clear at the end of the reasoning. On the one hand, we have Ck C. On the other hand, using Lemma 4, the spectrum of Ck H is real and positive. Hence, we have the convergence of the eigenvalues of Ck H towards the eigenvalues of K = CH. This follows from the definition of eigenvalues as roots of the characteristic polynomial and the fact that the roots of any polynomial P C[X] are continuous functions of the coefficients (Zedek, 1965). Consequently, there exists n1(ω) such that for all k n1(ω), (1 η)λm λmin(Ck H) λmax(Ck H) (1 + η)λM. (14) We can define n2(ω) such that for all k n2(ω) Ak is realized. (15) H 1 Id 0 as k , there is n3(ω) and n4(ω) such that for all k n3(ω) H 1 Id η 1 + η λm λM , (16) and for all k n4(ω), H 1 1. (17) Since γk 0, there is n5 such that for all k n5 γk+1 2ηλm (1 + η)2λ2 M . (18) To use the previous local properties, define n0(ω) = n1(ω) n2(ω) n3(ω) n4(ω) n5 and introduce the set Ej along with its complement Ec j , defined by Ej = {ω : j n0(ω)}. Let δ > 0 and take j 1 large enough such that P(Ec j ) δ. Invoking the Markov inequality, we have for all a > 0 P(γ 1/2 k Ek > a) = P(γ 1/2 k Ek > a, Ej) + P(γ 1/2 k Ek > a, Ec j ) P(γ 1/2 k Ek > a, Ej) + δ γ 1/2 k a 1E[ Ek 1Ej] + δ Because δ is arbitrary, we only need to show that for any value of j 1, ek := E[ Ek 1Ej] = o(γ1/2 k ). To prove this fact, we shall recognize a stochastic algorithm for the sequence ek. Let k j and assume further that Ej is realized. We have, because of (15), Ek+1 = k e k γk+1Ck F(θk) + γk+1K e k. Published in Transactions on Machine Learning Research (08/2023) Introducing e Ek = HEk, we find e Ek+1 = e Ek γk+1 HCk F(θk) + γk+1 and using (12), it comes that e Ek+1 = e Ek γk+1 HCk F(θ + Ek) γk+1 HCk H(ξk)e k + γk+1 = e Ek γk+1 HCk F(θ + Ek) + γk+1 H(K Ck H(ξk))e k. Using Minkowski inequality, we have e Ek+1 e Ek γk+1 HCk F(θ + Ek) + γk+1 H(K Ck H(ξk))e k . We shall now focus on the first term. Still on the set Ej, we have HCk F(θ + Ek) 2 = e Ek 2 2γk+1 e Ek, HCk F(θ + Ek) + γ2 k+1 HCk F(θ + Ek) 2 (19) We have on the one hand using (13) HCk F(θ + Ek) = e Ek, HCk H(ξ k)Ek HCk HEk + e Ek, HCk(H(ξ k) H)Ek Due to (14), the first term satisfies HCk HEk = e Ek, λmin(Ck H) e Ek 2 (1 η)λm e Ek 2 The second term satisfies HCk(H(ξ k) H)Ek = e Ek, H 1 Id) e Ek H 1 Id) e Ek Using Cauchy-Schwarz inequality, the submultiplicativity of the norm, (14) and (16), we have e Ek, H 1 Id) e Ek H 1 Id e Ek 2 ηλm e Ek 2. Finally, it follows that HCk F(θ + Ek) (1 2η)λm e Ek 2 (20) On the other hand using (13), (14) and (17), HCk F(θ + Ek) 2 = HCk H(ξ k)Ek 2 H 1) e Ek 2 λmax(Ck H)2 H 1 2 e Ek 2 (1 + η)2λ2 M e Ek 2 (21) Published in Transactions on Machine Learning Research (08/2023) Putting together (19), (20), (21) and using (18) gives that, on Ej, HCk F(θ + Ek) 2 e Ek 2(1 2γk+1(1 2η)λm + γ2 k+1(1 + η)2λ2 M) e Ek 2(1 2γk+1(1 3η)λm). By the Minkowski inequality and the fact that (1 x)1/2 1 x/2, on Ej, it holds e Ek+1 e Ek (1 2γk+1(1 3η)λm)1/2 + γk+1 H(K Ck H(ξk))e k e Ek (1 γk+1(1 3η)λm) + γk+1 H (K Ck H(ξk))e k Hence, we have shown that for any k j, e Ek 1Ej e Ek 1Ej(1 γk+1(1 3η)λm) + γk+1 H (K Ck H(ξk))1Ej e k . It follows that, for any k j, ek+1 ek(1 γk+1(1 3η)λm) + γk+1 H E[ Uk e k ], with Uk = (K Ck H(ξk))1Ej. Because with probability 1, Uk is bounded, we can apply the Lebesgue dominated convergence theorem to obtain that εk = E[ Uk 2] 0. From the Cauchy-Schwarz inequality, we get E[ Uk e k ] εk q E[ e k 2 2]. On the other hand, we have already shown in (9) that ρ(Σk) = Σk UD (Ua + Ub). Since e k = H 1 γk(Yk + Mk), we have E[ e k 2 2] 2(γk/λm)( Yk 2 2 + E[ Mk 2 2]), where the last term is the leading term and satisfies E[ Mk 2 2]] = E[Tr(Σk)] d E[ρ(Σk)]. Therefore, we have E[ e k 2 2] γk A for some A > 0. Consequently, for all k j, ek+1 ek(1 γk+1(1 3η)λm) + γ3/2 k+1A The condition 2αλm(1 3η) > 1 ensures that we can apply Lemma 3 with (mλ > p), m = 1, p = 1/2, λ = α(1 3η)λm. we finally get lim sup k (ek/γ1/2 k ) = 0. As a consequence, ek = o( γk), which concludes the proof. Since γ 1/2 k H e k N(0, Σ), we have γ 1/2 k e k N(0, eΣ) where eΣ = H 1. Recall that Σ satisfies the Lyapunov equation H ζId)Σ + Σ( Multiplying on the left and right sides by H 1, we get where we recognize the following Lyapunov equation (CH ζId)eΣ + eΣ(CH ζId) = CΓC. Published in Transactions on Machine Learning Research (08/2023) A.2 Proof of the almost sure convergence (Theorem 3) The idea behind the proof of the almost sure convergence is to apply the Robbins-Siegmund Theorem (Theorem 6) (which can be found in Appendix C) in combination with the following key deterministic result. Lemma 1 (Deterministic result). Let F : Rd R be a L-smooth function and (θt) a random sequence obtained by the SGD update rule θt+1 = θt γt+1Ctgt where (γ)t 1 a positive sequence of learning rates and Ct 1 νt Id are such that P t γtνt = . Let ω Ωsuch that the following limits exist: t 0 γt+1νt+1 F(θt(ω)) 2 2 < (ii) X t 1 γt Ct 1(gt 1(ω) F(θt 1(ω))) < then F(θt(ω)) 0 as t . Proof. The proof (and in particular the reasoning by contradiction) is inspired from the proof of Proposition 1 in Bertsekas & Tsitsiklis (2000). For ease of notation we omit the ω in the proof. Note that condition (i) along with P t γtνt = implie that lim inft F(θt) = 0. Now, by contradiction, let ε > 0 and assume that lim sup t F(θt) > ε We have that there is infinitely many t such that F(θt) < ε/2 and also infinitely many t such that F(θt) > ε. It follows that there is infinitely many crossings between the sets {t N : F(θt) < ε/2} and {t N : F(θt) > ε}. A crossing is a collection of indexes Ik = {Lk, Lk +1, . . . , Uk 1} with Lk Uk (Ik = when Lk = Uk) such that for all t Ik, F(θLk 1) < ε/2 F(θt) ε < F(θUk) . Define the following partial Cauchy sequence Rk = PUk t=Lk γt(gt 1 F(θt 1)) and note that condition (ii) implies that Rk 0 as k . For all k 1, ε/2 F(θUk) 2 F(θLk 1) 2 F(θUk) F(θLk 1) 2 L θUk θLk 1 2, where we use that F is L-Lipschitz. Then using the update rule θt θt 1 = γt Ct 1gt 1, we have by sum t=Lk θt θt 1 2 = L t=Lk γt Ct 1gt 1 2 t=Lk γt Ct 1 F(θt 1) 2 + L t=Lk γt Ct 1(gt 1 F(θt 1)) 2 t=Lk γtνt F(θt 1) 2 + L Rk 2 Since in the previous equation F(θt 1) 2 > ε/2, we get t=Lk γtνt F(θt 1) 2 2 + (ε/2)L Rk 2 But since P t 0 γt+1νt+1 F(θt) 2 is finite and limk Rk = 0, the previous upper bound goes to 0 and implies a contradiction. Published in Transactions on Machine Learning Research (08/2023) It remains to show that points (i) and (ii) in Lemma 1 are valid with probability one. Since θ 7 F(θ) is L-smooth, we have the quadratic bound (see Nesterov (2013)) θ, η Rd F(η) F(θ) + F(θ), η θ + L Using the update rule θk+1 = θk γk+1Ckg(θk, ξk+1), we get F(θk+1) F(θk) + F(θk), θk+1 θk + L 2 θk+1 θk 2 2 = F(θk) γk+1 F(θk), Ckg(θk, ξk+1) + L 2 γ2 k+1 Ckg(θk, ξk+1) 2 2. The last term can be upper bounded using the matrix norm and Assumption 9 as Ckg(θk, ξk+1) 2 2 Ck 2 g(θk, ξk+1) 2 2 ν2 k+1 g(θk, ξk+1) 2 2, and we have the inequality F(θk+1) F(θk) γk+1 F(θk), Ckg(θk, ξk+1) + L 2 (γk+1νk+1)2 g(θk, ξk+1) 2 2. Introduce uk = γkνk and vk = γkµk, we have P k 1 vk = + and P k 1 u2 k < + a.s. in virtue of Assumption 9. The random variables F(θk), Ck are Fk-measurable and the gradient estimate is unbiased with respect to Fk. Taking the conditional expectation denoted by Ek leads to Ek F(θk+1) F(θk) γk+1 F(θk), Ek [Ckg(θk, ξk+1)] + L 2 u2 k+1 Ek g(θk, ξk+1) 2 2 = γk+1 F(θk) Ck F(θk) + L 2 u2 k+1 Ek g(θk, ξk+1) 2 2 . On the one hand for the first term, using Assumption 9 , F(θk) Ck F(θk) λmin(Ck) F(θk) 2 2 µk+1 F(θk) 2 2. On the other hand, using Assumption 8, there exist 0 L, σ2 < such that almost surely k N, Ek g(θk, ξk+1) 2 2 2L (F(θk) F ) + σ2. Inject these bounds in the previous inequality and substract F(θ ) on both sides to have Ek [F(θk+1) F ] (1 + LLu2 k+1)(F(θk) F ) vk+1 F(θk) 2 2 + (L/2)u2 k+1σ2. Introduce Vk = F(θk) F , Wk = vk+1 F(θk) 2 2, ak = LLu2 k+1 and bk = (L/2)u2 k+1σ2. These four random sequences are non-negative Fk-measurable sequences with P k ak < and P k bk < almost surely. Moreover we have k N, E [Vk+1|Fk] (1 + ak)Vk Wk + bk. We can apply Robbins-Siegmund Theorem 6 to have k 0 Wk < a.s. (b) Vk a.s. V , E [V ] < . (c) sup k 0 E [Vk] < . Therefore we have the almost sure convergence of the series P vk+1 F(θk) 2 2 which, given that lim supk νk/µk exists, implies that P uk+1 F(θk) 2 2 is finite. Hence we obtain (i) in Lemma 1. We now show that (ii) in Lemma 1 is also valid. The term of interest is a sum of martingale increments. The quadratic variation is given by X t 1 γ2 t Et[ Ct 1(gt 1(ω) f(θt 1(ω))) 2 2] X t 1 γ2 t ν2 t Et[ (gt 1(ω) F(θt 1(ω))) 2 2] t 1 γ2 t ν2 t Et[ gt 1(ω) 2 2] t 1 γ2 t ν2 t (2L(F(θt 1) F ) + σ2). Published in Transactions on Machine Learning Research (08/2023) Now we can use that Vk = F(θk) F a.s. V (which was deduced from Robbins-Siegmund Theorem) to obtain that the previous series converges. Invoking Theorem 2.17 in Hall & Heyde (1980), we obtain (ii) in Lemma 1. Furthermore we can prove that θk+1 θk 0 almost surely and in L2. Indeed, we have E θk+1 θk 2 2 = E γk+1Ckg(θk, ξk+1 2 2 u2 k+1 2L (F(θk) F ) + σ2 . In virtue of the almost sure convergence of Vk = F(θk) F , the last term in parenthesis is upper bounded by a constant so that in view of the convergence of P u2 k+1, we have the convergence of the series P E θk+1 θk 2 2 . We then deduce that E θk+1 θk 2 2 0 and P θk+1 θk 2 2 < + almost surely. In particular, θk+1 θk 0 in L2 and almost surely. The last point follows from the fact that, for every δ > 0, lim n P sup k n θk+1 θk δ δ 2 lim n k n E θk+1 θk 2 2 = 0. A.3 Proof of Corollary 2 First observe that since F is coercive, the convergence of (F(θk)) obtained by Robbins-Siegmund theorem implies that the sequence of iterates (θk)k 0 remains in a compact subset K Rd. Let ε > 0. Since θ 7 d(θ, S) is continuous, the set D(ε) = {θ Rd : d(θ, S) ε} is closed and the set K(ε) = K D(ε) is compact. On this set, the map θ 7 F(θ) 2 is stricly positive and there exists ηε > 0 such that: θ K(ε) F(θ) 2 > ηε. Thus, P(θ K(ε)) P( F(θ) 2 > ηε) and this last quantity goes to zero which proves the convergence in probability d(θk, S) 0. Actually the almost sure convergence F(θk) 0 implies the convergence of the distances. Define Ak(ε) = {ω : θk(ω) K(ε)} and Bk(ε) = {ω : F(θk(ω)) 2 > ηε}. We have Ak(ε) Bk(ε) then n 1 k n Ak(ε) n 1 k n Bk(ε). Conclude by using the almost sure convergence P( n 1 k n Bk(ε)) = 0 for each ε > 0. If S is finite, it is in particular a compact set so the distance is attained for every k 0, d(θk, S) = mins S d(θk, s) 0. Since θk+1 θk 0, the sequence of iterates can only converge to a single point of S. B Practical procedure For the sake of completeness, the aim of this Section is to derive a feasible procedure that achieves the optimal asymptotic variance described in Corollary 1. First, we present a practical way to compute the conditioning matrix Ck and then we show that the resulting algorithm satisfies the high-level conditions of Theorem 2. This method is considered in a numerical illustration along with a novel variant of Ada Grad. B.1 Construction of the conditioning matrix Ck Similarly to the unavailability of gradients, one may not have access to values of the Hessian matrix but only stochastic versions of it (see details in numerical experiments below). As a consequence, we consider the following framework which involves random Hessian matrices. As for gradients, a policy (P k)k 0 is used at each iteration to produce random Hessians through H(θk, ξ k+1) with ξ k+1 P k. Assumption 10 (Unbiased and bounded Hessians). The Hessian generator H : Rd S Rd d is uniformly bounded around the minimizer and is such that for all θ Rd, H(θ, ) is measurable and we have: k 0, E H(θk, ξ k+1)|Fk = 2F(θk). An estimate of the Hessian matrix H = 2F(θ ) is now introduced as the weighted average j=0 ωj,k H(θj, ξ j+1) with j=0 ωj,k = 1. (22) The previous estimate has two advantages. First, thanks to averaging, the noise associated to each evaluation H(θj, ξ j+1) will eventually vanished due to the sum of martingale increments. Second, the weights ωj,k may help to give more importance to most recent iterates. In the idea that θk lies near θ eventually, it might be helpful to reduce the bias when estimating H = 2F(θ ). Published in Transactions on Machine Learning Research (08/2023) Proposition 2. Let (Φk)k 0 be obtained by (22). Suppose that Assumptions 3 and 10 are fulfilled and that θk θ a.s. If sup0 j k ωj,k = O(1/k), then we have Φk H = 2F(θ ) a.s. A natural choice is to take equal weights ωj,k = (k + 1) 1. However, since the last iterates are more likely to bring more relevant information through their Hessian estimates, we advocate the use of adaptive weights of the form ωj,k exp( η θj θk 1) with a parameter η 0 that recovers equal weights with η = 0. These two weights sequences satisfy the assumption of Proposition 2. They are considered in the numerical illustration of the next Section. While inverting Φk would produce a simple estimate of H 1, such an approach might result in a certain instability in practice caused by large jumps towards wrong directions (large eigenvalues) or a too restrictive visit along other components (vanishing eigenvalues). To overcome this issue, we rely on the following filter which clamps the eigenvalues of a symmetric matrix. For any symmetric matrix S and two positive numbers 0 < a < b, denote by S[a, b] the associated matrix where all the eigenvalues are clamped to [a, b], i.e., any eigenvalue λ of S is modified as λ max{a, min{λ, b}}. Let (λ(m) k )k 1 and (λ(M) k )k 1 be two sequence of positive numbers such that λ(m) k λ(M) k for all k 1. Define the matrices k N, Ck = Φk[(λ(M) k+1) 1, (λ(m) k+1) 1] 1 . (23) Such a definition guarantees two properties. First, Ck S++ d (R) with λ(m) k+1Id Ck λ(M) k+1Id. Second, in virtue of Proposition 2, Φk H a.s. so that, as soon as (λ(m) k )k 1 and (λ(M) k )k 1 go to 0 and + respectively, the matrix Ck converges almost surely to H 1 (as recommended by Corollary 1). Therefore, we obtain a feasible procedure leading to asymptotic optimality. Theorem 5 (Asymptotic optimality of the iterates). Let (θk)k 0 be obtained by conditioned SGD (2) with γk = 1/k, Φk defined by (22), λ(m) k 0, λ(M) k + and Ck given by (23). Suppose that Assumptions 1 to 9 are fulfilled and sup0 j k ωj,k = O(1/k). We have k(θk θ ) N(0, H 1ΓH 1), as k . This algorithm is theoretically asymptotically optimal. However in practice, adaptive gradient methods described in Table 1 have become the workhorse for training deep learning models as they take advantage of low rank-approximations and diagonal scalings. Interestingly, the conditioned matrices involved in these methods are linked to gradient estimates and thus to covariance matrices Γk (see Assumption 4) rather than the Hessian H. Indeed, since θ S, we have for the limiting covariance Γ = Eξ[g(θ , ξ)g(θ , ξ) ]. Consider a variant of Ada Grad which accumulates the average gradients Gk = δI + (1/k) Pk i=1 gig i and Ck = G 1/2 k . Averaging allows to anneal the stochastic noise of the gradient estimate. By the law of large numbers, the limiting matrix in our Theorem 2 will be C = (Γ + δI) 1/2. B.2 Numerical illustration Consider the empirical risk minimization framework applied to Generalized Linear Models. Given a data matrix X = (xi,j) Rn d with labels y Rn and a regularization parameter λ > 0, we are interested in solving minθ Rd F(θ) with i=1 fi(θ), fi(θ) = L(x i θ, yi) + λΩ(θ), L : R R R is smooth loss function and Ω: Rd R+ is a smooth convex regularizer chosen as Tikhonov regularization Ω(θ) = 1 2 θ 2 2. The gradient and Hessian of each component fi are given for all i = 1, . . . , n by fi(θ) = L (x i θ, yi)xi + λθ 2fi(θ) = L (x i θ, yi)xix i + λId, Published in Transactions on Machine Learning Research (08/2023) where L ( , ) and L ( , ) are the first and second derivative of L( , ) with respect to the first argument. Consider two well-known losses, namely least-squares and logistic. These losses are respectively associated to the Ridge regression problem with y Rn and the binary classication task with y { 1, +1}n. The regularization parameter is set to the classical value λ = 1/n. Denote by σ(z) = 1/(1+exp( z)) the sigmoid function, we have the following closed-form equations (Ridge Regression) L(x i θ, yi) = 1 2(yi x i θ)2 L (x i θ, yi) = x i θ yi L (x i θ, yi) = 1 (Logistic Regression) L(x i θ, yi) = log(1 + exp( yix i θ)) L (x i θ, yi) = σ(x i θ) yi L (x i θ, yi) = σ(x i θ)(1 σ(x i θ)) As stated in Example 1 of Section 2, stochastic versions of both the gradient and the Hessian of the objective F can be easily computed using only a batch B {1, . . . , n} of data and BF(θ) = P i B fi(θ)/|B| (resp. 2 BF(θ) = P i B 2fi(θ)/|B|) for the gradient (resp. Hessian) estimate. Note that these random generators meet Assumptions 1 and 10 as they produce unbiased estimates of the gradient and the Hessian matrix respectively. For the sake of completeness and illustrative purposes, we compare the performance of classical stochastic gradient descent (sgd) and the conditionned variant (csgd) presented in Appendix B where the matrix Φk is an averaging of past Hessian estimates as given in Equation (22). We shall compare equal weights ωj,k = (k + 1) 1 and adaptive weights ωj,k exp( η θj θk 1) with η > 0 to give more importance to Hessian estimates associated to iterates which are closed to the current point. Furthermore, for computational reason, we consider a novel adaptive stochastic first-order method which is a variant of Adagrad. Starting from the null vector θ0 = (0, . . . , 0) Rd, we use optimal learning rate of the form γk = α/(k + k0) (Bottou et al., 2018) and set λ(m) k 0, λ(M) k = Λ k in the experiments where γ, k0 and Λ are tuned using a grid search. The means of the optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )], obtained over 100 independent runs, are presented in Figures below. Methods in competition. The different methods in the experiments are: sgd: standard stochastic gradient descent. sgd_avg: Polyak-averaging stochastic gradient descent , with a burn-in period (n0 = 15 for d = 20 and n0 = 30 for d = 100) to avoid the poor performance of bad initialization. csgd(η = 0) and csgd(η > 0): conditioned stochastic gradient descent methods with equal and adaptive weights where the matrix Φk is an averaging of past Hessian estimates as given in Equation (22). adafull_avg: The variant of Adagrad presented in Appendix B where the gradient matrix Gk is updated as an average Gk = δI + (1/k) Pk i=1 gig i and Ck = G 1/2 k instead of the cumulative sum provided in the literature of Adagrad. Note that averaging here allows to anneal the stochastic noise whereas classical versions of Adagrad often rely on true gradients and may use cumulative sums. The parameter δ is also tuned using a grid search. We focus on Ridge regression on simulated data with n = 10, 000 samples in dimensions d {20; 100}. Stochastic gradient methods are known to greatly benefit from mini-batch instead of picking a single random sample when computing the gradient estimate. We use a batch-size equal to |B| = 16. In Figure 1, we can see that conditioned SGD outperforms standard SGD. Furthermore, adaptive weights (η > 0) improve the convergence speed of conditioned SGD methods. Interestingly, the novel approach adafull_avg offers great performance at a cheap computing cost. Indeed, the update of Ck+1 relies on the inverse of an average. This operation can be carried out in an efficient way thanks to Woodbury matrix identity. Real-world data. We now turn our attention to real-world data and consider again the Ridge regression problem on the following datasets: Boston Housing dataset (Harrison Jr & Rubinfeld, 1978) (n = 506; d = 14) and Diabetes dataset (Dua & Graff, 2017) (n = 442; d = 10). Published in Transactions on Machine Learning Research (08/2023) 0 20 40 60 80 100 Iterations Optimaliy Ratio sgd sgd_avg csgd( =0) adafull_avg (a) Ridge d = 20 0 50 100 150 200 250 300 Optimaliy Ratio sgd sgd_avg csgd( =0) adafull_avg (b) Ridge d = 100 Figure 1: Optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )] for Ridge regression in dimension d {20; 100}. Boston Housing dataset (Harrison Jr & Rubinfeld, 1978): This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It contains n = 506 samples in dimension d = 14. Diabetes dataset (Dua & Graff, 2017): Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. The means of the optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )], obtained over 100 independent runs, are presented in Figure 2. Once again, the conditioned SGD methods offer better performance than plain SGD. For these datasets, it is the conditioning matrix with adaptive weights as given in Equation (22) which presents the best results. 0 50 100 150 200 250 300 Optimaliy Ratio sgd sgd_avg csgd( =0) adafull_avg (a) Boston Dataset 0 100 200 300 400 500 Iterations Optimaliy Ratio sgd sgd_avg csgd( =0) adafull_avg (b) Diabetes Dataset Figure 2: Optimality ratio k 7 [F(θk) F(θ )]/[F(θ0) F(θ )] for Ridge regression on datasets Boston and Diabetes. Published in Transactions on Machine Learning Research (08/2023) C Auxiliary results C.1 Robbins-Siegmund Theorem Theorem 6. (Robbins & Siegmund (1971)) Consider a filtration (Fn)n 0 and four sequences of random variables(Vn)n 0 , (Wn)n 0 , (an)n 0 and (bn)n 0 that are adapted and non-negative. Assume that almost surely P k ak < and P k bk < . Assume moreover that E [V0] < and for all n N, E[Vn+1|Fn] (1 + an)Vn Wn + bn. Then it holds k Wk < a.s. (b) Vn a.s. V , E [V ] < . (c) sup n 0 E [Vn] < . C.2 Auxiliary lemmas Lemma 2. Let (un)n 1, (vn)n 1 and (γn)n 1 be non-negative sequences such that γn 0 and P n γn = + . Assume that there exists a real number m 1 and j 1 such that for all n j, un (1 γn)mun 1+γnvn. Then it holds that lim sup n + un lim sup n + vn. Proof. Denote x+ = max(x, 0). One has (x + y)+ x+ + y+. Set ε > 0 and v = lim supn vn + ε. Then there exists an integer N 1 such that (1 γn)m (1 γn) and vn < v, i.e., (vn v)+ = 0 for n N. We have for large enough n N j, un v (1 γn)(un 1 v) + γn(vn v), and taking the positive part gives (un v)+ (1 γn)(un 1 v)+ + γn(vn v)+ = (1 γn)(un 1 v)+. n γn = + , this inequality implies that (un v)+ tends to zero, but this is true for all ε > 0 so v is arbitrarily close to lim supn vn and the result follows. Lemma 3. Let (γn)n 1 be a non-negative sequence converging to zero, and λ, m and p three real numbers with λ > 0, m 1, p 0. Consider two non-negative sequences (xn), (εn) and an integer j 1 such that n j, xn = (1 λγn)mxn 1 + γp+1 n εn, i=j (1 λγi)mxj 1 + i=k+1 (1 λγi)m ! The following holds if γn = n β, β (1/2, 1), then for any p lim sup n + xn γp n 1 mλ lim sup n + εn. if γn = 1/n, then for any p < mλ lim sup n + xn γp n 1 mλ p lim sup n + εn. In particular, when εn 0 with j = 1 and x0 = 0, i=k+1 (1 λγi)mεk = 0, (mλ > 1) lim n + 1 γn i=k+1 (1 λγi)mεk = 0. Published in Transactions on Machine Learning Research (08/2023) Before proving this result, note that if we consider γn = γ/nβ then we can write xn = (1 λγn)mxn 1 + γp+1 n εn = (1 (λγ)n β)mxn 1 + (n β)p+1(γp+1εn) and apply the result with λ = γλ and εn = γp+1εn. Proof. We apply Lemma 2 to the sequence un = xn γp n . We have for all n j, (1 λγn)mxn 1 + γp+1 n εn p (1 λγn)mun 1 + γnεn = exp p log γn 1 + m log(1 λγn) un 1 + γnεn. 1 exp p log γn 1 + m log(1 λγn) , so we get the recursion equation n j, un = (1 λnγn)un 1 + λnγn εn λn . if γn = n β, β (1/2, 1) then 1/γn 1/γn 1 0 and the ratio γn 1/γn tends to 1 with γn 1 (1 + o(1)) = γn 1 (1 + o(1)) = o(γn). Besides, m log(1 λγn) = mλγn + o(γn) when n + and we get γn [1 exp ( mλγn + o(γn))] , which implies that λn converges to mλ. We conclude with Lemma 2. if γn = 1/n then the ratio γn 1/γn tends to 1 with = log 1 + 1 n 1 = γn + o(γn). We still have m log(1 λγn) = mλγn + o(γn) when n + and therefore γn [1 exp ((p mλ)γn + o(γn))] , which implies λn converges to (mλ p) and we conclude in the same way. Lemma 4. Let A, B S++ d (R) then the eigenvalues of AB are real and positive with Sp(AB) [λmin(A)λmin(B); λmax(A)λmax(B)]. Proof. Denote by B the unique positive square root of B. The matrix AB is similar to the real symmetric positive definite matrix B. Therefore its eigenvalues are real and positive. Since A 7 λmax(A) is a sub-multiplicative matrix norm on S++ d (R), λmax(AB) λmax(A)λmax(B) which gives λmax((AB) 1) λmax(A 1)λmax(B 1), i.e., λmin (AB) 1 λmin(A) 1λmin(B) 1, and finally λmin(A)λmin(B) λmin (AB) . Published in Transactions on Machine Learning Research (08/2023) Lemma 5. Let S S++ d (R) be a real symmetric positive definite matrix. Let (γk)k 1 be a positive decreasing sequence converging to 0 such that P k γk = + . Denote by λm the smallest eigenvalue of S. It holds that there exists j 1 such that for any k > j, all the eigenvalues of the real symmetric matrix Ak = I γk S are positive and we have ρ(Πn) = ρ(An . . . A1) n + 0, k > j, ρ(Πn,k) = ρ(An . . . Ak+1) i=k+1 (1 γiλm). Proof. For any k N, the eigenvalues of the real symmetric matrix Ak = I γk S are given by Sp(Ak) = {(1 γkλ), λ Sp(S)}. Since γk 0, there exists j 1 such that γkλm < 1 for all k > j. Therefore for any k > j, we have Sp(Ak) R + and the largest eigenvalue is ρ(Ak) = 1 γkλm. Since ρ is a sub-multiplicative norm for real symmetric matrices, we get ρ(Πn) Qn k=1 ρ(Ak) = Qj k=1 ρ(Ak) Qn k=j+1 ρ(Ak). The second product can be upper bounded with the convexity of exponential, k=j+1 ρ(Ak) = k=j+1 (1 γkλm) k=j+1 exp ( γkλm) = exp ( λm(τn τj)) n + 0. Similarly we have for all k > j, ρ(Πn,k) Qn i=k+1 ρ(Ai) Qn i=k+1(1 γiλm). Lemma 6. Let γn = αn β with β (1/2, 1] then it holds 1 β = α 1 β n1 β, (β = 1) k=1 γk α log(n). Proof. By series-integral comprison, R n+1 1 t βdt Pn k=1 k β 1 + R n 1 t βdt. Theorem 7. (Delyon & Portier, 2021, Theorem 17)(Freedman inequality) Let (Xj)1 j n be random variables such that E[Xj|Fj 1] = 0 for all 1 j n then, for all t 0 and v, m > 0, j=1 Xj t, max j=1,...,n |Xj| m, j=1 E X2 j | Fj 1 v 2 exp t2/2 v + tm/3 Lemma 7. Let A Rn n be a symmetric positive semi-definite matrix. Then for any B Rm n, the matrix BAB Rm m is symmetric positive semi-definite. Proof. First note that (BAB ) = (B ) A B = BAB because A is symmetric. Then for any vector x R, we have x (BAB )x = (B x) A(B x) 0 since A is positive semi-definite. Proposition 3. (Khalil, 2002, Theorem 4.6) Let H be a positive definite matrix and Γ a symmetric positive definite matrix of same dimension. Then there exists a symmetric positive definite matrix Σ, unique solution of the Lyapunov equation HΣ + ΣH = Γ, which is given by Σ = R + 0 e t HΓe t H dt. The results remains true if the matrix Γ is only symmetric positive semi-definite: in that case the matrix Σ is also symmetric positive semi-definite and is the solution of the Lyapunov equation. C.3 Additional propositions This section gathers the proofs of Proposition 1 about the optimal choice for the conditioning matrix and of Proposition 2 about the almost sure convergence of the conditioning matrices. Proposition 1. The choice C = H 1 is optimal in the sense that ΣC ΣC, C CH. Moreover, ΣC = H 1ΓH 1. Published in Transactions on Machine Learning Research (08/2023) Proof. Define C = ΣC H 1ΓH 1 and check that C satisfies (CH Id/2) C + C (CH Id/2) = (C H 1)Γ(C H 1). Because Γ is symmetric positive semi-definite, we have using Lemma 7 that the term on the right side is symmetric positive semi-definite. Therefore, in view of Proposition 3, we get that C is symmetric positive semi-definite C 0 which implies ΣC H 1ΓH 1 for all C CH. The equality is reached for C = H 1 with C = 0, ΣC = H 1ΓH 1. Proposition 2. Let (Φk)k 0 be obtained by (22). Suppose that Assumptions 3 and 10 are fulfilled and that θk θ almost surely . If sup0 j k ωj,k = O(1/k), then we have Φk H = 2F(θ ) almost surely. Proof. We use the decomposition j=0 ωj,k 2F(θj) H + j=0 ωj,k H(θj, ξ j+1) 2F(θj) . The continuity of 2F at θ and the fact that θj θ a.s. implie that 2F(θj) H 0 a.s. Since sup0 j k ωj,k = O(1/k), there exists a > 0 such that j=0 ωj,k 2F(θj) H a k + 1 which goes to 0 in virtue of Cesaro s Lemma, therefore limk Pk j=0 ωj,k 2F(θj) H = 0. The second term is a sum of martingale increments and shall be treated with Freedman inequality and Borel-Cantelli Lemma. Introduce the martingale increments 0 j k, Xj+1,k = ωj,k H(θj, ξ j+1) 2F(θj) . For a fixed k, we have Xj+1,k = x(i,l) j+1 1 i,l d where we remove the index k for the sake of clarity. Because the Hessian generator is unbiased, we have for all coordinates E h x(i,l) j+1|Fj i = 0 for all 0 j k. By definition of the Hessian generator and using that ( 2F(θj)) is bounded, we get that H(θj, ξ j+1) 2F(θj) = O(1) for all j 0. For any b > 0, consider the following event Ωb = sup k 0 max j=0,...,k(k + 1) x(i,l) j+1 b , and note that since ωj,k = O(1/k) we have P(Ωb) 1 as b . On this event, the martingale increments and the variance term are bounded as max j=0,...,k x(i,l) j+1 b(k + 1) 1, j=0 E x(i,l) j+1 2 | Fj b2(k + 1) 1. Using Freedman inequality (Theorem 7), we have for all coordinates i, l = 1, . . . , d, j=0 x(i,l) j+1 2 exp ε2(k + 1) The last term is the general term of a convergent series. Apply Borel-Cantelli Lemma (Borel, 1909) to finally get almost surely on Ωb that limk Pk j=0 x(i,l) j+1 = 0. Since b > 0 is arbitrary and P(Ωb) 1 when Published in Transactions on Machine Learning Research (08/2023) b , we have almost surely limk Pk j=0 x(i,l) j+1 = 0. This is true for all the coordinates of the martingale increments and therefore j=0 ωj,k H(θj, ξ j+1) 2F(θj) = 0 a.s. C.4 Auxiliary results on expected smoothness The following Lemma gives sufficient conditions to meet the weak growth condition on the stochastic noise as stated in Assumption 8. Lemma 8. Suppose that for all k 1, θ Rd, F(θ) = E [f(θ, ξk)|Fk 1] with ξk Pk 1. Assume that for all ξk Pk 1, the function θ 7 f(θ, ξk) is L-smooth almost surely and there exists m R such that for all θ Rd, f(θ, ξk) m. Then a gradient estimate is given by g(θ, ξ) = f(θ, ξ) and the growth condition of Assumption 8 is satisfied with σ2 = 2L(F m) and θ Rd, k N, E g(θ, ξk) 2 2|Fk 1 2L (F(θ) F ) + σ2. Proof. For all ξk Pk 1, Lipschitz continuity of the gradient θ 7 f(θ, ξk) implies (see Nesterov (2013)) f(y, ξk) f(θ, ξk) + f(θ, ξk), y θ + (L/2) y θ 2 2. Plug y = θ (1/L) f(θ, ξk) and use the lower bound f(y, ξk) m to obtain 1 2L f(θ, ξk) 2 2 f(θ, ξk) f(y, ξk) f(θ, ξk) m, which gives, g(θ, ξk) 2 2 2L (f(θ, ξk) f(θ , ξk)) + 2L (f(θ , ξk) m) and conclude by taking the conditional expectation with respect to Fk 1. The next Lemma links our weak growth condition with the notion of expected smoothness as introduced in Gower et al. (2019). In particular, this notion can be extended to our general context where the sampling distribution can evolve through the stochastic algorithm. Lemma 9. (Expected smoothness) Assume that with probability one, sup k 1 sup x =x E g(θ, ξk) g(θ , ξk) 2 2|Fk 1 F(θ) F < and sup k 1 E g(θ , ξk) 2 2|Fk 1 < . Then there exist 0 L, σ2 < such that θ Rd, k N, E g(θ, ξk) 2 2|Fk 1 2L (F(θ) F ) + 2σ2. Proof. For all θ Rd and all k N, we have g(θ, ξk) 2 2 = g(θ, ξk) g(θ , ξk) + g(θ , ξk) 2 2 2 g(θ, ξk) g(θ , ξk) 2 2 + 2 g(θ , ξk) 2 2. Using the expected smoothness, with probability one, there exists 0 L < such that E g(θ, ξk) g(θ , ξk) 2 2|Fk 1 L (F(θ) F ) . Since the noise at optimal point is almost surely finite there exists 0 σ2 < such that E g(θ , ξk) 2 2|Fk 1 σ2, which allows to conclude by taking the conditional expectation.