# stochastic_newton_proximal_extragradient_method__1e324623.pdf Stochastic Newton Proximal Extragradient Method Ruichen Jiang ECE Department UT Austin rjiang@utexas.edu Michał Derezi nski EECS Department University of Michigan derezin@umich.edu Aryan Mokhtari ECE Department UT Austin mokhtari@austin.utexas.edu Stochastic second-order methods achieve fast local convergence in strongly convex optimization by using noisy Hessian estimates to precondition the gradient. However, these methods typically reach superlinear convergence only when the stochastic Hessian noise diminishes, increasing per-iteration costs over time. Recent work in [1] addressed this with a Hessian averaging scheme that achieves superlinear convergence without higher per-iteration costs. Nonetheless, the method has slow global convergence, requiring up to O(κ2) iterations to reach the superlinear rate of O((1/t)t/2), where κ is the problem s condition number. In this paper, we propose a novel stochastic Newton proximal extragradient method that improves these bounds, achieving a faster global linear rate and reaching the same fast superlinear rate in O(κ) iterations. We accomplish this by extending the Hybrid Proximal Extragradient (HPE) framework, achieving fast global and local convergence rates for strongly convex functions with access to a noisy Hessian oracle. 1 Introduction In this paper, we focus on the use of second-order methods for solving the optimization problem min x Rd f(x), (1) where f : Rd R is strongly convex and twice differentiable. There is an extensive literature on second-order methods and their fast local convergence properties; e.g., [2 5]. However, these results necessitate access to the exact Hessian, which can pose computational challenges. To address this issue, several studies have explored scenarios where only the exact gradient can be queried, while a stochastic estimate of the Hessian is available similar to the setting we investigate in this paper. This oracle model is commonly encountered in large-scale machine learning problems, as computing the gradient is often much less expensive than computing the Hessian, and approximating the Hessian is a more affordable approach. Specifically, consider a finite-sum minimization problem minx Rd Pn i=1 fi(x), where n denotes the number of data points and d denotes the dimension of the problem. To achieve a fast convergence rate, standard first-order methods need to compute one full gradient in each iteration, resulting in a per-iteration computational cost of O(nd). In contrast, implementing a second-order method such as damped Newton s method involves computing the full Hessian, which costs O(nd2). An inexact Hessian estimate can be constructed efficiently at a cost of O(sd2), where s is the sketch size or subsampling size [1, 6]. Hence, when the number of samples n significantly exceeds d, the per-iteration cost of stochastic second-order methods becomes comparable to that of first-order methods. Moreover, using second-order information often reduces the number of iterations needed to converge, thereby lowering overall computational complexity. A common template among stochastic second-order methods is to combine a deterministic secondorder method, such as Newton s method or cubic regularized Newton method, with techniques such as Hessian subsampling [7 12] or Hessian sketching [4, 6, 13] that only require a noisy estimate 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Table 1: Comparison between Algorithm 1 and the stochastic Newton method in [1], in terms of how many iterations it takes to transition to each phase, and the convergence rates achieved. We drop constant factors as well as logarithmic dependence and 1/δ, and assume 1/poly(κ) Υ O(κ). Methods Weights Linear phase Initial superlinear phase Final superlinear phase T1 rate ϕ T2 rate θ(1) t T3 rate θ(2) t Stochastic Newton [1] Uniform Υ2 1 κ 2 κ3 κ3 t Non-uniform Υ2 1 κ 2 κ2 κ4 log(κ)+1 tlog(t) κ2 Υ log(t) Stochastic NPE (Ours) Uniform Υ2 κ2 1 κ 1 κ2 κ2 Non-uniform Υ2 κ2 1 κ 1 Υ2 + κ (Υ2+κ)log(Υ2+κ)+1 tlog(t) Υ2 + κ Υ log(t) of the Hessian. We refer the reader to [14, 15] for recent surveys and empirical comparisons. In terms of convergence guarantees, the majority of these works, including [4, 6, 8 11, 13], have shown that stochastic second-order methods exhibit a global linear convergence and a local linear-quadratic convergence, either with high probability or in expectation. The linear-quadratic behavior holds when xt+1 x c1 xt x + c2 xt x 2, (2) where x denotes the optimal solution of Problem (1) and c1, c2 are constants depending on the sample/sketch size at each step. In particular, the presence of the linear term in (2) implies that the algorithm can only achieve linear convergence when the iterate is sufficiently close to the optimal solution x . Consequently, as discussed in [9, 10], to achieve superlinear convergence, the coefficient c1 = c1,t needs to gradually decrease to zero as t increases. However, since c1 is determined by the magnitude of the stochastic noise in the Hessian estimate, this in turn demands the sample/sketch size to increase across the iterations, leading to a blow-up of the per-iteration computational cost. The only prior work addressing this limitation and achieving a superlinear rate for a stochastic second-order method without requiring the stochastic Hessian noise to converge to zero is by [1]. It uses a weighted average of all past Hessian approximations as the current Hessian estimate. This approach reduces stochastic noise variance in the Hessian estimate, though it introduces bias to the Hessian approximation matrix. When combined with Newton s method, it was shown that the proposed method achieves local superlinear convergence with a non-asymptotic rate of (Υ p log(t)/t)t with high probability, where Υ characterizes the noise level of the stochastic Hessian oracle (see Assumption 4). However, the method may require many iterations to achieve superlinear convergence. Specifically, with the uniform averaging scheme, it takes O(κ3) iterations before the method starts converging superlinearly and O(κ6/Υ2) iterations before it reaches the final superlinear rate. Here, κ = L1/µ denotes the condition number of the function f, where L1 is the Lipschitz constant of the gradient and µ is the strong convexity parameter. To address this, [1] proposed a weighted averaging scheme that assigns more weight to recent Hessian estimates, improving both transition points to O(Υ2 + κ2) while achieving a slightly slower superlinear rate of O(Υ log(t)/ Our contributions. In this paper, we improve the complexity of Stochastic Newton in [1] with a method that attains a superlinear rate in significantly fewer iterations. As shown in Table 1, our method requires fewer iterations for linear convergence, denoted as T1, by a factor of κ2 compared to [1]. Additionally, our method achieves a linear convergence rate of (1 O(1/κ))t, outperforming the (1 O(1/κ2))t rate in [1]. Thus, our method reaches the local neighborhood of the optimal solution x and transitions from linear to superlinear convergence faster. Specifically, the second transition point, T2, is smaller by a factor of κ in both uniform and non-uniform averaging schemes when Υ = O( κ). Similarly, our method s initial superlinear rate has a better dependence on κ, leading to fewer iterations, T3, to enter the final superlinear phase. To achieve this result, we use the hybrid proximal extragradient (HPE) framework [16, 17] instead of Newton s method as the base algorithm. The HPE framework provides a principled approach for designing second-order methods with superior global convergence guarantees [3, 17 20]. However, [16] and subsequent works focus on cases where f is merely convex, not leveraging strong convexity. Thus, we modify the HPE framework to suit our setting. Specifically, we relax the error condition for computing the proximal step in HPE, enabling a larger step size when the iterate is close to the optimal solution, crucial for achieving the final superlinear convergence rate. 2 Preliminaries In this section, we formally present our assumptions. Assumption 1. The function f is twice differentiable and µ-strongly convex. Assumption 2. The Hessian 2f satisfies 2f(x) 2f(y) M1. Assumption 3. The Hessian 2f is L2-Lipschitz, i.e., 2f(x) 2f(y) L2 x y 2. Assumption 2 is more general than the assumption that f is L1-Lipschitz. In particular, if the latter assumption holds, then M1 L1. Moreover, we define κ M1/µ as the condition number. To simplify our notation, we denote the exact gradient f(x) and the exact Hessian 2f(x) of the objective function by g(x) and H(x), respectively. As mentioned earlier, we assume that we have access to the exact gradient, but we only have access to a noisy estimate of the Hessian denoted by ˆH(x). In fact, we require a mild assumption on the Hessian noise. We define the stochastic Hessian noise as E(x) ˆH(x) H(x), where it is assumed to be mean zero and sub-exponential. Assumption 4. If we define E(x) ˆH(x) H(x), then E[E(x)] = 0 and E[ E(x) p] p!Υp E/2 for all integers p 2. Also, define Υ ΥE/µ to be the relative noise level. Assumption 5. The Hessian approximation matrix is positive semi-definite, i.e., ˆH(x) 0, x Rd. Stochastic Hessian construction. The two most popular approaches to construct stochastic Hessian approximations are subsampling and sketching . Hessian subsampling is designed for a finite-sum objective of the form f(x) = 1 n Pn i=1 fi(x), where n is the number of samples. In each iteration, a subset S {1, 2, . . . , n} is drawn uniformly at random, and then the subsampled Hessian at x is constructed as ˆH(x) = 1 |S| P i S 2fi(x). In this case, if each fi is convex, then the condition in Assumption 5 is satisfied. Moreover, if we further assume that 2fi(x) c M1 for some c > 0 and for all i, then Assumption 4 is satisfied with Υ = O( p cκ log(d)/|S| + cκ log(d)/|S|) (see [1, Example 1]). The other approach is Hessian sketching, applicable when the Hessian H can be easily factorized as H = M M, where M Rn d is the square-root Hessian matrix, and n is the number of samples. This is the case for generalized linear models; see [1]. To form the sketched Hessian, we draw a random sketch matrix S Rs n with sketch size s from a distribution D that satisfies ED[S S] = I. The sketched Hessian is then ˆH = M S SM. In this case, Assumption 5 is automatically satisfied. Moreover, for Gaussian sketch, Assumption 4 is satisfied with Υ = O(κ( p d/s + d/s)) (see [1, Example 2]). Remark 1. The above assumptions are common in the study of stochastic second-order methods, appearing in works on Subsampled Newton [7 10, 12], Newton Sketch [4, 6, 13], and notably, [1]. The strong convexity requirement is crucial as stochastic second-order methods have a clear advantage over first-order methods like gradient descent when the function is strongly convex. Specifically, stochastic second-order methods attain a superlinear convergence rate, as shown in this paper, which is superior to the linear rate of first-order methods. 3 Stochastic Newton Proximal Extragradient Our approach involves developing a stochastic Newton-type method grounded in the Hybrid Proximal Extragradient (HPE) framework and its second-order variant. Therefore, before introducing our proposed algorithm, we will provide a brief overview of the core principles of the HPE framework. Following this, we will present our method as it applies to the specific setting addressed in this paper. Hybrid Proximal Extragradient. Next, we first present the Hybrid Proximal Extragradient (HPE) framework for strongly convex functions. To solve problem (1), the HPE algorithm consists of two steps. In the first step, given xt, we find a mid-point ˆxt by applying an inexact proximal point update ˆxt xt ηt f(ˆxt), where ηt is the step size. More precisely, we require ˆxt xt + ηt f(ˆxt) α γt ˆxt xt , (3) where γt = 1+2ηtµ, µ is the strong convexity parameter, and α (0, 1) is a user-specified parameter. Then, in the second step, we perform the extra-gradient update and compute xt+1 based on γt (xt ηt f(ˆxt)) + 1 1 The weights 1 γt in the above convex combination are chosen to optimize the convergence rate. Remark 2. When µ = 0, the algorithm outline above reduces to the original HPE framework studied in [16, 18]. Our modification in (3) is inspired by [21] and allows a larger error when performing the inexact proximal point update, which turns out to be crucial for achieving a fast superlinear convergence rate. Moreover, the modification in (4) has been adopted in [22]. Stochastic Newton Proximal Extragradient (SNPE). The HPE method described above provides a useful algorithmic framework, instead of a directly implementable method. The main challenge comes from implementing the first step in (3), which involves an inexact proximal point update. Specifically, the naive approach is to solve the implicit nonlinear equation x xt + ηt f(x) = 0, which can be as costly as solving the original problem in (1). To address this issue, [18] proposed to approximate the gradient operator f(x) by its local linearization f(xt) + 2f(xt)(x xt), and then compute ˆxt by solving the linear system of equations ˆxt xt + ηt( f(xt) + 2f(xt)(ˆxt xt)) = 0. This leads to the Newton proximal extragradient method that was proposed and analyzed in [18]. However, in our setting, the exact Hessian 2f(xt) is not available. Thus, we construct a stochastic Hessian approximation Ht from our noisy Hessian oracle as a surrogate of 2f(xt). We will elaborate on the construction of Ht later, but for the present discussion assume that this stochastic Hessian approximation Ht is already provided. Then in the first step, we will compute ˆxt by ˆxt = xt ηt( f(xt) + Ht(ˆxt xt)), (5) where we replace f(ˆxt) by its local linear approximation f(xt) + Ht(ˆxt xt). Moreover, (5) is equivalent to solving the following linear system of equations (I + ηt Ht)(x xt) = ηt f(xt). For ease of presentation, we set ˆxt as the exact solution of this system, leading to ˆxt = xt ηt(I + ηt Ht) 1 f(xt). (6) However, we note that an inexact solution to this linear system is also sufficient for our convergence guarantees so long as (I + ηt Ht)(ˆxt xt) + ηt f(xt) α 2 ˆxt xt ; We refer the reader to Appendix A.3 for details. Additionally, since we employed a linear approximation to determine the mid-point ˆxt, the condition in (3) may no longer be satisfied. Consequently, it is crucial to verify the accuracy of our approximation after selecting ˆxt. To achieve this, we implement a line-search scheme to ensure that the step size is not large and the linear approximation error is small. Next, we discuss constructing the stochastic Hessian approximation Ht. A simple strategy is using ˆH(xt) instead of 2f(xt), but the Hessian noise would lead to a highly inaccurate approximation of the prox operator, ruining the superlinear convergence rate. To reduce Hessian noise, we follow [1] and use an averaged Hessian estimate H(xt). We consider two schemes: (i) uniform averaging; (ii) nonuniform averaging with general weights. In the first case, Ht = 1 t+1 Pt i=0 ˆH(xt)uniformly averages past stochastic Hessian approximations. Motivated by the central limit theorem for martingale differences, we expect Ht to have smaller variance than ˆH(xt). It can be implemented online as Ht = t t+1 Ht 1 + 1 t+1 ˆH(xt), without storing past Hessian estimates. However, H(xt) is a biased estimator of 2f(xt), since it incorporates stale Hessian information. To address the bias-variance trade-off, the second case uses non-uniform averaging to weight recent Hessian estimates more. Given an increasing non-negative weight sequence {wt} t= 1 with w 1 =0, the running average is: wt Ht 1 + 1 wt 1 ˆH(xt). (7) Equivalently, with zi,t = wi wi 1 wt , Ht can be written as Pt i=0 zi,t ˆH(xi). We discuss uniform averaging in Section 4 and non-uniform averaging in Section 5. Building on the discussion thus far, we are ready to integrate all the components and present our Stochastic Newton Proximal Extragradient (SNPE) method. The steps of SNPE are summarized in Algorithm 1. Each iteration of our SNPE method includes two stages. In the first stage, starting with the current point xt, we first query the noisy Hessian oracle and compute the averaged stochastic Hessian Ht from (7), as stated in Step 4. Then given the gradient f(xt), the Hessian approximation Ht, and an initial trial step size σt, we employ a backtracking line search to obtain ηt and ˆxt, as stated in Step 6 of Algorithm 1. Specifically, in this step, we set ηt σt and compute ˆxt as suggested Algorithm 1 Stochastic NPE 1: Input: x0 Rd, weights {wt} t=0, line-search parameters α, β (0, 1), initial step size σ0 > 0 2: Initialize: H 1 = 0 and w 1 = 0 3: for t = 0, 1, . . . do 4: Obtain a stochastic Hessian ˆHt = ˆH(xt) 5: Compute Ht = wt 1 wt Ht 1 + (1 wt 1 wt ) ˆHt 6: (ηt, ˆxt) = BLS(xt, f(xt), Ht, α, β, σt) 7: Let γt = 1 + 2ηtµ and compute xt+1 = 1 γt (xt ηt f(ˆxt)) + (1 1 γt )ˆxt 8: Set σt+1 = ηt/β 9: end for Subroutine 1 (η, ˆx) = BLS(x, g, H, α, β, σ) 1: Input: current iterate x Rd, gradient g Rd, Hessian approximation H Rd d, line-search parameters α, β (0,1), initial trial step size σ >0 2: Set η σ and ˆx x η(I + η H) 1g 3: Set γ 1 + 2ηµ 4: while ˆx x + η f(ˆx) > α γ ˆx x do 5: Set η βη and ˆx x η(I + η H) 1g 6: Set γ 1 + 2ηµ 7: end while 8: Output: η and ˆx in (6). If ˆxt and its corresponding step size ηt satisfy (3), meaning the linear approximation error is small, then the step size ηt and the mid-point ˆxt are accepted and we proceed to the second stage of SNPE. If not, we backtrack the step size ηt and try a smaller step size βηt, where β (0, 1) is a user-specified parameter. We repeat the process until the condition in (3) is satisfied. The details of the backtracking line search scheme are summarized in Subroutine 1. After completing the first stage and obtaining the pair (ηt, ˆxt), we proceed to the extragradient step and follow the update in (4), as in Step 7 of Algorithm 1. Finally, before moving to the next time index, we follow a warm-start strategy and set the next initial trial step size σt+1 as ηt/β, as shown in Step 8 of Algorithm 1. Remark 3. Similar to the analysis in [22], we can show that the total number of line search steps after t iterations can be bounded by 2t 1 + log( σ0 ηt 1 ). Moreover, when t is large enough, on average the line search requires 2 steps per iteration. We defer the details to Appendix A.4. Remark 4. Our motivation behind the choice σt+1 = ηt/β is to allow the step size to grow, which is necessary for achieving a superlinear convergence rate. Specifically, as shown in Proposition 1 below, we require the step size ηt to go to infinity to ensure that limt xt+1 x xt x = 0. Note that this would not be possible if we simply set σt+1 = ηt, since it would automatically result in ηt+1 σt+1 ηt. Moreover, this condition σt+1 = ηt/β is explicitly utilized in Lemmas 8 and 16 in the Appendix, where we demonstrate that ηt can be lower bounded by the minimum of σ0/βt and another term. We should also note that this more aggressive choice of the initial step size at each round could potentially increase the number of backtracking steps. However, as mentioned above, this does not cause a significant issue, since the average number of backtracking steps per iteration can be bounded by a constant close to 2. 3.1 Key properties of SNPE This section outlines key properties of SNPE, applied in Sections 4 and 5 to determine its convergence rates. The first result reveals the connection between SNPE s convergence rate and the step size ηt. Proposition 1. Let {xt}t 0 and {ˆxt}t 0 be the iterates generated by Algorithm 1. Then for any t 0, we have xt+1 x 2 xt x 2(1 + 2ηtµ) 1. Proposition 1 guarantees that the distance to the optimal solution is monotonically decreasing, and it shows a larger step size implies faster convergence. Hence, we need to provide an explicit lower bound on the step size. This task is accomplished in the next lemma. For ease of notation, we let B be the set of iteration indices where the line search scheme backtracks, i.e., B {t : ηt < σt}. Moreover, we use g(x) and H(x) to denote the gradient f(x) and the Hessian 2f(x), respectively. Lemma 1. For t / B, we have ηt = σt. For t B, let ηt = ηt/β and xt = xt ηt(I + ηt Ht) 1 f(xt). Then, xt xt 1 β ˆxt xt . Moreover, ηt max αβ xt xt g( xt) g(xt) Ht( xt xt) , 2α2βµ xt xt 2 g( xt) g(xt) Ht( xt xt) 2 As Lemma 1 demonstrates, in the first case where t / B, we have ηt = σt. Moreover, since we set σt = ηt 1/β for t 1, in this case the step size will increase by a factor of 1/β. In the second case that t B, our lower bound on the step size ηt depends inversely on the normalized approximation error Et = g( xt) g(xt) Ht( xt xt) xt xt . Also, note that Et involves an auxiliary iterate xt instead of the actual iterate ˆxt accepted by our line search. We use the first result to relate xt xt to ˆx xt . To shed light on our analysis, we use the triangle inequality and decompose this error into two terms: Et g( xt) g(xt) Ht( xt xt) xt xt + Ht Ht . (8) The first term in (8) represents the intrinsic error from the linear approximation in the inexact proximal update, while the second term arises from the Hessian approximation error. Using the smoothness properties of f, we can upper bound the first term, as shown in the following lemma. Lemma 2. Under Assumptions 2 and 3, we have g( xt) g(xt) Ht( xt xt) xt xt min M1, L2 xt x Lemma 2 shows that the linear approximation error is upper bounded by M1. Moreover, the second upper bound is O( xt x ). Thus, as Algorithm 1 converges to the optimal solution x , the second bound in (9) will become tighter than the first one, and the right hand side approaches zero. To analyze the second term in (8), we isolate the noise component in our averaged Hessian estimate. Specifically, recall Ht = Pt i=0 zi,t ˆHi and ˆHi = Hi + Ei. Thus, we have Ht = Ht + Et, where Ht = Pt i=0 zi,t Hi is the aggregated Hessian and Et = Pt i=0 zi,t Ei is the aggregated Hessian noise, and it follows from the triangle inequality that Ht Ht Ht Ht + Et . We refer to the first part, Ht Ht , as the bias of our Hessian estimate, and the second part, Et , as the averaged stochastic error. There is an intrinsic trade-off between the two error terms. For the fastest error concentration, we assign equal weights to all past stochastic Hessian noises, i.e., zi,t = 1/(t + 1) for all 0 i t, corresponding to the uniform averaging scheme discussed in Section 4. To eliminate bias, we assign all weights to the most recent Hessian matrix Ht, i.e., zt,t = 1 and zi,t = 0 for all i < t, but this incurs a large stochastic error. To balance these, we present a weighted averaging scheme in Section 5, gradually assigning more weight to recent stochastic Hessian approximations. 4 Analysis of uniform Hessian averaging In this section, we present the convergence analysis of the uniform Hessian averaging scheme, where wt = t + 1. In this case, we have Ht = 1 t+1 Pt i=0 ˆHi. As discussed in Section 3.1, our main task is to lower bound the step size ηt, which requires us to control the approximation error Et by analyzing the two error terms in (8). The first term is bounded by Lemma 2, and the second term can be bounded as Ht Ht Ht Ht + Et . Next, we establish a bound on Et , referred to as the Averaged Stochastic Error, and a bound on Ht Ht , referred to as the Bias Term. Averaged stochastic error. To control the averaged Hessian noise Et , we rely on the concentration of sub-exponential martingale difference, as shown in [1]. Lemma 3 ([1, Lemma 2]). Let δ (0, 1) with d/δ e. Then with probability 1 δπ2/6, we have log(d(t+1)/δ) t+1 for any t 4 log(d/δ). Lemma 3 shows that, with high probability, the norm of averaged Hessian noise Et approaches zero at the rate of O(ΥE/ t). As discussed in Section 4.1, this error eventually becomes the dominant factor in the approximation error Et and determines the final superlinear rate of our algorithm. Remark 5. Our subsequent results are conditioned on the event that the bound on Et stated in Lemma 3 is satisfied for all t 4 log(d/δ). Thus, to avoid redundancy, we will omit the with high probability qualification in the following discussion. Bias. We proceed to establish an upper bound on Ht Ht . The proof can be found in Appendix B.1. Lemma 4. If Ht = 1 t+1 Pt i=0 ˆHi, then Ht Ht 1 t+1 Pt i=0 Ht Hi . Moreover, for any i 0, we have Ht Hi max{M1, 2L2 xi x }. The analysis of the bias term is more complicated. Specifically, to obtain the best result, we break the sum in Lemma 4 into two parts, 1 t PI 1 i=0 Ht Hi and 1 t Pt i=I Ht Hi , where I is an integer to be specified later. The first part corresponds to the bias from stale Hessian information and converges to zero at O(M1I/t), as shown by the first bound in Lemma 4. The second part is the bias from recent Hessian information when the iterates are near the optimal solution x . Using the second bound in Lemma 4, we show this part contributes less to the total bias and is dominated by the first part. Thus, we can conclude that Ht Ht = O( M1I Based on the previous discussions, it is evident that the terms contributing to the upper bound of Et all converge to zero, albeit at different rates. Furthermore, the linear approximation error and bias term display distinct global and local convergence patterns, depending on the distance xt x . Hence, this necessitates a multi-phase convergence analysis, which we undertake in the following section. 4.1 Convergence analysis Similar to [1], we consider four convergence phases with three transitions points T1, T2, and T3, whose expressions will be specified later. Due to space limitations, in the following we provide an overview of the four phases and relegate the details to Appendix B. Warm-up phase 0 t < T1. At the beginning of the algorithm, the averaged Hessian estimate is dominated by stochastic noise and provides little useful information for convergence. Thus, there are generally no guarantees on the convergence rate for 0 t < T1. However, due to the line search scheme, Proposition 1 ensures that the distance to x is non-increasing, i.e., xt+1 x xt x for all t 0. During the warm-up phase, the averaged Hessian noise Et , which contributes most to the approximation error Et, is gradually suppressed. Once the averaged Hessian noise is sufficiently concentrated, Algorithm 1 transitions to the linear convergence phase, denoted by T1. Linear convergence phase T1 t < T2. After T1 iterations, Algorithm 1 starts converging linearly to the optimal solution x . Moreover, during this phase, all the three errors discussed in Section 3.1 continue to decrease. Specifically, Lemma 2 shows the linear approximation error is bounded by O( xt x ), which converges to zero at a linear rate. Furthermore, Lemma 3 implies that the averaged Hessian error Et diminishes at a rate of O( ΥE t ). Finally, regarding the bias term, it can be shown Ht Ht = O( M1I t ) following the discussions after Lemma 4. Thus, once all the three errors are sufficiently small, Algorithm 1 moves to the superlinear phase, denoted by T2. Superlinear phases T2 t < T3 and T3 t < T4. After T2 iterations, Algorithm 1 converges at a superlinear rate. Moreover, the superlinear rate is determined by the averaged noise Et , which decays at the rate of O( ΥE t ), and the bias of our averaged Hessian estimate Ht, which decays at the rate of O( M1I t ). Hence, as the number of iterations t increases, the averaged noise will dominate and the algorithm transitions from the initial superlinear rate to the final superlinear rate. We summarize our convergence guarantees in the following theorem and the proofs are in Appendix B. Theorem 1. Suppose Assumptions 1-5 hold and the weights for Hessian averaging in SNPE are uniform, and define C := 1 2β 1 α2 + 5. Then, the followings hold: (a) Warm-up phase: If 0 t < T1, then xt+1 x xt x , where T1 = O( Υ2 (b) Linear convergence phase: If T1 t < T2, then xt+1 x 2 xt x 2(1+ 2αβ 3κ ) 1, where T2 = O(max{ Υ2 κ + κ2, Υ2}) = O(Υ2 + κ2). (c) Initial superlinear phase: For T2 t < T3, we have xt+1 x Cρ(1) t xt x , where ρ(1) t = 6κI α 2β(t+1) = O Υ2/κ+κ2 t with I defined in (28) and T3 = O( (Υ2/κ+κ2)2 (d) Final superlinear phase: Finally, for t T3, we have xt+1 x Cρ(2) t xt x , where ρ(2) t = 8 log(d(t+1)/δ) t+1 = O Υ q Comparison with [1]. As shown in Table 1, our method in Algorithm 1 with uniform averaging achieves the same final superlinear convergence rate as the stochastic Newton method in [1]. However, it transitions to the linear and superlinear phases much earlier. Specifically, the initial transition point T1 is improved by a factor of κ2, and our linear rate in Lemma 6 is faster. This reduces the iterations needed to reach the local neighborhood, cutting the time to reach T2 and T3 by factors of κ and κ2. 5 Analysis of weighted Hessian averaging Previously, we showed Algorithm 1 with uniform averaging eventually achieves superlinear convergence. However, as per Theorem 1, it requires O( κ4 Υ2 ) iterations to reach this rate. To achieve a faster transition, we follow [1] and use Hessian averaging with a general weight sequence {wt}. We show this method also outperforms the stochastic Newton method in [1]. Specifically, we set wt = w(t) for all integer t 0, where w( ) : R R satisfies certain regularity conditions as in [1, Assumption 3]. Assumption 6. (i) w( ) is twice differentiable; (ii) w( 1) = 0, w(t) > 0, t 0; (iii) w ( 1) 0; (iv) w (t) 0, t 1; (v) max n w(t+1) w(t) , w (t+1) w (t) o Ψ, t 0 for some Ψ 1. Choosing w(t) = tp for any p 1 satisfies Assumption 6. Additionally, as discussed in [1], a suitable choice is w(t) = (t + 1)log(t+4), allowing us to achieve the optimal transition to the superlinear rate. Since the analysis in this section closely resembles that in Section 4 on uniform averaging, we will only present the final result here for brevity. The four stages of convergence are detailed in the following theorem, with intermediate lemmas and proofs in the appendix. To simplify our bounds, we report results for non-uniform averaging with w(t) = (t + 1)log(t+4). Theorem 2. Suppose Assumptions 1-5 hold and the weights for Hessian averaging in SNPE are defined as w(t) = (t + 1)log(t+4), and define C := ( 1 10β 2(1 α2) + 1 2). Then, the following hold: (a) Warm-up phase: If 0 t < U1, then xt+1 x xt x , where U1 = O( Υ2 (b) Linear convergence phase: If U1 t < U2, then xt+1 x 2 xt x 2(1+ 2αβ 3κ ) 1, where U2 = O(max{ Υ2 κ2 + κ, Υ2}) = O(Υ2 + κ). (c) Initial superlinear phase: If U2 t < U3, then xt+1 x C θ(1) t xt x , where θ(1) t = 5κw(J ) α 2βw(t) = O κ(Υ2+κ)log(Υ2+κ)/tlog t with J defined in (49) and U3 = O(Υ2+κ). (d) Final superlinear phase: Finally, if t U3, then xt+1 x C θ(2) t xt x , where w (t) log(d t+1 δ ) w(t) =O Υ log(t) In the weighted averaging case, similar to the uniform averaging scenario, we observe four distinct phases of convergence. The warm-up phase for SNPE, during which the distance to the optimal solution does not increase, has the same duration as in the uniform averaging case but is shorter than the warm-up phase for the stochastic Newton method in [1] by a factor of 1/κ2. The linear convergence rates of both uniform and weighted Hessian averaging methods are 1 κ 1, improving over the 1 κ 2 rate achieved by the stochastic Newton method in [1]. The number of iterations to reach the initial superlinear phase is O(Υ2+κ), smaller than the O(κ2) needed for uniform averaging in SNPE when we focus on the regime where Υ = O( κ). The non-uniform averaging method in [1] requires κ2 iterations to achieve the initial superlinear phase, whereas the non-uniform SNPE achieves an initial superlinear rate of O(κlog(κ)+1/t)t, improving over the rate of O(κ4 log(κ)+1/tlog(t))t in [1]. Finally, while the ultimate superlinear rates in all cases are comparable at approximately O((1/ t)t), the non-uniform version of SNPE requires O(Υ2 + κ) iterations to attain this fast rate, whereas [1] s non-uniform version requires O(κ2) iterations, which is less favorable. Remark 6. As discussed in [1, Example 1], with a subsampling size of s, we have Υ = O(κ/s) for subsampled Newton. This implies that when s = Ω( κ), we achieve Υ = O( κ). 6 Discussion and complexity comparison In this section, we compare the complexity of our method with accelerated gradient descent (AGD), damped Newton, and stochastic Newton [1] methods. The iteration complexities of these methods are summarized in Table 2, which we use to establish their overall complexities. Note that since both stochastic Newton and our proposed SNPE method achieve superlinear convergence, their Table 2: Comparisons in terms of overall iteration complexity to find an ϵ-accurate solution. Methods AGD Damped Newton Stochastic Newton [1] SNPE (Ours) Iteration O( κ log(1/ϵ)) O(κ2+log log(1/ϵ)) O(Υ2+κ2+ log(1/ϵ) log log(1/ϵ)) O(Υ2+κ+ log(1/ϵ) log log(1/ϵ)) complexities depend on the target accuracy ϵ in the form O( log(1/ϵ) log log(1/ϵ)), which is provably better than the complexity of AGD by at least a factor of log log(ϵ 1). Further details are provided in Appendix D.1. To better compare them, we focus on the finite-sum problem with n functions: minx Rd Pn i=1 fi(x). Let ϵ be the target accuracy, κ the condition number, and Υ the noise level in Assumption 4. In this case, computing the exact gradient and Hessian costs O(nd) and O(nd2), respectively. Thus, the per-iteration cost for AGD is O(nd). Each iteration of damped Newton s method requires computing the full Hessian and solving a linear system, resulting in a total periteration cost of O(nd2 + d3). For both stochastic Newton in [1] and our SNPE method, the per-iteration cost depends on how the stochastic Hessian is constructed. For example, Subsampled Newton constructs the Hessian estimate with a cost of O(sd2), where s denotes the sample size. Newton Sketch has a similar computation cost (see [6]). Additionally, it takes O(nd) to compute the full gradient and O(d3) to solve the linear system. Since the sample/sketch size s is typically chosen as s = O(d), the total per-iteration cost is O(nd + d3). Compared to AGD, SNPE achieves better iteration complexity. Specifically, when the noise level Υ and target accuracy ϵ are relatively small (Υ = O( κ) and log 1 ϵ = Ω( κ)), SNPE converges in fewer iterations. Additionally, when n d2 (indicating many samples), the per-iteration costs of both methods are O(nd), giving our method a better overall complexity. Compared to damped Newton s method, our method s iteration complexity depends better on the condition number κ, while damped Newton s depends better on ϵ. However, when n d2, the per-iteration cost of damped Newton is O(nd2), significantly more than our method s O(nd). Compared to the stochastic Newton method in [1], the per-iteration costs of both methods are similar. However, Table 2 shows that our iteration complexity is strictly better. Specifically, when the noise level is relatively small compared to the condition number, i.e., Υ = O( κ), the complexity of SNPE improves by an additional factor of κ over the stochastic Newton method. 7 Numerical experiments While our focus is on improving theoretical guarantees, we also provide simple experiments to showcase our method s improvements. All simulations are implemented on a Windows PC with an AMD processor and 16GB Memory. We consider minimizing the regularized log-sum-exp objective f(x) = ρ log(Pn i=1 exp( a i x bi 2 x 2, a common test function for second-order methods [23, 24] due to its high ill-conditioning. Here, λ > 0 is a regularization parameter, ρ > 0 is a smoothing parameter, and the entries of the vectors a1, . . . , an Rd and b Rd are randomly generated from the standard normal distribution and the uniform distribution over [0, 1], respectively. In our experiments, the regularization parameter λ is 10 3, the dimension d is 500, and the number of samples n is chosen from 50,000, 10,000, and 150,000, respectively. We compare our SNPE method with the stochastic Newton method in [1], using both uniform Hessian averaging (Section 4) and weighted averaging (Section 5). In addition, we evaluate it against accelerated gradient descent (AGD), damped Newton s method, and Newton Proximal Extragradient (NPE), which corresponds to our SNPE method with the exact Hessian. For the stochastic Hessian estimate, we use a subsampling strategy with a subsampling size of s = 500. Empirically, we found that the extragradient step (Line 7 in Algorithm 1) tends to slow down convergence. To address this, we consider a variant of our method without the extragradient step, modifying Line 7 to xk+1 = ˆxk. In Figure 3 of the Appendix, we further explore the effect of the extragradient step. We also note that similar observations were made in [19], where the simple iteration xt+1 = xt ηt(I + ηt Ht) 1gt outperformed accelerated second-order methods. From Figures 1 and 2, we have the following observations: 0 50 100 150 200 250 300 350 400 Iteration SN-Unif Avg SN-Weight Avg SNPE-Unif Avg SNPE-Weight Avg AGD Damped Newton NPE (a) n = 50, 000, d = 500 0 50 100 150 200 250 300 350 400 Iteration SN-Unif Avg SN-Weight Avg SNPE-Unif Avg SNPE-Weight Avg AGD Damped Newton NPE (b) n = 100, 000, d = 500 0 50 100 150 200 250 300 350 400 Iteration SN-Unif Avg SN-Weight Avg SNPE-Unif Avg SNPE-Weight Avg AGD Damped Newton NPE (c) n = 150, 000, d = 500 Figure 1: Iteration complexity comparison for minimizing log-sum-exp on a synthetic dataset. 0 5 10 15 20 25 30 time SN-Unif Avg SN-Weight Avg SNPE-Unif Avg SNPE-Weight Avg AGD Newton NPE (a) n = 50, 000, d = 500 0 5 10 15 20 25 30 time SN-Unif Avg SN-Weight Avg SNPE-Unif Avg SNPE-Weight Avg AGD Newton NPE (b) n = 100, 000, d = 500 0 5 10 15 20 25 30 time SN-Unif Avg SN-Weight Avg SNPE-Unif Avg SNPE-Weight Avg AGD Newton NPE (c) n = 150, 000, d = 500 Figure 2: Runtime comparison for minimizing log-sum-exp on a synthetic dataset. Comparison with stochastic Newton. In all cases, our SNPE method outperforms stochastic Newton in both the number of iterations and runtime, due to the problem s highly ill-conditioned nature and our method s better dependence on the condition number. Comparison with AGD. From Figure 1, we observe that our SNPE method, with either uniform or weighted averaging, requires far fewer iterations to converge than AGD due to the use of second-order information. Consequently, while SNPE has a higher per-iteration cost than AGD, it converges faster overall in terms of runtime, as demonstrated in Figure 2. Comparison with the damped Newton s method and NPE. As expected, since both damped Newton and NPE use exact Hessian, Figure 1 shows that they exhibit superlinear convergence and converge in fewer iterations than the other algorithms. However, since the exact Hessian matrix is expensive to compute, they incur a high per-iteration computational cost and overall take more time than our proposed SNPE method to converge (see Figure 2). Moreover, the gap between these two methods and SNPE widens as the number of samples n increases, demonstrating the advantage of our method in the large data regime. 8 Conclusions and limitations We introduced a stochastic variant of the Newton Proximal Extragradient method (SNPE) for minimizing a strongly convex and smooth function with access to a noisy Hessian. Our contributions include establishing convergence guarantees under two Hessian averaging schemes: uniform and non-uniform. We characterized the computational complexity in both cases and demonstrated that SNPE outperforms the best-known results for the considered problem. A limitation of our theory is the assumption of strong convexity. Extending the theory to the convex setting would make it more general. This extension is left for future work due to space limitations. Acknowledgments The research of R. Jiang and A. Mokhtari is supported in part by the NSF Grant 2007668, the NSF AI Institute for Foundations of Machine Learning (IFML), and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program at UT Austin. The research of M. Derezi nski was partially supported by NSF grant CCF-2338655. [1] Sen Na, Michał Derezi nski, and Michael W Mahoney. Hessian averaging in stochastic Newton methods achieves superlinear convergence. Mathematical Programming, pages 1 48, 2022. [2] Yu Nesterov. Accelerating the cubic regularization of Newton s method on convex problems. Mathematical Programming, 112(1):159 181, 2008. [3] Renato DC Monteiro and Benar Fux Svaiter. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization, 23(2):1092 1125, 2013. [4] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148 4187, 2017. [5] Guy Kornowski and Ohad Shamir. High-order oracle complexity of smooth and strongly convex optimization. ar Xiv preprint ar Xiv:2010.06642, 2020. [6] Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205 245, 2017. [7] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977 995, 2011. [8] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled Newton methods. Advances in Neural Information Processing Systems, 28, 2015. [9] Raghu Bollapragada, Richard H Byrd, and Jorge Nocedal. Exact and inexact subsampled Newton methods for optimization. IMA Journal of Numerical Analysis, 39(2):545 578, 2019. [10] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled Newton methods. Mathematical Programming, 174:293 326, 2019. [11] Michal Derezinski and Michael W Mahoney. Distributed estimation of the inverse hessian by determinantal averaging. Advances in Neural Information Processing Systems, 32, 2019. [12] Haishan Ye, Luo Luo, and Zhihua Zhang. Approximate Newton methods. Journal of Machine Learning Research, 22(66):1 41, 2021. [13] Michal Derezinski, Jonathan Lacotte, Mert Pilanci, and Michael W Mahoney. Newton-LESS: Sparsification without trade-offs for the sketched Newton update. Advances in Neural Information Processing Systems, 34:2835 2847, 2021. [14] Albert S Berahas, Raghu Bollapragada, and Jorge Nocedal. An investigation of Newton-sketch and subsampled Newton methods. Optimization Methods and Software, 35(4):661 680, 2020. [15] Michał Derezi nski and Michael W Mahoney. Recent and upcoming developments in randomized numerical linear algebra for machine learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6470 6479, 2024. [16] Mikhail V Solodov and Benar F Svaiter. A hybrid approximate extragradient proximal point algorithm using the enlargement of a maximal monotone operator. Set-Valued Analysis, 7(4):323 345, 1999. [17] Renato DC Monteiro and Benar F Svaiter. Iteration-complexity of a Newton proximal extragradient method for monotone variational inequalities and inclusion problems. SIAM Journal on Optimization, 22(3):914 935, 2012. [18] Renato D. C. Monteiro and B. F. Svaiter. On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM Journal on Optimization, 20(6):2755 2787, 2010. [19] Yair Carmon, Danielle Hausler, Arun Jambulapati, Yujia Jin, and Aaron Sidford. Optimal and adaptive Monteiro-Svaiter acceleration. Advances in Neural Information Processing Systems, 35:20338 20350, 2022. [20] Dmitry Kovalev and Alexander Gasnikov. The first optimal acceleration of high-order methods in smooth convex optimization. Advances in Neural Information Processing Systems, 35:35339 35351, 2022. [21] Mathieu Barré, Adrien Taylor, and Francis Bach. A note on approximate accelerated forwardbackward methods with absolute and relative errors, and possibly strongly convex objectives. Open Journal of Mathematical Optimization, 3:1 15, 2022. [22] Ruichen Jiang, Qiujiang Jin, and Aryan Mokhtari. Online learning guided curvature approximation: A quasi-Newton method with global non-asymptotic superlinear convergence. ar Xiv preprint ar Xiv:2302.08580, 2023. [23] Konstantin Mishchenko. Regularized Newton method with global convergence. SIAM Journal on Optimization, 33(3):1440 1462, 2023. [24] Nikita Doikov, Konstantin Mishchenko, and Yurii Nesterov. Super-universal regularized Newton method. SIAM Journal on Optimization, 34(1):27 56, 2024. [25] Yurii Nesterov. Lectures on Convex Optimization. Springer International Publishing, 2018. A Missing proofs in Section 3 A.1 Proof of Proposition 1 In this section, we prove Proposition 1. In fact, using the same proof, we can also show an additional result that upper bounds xt ˆxt , which will useful in the proof of Lemma 2. Therefore, we present the full version below for completeness. Proposition 2 (Full version of Proposition 1). Let {xt}t 0 and {ˆxt}t 0 be the iterates generated by Algorithm 1. Then for any t 0, we have xk+1 x 2 xk x 2(1 + 2ηkµ) 1 and xt ˆxt 1 1 α2 xt x . Proof. Our proof is inspired by the approach in [22, Proposition 1]. For any x Rd, we first write ηt f(ˆxt), ˆxt x = ˆxt xt + ηt f(ˆxt), ˆxt x + xt ˆxt, ˆxt x . (10) To begin with, we bound the first term in (10) by ˆxt xt + ηt f(ˆxt), ˆxt x ˆxt xt + ηt f(ˆxt) ˆxt x 1 + 2ηtµ ˆxt xt ˆxt x 2 ˆxt xt 2 + 1 + 2ηtµ 2 ˆxt x 2, (11) where the first inequality is due to Cauchy-Schwarz inequality, the second inequality is due to the condition in (3), and the last inequality is due to Young s inequality. Moreover, for the second term in (10), we use the three-point equality to get xt ˆxt, ˆxt x = 1 2 xt ˆxt 2 1 2 ˆxt x 2. (12) By combining (10), (11) and (12), we obtain that ηt f(ˆxt), ˆxt x 1 2 xt x 2 1 α2 2 xt ˆxt 2 + ηtµ ˆxt x 2. (13) Moreover, it follows from the update rule in (4) that ηt f(ˆxt) = xt xt+1 + 2ηtµ(ˆxt xt+1). This implies that, for any x Rd, ηt f(ˆxt), xt+1 x (14) = xt xt+1, xt+1 x + 2ηtµ ˆxt xt+1, xt+1 x 2 xt xt+1 2 2 xt+1 x 2 + ηtµ ˆxt x 2 ηtµ ˆxt xt+1 2, (15) where we applied the three-point equality twice in the last equality. Thus, by combining (13) with x = xt+1 and (15) with x = x , we get ηt f(ˆxt), ˆxt x = ηt f(ˆxt), xt+1 x + ηt f(ˆxt), ˆxt xt+1 2 XXXXXX xt xt+1 2 2 xt+1 x 2 + ηtµ ˆxt x 2 hhhhhhhh ηtµ ˆxt xt+1 2 + XXXXXX xt xt+1 2 2 xt ˆxt 2 +hhhhhhhh ηtµ ˆxt xt+1 2. (16) Since f(x ) = 0 and f is µ-strongly convex, we further have f(ˆxt), ˆxt x = f(ˆxt) f(x ), ˆxt x µ ˆxt x 2. (17) Combining (16) and (17) and rearranging the terms, we obtain that 1 + 2ηtµ 2 xt+1 x 2 1 2 xt x 2 1 α2 2 xt ˆxt 2. (18) Since α < 1, the last term in (18) is negative and we immediately obtain that xt+1 x 2 xt x 2(1 + 2ηtµ) 1 . Moreover, since 1+2ηtµ 2 xt+1 x 2 0, it follows that 1 α2 2 xt ˆxt 2 1 2 xt x 2, which leads to xt ˆxt 1 1 α2 xt x . The proof is complete. A.2 Proof of Lemma 1 Recall that in our backtracking line search scheme in Algorithm 1, the step size ηt starts from σt and keeps backtracking until the condition in (3) is satisfied. Hence, by the definition of B, it immediately follows that ηt = σt if t / B. Moreover, if t B, then the step size ηt = ηt/β and the corresponding iterate xt must have failed the condition in (3) (Otherwise, our line search scheme would have accepted the step size ηt instead). This implies that xt xt + ηt f( xt) > α p 1 + 2 ηtµ xt xt . (19) Since xt = xt ηt(I + ηt Ht) 1 f(xt), we have xt xt + ηt Ht( xt xt) = ηt f(xt) xt xt = ηt f(xt) + Ht( xt xt) . and hence the left-hand side in (19) equals to ηt f( xt) f(xt) Ht( xt xt) . Thus, we obtain from (19) that ηt > α 1 + 2 ηtµ xt xt f( xt) f(xt) Ht( xt xt) , (20) By substituting ηt = β ηt, we further have 1 + 2ηtµ/β xt xt f( xt) f(xt) Ht( xt xt) . Using the fact that 1 + 2ηtµ/β 1, we get ηt > αβ xt xt f( xt) f(xt) Ht( xt xt) . (21) Moreover, using the fact that 1 + 2ηtµ/β 2ηtµ/β, we can also conclude that 2ηtµ/β xt xt f( xt) f(xt) Ht( xt xt) ηt > 2α2βµ xt xt 2 f( xt) f(xt) Ht( xt xt) 2 . (22) By combining (21) and (22), we obtain the lower bound in Lemma 1. Finally, when Ht 0, it holds that I + ηt Ht I + ηt Ht 0 and thus (I + ηt Ht) 1 (I + ηt Ht) 1 0. This further implies that (I + ηt Ht) 1 f(xt) (I + ηt Ht) 1 f(xt) . Hence, we can conclude that xt xt = ηt (I + ηt Ht) 1 f(xt) ηt β (I + ηt Ht) 1 f(xt) 1 This completes the proof. A.3 Extension to inexact linear solving In this section, we extend our convergence results to the case where the linear system in (6) is solved inexactly, i.e., we find ˆxt such that (I + ηt Ht)(ˆxt xt) + ηt f(xt) α 2 ˆxt xt . (23) In this case, since the proof of Proposition 2 does not rely on the update rule in (6), Proposition 2 continues to hold. However, we need to modify the proof of Lemma 1 and replace it by the following lemma. We note that the two results differ only by an absolute constant. Lemma 5 (Extension to Lemma 1). For t / B, we have ηt = σt. For t B, let ηt = ηt/β and xt be the corresponding iterate rejected by our line search scheme. Then, xt xt 3 β ˆxt xt . Moreover, ηt max αβ xt xt 2 g( xt) g(xt) Ht( xt xt) , α2βµ xt xt 2 2 g( xt) g(xt) Ht( xt xt) 2 Proof. We follow a similar argument as in the proof of Lemma 1. If t / B, it immediately follows that ηt = σt. Further, if t B, then ηt and the corresponding iterate xt must fail to satisfy the condition in (3). This implies that xt xt + ηt f( xt) > α p 1 + 2 ηtµ xt xt . (24) Moreover, by our inexactness condition in (23), xt satisfies (I + ηt Ht)( xt xt) + ηt f(xt) α Hence, by using triangle inequality, we have ηt f( xt) f(xt) Ht( xt xt) = ( xt xt + ηt f( xt)) (I + ηt Ht)( xt xt) + ηt f(xt) xt xt + ηt f( xt) (I + ηt Ht)( xt xt) + ηt f(xt) 1 + 2 ηtµ xt xt . Thus, we obtain that ηt > α 1 + 2 ηtµ xt xt 2 f( xt) f(xt) Ht( xt xt) , Combining this with (20), we observe that these two bounds differ only by a constant factor of 2. Hence, the rest of the proof for the lower bound follows similarly as in Lemma 1. Next, we prove the inequality xt xt 1 β ˆxt xt . Define ˆx t = xt ηt(I + ηt Ht) 1 f(xt) and x t = xt ηt(I + ηt Ht) 1 f(xt), i.e, they are the exact solutions to the corresponding linear systems. By the argument in the proof of Lemma 1, we have x t xt 1 β ˆx t xt . In the following, we first prove that 1 2 ˆxt xt ˆx t xt 3 2 ˆxt xt and 1 2 xt xt x t xt 3 2 xt xt . (25) It suffices to prove the first set of inequalities, since the second one follows similarly. To see this, note that the condition in (23) can be rewritten as (I + ηt Ht)(ˆxt ˆx t ) α Since Ht 0, this further implies that ˆxt ˆx t (I + ηt Ht)(ˆxt ˆx t ) α 2 ˆxt xt . Hence, by the triangle inequality, we obtain that ˆx t xt ˆx t ˆxt + ˆxt xt 1 + α ˆx t xt ˆx t ˆxt ˆxt xt 1 α which lead to (25). Finally, we conclude that xt xt 2 x t xt 2 β ˆx t xt 3 β ˆxt xt . This completes the proof. A.4 The total complexity of line search Let lt denote the number of line search steps in iteration t. We first note that ηt = σtβlt 1 by our line search subroutine, which implies lt = log1/β(σt/ηt) + 1. Moreover, recall that σt = ηt 1/β for t 1. Hence, the total number of line search steps after t iterations can be bounded by (cf. [22, Lemma 22]): t 1 X + 1 = 2t 1 + log σ0 Moreover, it can be shown that ηt 1 αβ/(3M1) when t = Ω(Υ2/κ2) in both the uniform averaging and the non-uniform averaging cases (cf. Corollaries 2 and 3). This implies that the total number of line search steps can be bounded by 2t 1 + log(3M1σ0/αβ). A.5 Proof of Lemma 2 Recall that we use g(x) and H(x) to denote the gradient f(x) and the Hessian 2f(x), respectively. We first consider the inequality in (9). By the fundamental theorem of calculus, we can write f( xt) f(xt) = Z 1 0 2f(xt + τ( xt xt))( xt xt) dτ. Therefore, we can further use the triangle inequality to get f( xt) f(xt) 2f(xt)( xt xt) 2f(xt + τ( xt xt)) 2f(xt) ( xt xt) dτ 2f(xt + τ( xt xt)) 2f(xt) ( xt xt) dτ 2f(xt + τ( xt xt)) 2f(xt) xt xt dτ. (26) Moreover, we have 2f(xt + τ( xt xt)) 2f(xt) M1 for any τ [0, 1] by Assumption 2. Together with (26), this further implies that f( xt) f(xt) 2f(xt)( xt xt) xt xt M1, which proves the first bound in (9). Next, we consider the second bound in (9). By Assumption 3, it follows from standard arguments that f( xt) f(xt) 2f(xt)( xt xt) L2 2 xt xt 2 (e.g., see [25, Lemma 1.2.4]). This leads to f( xt) f(xt) 2f(xt)( xt xt) xt xt L2 xt xt 2 L2 ˆxt xt where we used xt xt 1 β ˆxt xt from Lemma 1 and ˆxt xt 1 1 α2 xt x from Proposition 2. This completes the proof. B Missing Proofs in Section 4 B.1 Proof of Lemma 4 Recall that in the case of uniform averaging, we have Ht = 1 t+1 Pt i=0 Hi. Hence, it follows from Jensen s inequality that Ht Ht 1 t+1 Pt i=0 Hi Ht . To prove the second claim, note that Assumption 2 directly implies Hi Ht M1. Moreover, we can use Assumption 3 and the triangle inequality to bound Ht Hi L2 xt xi L2( xt x + xi x ). Since i t and xt x is non-increasing in t by Proposition 1, we further have xt x xi x , which proves Ht Hi 2L2 xi x . B.2 Proof of Theorem 1 Warm-up phase. To determine the transition point T1, recall that both the linear approximation error and the bias term can be bounded by M1 according to Lemmas 2 and 4. Since Lemma 3 shows that Et = O(ΥE/ t), we will have Et M1 when t = Ω(Υ2 E/M 2 1 ) = Ω(Υ2/κ2). More specifically, the transition point is given by T1 = max n256Υ2 κδ , 4 log d where δ (0, 1) satisfies d/δ e, α, β (0, 1) are line-search parameters, and σ0 is the initial step size. Linear convergence phase. In the following lemma, we prove the linear convergence of Algorithm 1 with uniform averaging. Lemma 6. Assume that β 1/2 and recall the definition of T1 in (27). For any t T1, we have xt+1 x 2 xt x 2 1 + 2αβ where κ M1/µ is the condition number. Proof. See Appendix B.3. Now we discuss the transition point T2. At a high level, the algorithm transitions to the superlinear phase if all three errors discussed in Section 3.1 are reduced from O(M) to O(µ). For this to happen, first the iterate xt needs to reach a local neighborhood satisfying xt x = O(µ/L2). As a corollary of Lemma 6, this holds at most after an additional O(κ) iterations. Specifically, let ν (0, 1) be a parameter and define I = T1 + 2 1 + 3κ where D = x0 x is the initial distance to the optimal solution. Then Lemma 6 implies that we have xt x νµ/L2 for all t I. Moreover, Lemma 3 implies that the averaged Hessian noise satisfies Et = O(µ) when t = Ω(Υ2 E/µ2) = Ω(Υ2). Finally, regarding the bias term, following the discussions after Lemma 4, it can be shown that Ht Ht = O M1I t+1 . Thus, Ht Ht = O(µ) when t = Ω(κI). Combining all pieces together, we formally define the second transition point by T2 = max 256Υ2 Since I = O(T1+κ) = O(Υ2/κ2+κ), we note that T2 = O(max{Υ2, Υ2/κ+κ2}) = O(Υ2+κ2). Superlinear phase. In the following theorem, we show that after T2 iterations, Algorithm 1 with uniform averaging converges at a superlinear rate. See Appendix B.4 for proof. Theorem 3. Let ν (0, 1) be a parameter satisfying 5 2αβ (1 α2)β + 25 α 2β ν 1. and recall the definition of T2 in (29). Then for any t T2, xt+1 x 1 2β 1 α2 + 5 ρt xt x , ρt = 4Υ α β log(d(t + 1)/δ) t + 1 + 3κI 2α β(t + 1). (30) In Theorem 3, we observe that the rate ρt in (30) goes to zero as the number of iterations t increases, and thus it implies that the iterates converge to x superlinearly. Moreover, the rate ρt consists of two terms. The first term comes from the averaged noise Et , which decays at the rate of O(Υ/ t). In addition, the second term is due to the bias of our averaged Hessian estimate Ht, which decays at the rate of O(κI/t). Hence, when t is sufficiently large, the averaged noise will dominate and the superlinear rate settles for the slower rate of O(Υ/ t). Specifically, the algorithm transitions from the initial superlinear rate to the final superlinear rate when the two terms in (30) are balanced. Hence, we define the third transition point T3 as the root of 64(T3 + 1) log(d(T3 + 1)/δ) = 9κ2I2 Since I = O(Υ2/κ2 + κ), T3 = O((Υ2/κ + κ2)2/Υ2). We summarize our discussions in the following corollary. Corollary 1. For T2 t T3 1, we have xt+1 x 1 2β 1 α2 + 5 ρ(1) t xt x , where ρ(1) t = 6κI α 2β(t+1). Moreover, for t T3, we have xt+1 x 1 2β 1 α2 + 5 ρ(2) t xt x , where ρ(2) t = 8 log(d(t+1)/δ) B.3 Proof of Lemma 6 We divide the proof of Lemma 6 into the following three steps. First, in Lemma 7, we provide a lower bound on the the step size ηt in those iterations where our line search scheme backtracks the step size, i.e., t B. Building on Lemma 7, we use induction in Lemma 8 to prove a lower bound for all t 0. This allows us to establish ηt = Ω(1/M1) for all t T1 in Corollary 2, from which Lemma 6 immediately follows. To simplify the notation, we define the function log(d(t + 1)/δ) t + 1 . (31) Then Lemma 3 can be equivalently written as Et ϕ(t) for all t 4 log(d/δ). We are now ready to state our first lemma. Lemma 7. If t B, then we have ηt αβ/(2M1 + ϕ(t)). Proof. If t B, by Lemma 1 we can lower bound the step size ηt by ηt αβ xt xt g( xt) g(xt) Ht( xt xt) = αβ where Et g( xt) g(xt) Ht( xt xt) xt xt is the normalized approximation error. Moreover, as outlined in Section 3.1, we can apply the triangle inequality to upper bound Et. Specifically, we have Et g( xt) g(xt) Ht( xt xt) xt xt + Ht Ht + Et . (33) By (9) in Lemma 2, the first term in (33) is upper bounded by M1. Moreover, it also follows from Lemma 4 that Ht Ht 1 t+1 Pt i=0 Ht Hi M1. Hence, we further obtain Et 2M1 + Et 2M1 + ϕ(t). Combining this with (32), we obtain the desired result. Lemma 7 provides a lower bound on the step size ηt, but only for the case where t B. In the next lemma, we further use induction to show a lower bound for the step sizes in all iterations. Lemma 8. Assume that β 1 2. For any t 0, we have ηt min αβ 2M1 + ϕ(t), σ0 Proof. We prove this lemma by induction. For the base case t = 0, we consider two subcases. If 0 B, then by Lemma 7 we obtain that η0 αβ 2M1+ϕ(0). Otherwise, if 0 / B, we have η0 = σ0. In both cases, we observe that (34) is satisfied for the base case t = 0. Now assume that (34) is satisfied for t = s where s 0. For t = s + 1, we again distinguish two subcases. If s + 1 B, then by Lemma 7 we obtain that ηs+1 αβ 2M1+ϕ(s+1), which implies that (34) is satisfied. Otherwise, if s + 1 / B, then we have ηs+1 = σs+1 = ηs β min αβ 2M1 + ϕ(s), σ0 = min α 2M1 + ϕ(s), σ0 βs+1 where we used the induction hypothesis in the last inequality. Furthermore, note that ϕ(s)/ϕ(s+1) q s+2 s+1 2 1 β , which implies that ϕ(s) ϕ(s + 1)/β. Hence, we have α 2M1 + ϕ(s) αβ 2βM1 + ϕ(s + 1) αβ 2M1 + ϕ(s + 1). Therefore, (35) implies that ηs+1 min n αβ 2M1+ϕ(s+1), σ0 βs+1 o and thus (34) also holds in this subcase. This completes the induction and we conclude that (34) holds for all t 0. As a corollary of Lemma 8, we obtain the following lower bound on ηt for t T1. Corollary 2. Recall the definition of T1 in (27). For any t T1, we have ηt αβ/(3M1). Proof. As shown in [1, Lemma 2], we have ϕ(t) M1 when t max n 256 Υ2 κδ , 4 log d Moreover, we have σ0 βt αβ 3M1 when t log 1 β αβ 3M1σ0 . Hence, by Lemma 8 we conclude that ηt αβ 3M1 when t T1. Now we are ready to prove Lemma 6. Proof of Lemma 6. By Proposition 1, we have xt+1 x 2 xt x 2(1 + 2ηtµ) 1. By using Corollary 2, we obtain that xt+1 x 2 xt x 2(1 + 2αβµ/(3M1)) 1 = xt x 2(1 + 2αβ/(3κ)) 1. B.4 Proof of Theorem 3 In this section, we present the proof of Theorem 3. The proof consists of four steps. To begin with, in Lemma 9 we show that the iterate xt stays in a local neighborhood of x when t I, where I is defined in (28). Next, we use this result in Lemma 10 to upper bound 1 µηt in those iterations where our line search scheme backtracks the step size, i.e., t B. Then we use induction in Lemma 11 to prove an upper bound for all t 0. Furthermore, we again use induction in Lemma 12 to establish an improved upper bound on 1 µηt when t T2, where T2 is defined in (29). After proving Lemma 12, Theorem 3 then follows from Proposition 1. Lemma 9. We have xt x νµ/L2 for all t I, where I is given in (28). Proof. By applying Lemma 6, we have x T1+u x 2 x T1 x 2 1 + 2αβ 3κ u . Moreover, since xt x is non-increasing in t, we have x T1 x x0 x = D. Thus, we have x T1+u x νµ L2 D2 1 + 2αβ L2 2 u 2 1 + 3κ This completes the proof. Note that by Proposition 1, we have xt+1 x 2 xt x 2(1+2ηtµ) 1 xt x 2/(2ηtµ), which further implies that xt+1 x xt x 2ηtµ . (36) Hence, to characterize the convergence rate of our method, it is sufficient to upper bound the quantity 1/ 2ηtµ. We achieve this goal in the subsequent lemmas. Lemma 10. If t B and t I, then 1 2ηtµ L2 xt x (1 α2)βµ + Et 2α βµ + 3κI 2α β(t + 1). (37) Moreover, it also holds that (1 α2)β + ϕ(t) 2α βµ + 3κI 2α β(t + 1). (38) Proof. By using the second bound in Lemma 1, we obtain that 1 2ηtµ g( xt) g(xt) Ht( xt xt) 2α βµ xt xt = Et 2α βµ. (39) Furthermore, by combining (33) and (9) in Lemma 2, we further have 1 α2 + Ht Ht + Et . (40) It remains to bound the bias term Ht Ht . By Lemma 4, we have Ht Ht 1 t + 1 i=0 Ht Hi = 1 t + 1 i=0 Ht Hi + 1 t + 1 i=I Ht Hi . For i = 0, 1 . . . , I 1, we use the first upper bound on Ht Hi in Lemma 4 to bound Ht Hi M1, and thus PI 1 i=0 Ht Hi M1I. Moreover, for i = I, I + 1, . . . , t, we use the second upper bound in Lemma 4 to get Ht Hi 2L2 xi x . Moreover, note that xi converges linearly to x when i I by Lemma 6. Hence, we further have i=I Ht Hi 2L2 i=I xi x 2L2 x I x In the last inequality, we used the fact that x I x νµ/L2 and P i=0(1 + ϕ) i/2 = 1/(1 (1 + ϕ) 1/2) = (1 + ϕ)1/2/((1 + ϕ)1/2 1) = (1 + ϕ)1/2((1 + ϕ)1/2 + 1)/ϕ 2(1 + 1/ϕ), where ϕ = 2αβ/(3κ). Moreover, since I 2 1 + 3κ 2αβ and ν 1, from (41) we further have 1 t+1 Pt i=I Ht Hi 2µI t+1 . Combining the above inequalities, we arrive at t + 1 + 2µI t + 1 . (42) Combining (39), (40), and (42) leads to the first result in (37). Finally, (38) follows from the fact that Et ϕ(t) and xt x νµ/L2 for all t I. Lemma 11. Assume that β 1/2. For any t I, we have (1 α2)β + ϕ(t) 2α βµ + 3κI 2α β(t + 1) = ν (1 α2)β + ρt, (43) where ρt is defined in (30). Proof. We shall use induction to prove Lemma 11. For t = I, note that by Corollary 2, we have ηI αβ/(3M1). Thus, this implies that 3κ 2αβ 3κI 2α β(I + 1), where we used the fact that κ 1, α < 1, β 1/2 and I 4. This proves the base case where t = I. Now assume that (43) holds for t = s, where s I. For t = s + 1, we distinguish two subcases. If s + 1 B, then by Lemma 10 we obtain that (43) is satisfied for t = s + 1. Otherwise, if s + 1 / B, then we have ηs+1 = σs+1 = ηs/β. Hence, by using the induction hypothesis, we have 1 2ηs+1µ = β 2ηsµ ν (1 α2)β + βϕ(s) 2α βµ + 3 βκI 2α β(s + 1). Since β 1/2 and I 2, we have (s + 2)/(s + 1) (I + 2)/(I + 1) 1.4 1/ β and ϕ(s) ϕ(s + 1)/ β. Thus, we further have (1 α2)β + ϕ(s + 1) 2α βµ + 3κI 2α β(s + 2). This shows that (43) also holds in this subcase. This completes the induction and we conclude that (43) holds for all t I. Before proving Lemma 12, recall the definition of ϕ in (31) and first define I = sup t {t : ϕ(t) νµ} and T 2 = max I , κI Note that we have ϕ(t) νµ when t 256Υ2 δν . Hence, by the definition of (29), it holds that T2 T 2. Lemma 12. Recall the definition of ρt in (30). For any t T 2, we have 2α βµ ρt and 1 2µηt 1 2β 1 α2 + 5 ρt. (45) Proof. By Lemma 10, if t B, then 1 µηt L2 xt x 2(1 α2)βµ + ρt. We shall prove (45) by induction. First consider the base case where t = T 2, where T 2 is defined in (44). To begin with, we will show that ν 2α β ρT 2 5ν 2α β . Since T 2 is the maximum of ν , we have either T 2 = I or T 2 = κI ν 1. In the former case, we can lower bound ρT 2 1 2α βµϕ(I ) ν 2α β . In the latter case, we can lower bound ρT 2 3κI 2α β(T 2 +1) = 3ν 2α β . Combining both cases leads to the lower bound on ρT 2 . Furthermore, note that both two terms in ρt are a decreasing function in terms of t, and hence we have ρT 2 1 2α βµϕ(I ) + 3κI 2α β(κI/ν) 2 2α βµϕ(I + 1) + 3ν 2α β 5ν 2α β . This proves the upper bound on ρT 2 . Now we return to the proof in the base case where t = T 2. since x T 2 x νµ/L2 by Lemma 9, we obtain that L2 x T 2 x 2α βµ ν 2α β ρT 2 . Moreover, by Lemma 11, we have 1 p2ηT 2 µ ν (1 α2)β + ρ(T 2) ν (1 α2)β + 5ν 2α β 1 2β 1 α2 + 5 ρT 2 . This shows that (45) holds for t = T 2. Now assume that (45) holds for t = s T 2. For t = s + 1, by using the induction hypothesis and (36), we obtain that 2α βµ L2 xs x 2α βµ 2ηsµ 1 1 α2 + 5 ρ2 s. Moreover, since ρs/2 ρs+1, it suffices to show that 1 1 α2 + 5 ρ2 s ρs/2, which is equivalent to 1 α2 + 5 ρs 1. Furthermore, since ρs is non-increasing, we further have 1 α2 + 5 ρs 1 α2 + 5 ρT 2 1 α2 + 5 5ν 2α β 1, where we used the condition on ν stated in Theorem 3 in the last inequality. This proves the first inequality in (45). To prove the second inequality in (45) for t = s + 1, we distinguish two subcases. If s + 1 B, then by Lemma 10, we have 1 2ηs+1µ L2 xs+1 x (1 α2)βµ + Es+1 2α βµ + 3κI 2α β(s + 1) 1 2β 1 α2 ρs+1 + ρs+1 1 α2 + 5 ρs+1. Otherwise, if s + 1 / B, then we have ηs+1 = ηs/β and hence 1 2ηs+1µ = β 2ηsµ 1 2β Since T 2 I 2 and β 1/2, we have ρs/ρs+1 (I + 2)/(I + 1) 1.4 1/ β. Thus, we also proved that 1 µηs+1 1 2β 1 α2 + 5 ρs+1. This completes the induction. Proof of Theorem 3. It immediately follows from (36) and Lemma 12. C Missing Proofs in Section 5 In this section, we will present the formal version of Theorem 2. Our proof largely mirrors the developments in Section 4. C.1 Approximation error analysis Averaged stochastic error. Similar to Lemma 3, we can use tools from concentration inequalities to prove the following upper bound on the averaged stochastic error. Lemma 13 ([1, Lemma 6]). Let δ (0, 1) with d/δ e. Then with probability 1 δπ2/6, we have, for any t 0, Et 8ΨΥE max log d(t + 1) w(t) , log d(t + 1) C.2 Convergence analysis Warm-up phase. Similar to the case of uniform averaging, we can only ensure that the distance to x is monotonically non-increasing by Proposition 1 during this phase. Moreover, Algorithm 1 transitions to the linear phase when Et M1. Specifically, the transition point U1 is given by β αβ 3M1σ0 : log d(t + 1) 2 + 1. (47) When w(t) = (t + 1)log(t+4), we have w (t)/w(t) = O (log(t)/t). Thus, we conclude that U1 = O(Υ2/κ2). Linear convergence phase. In the following lemma, we prove the linear convergence of Algorithm 1 with weighted averaging. Lemma 14. Assume that β 1/Ψ2 and recall the definition of U1 in (47). For any t U1, we have xt+1 x 2 xt x 2 1 + 2αβ where κ M1/µ is the condition number. Proof. See Appendix C.3. Superlinear convergence phase. Define t : log d(t + 1) 2 + 1. (48) Moreover, let ν (0, 1) be a parameter and define J = max U1 + 2 1 + 2κ Finally, let U2 = sup t n t : w(t) w(J )κ When w(t) = (t + 1)log(t+4), we remark that J = O(Υ2) and thus J = O(κ + Υ2). Moreover, similar to the derivation in [1], it can be shown that U2 = O(J ) = O(κ + Υ2). Theorem 4. Let ν (0, 1) be a parameter satisfying 1 (1 α2)β + 5 α β and recall the definition of U2 in (50). For any t U2, we have 2(1 α2) + 1 log d(t + 1) w(t) + 5κw(J ) α 2βw(t). (52) Proof. See Appendix C.4. C.3 Proof of Lemma 14 To simplify the notation, define the function ϕ(t) = 8ΨΥE max log d(t + 1) w(t) , log d(t + 1) Then we can rewrite (46) as Et ϕ(t) for all t 0. Similar to Lemma 7, we have the following result. Lemma 15. If t B, then we have ηt αβ/(2M1 + Et ) αβ/(2M1 + ϕ(t)). Lemma 16. Assume that β 1/Ψ. For any t 0, we have ηt min αβ 2M1 + ϕ(t), σ0 Proof. We prove this lemma by induction. For t = 0, we distinguish two subcases. If 0 B, then by Lemma 7 we obtain that η0 αβ 2M1+ϕ(0). Otherwise, if 0 / B, we have η0 = σ0. In both cases, we observe that (53) is satisfied for the base case t = 0. Now assume that (53) is satisfied for t = s. For t = s + 1, we again distinguish two subcases. If s+1 B, then by Lemma 15 we obtain that ηs+1 αβ 2M1+ϕ(s+1), which implies that (53) is satisfied. Otherwise, if s + 1 / B, then we have ηs+1 = σs+1 = ηs β min αβ M1 + ϕ(s), σ0 = min α M1 + ϕ(s), σ0 βs+1 where we used the induction hypothesis in the last inequality. Furthermore, note that ϕ(s) ϕ(s + 1) w (s)w(s + 1) w (s + 1)w(s) Ψ 1 and hence α M1 + ϕ(s) αβ βM1 + ϕ(s + 1) αβ M1 + ϕ(s + 1). Therefore, (54) implies that ηs+1 min n αβ M1+ϕ(s+1), σ0 βs+1 o and thus (53) also holds in this case. This completes the induction and we conclude that (53) holds for all t 0. Corollary 3. Recall the definition of U1 in (47). For any t U1, we have ηt αβ/(3M1). Proof. By definition, we have U1 log 1 β αβ 3M1σ0 , and thus σ0 βU1 αβ 3M1 . Moreover, we also have ϕ(t) M1. Hence, by Lemma 8 we conclude that ηt αβ 3M1 when t U1. Now we are ready to prove Lemma 14. Proof of Lemma 14. It follows from Proposition 1 and Corollary 3. C.4 Proof of Theorem 4 Lemma 17. We have Et νµ and xt x νµ/L2 for all t J . Proof. This follows from Lemmas 13 and 14. Lemma 18. If t B and t J , then 1 µηt L2 xt x 2(1 α2)βµ + Et α 2βµ + κw(J 1) α 2βw(t) + 2ν α 2β (55) and also 1 µηt ν 2(1 α2)β + 3ν α 2β + κw(J 1) α 2βw(t) . (56) Proof. Similar to the proof in Lemma 10, note that 1 µηt L2 xt x 2(1 α2)βµ + Ht Ht α 2βµ + Et α 2βµ. For the second term, note that i=0 zi,t Hi, where zi,t = w(i) w(i 1) Hence, by Jensen s inequality, we have i=0 zi,t Ht Hi = i=0 zi,t Ht Hi + i=J zi,t Ht Hi . When 0 i J 1, we use Assumption 2 to bound Ht Hi M1 and thus i=0 zi,t Ht Hi M1 w(i) w(i 1) w(t) = M1 w(J 1) Moreover, for J i t, we use Assumption 3 to get Ht Hi L2 xt xi L2 ( xt x + xi x ) 2L2 xi x 2νµ. Thus, Pt i=J zi,t Ht Hi 2νµ Pt i=I zi,t 2νµ. Combining the above inequalities, we arrive at Ht Ht M1 w(J 1) w(t) + 2νµ. This leads to the first result in (55). To show (56), we note that Et ϕ(t) and xt x νµ/L2 for all t J . Lemma 19. Assume that β 1/Ψ2. For any t J , we have 2(1 α2)β + 3ν α 2β + 2κw(J ) α βw(t) . (57) Proof. We shall use induction to prove Lemma 19. For t = J , note that by Corollary 3, we have ηJ αβ/(3M1). Thus, this implies that where we used κ 1, α, β < 1 and I 2. This proves the base case where t = J . Now assume that (57) holds for t = s, where s J . For t = s + 1, we distinguish two cases. If s + 1 B, then by Lemma 18 we obtain that (57) is satisfied for t = s + 1. Otherwise, if s + 1 / B, then we have ηs+1 = σs+1 = ηs/β. Hence, by using the induction hypothesis, we have 1 µηs+1 = β µηs ν 2(1 α2)β + 3ν α 2β + 2βκw(J ) Since β 1/Ψ2, we have w(s + 1)/w(s) Ψ 1/ β. Thus, we further have 2(1 α2)β + 3ν α 2β + 2κw(J ) α βw(s + 1). This shows that (57) also holds in this case. Recall that w(U2) = w(J ) κ ν . Then by Lemma 19, for t U2 we have 2(1 α2)β + 3ν α 2β + 2κw(J ) α βw(U2) ν 2(1 α2)β + 3ν α 2β + 2(1 α2)β + 5ν α 2β . We will choose ν such that 1 (1 α2)β + 5 α β Therefore, this further implies that xt+1 x xt x /(2Ψ) for t U2. Lemma 20. If t B and t U2, then 1 µηt L2 xt x 2(1 α2)βµ + ϕ(t) α 2βµ + 5κw(J ) α 2βw(t). (59) Proof. Recall that we have 1 µηt L2 xt x 2(1 α2)βµ + Ht Ht α 2βµ + Et α 2βµ. For the second term, we can write i=0 zi,t Ht Hi = i=0 zi,t Ht Hi + i=J zi,t Ht Hi + i=U2+1 zi,t Ht Hi . For the first part, PJ 1 i=0 zi,t Ht Hi M1 w(J 1) w(t) M1 w(J ) w(t) . For the second part, i=J zi,t Ht Hi 2νµ i=J zi,t 2νµw(U2) w(t) = 2M1 w(J ) where we used the fact that w(U2) = w(J ) κ ν . For the third part, i=U2+1 zi,t Ht Hi i=U2+1 2L2 w(i) w(i 1) j=1 2L2 w(U2 + j) w(U2 + j 1) w(t) x U2 x j=1 2L2 w(U2 + j) w(t) x U2 x j=1 2νµw(U2) w(t) w(U2 + j) w(U2)(2Ψ)j w(U2 + j) w(U2)(2Ψ)j w(t) = 2M1w(J ) Therefore, we conclude that Ht Ht 3M1w(J ) w(t) + 2M1w(J ) w(t) = 5M1w(J ) This completes the proof. Lemma 21. Recall the definition of θt in (52). For any t 0, we have 5L2 x U2+t x α 2βµ θt and 1 µηU2+t 1 10β 1 α2 + 1 θt, (60) Proof. Note that by Proposition 1, we have x U2+t+1 x x U2+t x (1 + 2ηU2+tµ) 1/2 x U2+t x 2ηU2+tµ . By Lemma 18, if U2 + t B, then 1 µηU2+t L2 x U2+t x 2(1 α2)βµ + θt. We will prove (60) by induction. First consider the base case t = 0. We note that θ0 5κw(J ) α 2βw(U2) = 5ν α 2β . On the other hand, since x U2 x νµ/L2, we obtain that 5L2 x U2 x α 2βµ 5ν α 2β θ0. Moreover, by Lemma 19, we have 2(1 α2)β + 3ν α 2β + 2κw(J ) α βw(U2) ν 2(1 α2)β + 5ν α 2β 1 α2 + 1 θ0. This shows that (60) holds for t = 0. Now assume that (60) holds for t = s 0. For t = s + 1, by using the induction hypothesis, we obtain 5L2 x U2+s+1 x α 2βµ 5L2 x U2+s x α 2βµ 2ηU2+sµ 1 1 α2 + 1 θ2 s. Note that θs/Ψ θs+1. Thus, it suffices to show that 1 1 α2 + 1 θ2 s θs/Ψ, which is equivalent to Ψ 1 α2 + 1 θs 1. Furthermore, note that θs is non-increasing and θ0 ν α 2β + 5ν α 2β = 6ν α 2β . Thus, we only need to require 1 α2 + 1 6ν α 2β 1 (1 α2)β + 3 α β which is satisfied due to (51). This proves the first inequality in (60). To prove the second inequality in (60) for t = s + 1, we distinguish two cases. If s + 1 B, then by Lemma 20, we have 1 µηU2+s+1 L2 x U2+s+1 x 2(1 α2)βµ + EU2+s+1 α 2βµ + 5κw(J ) α 2βw(U2 + s + 1) 1 α2 θs+1 + θs+1 1 10β 1 α2 + 1 θs+1. Otherwise, if s + 1 / B, then we have ηU2+s+1 = ηU2+s/β and hence 1 µηU2+s+1 = β µηU2+s 1 10β Since θs/Ψ θs+1 and β 1/Ψ, this implies that 1 µηT2+s+1 1 10β 1 α2 + 1 θs+1. This completes the induction. Proof of Theorem 4. By Proposition 1, we have x T2+t+1 x x T2+t x (1 + 2ηT2+tµ) 1/2 x T2+t x 2ηT2+tµ . The rest follows from Lemma 21. D Additional discussions D.1 Iteration complexity of SNPE Note that for t U3 = O(Υ2 + κ), we have xt+1 x = O( Υ t) xt x by Theorem 2. By unrolling the inequality, this implies that xt+1 x x U3 x O Further, for t 2U3, we can upper bound Υ s as follows: (i) Υ s Υ U3 1 for any s [U3, t/2], t/2 for any s [t/2, t]. Thus, we have To derive a complexity bound, we upper bound the required number of iterations t such that Υ 2 = ϵ. Taking the logarithm of both sides and with some algebraic manipulation, we obtain t 2Υ2 log t 2Υ2 = 2 0 50 100 150 200 250 300 350 400 Iteration SN-Unif Avg SN-Weight Avg SNPE-Unif Avg (w/o EG) SNPE-Weight Avg (w/o EG) SNPE-Unif Avg (w/ EG) SNPE-Weight Avg (w/ EG) (a) n = 50, 000, d = 500 0 50 100 150 200 250 300 350 400 Iteration SN-Unif Avg SN-Weight Avg SNPE-Unif Avg (w/o EG) SNPE-Weight Avg (w/o EG) SNPE-Unif Avg (w/ EG) SNPE-Weight Avg (w/ EG) (b) n = 100, 000, d = 500 0 50 100 150 200 250 300 350 400 Iteration SN-Unif Avg SN-Weight Avg SNPE-Unif Avg (w/o EG) SNPE-Weight Avg (w/o EG) SNPE-Unif Avg (w/ EG) SNPE-Weight Avg (w/ EG) (c) n = 150, 000, d = 500 Figure 3: The effect of the extragradient step in stochastic NPE. Using the Lambert W function1, the solution can be expressed as log t 2Υ2 = W( 2 ϵ ) t = 2Υ2e W ( 2 Υ2 log 1 ϵ ). Finally, by applying the bound e W (x) 2x+1 1+log(x+1) for any x 0, we conclude that t = O log(ϵ 1) log(Υ 2 log(ϵ 1)) . Note that in the above derivation, we ignore the additional logarithmic factor log(t) in our superlinear convergence rate. However, a more careful analysis will show that it does not affect the final complexity bound. We also refer the reader to a similar derivation in [22, Appendix D.2], where the authors provide the same complexity bound for a similar convergence rate of (1 + O( D.2 The effect of the extragradient step In Figure 3, we test the effect of the extragradient step in our proposed SNPE method. We observe that in all cases, the variant without an extragradient step outperforms the original version, suggesting that the extragradient step may not be beneficial for minimization problems. Nevertheless, the SNPE method with the extragradient step, which is the one analyzed in our paper, still outperforms the stochastic Newton method in [1]. 1https://en.wikipedia.org/wiki/Lambert_W_function Neur IPS Paper Checklist The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit. Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist: You should answer [Yes] , [No] , or [NA] . [NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available. Please provide a short (1 2 sentence) justification right after your answer (even for NA). The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper. The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a proper justification is given (e.g., "error bars are not reported because it would be too computationally expensive" or "we were unable to find the license for the dataset we used"). In general, answering "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found. Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: In the abstract and introduction of the paper, we claim that we can improve the complexity of the stochastic Newton method presented in [1], and this is exactly what we demonstrate. This improvement is achieved by introducing a novel stochastic variant of the Newton HPE framework, as promised in the abstract and introduction. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: In Section 8, we explicitly mentioned the limitation of our paper. Specifically, we noted that our theory assumes strong convexity. Extending the theory to the convex setting would make it more general. This extension is left for future work due to space limitations. We also highlighted in the introduction of our paper that we focus solely on the setting where the Hessian oracle is noisy, and the gradient can be queried without error. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All the theorems, formulas, and proofs in the paper are numbered and crossreferenced. All assumptions for each presented result are clearly stated or referenced in the statements of the lemmas, propositions, or theorems. The proofs of all results are presented in the supplemental material. High-level ideas of the proofs are included in the main text whenever possible. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The experimental results and the implementation details are included in Section 7. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code and data are attached in the supplementary material. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We explained how we performed the experiments and included the implementation details in Section 7. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The compute resources are reported in Section 7. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: There is no societal impact of the work performed. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.