# a_consistently_adaptive_trustregion_method__b34b0b76.pdf

A consistently adaptive trust-region method

Fadi Hamad Department of Industrial Engineering University of Pittsburgh Pittsburgh, PA 15261 fah33@pitt.edu

Oliver Hinder Department of Industrial Engineering University of Pittsburgh Pittsburgh, PA 15261 ohinder@pitt.edu

Adaptive trust-region methods attempt to maintain strong convergence guarantees without depending on conservative estimates of problem properties such as Lipschitz constants. However, on close inspection, one can show existing adaptive trust-region methods have theoretical guarantees with severely suboptimal dependence on problem properties such as the Lipschitz constant of the Hessian. For example, TRACE developed by Curtis et al. obtains a O( f L3/2ϵ 3/2) + O(1) iteration bound where L is the Lipschitz constant of the Hessian. Compared with the optimal O( f L1/2ϵ 3/2) bound this is suboptimal with respect to L. We present the first adaptive trust-region method which circumvents this issue and requires at most O( f L1/2ϵ 3/2) + O(1) iterations to find an ϵ-approximate stationary point, matching the optimal iteration bound up to an additive logarithmic term. Our method is a simple variant of a classic trust-region method and in our experiments performs competitively with both ARC and a classical trust-region method.

1 Introduction

Second-order methods are known to quickly and accurately solve sparse nonconvex optimization problems that, for example, arise in optimal control [1], truss design [2], AC optimal power flow [3], and PDE constrained optimization [4]. Recently, there has also been a large push to extend second-order methods to tackle machine learning problems by coupling them with carefully designed subproblem solvers [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19].

Much of the early theory for second-order methods focused on showing fast local convergence and (eventual) global convergence [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]. These proofs of global convergence, unsatisfactorily, rested on showing at each iteration second-order methods reduced the function value almost as much as gradient descent [22, Theorem 4.5] [30, Theorem 3.2.], this is despite the fact that in practice second-order methods require far fewer iterations. In 2006, Nesterov and Polyak [31] partially resolved this inconsistency by introducing a new second-order method, cubic regularized Newton s method (CRN). Their method can be used to find stationary points of multivariate and possibly nonconvex functions f : Rn R. Their convergence results assumes the optimality gap f := f(x1) inf x Rn f(x)

is finite and that the Hessian of f is L-Lipschitz:

2f(x) 2f(x ) L x x x, x Rn (1)

where is the spectral norm for matricies and the Euclidean norm for vectors. If L is known they guarantee their algorithm terminates with an ϵ-approximate stationary point:

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

after at most

O( f L1/2ϵ 3/2) (2)

iterations. For sufficiently small ϵ, this improves on the classic guarantee that gradient descent terminates after at most O( f Sϵ 2) iterations for S-smooth functions, thereby partially resolving this inconsistency between theory and practice. Bound (2) is also known to be the best possible for second-order methods [32].

However, CRN only achieves (2) if the Lipschitz constant of the Hessian is known. In practice, we rarely know the Lipschitz constant of the Hessian, and if we do it is likely to be a conservative estimate. With this in mind, many authors have developed practical algorithms that achieve the convergence guarantees of CRN without needing to know the Lipschitz constant of the Hessian. We list these adaptive second-order methods in Table 1 along with their worst-case iteration bounds.

Despite the fact that all these algorithms match the ϵ-dependence of (2), the majority of them are suboptimal due to the dependency on the Lipschitz constant L. For example, only our method and [33, 34] are optimal in terms of L scaling. Whereas [35] is suboptimal as the bound scales proportional to L3/2 instead of L1/2. Moreover, all the trust-region methods have suboptimal L scaling. In particular, inspection of these bounds shows scaling with respect to L of L3/2 for [36] and L2 for [37] instead of the optimal scaling of L1/2.

An ideal algorithm wouldn t incur this cost for adaptivity. This motivates the following definition.

Definition 1. A method is consistently adaptive on a problem class if, without knowing problem parameters, it achieves the same worst-case complexity bound as one obtains if problem parameters were known, up to a problem-independent constant-factor and additive polylogarithmic term.

Clearly, based on our above discussion there does not exist consistently adaptive trust-region methods. Indeed, despite the extensive literature on trust-region methods [9, 10, 22, 25, 27, 36, 38, 39, 40, 41] and their worst-case iterations bounds [36, 37, 42], none of these methods are consistently adaptive. As we mentioned earlier and according to Table 1, [33, 34] are cubic regularization based methods which scale optimally with respect to the problem parameters. However, they are not quite consistently adaptive because σ0 appears outside the additive polylogarithmic term.

Table 1: Adaptive second-order methods along with their worst-case bounds on the number of gradient, function and Hessian evaluations. σmin (0, ) is the smallest regularization parameter used by ARC [35]. σ0 (0, ) is the initial regularization parameter for cubic regularized methods.

Algorithm type worst-case iterations bound

ARC [35] 1 cubic regularized O( f L3/2σ 1 minϵ 3/2 + fσ1/2 minϵ 3/2) Nesterov et al. [33, Eq. 5.13 and 5.14] 2 cubic regularized O( f max{L, σ0}1/2ϵ 3/2) + O(1) ARp [34, Section 4.1] 3 cubic regularized O( f max{L, σ0}1/2ϵ 3/2) + O(1) TRACE [36, Section 3.2] trust-region O( f L3/2ϵ 3/2) + O(1) Toint et al. [37, Section 2.2] trust-region O f max L2, 1 + 2L ϵ 3/2

Our method trust-region O( f L1/2ϵ 3/2) + O(1)

Our contributions:

1. We present the first consistently adaptive trust-region method for finding stationary points of nonconvex functions with L-Lipschitz Hessians and bounded optimality gap. In particular, we prove our method finds an ϵ-approximate stationary point after at most O( f L1/2ϵ 3/2) + O(1) iterations. 2. We show our trust-region method has quadratic convergence when it enters a region around a point satisfying the second-order sufficient conditions for local optimality. 3. Our method appears promising in experiments. We test our method on the CUTEst test set [43] against other methods including ARC and a classic trust-region method. These tests show how competitive we are against the other methods in term of total number of required iterations until convergence.

Paper outline The paper is structured as follows. Section 2 presents our trust-region method and contrasts it with existing trust-region methods. Section 3 presents our main result: a convergence bound for finding ϵ-approximate stationary points that is consistently adaptive to problems with Lipschitz continuous Hessian. Section 4 shows quadratic convergence of the method. Section 5 discusses the experimental results.

Notation Let N be the set of natural numbers (starting from one), I be the identity matrix, and R the set of real numbers. Throughout this paper we assume that n N and f : Rn R is bounded below and twice-differentiable. We define f := infx Rn f(x) and f := f(x1) f .

2 Our trust-region method

2.1 Trust-region subproblems

As is standard for trust-region methods [22] at each iteration k of our algorithm we build a secondorder Taylor series approximation at the current iterate xk:

2d T 2f(xk)d + f(xk)T d (3)

and minimize that approximation over a ball with radius rk > 0: min d Rn Mk(d) s.t. d rk (4)

to generate a search direction dk. One important practical question is given a candidate search direction dk, how can we verify that it solves (4). For this one can use the following well-known Fact. Fact 1 (Theorem 4.1 [44]). The direction dk exactly solves (4) if and only there exists δk [0, ) such that: Mk(dk) + δkdk = 0 (5a) δkrk δk dk (5b) dk rk (5c)

2f(xk) + δk I 0 (5d)

1Obtaining this bound does require carefully inspection of Cartis, Gould and Toint [35] (who highlighted only on the ϵ-dependence of their bound). For simplicity of discussion we assume the ARC subproblems are solved exactly (i.e., C = 0, κθ = 0), and that the initial regularization parameter satisfies σ0 = O(L + σmin) (the bound only gets worse otherwise). We also consider only the bound on the number of Hessian evaluations, inclusion of the unsuccessful iterations (where cubic regularized subproblems are still solved) makes this bound even worse. Finally, we ignore problem-independent parameters γ1, γ2, γ3, and η1. 2Since by our assumption the function f has L-Lipschitz Hessian, we only consider the case when the Hölder exponent ν = 1. Note also that the algorithm description [33, Eq. 5.12], requires that the initial regularization parameter σ0 (H0 using their notation) satisfies H0 (0, Hf(v)] where Hf(v) is defined in [33, Eq. 2.1]. Technically this condition is not verifiable as Hf(v) is unknown in practice. However, one can readily modify [33] by redefining Hf(v) to be the maximum of H0 and the RHS of [33, Eq. 2.1] to remove the requirement that H0 (0, Hf(v)]. This gives the bound stated in Table 1. 3We only consider the case for the cubic regularized model when p = 2 and r = p + 1 = 3. Also, since by our assumption the function f has L-Lipschitz Hessian, we only consider the case when the Hölder exponent β2 = 1.

which solves (4).

In practice, it is not possible to exactly solve the trust-region subproblem defined in (4), instead we only require that the trust-region subproblem is approximately solved. For our method, it will suffice to find a direction dk satisfying:

Mk(dk) + δkdk γ1 f(xk + dk) (6a) γ2δkrk δk dk (6b) dk rk (6c)

Mk(dk) γ3 δk

2 dk 2 (6d)

where δk denotes the solution for the above system and γ1 [0, 1), γ2 (1/ω, 1], γ3 (0, 1]. Setting γ1 = 0, γ2 = 1, γ3 = 1 represents the exact version of these conditions. As Lemma 1 shows, exactly solving the trust-region subproblem gives a solution to the system (6). However, the converse it not true, an exact solution to (6) does not necessarily solve the trust-region subproblem. Nonetheless, these conditions are all we need to prove our results, and are easier to verify than a relaxation of (4) that includes a requirement like (5d) which needs a computationally expensive eigenvalue calculation. Lemma 1. Any solution to (5) is a solution to (6) with γ1 = 0, γ2 = 1, γ3 = 1.

Proof. The only tricky part is proving (6d). However, this can be shown using standard arguments: Mk(dk) = 1

2d T k 2f(xk)dk + f(xk)T dk = 1

2d T k ( 2f(xk) + 2δk I)dk δk

2 dk 2 where the second equality uses (5a) and the inequality (5d).

2.2 Our trust-region method

An important component of a trust-region method is the decision for computing the radius rk at each iteration. This choice is based on whether the observed function value reduction f(xk) f(xk + dk) is comparable to the predicted reduction from the second-order Taylor series expansion Mk. In particular, given a search direction dk existing trust-region methods compute the ratio

ρk := f(xk) f(xk + dk)

and then increase rk if ρk β or decrease rk if ρk < β [22]. Unfortunately, while intuitive, this criteria is provably bad, in the sense that one can construct examples of functions with Lipschitz continuous Hessians where any trust-region method that uses this criteria will have a convergence rate proportional to ϵ 2 [45, Section 3].

Instead of (7), we introduce a variant of this ratio by adding the term θ

2 f(xk + dk) dk to the predicted reduction where θ (0, ) is a problem-independent hyperparameter (we use θ = 0.1 in our implementation). This requires the algorithm to reduce the function value more if the gradient norm at the candidate solution xk + dk, and search direction norm are big. In particular, we define our new ratio as:

ˆρk := f(xk) f(xk + dk) Mk(dk) + θ

2 f(xk + dk) dk (8)

Our trust-region method is presented in Algorithm 1. The algorithm includes some other minor modification of classic trust-region methods [22]: we accept all search directions that reduce the function value, and update the rk+1 using dk instead of rk (see [46, Equation 13.6.13] for a similar update rule). We recommend contrasting our algorithm with [36, Algorithm 1] which is trust-region method with an iteration bound proportional to ϵ 3/2 but is more complex and not consistently adaptive.

For the remainder of this paper xk and dk refer to the iterates of Algorithm 1.

3 Proof of full adaptivity on Lipschitz continuous functions

This section proves that our method is consistently adaptive for finding approximate stationary points on functions with L-Lipschitz Hessians. The core idea behind our proof is to get a handle on the size

Algorithm 1: Consistently Adaptive Trust Region Method (CAT) Input requirements: r1 (0, ), x1 Rn ; Problem-independent parameter requirements: θ (0, 1), β (0, 1), ω (1, ), γ1 [0, 1), γ2 (1/ω, 1], γ3 (0, 1], βθ γ3(1 β) + γ1 < 1 ; for k = 1, . . . , do

Approximately solve the trust-region subproblem, i.e., obtain dk that satisfies (6) ;

xk+1 xk + dk f(xk + dk) f(xk) xk otherwise

rk+1 ω dk ˆρk β dk /ω otherwise

of dk . In particular, if we can bound dk from below and ˆρk β then the θ

2 f(xk + dk) dk term guarantees that at iteration k the function value is reduced by a large amount relative to the gradient norm f(xk + dk) .

Lemma 2 guarantees the norm of the gradient for the candidate solution xk +dk lower bounds the size of dk under certain conditions. Note this bound on the gradient, i.e., (11) holds without us needing to know the Lipschitz constant of the Hessian L. The proof of Lemma 2 appears in Section A.1 and heavily leverages Fact 2.

Fact 2 (Nesterov & Polyak 2006, Lemma 1 [31]). If 2f is L-Lipschitz,

f(xk + dk) Mk(dk) + L

f(xk + dk) f(xk) + Mk(dk) + L

6 dk 3. (10)

Lemma 2. Suppose 2f is L-Lipschitz. If dk < γ2rk or ˆρk β then

f(xk + dk) c1L dk 2 (11)

where c1 > 0 is a problem-independent constant:

c1 := max 5 3β 6(γ3(1 γ1)(1 β) βθ), 1 2(1 γ1)

For the remainder of this section we will find the following quantities useful,

dϵ := γ2ω 1c 1/2 1 L 1/2ϵ1/2

As we will show shortly in Lemma 4, after a short warm up period dϵ and dϵ represent lower and upper bound on dk (i.e., dϵ dk dϵ) as long as f(xk + dk) ϵ. But before presenting and proving Lemma 4 we develop Lemma 3 which is a stepping stone to proving Lemma 4. Lemma 3 shows that if dk is almost above dϵ then the trust-region radius will shrink, and if dk is almost below dϵ then the trust-region radius will grow (recall from Algorithm 1 that ω (1, )).

Lemma 3. Suppose 2f is L-Lipschitz. Let ϵ (0, ) and f(xk + dk) ϵ then

1. If dk > dϵ/ω then dk /ω = rk+1.

2. If dk < ωγ 1 2 dϵ then γ2rk dk rk & ω dk = rk+1.

Proof. Proof for 1. We have

dk > dϵ/ω = 2 f

βθϵ 2f(xk) f(xk+1)

where the first equality uses the definition of dϵ and the second inequality uses f(xk+1) infx Rn f(x) and f(x1) f(xk). Furthermore,

ˆρk = f(xk) f(xk + dk) Mk(dk) + θ

2 f(xk + dk) dk f(xk) f(xk+1)

θ 2 f(xk + dk) dk 2f(xk) f(xk+1)

where the first inequality follows from Mk(dk) 0 and f(xk+1) f(xk + dk), the second inequality follows from the fact that f(xk + dk) ϵ, and the third inequality uses (12). By inspection of Algorithm 1, if ˆρk < β, then dk /ω = rk+1.

Proof for 2. We will prove the result by contrapositive. In particular, suppose that (γ2rk dk rk) or dk ω = rk+1. Let us consider these two cases. If (γ2rk dk rk) then as dk rk we have γ2rk > dk . If dk ω = rk+1 then by inspection of Algorithm 1 we have ˆρk β. Therefore, in both these cases the premise of Lemma 2 holds. Now, by f(xk + dk) ϵ and Lemma 2 we get ϵ f(xk + dk) c1L dk 2

which implies dk c 1/2 1 L 1/2ϵ1/2 = γ 1 2 ω(γ2ω 1c 1/2 1 L 1/2ϵ1/2) = γ 1 2 ω dϵ.

Now we show that the norm of the direction dk, after some finite iteration kϵ, will be bounded below and above by dϵ and dϵ respectively. For that we first define:

Kϵ := min{{k N : f(xk + dk) ϵ} { }}

as the first iteration for which f(xk + dk) ϵ, and we also define:

kϵ := min{{k N : dϵ dk dϵ} {Kϵ 1}}

as the first iteration for which dϵ dk dϵ. An illustration of Lemma 4 is given in Figure 1. In particular, after a certain warm up period the direction norms can no longer rise above dϵ or below dϵ. Broadly speaking, the idea behind the proof is that if dk is above dϵ/ω then at the next iteration dk decreases and conversely if dk is bellow dϵωγ 1 2 then at the next iteration it must increase.

Lemma 4. Suppose 2f is L-Lipschitz and let ϵ (0, ). If dϵ dk dϵ then dϵ dj dϵ for all j [k, Kϵ) N. Furthermore, kϵ 1 + logγ2ω(max{1, dϵ/r1, r1/ dϵ}).

Proof. We begin by proving dϵ dk dϵ then dϵ dj dϵ for j [k, Kϵ) N. We assume that k < Kϵ otherwise our desired conclusion clearly holds. We split this proof into two claims.

Our first claim is that dk dϵ implies dk+1 dϵ. We split dk dϵ into two subcases. If dk dϵ/ω, then inspection of Algorithm 1 shows that dk+1 rk+1 dk ω dϵ. If dϵ/ω dk dϵ, then Lemma 3.1 implies that dk+1 rk+1 dk /ω dϵ.

Our second claim is that dϵ dk implies dϵ dk+1 . We split dϵ dk into three subcases. If dk+1 < γ2rk+1, then the contrapositive of Lemma 3.2 implies that dk+1 dϵ. If γ2rk+1 dk+1 rk+1 and dϵ dk < γ 1 2 ω dϵ, then dϵ < γ2ω dϵ γ2ω dk = γ2rk+1 dk+1 where uses Lemma 3.2. If γ2rk+1 dk+1 rk+1 and dϵωγ 1 2 dk , then dϵ γ2 dk

ω γ2rk+1 dk+1 where is from the update rule for rk+1 in Algorithm 1.

By induction on the previous two claims we deduce if dϵ dk dϵ then dϵ dj dϵ for j [k, Kϵ) N.

Next, we prove that if k 1 + logωγ2( dϵ/r1)

then dk dϵ. If dϵ dj for some j k, then the result holds because as we already established dϵ dk dϵ dk+1 . On the other hand, if dj < dϵ for all j k, then Lemma 3.2 implies that rj+1 = ω dj ωγ2rj which by induction gives dk (ωγ2)k 1r1 (ωγ2)logωγ2( dϵ/r1)r1 = dϵ.

Finally, we prove that if k 1 + logω(r1/ dϵ) then dk dϵ. If dj dϵ for some j k, then the result holds because as we already established dk dϵ dk+1 dϵ. On the other hand, if dj > dϵ for all j k, then Lemma 3.1 implies

always decreases on next iteration above this line while

always increases on next iteration below this line while

Maximum for

Minimum for

r QBAICnu EV3hzlv Djvzseite Dk M8fw B87n Dy DCj/s=</latexit>kdkk

Figure 1: An example of a plausible sequence of iterates and the norms of their directions. Each red dot represents an iterate and its search direction norm. This illustrates Lemma 4.

that rj+1 = dj /ω rj/ω which by induction gives dk ω1 kr1 ω logω(r1/ dϵ)r1 = dϵ.

Let Pϵ := {k N : ˆρk β, kϵ k < Kϵ}

which represents the set of iterations, before we find an ϵ-approximate stationary point, where the function value is reduced a large amount compared with our target reduction, i.e., ˆρk β. Lemma 5 shows that there is a finite number of these iterations until the gradient drops below the target threshold ϵ. The proof of Lemma 5 appears in Appendix A.2. Roughly, the idea of the proof is to use that, due our definition of ˆρk, when ˆρk β we always reduce the function value by at least βθ

2 f(xk + dk) dk and dk can be lower bounded by dϵ using Lemma 4. As we cannot reduce the function value by a constant value indefinitely, we must eventually have f(xk + dk) ϵ.

Lemma 5. Suppose 2f is L-Lipschitz and ϵ (0, ) then |Pϵ| dϵ

dϵω + 1 = 2ωc1/2 1 βθ f L1/2

With Lemma 5 in hand we are now ready to prove our main result, Theorem 1. We have already provided a bound on the length of the warm up period, kϵ (Lemma 4) and on the number of points with ˆρk β. Therefore, the only obstacle is to bound the number of points with ˆρk < β. However, on these iterations we always decrease the radius by at least ω (see update rules in Algorithm 1), and therefore as dk is bounded below by dϵ, there must be iterations where we increase the radius rk, which by definition of Algorithm 1 only occurs if β ˆρk. Consequently, the number of iterations where β < ˆρk can be bounded by the number of iterations where β ˆρk plus a O(1) term. This is the crux of the proof of Theorem 1 which appears in Section A.3.

Theorem 1. Suppose that 2f is L-Lipschitz and f is bounded below with f = f(x1) f , then for all ϵ (0, ) there exists some iteration k with f(xk + dk) ϵ and

ϵ 3 2 + log

L 1 2 r1 + r1ϵ

where O( ) hides problem-independent constant factors and r1 is the initial trust-region radius.

One drawback of Theorem 1 is that it only bounds the number of iterations to find a first-order stationary point. Many second-order methods in the literature show convergence to points satisfying the second-order optimality conditions [47, 35, 36, 18]. Of course, these methods are not consistently adaptive. Therefore, in the future, it would be interesting to develop a method that provides a consistently adaptive convergence guarantee for finding second-order stationary points.

4 Quadratic convergence when sufficient conditions for local optimality hold

Theorem 2. Suppose f is twice differentiable and for some x Rn the second-order sufficient conditions for local optimality hold ( f(x ) = 0 and 2f(x ) 0). Under these conditions there exists a neighborhood N around x and a constant c > 0 such that if xi N then there exist xˆk N such that for all k ˆk we have xk+1 x c xk x 2 1

The proof of Theorem 2 appears in Section B. It is a little tricker than typical quadratic convergence proofs for trust-region methods because in our method we have limk 0 rk 0 whereas classical trust-region methods have rk bounded away from zero [22, Proof of Theorem 4.14]. Fortunately, one can show that asymptotically rk ω xk x so the decaying radius does not interfere with quadratic convergence. In particular, the crux of proving Theorem 2 is proving the premise of Lemma 6 (as the conclusion of Lemma 6 is rk ω xk x ). For this Lemma we define diam(X) := supx,x X x x . Lemma 6. Let N be a bounded set such that for all xk N we have xk+1 N, ˆρk β, and min{γ2rk, xk+1 x } dk ωγ2 xk x . Suppose there exists xk N and let i be the smallest index with xi N, then rk ω xk x for all k 2 + i + logγ2ω( diam(N)

Proof. Let k i. By induction xk N. By ˆρk β and inspection of Algorithm 1 we have rk+1 = ω dk . Suppose that dk γ2rk for all i k i + ℓthen diam(N) dk+1 γ2rk+1 = γ2ω dk = γℓ 2ωℓ di . Rearranging gives ℓ logγ2ω( diam(N)

di ). Next observe that if dk < γ2rk then dk+1 ωγ2 xk+1 x ωγ2 dk = (ωγ2/ω)rk+1 = rk+1. By induction dk < γ2rk for all k > ℓ. Finally, observe that if dk < γ2rk then rk+1 = ω dk ω xk+1 x .

5 Experimental results

We test our algorithm on learning linear dynamical systems [48], matrix completion [49], and the CUTEst test set [50]. Appendix D contains the complete set of results from our experiments. Our method is implemented in an open-source Julia module available at https://github.com/ fadihamad94/CAT-Neur IPS. The implementation uses a factorization and eigendecomposition approach to solve the trust-region subproblems (i.e., satisfy (6)). We perform our experiments using Julia 1.6 on a Linux virtual machine that has 8 CPUs and 16 GB RAM. The CAT code repository provides instructions for reproducing the experiments and detailed tables of results.

For these experiments, the selection of the parameters (unless otherwise specified) is as follow: r1 = 1.0, β = 0.1, θ = 0.1, ω = 8.0, γ1 = 0.0, γ2 = 0.8, and γ3 = 1.0. When implementing Algorithm 1 with some target tolerance ϵ, we immediately terminate when we observe a point xk with f(xk + dk) ϵ. This also includes the case when we check the inner termination criteria for the trust-region subproblem. The full details of the implementation are described in Appendix C.

5.1 Learning linear dynamical systems

We test our method on learning linear dynamical systems [48] to see how efficient our method compared to a trust-region solver. We synthetically generate an example with noise both in the observations and also the evolution of the system, and then recover the parameters using maximum likelihood estimation. Details are provided in Appendix D.1.

On this problem we compare our algorithm with a Newton trust-region method that is available through the Optim.jl package [51]. The comparisons are summarized in Table 2.

5.2 Matrix completion

We also demonstrate the effectiveness of our algorithm against a trust-region solver on the matrix completion problem. The matrix completion formulation can be written as the regularized squared error function of SVD model [49, Equation 10]. For our experiment, we use the public data set of Ausgrid, but we only use the data from a single substation. Details are provided in Appendix D.2.

Again we compare our algorithm with a Newton trust-region method [51]. The comparison are summarized in Table 3.

Table 2: Geometric mean for total number of iterations, and evaluations of the function and gradient, per solver on 60 randomly generated instances with ϵ = 10 5 termination tolerance. Failures counted as the maximum number of iterations (10000) when computing the geometric mean.

#iter #function #gradient

Newton trust-region [51] 480.1 482.3 482.3 Our method 308.1 309.6 309.6 95% CI for ratio [1.24, 1.95] [1.24, 1.94] [1.24, 1.94]

Table 3: Geometric mean for total number of iterations per solver on 10 instances by randomly generating the sampled measurements from the matrix D using data from Ausgrid with ϵ = 10 5 termination tolerance. Failures counted as the maximum number of iterations (1000) when computing the geometric mean.

Newton trust-region [51] 1000 Our method 216.4

5.3 Results on CUTEst test set

The CUTEst test set [50] is a standard test set for nonlinear optimization algorithms. To run the benchmarks we use https://github.com/Julia Smooth Optimizers/CUTEst.jl (the License can be found at https://github.com/Julia Smooth Optimizers/CUTEst.jl/blob/main/LICENSE. md). We will be comparing with the results for ARC reported in [52, Table 1]. As the benchmark CUTEst that we used has changed since [52] was written we select only the problems in CUTEst that remain the same (some of the problem sizes have changed). This gives 67 instances. A table with our full results can be found in the results/CUTEst subdirectory in the git repository for this paper. Catris et.al [52] report the results for three different implementations of their ARC algorithm. We limit our comparison to the ARC g-rule algorithm since it performs better than the other ARC approaches. We also run the Newton trust-region method from the Optim.jl package [51].

Our algorithm is stopped as soon f(xk + dk) is smaller than 10 5. For the Newton trust-region method [51] we also used as a stopping criteria a value of 10 5 for the gradient termination tolerance. We used 10000 as an iteration limit and any run exceeding this is considered a failure. This choice of parameters is to be consistent with [52].

As we can see from Table 4, our algorithm offers significant promise, requiring similar function evaluations (and therefore subproblem solves) to converge than the Newton trust-region of [51] and ARC, although the number of gradient evaluations is slightly higher than ARC. In addition, the comparison between these algorithms in term of total number of iterations and total number of gradient evaluations is summarized in Figure 2.

5.4 Convergence of our method with different theta values

To demonstrate the role of the additional θ

2 dk f(xk + dk) term we run experiments with different θ values. In particular, we contrast our default value of θ = 0.1 with θ = 0 which corresponds to not adding the θ

2 dk f(xk + dk) term to ˆρk (recall the discussion in Section 2.2).

Table 4: Number of failures, geometric mean for total number of iterations and function and gradient evaluations per solver on 67 unconstrained instance from the CUTEst benchmark instances. Failures counted as the maximum number of iterations (10000) when computing the geometric mean.

#failures #iter #function #gradient

Our method 3 41.5 44.4 44.4 ARC with the g-rule [52] 1 38.1 38.1 26.6 Newton trust-region [51] 4 44.5 47.2 47.2

Figure 2: Fraction of problems solved on the CUTEst benchmark.

(a) Fraction of problems solved versus total number of iterations on the CUTEst test set

(b) Gradient norm versus total number of iterations on the example of [45] with r1 = 1.5

Figure 3: Performance of our algorithm based on different θ values.

In Figure 3a we rerun on the CUTEst test set (as per Section 5.3) and compare these two options. One can see the algorithm performs similarly with either θ = 0 or θ = 0.1.

In Figure 3b we test on the hard example from [45] which is designed to exhibit the poor worst-case complexity of trust-region methods (i.e., a convergence rate proportional to ϵ 2) if the initial radius r1 is chosen sufficiently large (to achieve this we set r1 = 1.5). We run this example with ϵ = 10 3. One can see that while for the first 104 iterations the methods follow identical trajectories, thereafter θ = 0.1 rapidly finds a stationary point whereas θ = 0.0 requires two orders of magnitude more iterations to terminate. This crystallizes the importance of θ in circumventing the worst-case ϵ 2 convergence rate of trust-region methods.

Acknowledgments and Disclosure of Funding

The authors were supported by the Pitt Momentum Funds. The authors would like to thank Coralia Cartis and Nicholas Gould for their helpful feedback on a draft of this paper. The authors would also like to thank Xiaoyi Qu for finding errors in the proof of an early draft of the paper and for helpful discussions.

[1] John T Betts. Practical methods for optimal control and estimation using nonlinear programming. SIAM, 2010.

[2] Klaus Schittkowski, Christian Zillober, and Rainer Zotemantel. Numerical comparison of nonlinear programming algorithms for structural optimization. Structural Optimization, 7(1):1 19, 1994.

[3] Stephen Frank, Steffen Rebennack, et al. A primer on optimal power flow: Theory, formulation, and practical examples. Colorado School of Mines, Tech. Rep, 2012.

[4] Lorenz T Biegler, Omar Ghattas, Matthias Heinkenschloss, and Bart van Bloemen Waanders. Large-scale PDE-constrained optimization: an introduction. In Large-Scale PDE-Constrained Optimization, pages 3 13. Springer, 2003.

[5] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, pages 1895 1904. PMLR, 2017.

[6] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularized Newton methods. In International Conference on Machine Learning, pages 5990 5999. PMLR, 2018.

[7] Ching-pei Lee, Cong Han Lim, and Stephen J Wright. A distributed quasi-Newton algorithm for empirical risk minimization with nonsmooth regularization. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1646 1655, 2018.

[8] Frank E Curtis, Michael J O Neill, and Daniel P Robinson. Worst-case complexity of an SQP method for nonlinear equality constrained stochastic optimization. ar Xiv preprint ar Xiv:2112.14799, 2021.

[9] Frank E Curtis and Rui Shi. A fully stochastic second-order trust region method. Optimization Methods and Software, pages 1 34, 2020.

[10] Frank E Curtis, Katya Scheinberg, and Rui Shi. A stochastic trust region algorithm based on careful step normalization. Informs Journal on Optimization, 1(3):200 220, 2019.

[11] Vipul Gupta, Avishek Ghosh, Michał Derezi nski, Rajiv Khanna, Kannan Ramchandran, and Michael W Mahoney. Local Newton: Reducing communication rounds for distributed learning. In Uncertainty in Artificial Intelligence, pages 632 642. PMLR, 2021.

[12] Peng Xu, Fred Roosta, and Michael W Mahoney. Newton-type methods for non-convex optimization under inexact Hessian information. Mathematical Programming, 184(1):35 70, 2020.

[13] Chih-Hao Fang, Sudhir B Kylasa, Fred Roosta, Michael W Mahoney, and Ananth Grama. Newton-ADMM: A distributed GPU-accelerated optimizer for multiclass classification problems. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1 12. IEEE, 2020.

[14] Shusen Wang, Fred Roosta, Peng Xu, and Michael W Mahoney. Giant: Globally improved approximate Newton method for distributed optimization. Advances in Neural Information Processing Systems, 31, 2018.

[15] Sen Na, Michał Derezi nski, and Michael W Mahoney. Hessian averaging in stochastic Newton methods achieves superlinear convergence. ar Xiv preprint ar Xiv:2204.09266, 2022.

[16] Peng Xu, Jiyan Yang, Fred Roosta, Christopher Ré, and Michael W Mahoney. Sub-sampled Newton methods with non-uniform sampling. Advances in Neural Information Processing Systems, 29, 2016.

[17] Rixon Crane and Fred Roosta. DINO: Distributed Newton-type optimization method. In International Conference on Machine Learning, pages 2174 2184. PMLR, 2020.

[18] Junyu Zhang, Lin Xiao, and Shuzhong Zhang. Adaptive stochastic variance reduction for subsampled Newton method with cubic regularization. INFORMS Journal on Optimization, 4(1):45 64, 2022.

[19] Dmitry Kovalev, Konstantin Mishchenko, and Peter Richtárik. Stochastic Newton and cubic Newton methods with simple local linear-quadratic rates. ar Xiv preprint ar Xiv:1912.01597, 2019.

[20] Philip E Gill and Walter Murray. Newton-type methods for unconstrained and linearly constrained optimization. Mathematical Programming, 7(1):311 350, 1974.

[21] Jorge J Moré and Danny C Sorensen. On the use of directions of negative curvature in a modified Newton method. Mathematical Programming, 16(1):1 20, 1979.

[22] Danny C Sorensen. Newton s method with a model trust region modification. SIAM Journal on Numerical Analysis, 19(2):409 426, 1982.

[23] Stefan Ulbrich. On the superlinear local convergence of a filter-SQP method. Mathematical Programming, 100(1):217 245, 2004.

[24] Luís N Vicente and Stephen J Wright. Local convergence of a primal-dual method for degenerate nonlinear programming. Computational Optimization and Applications, 22(3):311 328, 2002.

[25] Richard H Byrd, Jean Charles Gilbert, and Jorge Nocedal. A trust region method based on interior point techniques for nonlinear programming. Mathematical Programming, 89(1):149 185, 2000.

[26] Lifeng Chen and Donald Goldfarb. Interior-point l2-penalty methods for nonlinear programming with strong global convergence properties. Mathematical Programming, 108(1):1 36, 2006.

[27] Andrew R Conn, Nicholas IM Gould, Dominique Orban, and Philippe L Toint. A primal-dual trust-region algorithm for non-convex nonlinear programming. Mathematical programming, 87(2):215 249, 2000.

[28] Nicholas IM Gould, Dominique Orban, and Philippe L Toint. An interior-point ℓ1-penalty method for nonlinear optimization. Citeseer, 2002.

[29] Andreas Wächter and Lorenz T Biegler. Line search filter methods for nonlinear programming: Motivation and global convergence. SIAM Journal on Optimization, 16(1):1 31, 2005.

[30] Stephen Wright and Jorge Nocedal. Numerical optimization. Springer Science and Media, 2006.

[31] Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

[32] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points I. Mathematical Programming, 2020.

[33] Geovani Nunes Grapiglia and Yu Nesterov. Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM Journal on Optimization, 27(1):478 506, 2017.

[34] Coralia Cartis, Nick I Gould, and Philippe L Toint. Universal regularization methods: varying the power, the smoothness and the accuracy. SIAM Journal on Optimization, 29(1):595 615, 2019.

[35] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130(2):295 319, 2011.

[36] Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of O(ϵ 3/2) for nonconvex optimization. Mathematical Programming, 162(1-2):1 32, 2017.

[37] Frank E Curtis, Daniel P Robinson, Clément W Royer, and Stephen J Wright. Trust-region Newton-cg with strong second-order complexity guarantees for nonconvex optimization. SIAM Journal on Optimization, 31(1):518 544, 2021.

[38] Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000.

[39] MJD Powell. On the global convergence of trust region algorithms for unconstrained minimization. Mathematical Programming, 29(3):297 303, 1984.

[40] Richard G Carter. On the global convergence of trust region algorithms using inexact gradient information. SIAM Journal on Numerical Analysis, 28(1):251 265, 1991.

[41] Michael JD Powell. On trust region methods for unconstrained minimization without derivatives. Mathematical programming, 97(3):605 623, 2003.

[42] Frank E Curtis, Zachary Lubberts, and Daniel P Robinson. Concise complexity analyses for trust region methods. Optimization Letters, 12(8):1713 1724, 2018.

[43] Nicholas IM Gould, Dominique Orban, and Philippe L Toint. CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Computational optimization and applications, 60(3):545 557, 2015.

[44] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

[45] Coralia Cartis, Nicholas IM Gould, and Ph L Toint. On the complexity of steepest descent, Newton s and regularized Newton s methods for nonconvex unconstrained optimization problems. SIAM journal on optimization, 20(6):2833 2852, 2010.

[46] Wenyu Sun and Ya-Xiang Yuan. Optimization theory and methods: nonlinear programming, volume 1. Springer Science & Business Media, 2006.

[47] Ernesto G Birgin and José Mario Martínez. The use of quadratic regularization with a cubic descent condition for unconstrained optimization. SIAM Journal on Optimization, 27(2):1049 1074, 2017.

[48] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. ar Xiv preprint ar Xiv:1609.05191, 2016.

[49] Chijie Zhuang, Jianwei An, Zhaoqiang Liu, and Rong Zeng. Power load data completion method considering low rank property. CSEE Journal of Power and Energy Systems, 2020.

[50] D. Orban, A. S. Siqueira, and contributors. CUTEst.jl: Julia s CUTEst interface. https: //github.com/Julia Smooth Optimizers/CUTEst.jl, October 2020.

[51] Patrick Kofod Mogensen and Asbjørn Nilsen Riseth. Optim: A mathematical optimization package for julia. Journal of Open Source Software, 3(24), 2018.

[52] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245 295, 2011.

[53] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and trends in machine learning, 2015.

[54] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426 434, 2008.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] The comparisons with other algorithms show the limitations of our work. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Sections 3 and 4 (b) Did you include complete proofs of all theoretical results? [Yes] See Sections 3 and 4 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Code and instructions to reproduce the results are attached in the supplementary materials. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We use CUTEst.jl and Optim.jl. For both we cite the creators. (b) Did you mention the license of the assets? [Yes] See Section 5.3 and Appendix D.1.

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

Our code is attached in the supplementary materials. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [No] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]