# an_enhanced_levenbergmarquardt_method_via_gram_reduction__dd6eade0.pdf

An Enhanced Levenberg Marquardt Method via Gram Reduction

Chengchang Liu1, Luo Luo* 2, 3, John C.S. Lui1

1Department of Computer Science and Engineering, The Chinese University of Hong Kong 2School of Data Science, Fudan University 3Shanghai Key Laboratory for Contemporary Applied Mathematics 7liuchengchang@gmail.com, luoluo@fudan.edu.cn, cslui@cse.cuhk.edu.hk

This paper studies the problem of solving the system of nonlinear equations. We propose the Gram-reduced Levenberg Marquardt method, which reuses the Gram matrix. Our method has a global convergence guarantee without relying on any step of line-search or solving sub-problems. We show that our method takes a smaller computational complexity than existing Levenberg Marquardt methods to find the stationary point of the square norm of the equations. We also show that the proposed method enjoys a local superlinear convergence rate under the non-degenerate assumption. Experiments are conducted on real-world applications in scientific computing and machine learning, which validate the efficiency of our method.

1 Introduction

We consider solving the system of nonlinear equations

F(x) = 0, (1)

where x Rd and F(x) [F1(x), . . . , Fd(x)] : Rd Rd with differentiable Fi(x) : Rd R for all i [d]. This problem can also be reformulated by the nonlinear leastsquare problem:

min x Rd ϕ(x) 1

2 F(x) 2. (2)

Solving nonlinear equations is one of the most fundamental problems in scientific computing (Nesterov 2007; Yuan 2011). It has wide applications in machine learning (D efossez and Bach 2015; Botev, Ritter, and Barber 2017; Bai, Kolter, and Koltun 2019; Liu, Zhu, and Belkin 2022), control system (Berthier, Carpentier, and Bach 2021), data assimilation (Tr emolet 2007), and game theory (Frehse and Bensoussan 1984; Nourian and Caines 2013). First-order methods are popular to solve nonlinear equations. Specifically, one can perform the gradient descent on problem (2), that is

xt+1 = xt η ϕ(xt) = xt ηJ(xt) F(xt), (3)

*Corresponding author. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

where η > 0 is the step-size, ϕ(xt) = J(xt) F(xt) and J( ) [ F1( ), . . . , Fd( )] Rd d is the Jacobian of F( ). The iteration scheme (3) takes O(d2) flops in general. Taking the Lipschitz continuity on ϕ( ), it requires O(ϵ 2) iterations to find an ϵ-stationary point of ϕ( ), then the overall computation cost is O(d2ϵ 2). Gauss Newton methods (Ben-Israel 1966; Nocedal and Wright 1999) consider the estimation 2ϕ( ) J( ) J( ) and establish the iteration

xt+1 = xt J(xt) J(xt) J(xt) F(xt)

with local superlinear convergence, while such needs the flops of O(d3) to compute the matrix (pseudo) inverse. Recently, Liu and Luo (2022) introduced Broyden family updates (Rodomanov and Nesterov 2021; Lin, Ye, and Zhang 2022) to estimate J( ) J( ), which leads to Quasi-Newton methods with local superlinear convergence and reduces the computation cost in per iteration to O(d2) flops. However, both Gauss Newton method and its quasi-Newton variants lack global convergence guarantees. In this paper, we are interested in Levenberg Marquardt (LM) methods (Levenberg 1944; Marquardt 1963), which iterates according to

xt+1 = xt J(xt) J(xt) + λt I 1J(xt) F(xt), (4)

where λt > 0 is the regularization parameter. This method globalizes the Gauss Newton iteration and also maintains local superlinear convergence by properly choosing the regularization term λt (Yamashita and Fukushima 2001; Fan and Yuan 2005). However, the iteration scheme (4) takes O(d3) flops since it contains matrix inversion such as the Gauss Newton update. Moreover, the upper bound of the iteration numbers for the existing global convergent LM methods is not better than O(ϵ 2) (Ueda and Yamashita 2010; Zhao and Fan 2016; Huang and Fan 2018; Bergou, Diouane, and Kungurtsev 2020; Tran-Dinh, Pham, and Nguyen 2020), which makes the total computation cost O(d3ϵ 2). In addition, these methods require the line search step or solving the trust region subproblem (Yuan 1994) to achieve a descent direction, which makes their implementation complicated. Recently, Mishchenko (2023) proposed an adaptive LM method without any line-search step nor sub-problem solver, which determine the regularization term by

J(xt) F(xt) . (5)

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

This method can find an ϵ-stationary within at most O(ϵ 2.5) iterations under the cubic growth condition, which is worse than line-search or trust-region based LM methods. Furthermore, its overall computation cost O(d3ϵ 2.5) is worse than the classical LM methods, and the existence of its local superlinear convergence is unknown. Building upon this, there arises the natural question that: Can we design a computation efficient and globally convergent LM method with local superlinear convergence? We give an affirmative answer to the above question by proposing the Gram-Reduced Levenberg Marquardt (GRLM) method. Instead of existing LM methods that compute the Gram matrix J(xt) J(xt) at every iteration, our method only updates J( ) J( ) at the snapshot point and reuses it in the next m iterations. The update of our method can be formulated by

xt+1 =xt J xπ(t) J xπ(t) +λt I 1 J(xt) F(xt), (6)

where π(t) m t/m and λt is chosen according to equation (5). Note that the scheme (6) only needs to compute the matrix J(xπ(t)) J(xπ(t)) in the case of t 0 (mod m), which significantly reduces the time required to compute the Gram matrix and leads to a better algorithmic complexity. For global behavior, our method takes the overall computational cost of O(d3ϵ 1 + d2ϵ 2) to find an ϵ-stationary point under the cubic-growth condition, which improves the results of O(d3ϵ 2) of existing LM methods. For local behavior, our method has the superlinear rate when the solution has a non-degenerate Jacobian, which goes beyond the theory of Mishchenko (2023). We summarize the main theoretical results of GRLM and related work in Table 1.

Paper Organization The remainder of the paper is organized as follows. In Sections 2 and 3, we introduce the preliminaries and related work, respectively. In Section 4, we provide details of the algorithms and convergence analysis of GRLM. In Section 5, we validate our methods by numerical experiments. We conclude our work and discuss the future directions in Section 6. 1

2 Preliminaries We use the notation to present the Euclidean norm of a vector and the spectral norm of a matrix, respectively. We use the notation I to present the identity matrix. We denote σmin( ) as the smallest singular value of a matrix. For the ease of presentation, we denote the Gram matrix of F( ) by

G(x) J(x) J(x) 0, (7)

where J(x) Rd d is the Jacobian of F : Rd Rd at point x Rd. We also define the approximate stationary point of the function ϕ( ) = 1

2 F( ) 2 as follows.

Definition 1. We say x Rd is an ϵ-stationary point of ϕ( ) if it satisfies J(x) F(x) ϵ. We make the following standard assumptions on the Jacobian of F( ).

1The proof of this paper can be found in https://arxiv.org/pdf/ 2412.08561.

Assumption 2. We assume the Jacobian of F( ) is bounded and Lipschitz continuous, that is, we have J(x) L1 and

J(x) J(y) L2 x y (8)

hold for all x, y Rd and some constants L1, L2 0. Proposition 3. If the Jacobian J( ) satisfies Assumption 2, then it holds that F(x) F(y) L1 x y and

F(x) F(y) J(y)(x y) L2

2 x y 2 (9)

for all x, y Rd. The following proposition means that Assumption 2 leads to the Lipschitz continuity of G( ). Proposition 4. If the function F( ) satisfies Assumption 2, we have

G(x) L2 1 and G(y) G(x) 2L1L2 x y (10)

for all x, y Rd.

3 Related Work The idea of reusing second-order information dates back to the 1960s, where Shamanskii (1967) presented a variant of the Newton method that constructs a new Jacobian every m iterations and analyzed its local behavior. Later, such an idea has been generalized to different types of Newton methods (Fan 2013; Wang, Chen, and Du 2006; Lampariello and Sciandrone 2001; Adler, Hu, and Lin 2020) and to training the large language models (Liu et al. 2023c; Elbakary et al. 2024). Particularly, Fan (2013) proposed Shamanskii Levenberg Marquardt method in the form of

= xt J(xπ(t)) J(xπ(t))+µt F(xπ(t)) αI 1 J(xπ(t)) F(xt),

where µt > 0 and α [1, 2], with a fast local superlinear rate. However, these methods do not have global theoretical advantage compared to the traditional methods. Recently, Doikov, Chayti, and Jaggi (2023) proposed the regularized Newton method with lazy Hessians for general minimization problem minx Rd ϕ(x), which iterates with

xt+1 =xt 2ϕ(xπ(t))+ p

c ϕ(xt) I 1 ϕ(xt) (11)

for some c > 0. However, this method is not suitable to solve our system of nonlinear equations (or its nonlinear least-square formulation) in the following aspects: The update (11) requires accessing the second-order information of the objective. In the view of nonlinear leastsquare formulation (2), we have

= J(xπ(t)) J(xπ(t)) +

i=1 Fi(xπ(t)) 2Fi(xπ(t)).

Compared with the cost of LM methods mainly depends on J( ), accessing the Hessians 2Fi( ), . . . , 2Fd( ) in above equation may be much more expensive.

Method Globally Computation Cost Local superlinear Line-search Free References

Gradient Descent O(d2ϵ 2) % ! Nesterov (2018)

Levernberg Marquardt (v1) O(d3ϵ 2 log(ϵ 1)) ! % Bergou, Diouane, and Kungurtsev (2020)

Levernberg Marquardt (v2) O(d3ϵ 2.5 log(ϵ 1)) % ! Mishchenko (2023)

GRLM (ours) O(d2ϵ 2 + d3ϵ 1) ! ! Algorithm 1

This results can be improved to O(d3ϵ 2.5) by using our analysis framework. We present a discussion in Remark 10. Our analysis can also provide the local superlinear convergence rate for Levernberg Marquardt (v2), which is not proved in Mishchenko (2023). We present the result in Remark 16.

Table 1: We summarize the globally convergent algorithms for solving the nonlinear equations by their total computation cost for finding the ϵ-stationary point of ϕ( ) (Globally Computation Cost), if they enjoy local superlinear rates (Local Superlinear), and if they are line-search free (Line-search Free).

Algorithm 1: Gram-Reduced Levenberg Marquardt (GRLM)

1: Input: x0, c, T, and m

2: for t = 0, 1, . . . T 1

3: π(t) = m t/m , zt = xπ(t)

c J(xt) F(xt)

5: xt+1 = xt G(zt) + λt I 1J(xt) F(xt)

The convergence guarantees of (11) require the convexity of ϕ( ) and the Lipschitz continuity of 2ϕ( ), while the function ϕ( ) = 1 2 F( ) 2 in our problem (2) is nonconvex in general and the popular setting of nonlinear equations (Assumption 2) cannot guarantee the Lipschitz continuity on ϕ( ) nor 2ϕ( ). It is also notable that Doikov, Chayti, and Jaggi (2023) also studied the cubic regularized Newton methods (Nesterov 2006) with lazy Hessians for the general non-convex case, however, the algorithm also requires accessing the Hessian 2ϕ( ) and its analysis is based on the Lipschitz continuity of Hessian. Thus, their theory is not applicable to solving the system of nonlinear equations.

4 The Gram-Reduced LM Method We propose the Gram-Reduced Levenberg Marquardt (GRLM) method in Algorithm 1. Our GRLM method takes the Gram matrix G(zt) = J(zt) J(zt) as the approximation to 2ϕ(zt), which avoids the exact 2ϕ( ) in regularized Newton with Lazy Hessians (Doikov, Chayti, and Jaggi 2023). Furthermore, setting zt in GRLM does not require accessing J( ) J( ) in all iterations, making the algorithm more efficient than existing LM methods. We then consider the convergence of our GRLM method. For the ease of presentation, we denote

c J(xt) F(xt) and rt xt+1 xt .

where the constant c > 0 follows the input of Algorithm 1. In contrast to the analysis of Mishchenko (2023) which

only lower bounds rt by λt+1, the following lemma provides both the lower bound and the upper bound of rt by λt.

Lemma 5. Under Assumption 2, Algorithm 1 holds that

λ2 t c(L2 1 + λt) rt λt

Compared to the ordinary LM method, our GRLM use G(zt) instead of G(xt) at each iteration, which leads to some additional error terms. In the following lemma, we show that such terms can be bounded by the distance between the current iteration point to the snapshot point due to the Lipschitz continuity of G( ) we have proved in Proposition 4.

Lemma 6. Under Assumption 2, Algorithm 1 holds that G(xt)(xt+1 xt) + J(xt) F(xt)

λtrt + 2L1L2rt zt xt ,

and J(xt) F(xt) + G(xt)(xt+1 xt), xt+1 xt

λtr2 t + 2L1L2r2 t xt zt . (13)

4.1 Global Convergence Analysis

We establish the global convergence for Algorithm 1 in this subsection. We adopt the following cubic-growth assumption which has been well studied by Mishchenko (2023).

Assumption 7. We assume the function F : Rd Rd satisfies that

F(y) 2 F(x) + J(x)(y x) 2+M y x 3 (14)

for all x, y Rd, where M > 0 is some constant.

The following lemma shows that, by properly setting the regularization parameter c > 0, we can guarantee the descent property of F(xt) at the snapshot point xt Rd such that t 0 (mod m).

Lemma 8. Under Assumption 2 and 7, if we run Algorithm 1 with c max{4L1L2m, M}, then it holds that

F(x(k+1)m) 2 F(xkm) 2 F(x0) 2

F(xkm) 2 F(x(k+1)m) 2

for all k = 0, 1, . Now, we present the global convergence results of our GRLM (Algorithm 1).

Theorem 9. Following the settings of Lemma 8, Algorithm 1 takes at most

T = O(c2 + c 0.5ϵ 2.5)

iterations to find an ϵ-stationary point of ϕ( ), i.e., we have

min t=0, ,T 1 J(xt) F(xt) ϵ.

Remark 10. Taking m = 1 such that c = max{4L1L2, M}, Algorithm 1 reduces to the ordinary Levernberg Mardquardt method proposed by Mishchenko (2023). Compared with the iteration complexity of O(ϵ 2.5 log(1/ϵ)) established in the previous work, Theorem 1 provides an improved iteration complexity of O(ϵ 2.5), which removes the logarithmic factor. We achieve this by lower bounding rt by λt rather than λt+1 in the existing work and using a novel proof framework that divides the iterations according to the value of λt. Recall that Algorithm 1 needs to compute a new G(zt) in every m iterations. W.L.O.G, we let m M 4L1L2 such that c = 4L1L2m and use K to denote the numbers of the snapshot points, then Theorem 9 means

= O(m + m 1.5ϵ 2.5).

For reusing the Gram matrix G(zt) and reducing the computation cost, we can implement Algorithm 1 by performing singular value decomposition on the Jacobian at the snapshot point, which generally takes O(d3) flops and results

J(zt) = U(zt)Σ(zt)V(zt) ,

where U(zt), V(zt) Rd d are orthogonal and Σ(zt) is diagonal. Based on the SVD of J(zt), we can update xt according to

xt+1 = xt (G(zt) + λt I) 1 J(xt) F(xt)

= xt V(zt) (Σ(zt))2 + λt I 1V(zt) J(xt) F(xt),

which can be done in O(d2) flops. Thus, the total computation cost of Algorithm 1 is

#flops = O(d3K + d2T)

=O(d3m+d3m 1.5ϵ 2.5+d2m2+d2m 0.5ϵ 2.5). (15)

Now we desire to select an appropriate m to minimize the overall flops. We let m = O(dαϵ β) for some α, β 0 and plug it into equation (15), then the total computation cost is

= O d3+αϵ β + d3 1.5αϵ 2.5+1.5β

+ d2+2αϵ 2β + d2 0.5αϵ 2.5+0.5β

O d3 0.25αϵ 1.25+0.25β + d2+1.25αϵ 1.25 0.75β ,

where the inequality is due to the AM GM and the equality holds when α = 0 and β = 1. Therefore, we should take m = Θ(ϵ 1), which leads to #flops = O(d3ϵ 1 + d2ϵ 2). We formally present the above result in the following corollary.

Corollary 11. Following the setting of Lemma 8, running Algorithm 1 with m = Θ(ϵ 1) can find an ϵ-stationary point of ϕ( ) within O(d3ϵ 1 + d2ϵ 2) flops.

As comparison, the globally convergent LM method proposed by Mishchenko (2023) requires O(d3ϵ 2.5 log(1/ϵ)) flops to find an ϵ-stationary point of ϕ( ), which is more expensive than our complexity of O(d3ϵ 1 + d2ϵ 2). Our result is also sharper than the complexity of O(d3ϵ 2) for other LM methods (Ueda and Yamashita 2010; Zhao and Fan 2016; Huang and Fan 2018; Bergou, Diouane, and Kungurtsev 2020; Tran-Dinh, Pham, and Nguyen 2020).

4.2 Local Convergence Analysis

We establish the local superlinear rates for Algorithm 1 in this subsection, with the following assumption for our problem.

Assumption 12. We assume that there exists a solution x for problem (1) which has non-degenerate Jacobian, i.e., we have

F(x ) = 0 and σmin(J(x )) = µ (16)

for some µ > 0.

The following proposition says that Assumption 12 guarantees that the points in the local neighbor of solution x have nondegenerate J( ) and G( ).

Proposition 13 ((Liu et al. 2023a, Proposition 2.3)). Under Assumption 2 and 12, for all x Rd satisfying x x µ2/(6L1L2), we have σmin(J(x)) µ

2 and G(x) µ2

We first provide the recursion for the distances between the points in the local region to the solution.

Lemma 14. Under Assumptions 2 and 12, we assume that some points xt and zt generated by Algorithm 1 are sufficiently close to the solution x such that

n x:x Rd, x x min µ2

18L1L2 , µ4

64L2 1c o , (17)

then it holds that

xt+1 x α1 xt x 2 + α2 xt x 1.5

+ 2α1 zt x xt x , (18)

where α1 L1L2/µ2 and α2 2L1c0.5/µ2. In addition, we have xt+1 S.

Now, we show the explicit local superlinear convergence rate of GRLM.

Theorem 15. Under Assumptions 2 and 12, we run Algorithm 1 with x0 such that

x0 x 1 32(α1 + α2 2), (19)

where α1 and α2 follow the definitions in Lemma 14. Then it holds that

xt x 1 2(α1 + α2 2)

2(1+(1+m/2)π(t))(1+(t%m)/2) ,

where t%m = t π(t). Remark 16. When m = 1, Theorem 15 provides the superlinear rate of xt x 1 2(α1+α2 2) 1

2 4(1+t/2) , for the LM method in Mishchenko (2023, Algorithm 2.2). The local convergence behavior has been widely studied for LM methods, while most of the work focuses on selecting the regularization parameter in the order of λt F(xt) α (Yamashita and Fukushima 2001; Fan and Yuan 2005; Bergou, Diouane, and Kungurtsev 2020). This paper first investigates local convergence using λt p

J(xt) F(xt) . The local superlinear rate in Theorem 15 is comparable to the rate achieved by regularized Newton with lazy Hessians (Doikov, Chayti, and Jaggi 2023). Compared with Doikov, Chayti, and Jaggi (2023) where the objective is supposed to be strongly convex, we only assume the Jacobian at the solution is nondegenerate, which allows the objective ϕ( ) = F( ) 2 to be non-convex. Furthermore, we characterize the superlinear convergence of GRLM by the measure of distance to the solution, while the analysis of Doikov, Chayti, and Jaggi (2023) considers the gradient norm of the objective. The GRLM method (Algorithm 1) takes an average computation cost of O(d2) per iteration when we choose m = Ω(d), which matches the cost of per iteration in quasi-Newton methods or incremental Newton methods to solve the nonlinear equations (Lin, Ye, and Zhang 2021; Ye, Lin, and Zhang 2021; Liu and Luo 2022; Liu et al. 2023a; Zhou et al. 2024). However, these quasi-Newton methods lack global convergence guarantees.

5 Experiment In this section, we conduct experiments on scientific computing and machine learning applications of Chandrasekhar H-equation and non-convex regularized logistic regression to verify our theory. We choose the gradient descent method (GD) and the Levernberg Marquardt method with gradient regularization (Mishchenko 2023) (LM) as baselines. We do not compare our methods with quasi-Newton type methods (Ye, Lin, and Zhang 2021; Lin, Ye, and Zhang 2021; Liu et al. 2023a) nor Jacobian-free Newton Krylov methods (Knoll and Keyes 2004; Ashrafizadeh, Devaud, and Aydemir 2015), since they do not have global convergence guarantees. For all experiments, we tuned the step size η for GD from {0.1, 0.2, , 1}. We tune the regularized parameter c in LM and GRLM from {1, 10, 100, 1000}. Our experiments are conducted on a PC with Apple M1 and all algorithms are implemented in Python 3.8.12.

0 7000 14000

0.0 0.3 0.6

(a) N = 100 (#JV) (d) N = 100 (time)

0 8000 16000

0.0 1.5 3.0

(b) N = 200 (#JV) (e) N = 200 (time)

0 10000 20000

(c) N = 300 (#JV) (f) N = 300 (time)

Figure 1: We demonstrate Jacobian-vector products computing times (#JV) and CPU time (second) vs. J(x) F(x) 2 for H-equation with different equation numbers N.

5.1 Chandrasekhar H-equation Chandrasekhar H-equation plays an important role in scientific computing (Chandrasekhar 1960; Leggett 1976; Kelley 1982) and has been well studied in the previous literature (Kelley 1995; Lin, Ye, and Zhang 2021; Ye, Lin, and Zhang 2021; Liu et al. 2023a). It is defined by

Fi(x) xi 1 c 2N

µixj µi + µj

where F(x) = [F1(x), , FN(x)] RN and x = [x1, , x N] RN. We compare GRLM (m = 50) with baselines. We test the cases N = 100, N = 200, and N = 300. In all cases, we set c = 1 10 10. We randomize an x0 as the initial points for all the methods. The results of the total number of the Jacobian-vector products (#JV) computed in the algorithms against J(xt) F(xt) and the running time against J(xt) F(xt) is presented in Figure 1.2 We observe that the proposed Gram-reduced Levernberg Mardquardt method (GRLM) outperforms the baselines in all cases. We also provide experiments to study the impact of choosing different m in GRLM (Algorithm 1). We choose m = {1, 50, 100, 500} for all cases and present the results in Figure 2.

2The number of the Jacobian-vector products for computing a full Jacobian is d.

0 4000 8000

m = 1 m = 50 m = 100 m = 500

0.0 0.2 0.4

m = 1 m = 50 m = 100 m = 500

(a) N = 100 (iteration) (d) N = 100 (time)

0 3000 6000

m = 1 m = 50 m = 100 m = 500

m = 1 m = 50 m = 100 m = 500

(b) N = 200 (iteration) (e) N = 200 (time)

0 4000 8000

m = 1 m = 50 m = 100 m = 500

m = 1 m = 50 m = 100 m = 500

(c) N = 300 (iteration) (f) N = 300 (time)

Figure 2: We demonstrate the iteration numbers (iteration) and CPU time (second) vs. J(x) F(x) 2 for H-equation with different equation numbers N.

We find that a smaller m leads to a faster iteration in (a), (b), and (c) of Figure 2, which matches our theoretical results obtained in Theorem 9. However, there exists a trade-off between the iteration complexity and the computation cost per iteration.

5.2 Non-Convex Regularized Logistic Regression We further validate the GRLM methods on the non-convex regularized logistic regression model (Antoniadis, Gijbels, and Nikolova 2011):

min x Rd f(x) 1

i=1 ln(1 + exp( bia i x))+λ

x2 (p) 1 + x2 (p) ,

where x(p) is the p-th coordinate of x Rd, λ > 0 is the regularized parameter, ai Rd and bi { 1, +1} are the feature and the corresponding label of the i-th sample. This model corresponds to solving the following nonlinear equations F(x) f(x). We compare GRLM (m = 100) with baselines in three real-world datasets: a1a (n = 1, 605, d = 123), w1a (n = 2, 477, d = 300), and splice (n = 1, 000, d = 60). All of these data sets can be downloaded from the LIBSVM repository (Chang and Lin 2011). We present the results of the number of Jacobian-vector products (#JV) against J(xt) F(xt) and the running time against

0 4000 8000

10 2 GD LM GRLM

10 2 GD LM GRLM

(a) a1a (#JV) (d) a1a (time)

0 15000 30000

(b) w1a (#JV) (e) w1a (time)

0.0 0.4 0.8

(c) splice (#JV) (f) splice (time)

Figure 3: We demonstrate Jacobian-vector products computing times (#JV) and CPU time (second) vs. J(x) F(x) 2 for non-convex logistic regression on datasets a1a , w1a , and splice .

J(xt) F(xt) in Figure 3. GRLM outperforms the baselines for all datasets in terms of both the total number of Jacobian vector products and the CPU time.

6 Conclusion and Future Work

In this paper, we have proposed Gram-Reduced Levenberg Marquardt Method (GRLM) to solve nonlinear equations. GRLM improves the behavior of existing LM methods by reducing the computation times of Gram matrices. It is globally convergent with a simple iteration scheme and a low computational cost. Furthermore, it exhibits local superlinear convergence, which cannot be achieved by any of the firstorder methods. To the best of our knowledge, this is the first method for solving nonlinear equations that can achieve the best of both worlds.

For future work, it is interesting to design sketched, stochastic, and distributed variants of GRLM (Yuan 2009; Yuan, Lazaric, and Gower 2022; Chayti, Doikov, and Jaggi 2023; Liu et al. 2023b). It is also possible to incorporate the idea of super universal Newton methods (Doikov, Mishchenko, and Nesterov 2024) to make GRLM parameterfree.

Acknowledgments Chengchang Liu is supported by the National Natural Science Foundation of China (624B2125). Luo Luo is supported by the National Natural Science Foundation of China (No. 62206058), Shanghai Sailing Program (22YF1402900), Shanghai Basic Research Program (23JC1401000), and the Major Key Project of PCL under Grant PCL2024A06. John C.S. Lui is supported in part by the RGC GRF-142202923.

References Adler, I.; Hu, Z. T.; and Lin, T. 2020. New proximal Newtontype methods for convex optimization. In 2020 59th IEEE Conference on Decision and Control (CDC), 4828 4835. IEEE. Antoniadis, A.; Gijbels, I.; and Nikolova, M. 2011. Penalized likelihood regression for generalized linear models with nonquadratic penalties. Annals of the Institute of Statistical Mathematics, 63: 585 615. Ashrafizadeh, A.; Devaud, C.; and Aydemir, N. 2015. A Jacobian-free Newton Krylov method for thermalhydraulics simulations. International Journal for Numerical Methods in Fluids, 77(10): 590 615. Bai, S.; Kolter, J. Z.; and Koltun, V. 2019. Deep equilibrium models. Advances in Neural Information Processing Systems, 32. Ben-Israel, A. 1966. A Newton-Raphson method for the solution of systems of equations. Journal of Mathematical analysis and applications, 15(2): 243 252. Bergou, E. H.; Diouane, Y.; and Kungurtsev, V. 2020. Convergence and complexity analysis of a Levenberg Marquardt algorithm for inverse problems. Journal of Optimization Theory and Applications, 185: 927 944. Berthier, E.; Carpentier, J.; and Bach, F. 2021. Fast and Robust Stability Region Estimation for Nonlinear Dynamical Systems. In 2021 European Control Conference (ECC), 1412 1419. IEEE. Botev, A.; Ritter, H.; and Barber, D. 2017. Practical Gauss Newton optimisation for deep learning. In International Conference on Machine Learning, 557 565. PMLR. Chandrasekhar, S. 1960. Radiative transfer. Courier Corporation. Chang, C.-C.; and Lin, C.-J. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2: 27:1 27:27. Software and datasets available at http://www.csie.ntu.edu.tw/ cjlin/ libsvm. Chayti, E. M.; Doikov, N.; and Jaggi, M. 2023. Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods. ar Xiv preprint ar Xiv:2302.11962. D efossez, A.; and Bach, F. 2015. Averaged least-meansquares: Bias-variance trade-offs and optimal sampling distributions. In Artificial Intelligence and Statistics, 205 213. PMLR. Doikov, N.; Chayti, E. M.; and Jaggi, M. 2023. Second-order optimization with lazy Hessians. In International Conference on Machine Learning, 8138 8161. PMLR.

Doikov, N.; Mishchenko, K.; and Nesterov, Y. 2024. Superuniversal regularized Newton method. SIAM Journal on Optimization, 34(1): 27 56. Elbakary, A.; Issaid, C. B.; Shehab, M.; Seddik, K.; El Batt, T.; and Bennis, M. 2024. Fed-Sophia: A Communication Efficient Second-Order Federated Learning Algorithm. ar Xiv preprint ar Xiv:2406.06655. Fan, J. 2013. A Shamanskii-like Levenberg-Marquardt method for nonlinear equations. Computational Optimization and Applications, 56(1): 63 80. Fan, J.; and Yuan, Y.-X. 2005. On the quadratic convergence of the Levenberg Marquardt method without nonsingularity assumption. Computing, 74: 23 39. Frehse, J.; and Bensoussan, A. 1984. Nonlinear elliptic systems in stochastic game theory. Journal f ur die reine und angewandte Mathematik, 350: 23 67. Huang, J.-C.; and Fan, J.-Y. 2018. Global Complexity Bound of the Inexact Levenberg Marquardt Method. Journal of the Operations Research Society of China, 6: 417 428. Kelley, C. 1982. Approximate methods for the solution of the Chandrasekhar H-equation. Journal of Mathematical Physics, 23(11): 2097 2100. Kelley, C. T. 1995. Iterative methods for linear and nonlinear equations. SIAM. Knoll, D. A.; and Keyes, D. E. 2004. Jacobian-free Newton Krylov methods: a survey of approaches and applications. Journal of Computational Physics, 193(2): 357 397. Lampariello, F.; and Sciandrone, M. 2001. Global convergence technique for the Newton method with periodic Hessian evaluation. Journal of optimization theory and applications, 111: 341 358. Leggett, R. W. 1976. A new approach to the H-equation of Chandrasekhar. SIAM Journal on Mathematical Analysis, 7(4): 542 550. Levenberg, K. 1944. A method for the solution of certain non-linear problems in least squares. Quarterly of applied mathematics, 2(2): 164 168. Lin, D.; Ye, H.; and Zhang, Z. 2021. Explicit superlinear convergence rates of Broyden s methods in nonlinear equations. ar Xiv preprint ar Xiv:2109.01974. Lin, D.; Ye, H.; and Zhang, Z. 2022. Explicit convergence rates of greedy and random quasi-Newton methods. Journal of Machine Learning Research, 23(162): 1 40. Liu, C.; Chen, C.; Luo, L.; and Lui, J. C. 2023a. Block Broyden s Methods for Solving Nonlinear Equations. In Thirty-seventh Conference on Neural Information Processing Systems. Liu, C.; Chen, L.; Luo, L.; and Lui, J. C. 2023b. Communication efficient distributed Newton method with fast convergence rates. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1406 1416. Liu, C.; and Luo, L. 2022. Quasi-Newton methods for saddle point problems. Advances in Neural Information Processing Systems, 35: 3975 3987.

Liu, C.; Zhu, L.; and Belkin, M. 2022. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59: 85 116. Liu, H.; Li, Z.; Hall, D.; Liang, P.; and Ma, T. 2023c. Sophia: A scalable stochastic second-order optimizer for language model pre-training. ar Xiv preprint ar Xiv:2305.14342. Marquardt, D. W. 1963. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2): 431 441. Mishchenko, K. 2023. Regularized Newton Method with Global O(1/k2) Convergence. SIAM Journal on Optimization, 33(3): 1440 1462. Nesterov, Y. 2006. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1): 177 205. Nesterov, Y. 2007. Modified Gauss Newton scheme with worst case guarantees for global performance. Optimisation methods and software, 22(3): 469 483. Nesterov, Y. 2018. Lectures on convex optimization, volume 137. Springer. Nocedal, J.; and Wright, S. J. 1999. Numerical optimization. Springer. Nourian, M.; and Caines, P. E. 2013. ϵ-Nash mean field game theory for nonlinear stochastic dynamical systems with major and minor agents. SIAM Journal on Control and Optimization, 51(4): 3302 3331. Rodomanov, A.; and Nesterov, Y. 2021. Greedy quasi Newton methods with explicit superlinear convergence. SIAM Journal on Optimization, 31(1): 785 811. Shamanskii, V. 1967. A modification of Newton s method. Ukrainian Mathematical Journal, 19(1): 118 122. Tran-Dinh, Q.; Pham, N.; and Nguyen, L. 2020. Stochastic Gauss-Newton algorithms for nonconvex compositional optimization. In International Conference on Machine Learning, 9572 9582. PMLR. Tr emolet, Y. 2007. Model-error estimation in 4D-Var. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography, 133(626): 1267 1280. Ueda, K.; and Yamashita, N. 2010. On a global complexity bound of the Levenberg-Marquardt method. Journal of optimization theory and applications, 147: 443 453. Wang, C.-y.; Chen, Y.-y.; and Du, S.-q. 2006. Further insight into the Shamanskii modification of Newton method. Applied mathematics and computation, 180(1): 46 52. Yamashita, N.; and Fukushima, M. 2001. On the rate of convergence of the Levenberg-Marquardt method. In Topics in Numerical Analysis: With Special Emphasis on Nonlinear Problems, 239 249. Springer. Ye, H.; Lin, D.; and Zhang, Z. 2021. Greedy and Random Broyden s Methods with Explicit Superlinear Convergence Rates in Nonlinear Equations. ar Xiv preprint ar Xiv:2110.08572.

Yuan, R.; Lazaric, A.; and Gower, R. M. 2022. Sketched Newton Raphson. SIAM Journal on Optimization, 32(3): 1555 1583. Yuan, Y.-X. 1994. Trust region algorithms for nonlinear equations. Citeseer. Yuan, Y.-X. 2009. Subspace methods for large scale nonlinear equations and nonlinear least squares. Optimization and Engineering, 10(2): 207 218. Yuan, Y.-X. 2011. Recent advances in numerical methods for nonlinear equations and nonlinear least squares. Numerical algebra, control & optimization, 1(1): 15. Zhao, R.; and Fan, J. 2016. Global complexity bound of the Levenberg Marquardt method. Optimization Methods and Software, 31(4): 805 814. Zhou, Z.; Liu, Z.; Liu, C.; and Luo, L. 2024. Incremental Gauss Newton Methods with Superlinear Convergence Rates. ar Xiv preprint ar Xiv:2407.03195.