# quantum_optimization_via_gradientbased_hamiltonian_descent__2ea798e9.pdf Quantum Optimization via Gradient-Based Hamiltonian Descent Jiaqi Leng 1 2 Bin Shi 3 4 With rapid advancements in machine learning, first-order algorithms have emerged as the backbone of modern optimization techniques, owing to their computational efficiency and low memory requirements. Recently, the connection between accelerated gradient methods and damped heavyball motion, particularly within the framework of Hamiltonian dynamics, has inspired the development of innovative quantum algorithms for continuous optimization. One such algorithm, Quantum Hamiltonian Descent (QHD), leverages quantum tunneling to escape saddle points and local minima, facilitating the discovery of global solutions in complex optimization landscapes. However, QHD faces several challenges, including slower convergence rates compared to classical gradient methods and limited robustness in highly nonconvex problems due to the non-local nature of quantum states. Furthermore, the original QHD formulation primarily relies on function value information, which limits its effectiveness. Inspired by insights from high-resolution differential equations that have elucidated the acceleration mechanisms in classical methods, we propose an enhancement to QHD by incorporating gradient information, leading to what we call gradient-based QHD. Gradient-based QHD achieves faster convergence and significantly increases the likelihood of identifying global solutions. Numerical simulations on challenging problem instances demonstrate that gradient-based QHD outperforms existing quantum and classical methods by at least an order of magnitude. 1Simons Institute for the Theory of Computing, University of California, Berkeley, USA 2Department of Mathematics, University of California, Berkeley, USA 3Center for Mathematics and Interdisciplinary Sciences, Fudan University, Shanghai, China 4Shanghai Institute for Mathematics and Interdisciplinary Sciences, Shanghai, China. Correspondence to: Jiaqi Leng . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1. Introduction In modern machine learning, a central challenge lies in unconstrained optimization, particularly the task of minimizing a continuous objective function without any constraints. Mathematically, this problem is formulated as: min x Rd f(x). Efficiently solving such optimization problems is fundamental to a wide range of machine learning applications. First-order optimization algorithms have emerged as the cornerstone of this endeavor due to their computational efficiency and low memory requirements. One of the simplest yet most widely used first-order methods is the vanilla gradient descent, which updates iteratively according to: xk+1 = xk s f(xk), where s > 0 denotes the step size. This method, though simple, serves as the foundation for many modern optimization techniques. In the early 1980s, a groundbreaking advancement was introduced by Nesterov (1983): the accelerated gradient method, now widely known as Nesterov s accelerated gradient descent method (NAG). This method revolutionized first-order optimization by achieving a faster convergence rate compared to vanilla gradient descent. The iterative update rules for NAG are as follows: xk = yk 1 s f(yk 1), yk = xk + k 1 k + 2(xk xk 1), where s > 0 is the step size. The key innovation of NAG lies in the introduction of momentum, which effectively reduces oscillations in the optimization trajectory and speeds up progress towards the optimal solution. Recent advancements have shed light on the mechanisms underlying the acceleration of NAG, thereby effectively bridging the gap between its discrete updates and the continuous dynamics of damped heavy-ball motion. One pivotal contribution in this area is the introduction of the low-resolution ordinary differential equation (ODE) by Su et al. (2016), which characterizes the continuous limit of NAG as: t X + f(X) = 0, Quantum Optimization via Gradient-Based Hamiltonian Descent Figure 1. Numerical comparison of successful probability across iterations for both QHD and gradient-based QHD applied to the Styblinski-Tang function. P k denotes the success probability at iteration k. where the first derivative X represents velocity in classical mechanics. By transforming this equation into its canonical form, we obtain: This canonical form establishes the foundation for a variational perspective on the acceleration phenomenon, which is articulated through the Bregman Lagrangian, L(X, V, t) = 1 2t3 V 2 t3f(X), (1) as introduced by Wibisono et al. (2016). Furthermore, employing the Legendre transformation, we can convert this Lagrangian into its Hamiltonian form: H(X, P, t) = 1 2t3 P 2 + t3f(X), (2) which paves the way for extending the analysis from classical dynamics to quantum dynamics. By transforming the classical momentum variable P to the quantum momentum operator i within the Hamiltonian (2), Leng et al. (2023a) have pioneered a groundbreaking algorithm, known as Quantum Hamiltonian Descent (QHD), which defines the quantum dynamics through the following Schr odinger equation as i tΨ(t, x) = ˆH(t)Ψ(t, x), (3) where the time-dependent Hamiltonian is articulated as1: t 3/2( i ) 2 + t3f(x). (4) 1The original formulation of QHD allows a more general Hamiltonian: ˆH(t) = eαt γt( /2) + eαt+βt+χtf(x), where the Laplacian = , as given in Eq. (A.24) of (Leng et al., 2023a). For simplicity, we specialize to the parameter choices corresponding to the classical NAG, namely αt = log(t) and βt = γt = 2 log(t). Let Ψ(t, x): [0, ) Rd C denote a quantum wave function, whose squared modulus |Ψ(t, x)|2 represents the probability distribution of a hypothetical quantum particle in Rd at any time t 0. For sufficiently large evolution time t, the probability distribution is expected to concentrate near the low-energy configurations of the potential f, particularly around its global minimum. Measuring the quantum state in the computational basis at such times yields a random vector X |Ψ(t, x)|2, which is likely to lie close to the global minimizer of f, thereby approximately solving the associated optimization problem. As a quantum algorithm, QHD is implemented by simulating the time-dependent Hamiltonian (4), which relies only on oracle access to the function values of f. Thus, QHD can be viewed as a quantum zeroth-order method. A natural extension of QHD is to develop its higher-order variants that leverage additional information such as the gradient of f, and to analyze whether such extensions can enhance QHD s efficiency on various continuous optimization problems. Inspired by the high-resolution ODE framework introduced by Shi et al. (2022), where the Lyapunov function is conceptualized as a form of energy or Hamiltonian involving the interplay of kinetic energy and gradient, we propose a novel time-dependent Hamiltonian as t 3/2( i ) + αt3/2 f 2 2 t3/2 f 2 + (t3 + γt2)f(x). (5) In this paper, we mainly investigate the Schr odinger equation (3) with the gradient-based Hamtiltonian (5), termed as gradient-based QHD. 1.1. Warm-up: gradient-based QHD v.s. QHD We provide a numerical example to illustrate the differences between gradient-based QHD and standard QHD, both qualitatively and quantitatively. Figure 1 visualizes the probability distribution across iterations for both QHD and Quantum Optimization via Gradient-Based Hamiltonian Descent gradient-based QHD, applied to the non-convex Styblinski Tang function, which features three local minima alongside a global minimum. (a) Function value (b) Success probability Figure 2. Numerical performance comparison of various algorithms on the Styblinski-Tang function. Furthermore, Figure 2 demonstrates the numerical performance involving function values and success probability. While QHD does not depict an obvious advantage against stochastic gradient descent with momentum (SGDM) (Shi, 2024; Shi et al., 2023) and NAG, gradient-based QHD demonstrates a much more concentrated solution distribution as the iterations progress, leading to an improved global convergence. These findings motivate us to conduct a detailed investigation into gradient-based QHD and its potential in continuous optimization. 1.2. Overview of contributions Our contributions are listed as follows: We propose gradient-based QHD for continuous optimization problems. With a novel Lyapunov function approach involving quantum operators, we provide a convergence analysis of gradient-based QHD in continuous time. In particular, we establish the convergence rate of gradient-based QHD in both function values (Theorem 1) and gradient norms (Theorem 4). We develop a quantum algorithm that simulates discrete-time gradient-based QHD to solve optimization problems (Algorithm 1). With a gate complexity linear in problem dimension d, this quantum algorithm is readily scalable to handle large-scale problems in practice. In addition to the theoretical analysis, we conduct a numerical study to evaluate the performance of gradient-based QHD in both convex and non-convex optimization. Our results show that gradient-based QHD achieves an enhanced performance compared to standard QHD and other prominent classical optimization algorithms. In some cases, gradient-based QHD yields solutions that are an order of magnitude better than those obtained by other methods. Organization. This work is structured as follows. First, we survey related classical and quantum optimization algorithms in Section 2. Next, we formulate gradient-based QHD in Section 3, with several continuous-time convergence results established in Section 4. In the subsequent Section 5, we show that gradient-based QHD can be efficiently implemented using a quantum computer. Finally, in Section 6, we present the numerical experiments comparing gradient-based QHD with several other quantum and classical optimization algorithms. We conclude this paper in Section 7. 2. Related work NAG-related algorithms and ODEs. There has been a long history of analyzing NAG-related optimization algorithms (Giselsson & Boyd, 2014; O donoghue & Candes, 2015). Su et al. (2016) sheds new light on the understanding and design of NAG using an ODE perspective. In (Betancourt et al., 2018; Wibisono et al., 2016; Wilson et al., 2021), a Lagrangian (or Hamiltonian) framework is used to describe a larger class of ODEs that provides a unified perspective for the acceleration phenomenon in first-order optimization. Notably, accelerated gradient descent has been investigated in non-Euclidean settings, including mirror descent (Krichene et al., 2015; Lin et al., 2019) and more generally, Riemannian manifolds (Ahn & Sra, 2020; Han et al., 2023; Kong & Tao, 2024; Siegel, 2019). Quantum algorithms for unconstrained optimization. Using quantum computers to accelerate bottleneck steps in classical optimization algorithms has shown promise in achieving quantum advantage (Kerenidis & Prakash, 2020; Liu et al., 2024; Rebentrost et al., 2019). However, their practical performance requires further investigation due to the non-trivial overhead involved in extracting classical information from quantum states (Yuen, 2023). Motivated by the interplay between NAG and ODEs, another line of research proposes leveraging quantum Hamiltonian dynamics as an algorithmic surrogate for addressing unconstrained optimization problems (Leng et al., 2023a; Liu et al., 2023; Zhang et al., 2021), with recent extensions to open quantum systems (Chen et al., 2023), constrained optimization (Augustino et al., 2023), and discrete optimization (Cheng et al., 2024). This approach is particularly effective for highly non-convex problems (Leng et al., 2023b) and well-suited for hardware implementation (Kushnir et al., 2024; Leng et al., 2024). More discussions are available in Appendix B. Quantum Optimization via Gradient-Based Hamiltonian Descent 3. Gradient-based Hamiltonian dynamics 3.1. Classical Hamiltonian flows with gradient Inspired by the Bregman Lagrangian (Wibisono et al., 2016) and the high-resolution ODE framework (Shi et al., 2022), we propose to study the following Lagrangian function: L(t, X, X) = t3 2 | X|2 αt3 X f(X) 2 | f(X)|2 (t3 + γt2)f(X), (6) where α, β, γ R are real-valued parameters that will be specified later. Compared with the standard Bregman Lagrangian (1), our new Lagrangian function explicitly incorporates the gradient f into the Lagrangian. This design is motivated by the convergence analysis in the high-resolution ODE, where the Lyapunov function can be interpreted as a generalized energy functional that includes gradient information. More details are provided in Appendix A.2. By applying the Legendre transformation, we obtain the Hamiltonian function associated with (6): H(t, X, P) = sup Y P Y L(t, X, Y ) 2 t 3/2P + αt3/2 f 2 2 f(X) 2 + (t3 + γt2)f(X). Thus, we derive the Hamiltonian dynamics: P = 1 2t3 P(t) + α f(X(t)), (8) X = 2f(X) αP + (α2 + β)t3 f(X) (t3 + γt2) f(X). (9) Connection with high-resolution ODEs. It is worth noting that while our Lagrangian function shares certain similarities with high-resolution ODEs, they are not equivalent. By substituting (9) into (8), and choosing β/α = s, γ 3α = 3 s/2, (10) we can transform the Hamiltonian dynamics to a secondorder ODE: t X(t) + s 2f(X(t)) X f(X(t)) = s 2t3 2f(X(t))P. (11) Formally, the left-hand side corresponds to the highresolution ODE derived by Shi et al. (2022) (for details, see Appendix A). The right-hand side of (11) is asymptotically vanishing as the momentum P eventually decays to 0.2 Therefore, we expect the Hamiltonian dynamics to exhibit long-term behavior similar to that of high-resolution ODEs; however, we leave a detailed analysis for future research. Due to its distinctive properties, the proposed Lagrangian function is of independent theoretical interest. In this work, we deliberately do not restrict the parameters α, β, and γ to the specific values associated with the high-resolution ODE case (10). This flexibility allows us to explore a broader class of dynamical systems, potentially leading to novel insights and improved algorithms for continuous optimization. 3.2. Canonical quantization We introduce canonical quantization, a standard procedure that maps a classical Hamiltonian function to a quantum Hamiltonian operator. The Hamiltonian operator serves as an infinitesimal generator of a quantum evolution, which will be the core of our quantum optimization algorithms. A classical-mechanical system is described by a Hamiltonian function H(X, P, t). In contrast, a quantummechanical system is governed by a quantum Hamiltonian operator ˆH : L2(Rd) L2(Rd). The canonical quantization procedure allows us to translate a known classical Hamiltonian function to a corresponding quantum Hamiltonian by the mapping: xj 7 ˆxj, pj 7 ˆpj := i Here, xj and pj are the position and momentum variables describing a classical object living in a d-dimensional space Rd, respectively, with the dimension indices i [d]. Correspondingly, ˆxj and ˆpj are the quantum position and momentum operators acting on wave functions ψ(x) L2(Rd): (ˆxjψ)(x) = xjψ(x), (ˆpjψ)(x) = i Using this dictionary, we obtain the quantum Hamiltonian operator corresponding to the Hamiltonian function (7): j=1 A2 j + β 2 t3 f 2 + (t3 + γt2)f, (13) where for j = 1, . . . , d, the operator Aj is defined by Aj = t 3/2ˆpj + αt3/2ˆvj, ˆvjψ := f with ˆvj a multiplicative operator acting on a wave function ψ. Due to the non-commutativity of quantum operators, the 2Details are available in Section 4. Quantum Optimization via Gradient-Based Hamiltonian Descent square of the operator Aj is expressed as A2 j = t 3ˆp2 j + α{ˆpj, ˆvj} + α2t3ˆv2 j , where {A, B} := AB + BA denotes the anti-commutator of operators. Given a quantum Hamiltonian operator ˆH(t), the quantum evolution generated by the Hamiltonian operator is governed by the Schr odinger equation: i tΨ(t, x) = ˆH(t)Ψ(t, x), (15) for time 0 < T0 t T, subject to an initial condition Ψ(T0, x) = Ψ0(x). The quantum wave function Ψ(t) is complex-valued, and its modulus squared |Ψ(t)|2 corresponds to a probability density that characterizes the distribution of the quantum particle in the real space Rd. Connection with the original QHD. In Leng et al. (2023a), the Hamiltonian was derived from the Bregman Lagrangian via Feymann s path integral technique. Our derivation relying on canonical quantization takes a different yet complementary approach. The resulting Hamiltonian operator ˆH(t) naturally encompasses the original QHD as a special case by choosing the parameters α = β = γ = 0. 4. Convergence analysis In this section, we focus on the convergence results of the newly derived quantum dynamics. Throughout this section, we assume f(x ) = 0 and x = 0. This can always be achieved by considering the translated objective function f(x) f(x + x ) f(x ). 4.1. Case 1: convergence to global minimum First, we consider a simple case where no gradient norm appears in the Hamiltonian (13), i.e., β = 0. In this case, we can prove that the dynamics converge to the global minimum of f. Theorem 1. Let β = 0 and γ max(3α, 0) for any α R. For any 1/α T0 > 0, we denote Ψ(t, x) as the solution to the PDE (15) for t T0. Let Xt be a random variable distributed according to the probability density |Ψ(t, x)|2. Then, for a convex and continuously differentiable f, we have E[f(Xt)] K0 + D0 t2 + ωt , ω = γ 3α 0, where K0 = T 4 0 Ψ(T0)|( )|Ψ(T0) and D0 = E f(XT0) 2 + 4 XT0 2 + (T 2 0 + ωT0)f(XT0) . In other words, E[f(Xt)] O(t 2). Remark 1. K0 represents the initial kinetic energy (rescaled by T 4 0 ). Its value is independent of f and typically does not depend on the dimension d, e.g., when the initial state Ψ0 is a standard Gaussian wave. In general, D0 can scale linearly in d due to the presence of f 2. The convergence rate is proved by constructing a Lyapunov function E(t) that is non-increasing in time. The Lyapunov function is defined by E(t) = ˆO(t) t := Ψ(t)| ˆO(t)|Ψ(t) , t 2ˆpj + αtˆvj + 2ˆxj 2 + t2 + ωt f. Here, ω = γ 3α 0 because γ max(3α, 0). Lemma 2. Let β = 0 and γ max(3α, 0). For any t > 0, we have E(t) 0. In Lemma 2, we prove that the function E(t) is nonincreasing in time, as a result, t2 f t E(t) E(T0) = f t E(T0) Moreover, we note that E(T0) Ψ(t) 1 T 4 0 ( ) + α2T 2 0 f 2 + 4 x 2 Ψ(t) + (T 2 0 + ωT0) Ψ(t)|f|Ψ(t) , which proves Theorem 1. The details of Lemma 2 is presented in Appendix C.2. The technical proof heavily relies on the commutation relations between various non-commuting quantum operators that appeared in the Lyapunov function. We summarize the common commutation relations used in this work in Lemma 3, which might be of independent interest in future work. Lemma 3 (Commutation relations). Let Aj, pj, and xj be the same as above. For any 1 j, k d, we have the following identities: 1. i[A2 j, f] = t 3{pj, vj} + 2αv2 j , 2. i[f, {Aj, xj}] = 2t 3/2xjvj, 3. i[A2 j, x2 k] = ( 2 t3 {pj, xj} + 4αxjvj (j = k) 0 (j = k), 4. i[A2 j, {Ak, xk}] = ( 4 t3/2 A2 j (j = k) 0 (j = k), 5. i[v2 j , {Ak, xk}] = 4t 3/2 2f xjxk For the proof, please refer to Appendix C.1. Quantum Optimization via Gradient-Based Hamiltonian Descent 4.2. Case 2: convergence to first-order stationary point We denote the function G(x) as the square of the gradient norm of f, i.e., G(x) := f(x) f(x) = Theorem 4. Let γ max(3α, 0) and β > 0. For any min(1/α, p 2/β) T0 > 0, we denote Ψ(t, x) as the solution to the PDE (15). Let Xt be a random variable distributed according to the probability density |Ψ(t, x)|2. Then, for a convex and continuously differentiable f such that its gradient norm satisfies the following identity: G(x) G(x) x 0, (17) E[ f(Xt) 2] 2 (K0 + D 0) βt2 , where K0 is the same as in Theorem 1 and D 0 = E 2 f(XT0) 2 + 4 XT0 2 + (T 2 0 + ωT0)f(XT0) , where ω = γ 3α. In other words, E[ f(Xt) 2] O(t 2). Remark 2. A sufficient condition for the identity (17) is that G(x) is convex. In this case, the global minimizer of G(x) must be x and (17) holds. However, this does not always require the objective function f to be convex. For example, consider f(x) = x for x > 0. While f is a concave function, G(x) = (f )2 = 1 4x is a convex function for x > 0. Similarly, the proof of Theorem 4 exploits a Lyapunov function approach. We define F(t) = ˆJ(t) t := Ψ(t)| ˆJ(t)|Ψ(t) , t 2ˆpj + αtˆvj + 2ˆxj 2 + β 2 t2G + (t2 + ωt)f, with ω = γ 3α 0. Due to the positivity of (t 2pj + αtvj + 2xj)2 and f, we have f 2 t 2 βt2 F(t) 2 βt2 F(0), where the last step follows from Lemma 5. This proves Theorem 4. Lemma 5. Let γ > 0 and α max(β, 0). If (17) holds, we have F(t) 0 for any t > 0. The proof is left in Appendix C.3. 5. Quantum algorithms and complexity analysis In this section, we study the time discretization of gradientbased QHD, which facilitates the simulation of the quantum dynamics in a (fault-tolerant) quantum computer. 5.1. Time discretization of the quantum Hamiltonian dynamics Recall that the gradient-based QHD dynamics are governed by the differential equation (15). Let U(t) be the timeevolution operator that maps an initial state |Ψ0 to the solution state |Ψ(t) at time t [0, T], i.e., U(t)Ψ(0) = Ψ(t) t [0, T]. Formally, the time-evolution operator can be obtained by a sequence of infinitesimal time evolution of the quantum Hamiltonian ˆH(t): U(t) = lim K e ih ˆ H(t K)e ih ˆ H(t K 1) . . . e ih ˆ H(t1), where K is a positive integer, h = t/K and tk = kh for 1 k N. Note that the gradient-based QHD Hamiltonian can be decomposed in the form ˆH(tk) = Hk,1 + Hk,2 + Hk,3, where 2t3 k , Hk,2 = α 2 { i , f}, 2 t3 f 2 + (t3 + γt2)f. Therefore, we can further decompose a short-time evolution step using the product formula (i.e., operator splitting): e ih ˆ H(tk) e ih Hk,1e ih Hk,2e ih Hk,3. (18) Since all the Hamiltonians Hk,1, Hk,2, and Hk,3 can be efficiently simulated using a quantum computer, we obtain a quantum algorithm that implements gradient-based QHD to solve large-scale optimization problems, as summarized in Algorithm 1. On the choice of step size h. It is shown in Childs et al. (2021) that the product formula will introduce a simulation error such that e ih ˆ H(tk) e ih Hk,1e ih Hk,2e ih Hk,3 1 i =j 3 [Hk,i, Hk,j] . A formal calculation shows that the commutator norm scales as O(t3 k), which suggests h t 3/2 k may be needed to control the simulation error in each time step. However, in Quantum Optimization via Gradient-Based Hamiltonian Descent Algorithm 1 Gradient-based QHD with fixed step size Classical Input: Hamiltonian parameters α, β, γ, step size h, number of iterations K. Quantum Input: an initial guess state |Ψ0 Output: a classical point ξ Rd. Initialize the quantum register to |Ψ0 . for k = 1 to K do Determine tk = kh. Implement a quantum circuit Uk as described in (18). Compute |Ψk = Uk |Ψk 1 . end for Measure the final quantum state |ΨK with the position observable ˆx to obtain a sample point ξ Rd. the numerical experiments, we observe that a much larger step size h can still result in the convergence of the discretetime gradient-based QHD. This observation aligns with our experience with the NAG method, where convergence is achieved with a step size proportional to 1/L, irrespective of the continuous-time dynamics. As a result, we treat the step size h as an independent parameter in the complexity analysis. A complete understanding of the convergence of the discrete-time algorithm, however, is left for future study. Remark 3. Quantum simulations of time-dependent Hamiltonians constitute an active research area, with a growing body of literature addressing this topic (e.g., (An et al., 2021; 2022; Berry et al., 2020; Childs et al., 2022; Mizuta et al., 2024)). These developments pave the way for more advanced implementations of gradient-based QHD, potentially offering improved asymptotic complexity. A detailed exploration of such implementations is left for future work. 5.2. Complexity analysis Now, we analyze the computational cost of Algorithm 1. In our analysis, we assume the quantum computer has access to the function f and its gradient via the following quantum circuits: Of : |x |z 7 |x |f(x) + z , O f : |x |z 7 |x | f(x) + z . The quantum circuits Of and O f are often called quantum zerothand first-order oracles. They can be efficiently constructed by quantum arithmetic circuits when the expressions of f and f are known. Remark 4. The requirement for a quantum first-order oracle O f can potentially be eliminated by leveraging Jordan s algorithm (Jordan, 2005), which estimates gradients using only a zeroth-order oracle Of. However, without astrong smoothness assumptions on the objective function f, the query complexity of obtaining an ϵ-approximate gradient typically scales as O( d/ϵ) (Gily en et al., 2019a). In this work, we focus on the convergence properties of gradientbased QHD, leaving the incorporation of quantum gradient estimation techniques for future research. A crucial step in Algorithm 1 is to implement the quantum unitary operator Uk based on the operator splitting formula (18). We note that the sub-Hamiltonians Hk,1 and Hk,3 are fast-forwardable, and the operator Hk,2 can be simulated by invoking Quantum Singular Value Transformation (QSVT). Combining these technical results together, we end up with the overall complexity of the quantum algorithm, as summarized in Theorem 6. Theorem 6. Let f be L-Lipschitz and |Ψ0 be a sufficiently smooth function. Then, we can implement Algorithm 1 for K iterations using O(K) queries to the quantum zerothorder oracle Of and e O (αdh KL) queries to the quantum first-order oracle O f and its inverses.3 The details proof of Theorem 6, including the efficient simulation of Hk,2 via QSVT, is presented in Appendix D. 6. Numerical experiments In this section, we conduct extensive numerical experiments to evaluate the performance of gradient-based QHD and compare it with other prominent optimization algorithms. 6.1. Experiment setup and implementation details Let f : Rd R be an objective function with gradient f(x). Given an optimization algorithm initialized with a random sample drawn from a fixed distribution ρ0, the algorithm s output after k iterations can be represented by a random variable Xk Rd. We denote E[f(Xk)] as the expectation value of the objective function and E[ f(Xk) 2] as expected gradient norm at iteration k. To assess the algorithm s performance, we define the success probability after k iterations as Pk := P[f(Xk) f(x ) δ]. where δ > 0 is a pre-defined error threshold. For all the subsequent experiments, we set δ = 1. We remark that the iteration steps in gradient-based QHD (as shown in Algorithm 1) are more intricate than those in classical methods such as SGDM and NAG. As demonstrated in the proof of Lemma 9, the query and gate complexity per iteration of gradient-based QHD scales as e O(d). In contrast, each iteration of SGDM/NAG involves only a single query to f, with a time complexity of O(d). Therefore, in terms 3Here, the e O notation suppresses poly-logarithmic factors in the error parameter ϵ. The parameter ϵ > 0 represents the error budget in the Hamiltonian simulation, as detailed in Lemma 9. Quantum Optimization via Gradient-Based Hamiltonian Descent of overall runtime, gradient-based QHD remains asymptotically comparable to NAG, which justifies our comparison based on the iteration count. To evaluate the classical methods such as SGDM and NAG, we estimate the expectation values and success probabilities using a sample of 1000 independent runs. Each run begins with a uniformly random initial guess and proceeds for k iterations. For the quantum methods, since the probability density function can be explicitly derived from the quantum state vector, expectation values and success probabilities are computed via numerical integration. The numerical simulations of the quantum algorithms, including QHD and gradient-based QHD, are performed on a Mac Book equipped with an M4 chip. Additional details on the numerical methods employed are provided Appendix E.1. 6.2. Convex optimization To evaluate performance, we conduct a numerical comparison of gradient-based QHD against three baseline algorithms, including SGDM, NAG, and QHD, for convex optimization. The test function used is f(x, y) = (x + y)4 256 + (x y)4 which is a convex yet non-strongly convex function, with a singular Hessian at its unique minimum (0, 0). Notably, the gradient of this function does not satisfy the Lipschitz continuity condition. This flat geometry presents significant challenges for classical methods that rely heavily on curvature information, making it a suitable benchmark for comparative evaluation. All methods are executed with a fixed step size h = 0.2.4 For the quantum variants, the initial evolution time is set to t0 = 1. The parameters of gradient-based QHD are configured as α = 0.1, β = 0, and γ = 5. The performance of these optimization algorithms is visualized in Figure 3, where two key metrics are employed to access convergence: the average function values E[f(Xk)] (depicted in the left subplot) and the average gradient norm E[ f(Xk) 2] (depicted in the right subplot). Both quantities are tracked over iterations 1 k 25. The results reveal distinct convergence behaviors. While the (classical) QHD exhibits a slower convergence rate compared to NAG, the gradient-based QHD stands out by achieving a remarkably faster convergence rate, outperforming all other algorithms. This superior performance highlights the effectiveness of incorporating gradient-based techniques into QHD, particularly for challenging optimization landscapes. 4We have tested various step sizes (h [0.05, 0.5]) for gradient-based QHD and observed similar convergence behavior. To maintain consistency, we fix h = 0.2 in all comparisons. (a) Function value (b) Gradient norm Figure 3. Numerical comparison of various optimization algorithms on the convex objective function (19), including function values and success probability. 6.3. Non-convex optimization More details on these test problems are available in Appendix E.2. We now turn our attention to the numerical comparison of gradient-based QHD against three baseline algorithms, including SGDM, NAG, and QHD, in non-convex optimization settings. Non-convexity introduces significant challenges for classical first-order methods, as local gradient information alone is often insufficient to distinguish the global minimum from other spurious local optima. To illustrate these challenges, we evaluate a variety of nonconvex optimization problem instances characterized by diverse landscape features: (i) Michalewicz function (Figure 4): This function features a flat plateau and a unique global minimum hidden within a sharp valley, posing a difficult search problem. (ii) Cube-Wave function (Figure 5): With over ten local minima (four of which are global minima) concentrated within the cube [ 2, 2]2, this function exemplifies a rugged landscape. (iii) Rastrigin function (Figure 6): This function presents a highly oscillatory landscape with a global minimum at the origin, making it notoriously challenging for optimization algorithms. Due to these intricate characteristics, all three problems are recognized as particularly difficult for classical first-order methods. Additional details about these test functions are provided in Appendix E.2. For the two quantum algorithms, the evolution time starts from t0 = 0. In gradient-based QHD, the parameters are set to α = 0.05, β = 0, and γ = 5. Despite the diversity of non-convex test problems, gradientbased QHD consistently delivers robust and favorable performance. Compared to both QHD and the classical algorithms, it achieves a significantly faster convergence rate Quantum Optimization via Gradient-Based Hamiltonian Descent Figure 4. Numerical comparison of various optimization algorithms on the Michalewicz function, including function values and success probability. Figure 5. Numerical comparison of various optimization algorithms on the Cube-Wave function, including function values and success probability. and yields notably lower terminal objective function values. In the Cube-Wave function, for instance, the final objective value obtained by gradient-based QHD is nearly an order of magnitude lower than that of QHD and two orders of magnitude lower than those achieved by SGDM and NAG. Further numerical analysis highlights that gradient-based QHD attains a higher success probability across all problem instances, indicating that its final states are tightly concentrated around the global minimizer. In summary, by leveraging gradient information within the quantum Hamiltonian Figure 6. Numerical comparison of various optimization algorithms on the Rastrigin function, including function values and success probability. framework, gradient-based QHD demonstrates enhanced global convergence properties, outperforming QHD and classical optimization methods. 7. Conclusion and Future Work In this paper, we propose gradient-based QHD for continuous optimization problems without constraints. We prove the convergence of the gradient-based QHD dynamics in both function values and gradient norms via a Lyapunov function approach. We also discuss an efficient implementation of discrete-time gradient-based QHD using a faulttolerant quantum computer. Our numerical results show that gradient-based QHD achieves improved convergence with a higher chance of identifying the global minimum in a sophisticated optimization landscape. Our theoretical analysis has primarily focused on the convergence of gradient-based QHD in continuous time, while the long-term behavior of the discrete-time algorithm deserves further investigation. The numerical experiments are limited to 2D problems due to the exponential growth of computational overhead. Developing new numerical techniques could help evaluate the advantages of quantum Hamiltonianbased algorithms for high-dimensional optimization. Software and Data The source code of the experiments is available at https://github.com/jiaqileng/ Gradient-Based-QHD. Quantum Optimization via Gradient-Based Hamiltonian Descent Acknowledgements J. L. is partially supported by the Simons Quantum Postdoctoral Fellowship and a Simons Investigator Award in Mathematics through Grant No. 825053. B. S. is partially supported by a startup fund from SIMIS and Grant No.12241105 from NSFC. Most of this work was completed at the University of California, Berkeley, during B.S. s visit in the fall of 2024. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here. Ahn, K. and Sra, S. From nesterov s estimate sequence to riemannian acceleration. In Conference on Learning Theory, pp. 84 118. PMLR, 2020. An, D., Fang, D., and Lin, L. Time-dependent unbounded hamiltonian simulation with vector norm scaling. Quantum, 5:459, 2021. An, D., Fang, D., and Lin, L. Time-dependent hamiltonian simulation of highly oscillatory dynamics and superconvergence for schr odinger equation. Quantum, 6:690, 2022. Augustino, B., Leng, J., Nannicini, G., Terlaky, T., and Wu, X. A quantum central path algorithm for linear optimization. ar Xiv preprint ar Xiv:2311.03977, 2023. Berry, D. W., Childs, A. M., Su, Y., Wang, X., and Wiebe, N. Time-dependent hamiltonian simulation with L1-norm scaling. Quantum, 4:254, 2020. Betancourt, M., Jordan, M. I., and Wilson, A. C. On symplectic optimization. ar Xiv preprint ar Xiv:1802.03653, 2018. Chen, Z., Lu, Y., Wang, H., Liu, Y., and Li, T. Quantum langevin dynamics for optimization. ar Xiv preprint ar Xiv:2311.15587, 2023. Cheng, J., Zhou, R., Gan, Y., Qian, C., and Liu, J. Quantum hamiltonian descent for graph partition. ar Xiv preprint ar Xiv:2411.14696, 2024. Childs, A. M., Su, Y., Tran, M. C., Wiebe, N., and Zhu, S. Theory of trotter error with commutator scaling. Physical Review X, 11(1):011020, 2021. Childs, A. M., Leng, J., Li, T., Liu, J.-P., and Zhang, C. Quantum simulation of real-space dynamics. Quantum, 6:860, 2022. Gily en, A., Arunachalam, S., and Wiebe, N. Optimizing quantum optimization algorithms via faster quantum gradient computation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1425 1444. SIAM, 2019a. Gily en, A., Su, Y., Low, G. H., and Wiebe, N. Quantum singular value transformation and beyond: exponential improvements for quantum matrix arithmetics. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 193 204, 2019b. Giselsson, P. and Boyd, S. Monotonicity and restart in fast gradient methods. In 53rd IEEE Conference on Decision and Control, pp. 5058 5063. IEEE, 2014. Han, A., Mishra, B., Jawanpuria, P., and Gao, J. Riemannian accelerated gradient methods via extrapolation. In International Conference on Artificial Intelligence and Statistics, pp. 1554 1585. PMLR, 2023. Jordan, S. P. Fast quantum algorithm for numerical gradient estimation. Physical review letters, 95(5):050501, 2005. Kerenidis, I. and Prakash, A. Quantum gradient descent for linear systems and least squares. Physical Review A, 101 (2):022316, 2020. Kong, L. and Tao, M. Quantitative convergences of lie group momentum optimizers. ar Xiv preprint ar Xiv:2405.20390, 2024. Krichene, W., Bayen, A., and Bartlett, P. L. Accelerated mirror descent in continuous and discrete time. Advances in neural information processing systems, 28, 2015. Kushnir, S., Leng, J., Peng, Y., Fan, L., and Wu, X. Qhdopt: A software for nonlinear optimization with quantum hamiltonian descent. INFORMS Journal on Computing, 2024. Leng, J., Hickman, E., Li, J., and Wu, X. Quantum Hamiltonian Descent. ar Xiv preprint ar Xiv:2303.01471, 2023a. Leng, J., Zheng, Y., and Wu, X. A quantum-classical performance separation in nonconvex optimization. ar Xiv preprint ar Xiv:2311.00811, 2023b. Leng, J., Li, J., Peng, Y., and Wu, X. Expanding hardwareefficiently manipulable hilbert space via hamiltonian embedding. ar Xiv preprint ar Xiv:2401.08550, 2024. Lessard, L., Recht, B., and Packard, A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57 95, 2016. Quantum Optimization via Gradient-Based Hamiltonian Descent Li, H., Ni, H., and Ying, L. On efficient quantum block encoding of pseudo-differential operators. Quantum, 7: 1031, 2023. Lin, T., Ho, N., and Jordan, M. On efficient optimal transport: An analysis of greedy and accelerated mirror descent algorithms. In International Conference on Machine Learning, pp. 3982 3991. PMLR, 2019. Liu, J., Liu, M., Liu, J.-P., Ye, Z., Wang, Y., Alexeev, Y., Eisert, J., and Jiang, L. Towards provably efficient quantum algorithms for large-scale machine-learning models. Nature Communications, 15(1):434, 2024. Liu, Y., Su, W. J., and Li, T. On quantum speedups for nonconvex optimization via quantum tunneling walks. Quantum, 7:1030, 2023. Mizuta, K., Ikeda, T. N., and Fujii, K. Explicit error bounds with commutator scaling for time-dependent product and multi-product formulas. ar Xiv preprint ar Xiv:2410.14243, 2024. Nesterov, Y. A method for solving the convex programming problem with convergence rate o (1/k2). In Dokl akad nauk Sssr, volume 269, pp. 543, 1983. O donoghue, B. and Candes, E. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15:715 732, 2015. Rebentrost, P., Schuld, M., Wossnig, L., Petruccione, F., and Lloyd, S. Quantum gradient descent and newton s method for constrained polynomial optimization. New Journal of Physics, 21(7):073023, 2019. Shi, B. On the hyperparameters in stochastic gradient descent with momentum. Journal of Machine Learning Research, 25(236):1 40, 2024. Shi, B., Du, S. S., Jordan, M. I., and Su, W. J. Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, pp. 1 70, 2022. Shi, B., Su, W., and Jordan, M. I. On learning rates and schr odinger operators. Journal of Machine Learning Research, 24(379):1 53, 2023. Siegel, J. W. Accelerated optimization with orthogonality constraints. ar Xiv preprint ar Xiv:1903.05204, 2019. Su, W., Boyd, S., and Candes, E. J. A differential equation for modeling nesterov s accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1 43, 2016. Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47): E7351 E7358, 2016. Wilson, A. C., Recht, B., and Jordan, M. I. A lyapunov analysis of accelerated methods in optimization. Journal of Machine Learning Research, 22(113):1 34, 2021. Yuen, H. An improved sample complexity lower bound for (fidelity) quantum state tomography. Quantum, 7:890, 2023. Zhang, C., Leng, J., and Li, T. Quantum algorithms for escaping from saddle points. Quantum, 5:529, 2021. Quantum Optimization via Gradient-Based Hamiltonian Descent A. Review of accelerated gradient descent A.1. Accelerated gradient descent as differential equations Accelerated gradient descent methods are fundamental in both theory and practice. Nesterov (Nesterov, 1983) proposed the first accelerated gradient method that has the following update rules (where s > 0 is the step size): xk = yk 1 s f(yk 1), (20a) yk = xk + k 1 k + 2(xk xk 1), (20b) It is known that Nesterov s gradient descent achieves the optimal convergence rate among all gradient-based methods. On the other hand, there has been a long-lasting research attempting to relate gradient-based optimization algorithms with differential equations. A seminal work by Su et al. (Shi et al., 2022) proposed a second-order differential equation to capture the acceleration phenomenon in Nesterov s algorithm. For sufficiently small step size s, the continuous-time limit of (20) is given by the following ordinary differential equation, t X + f(X) = 0, (21) for t > 0, with initial conditions X(0) = x0 and X(0) = 0. The convergence rate of the ODE is O(t 2), which matches that of the discrete-time algorithm (20). The ODE framework of accelerated gradient descent was later generalized via a variational formulation of the underlying dynamics. Wibisono, Wilson, and Jordan (Wibisono et al., 2016) proposed to consider the Bregman Lagrangian, L(X, V, t) = eαt+γt 1 e αt V 2 eβtf(X) , (22) where t 0 is the time, X Rd is the state vector, and V Rd is the velocity.5 Given the Lagrangian function L(X, V, t), we can consider the following variational problem: min Xt J(Xt) = Z 0 L(X, Xt, t)dt, (23) where J(Xt) is a functional defined on smooth curves {Xt : t [0, )}. From the calculus of variations, a curve that minimizes the functional J(Xt) necessarily satisfies the Euler-Lagrange equation: V (Xt, Xt, t) = L X (Xt, Xt, t). (24) Specifically, if we choose αt = log(t), βt = γt = 2 log(t), the resulting Euler-Langrage equation is exactly the continuous-time limit of Nesterov s accelerated gradient descent (21). It is also shown that, if αt, βt, and γt satisfies the following ideal scaling conditions, βt eαt, γt = eαt, (25) the solutions to the Euler-Lagrange equation satisfy f(Xt) f(x ) O(e βt), (26) which gives a convergence rate of the dynamical system in continuous time. 5Here, we give a simplified version of the Bregman Lagrangian in which the Bregman divergence is given by the standard Euclidean distance; for details, see (Wibisono et al., 2016). Quantum Optimization via Gradient-Based Hamiltonian Descent A.2. Understanding acceleration via high-resolution ODEs While the continuous-time formulations of accelerated gradient descent provide a more transparent perspective on the acceleration phenomenon and allow us to introduce the rich toolbox from ODE theory, they offer little understanding of different accelerated gradient descent algorithms with the same continuous-time limit. For example, Polyak s heavy-ball method and NAG have the same continuous-time limit, however, they exhibit strikingly different behaviors in practice: the heavy-ball method generally only achieves local acceleration, while NAG is an acceleration method applicable to all initial values of the iterate (Lessard et al., 2016). The difference between the two algorithms lies in a gradient correction step that only exists in NAG. Inspired by the dimensional-analysis strategies in fluid mechanics, Shi et al. (2022) developed a high-resolution ODE framework to reflect the gradient correction effect in different algorithms with the same low-resolution continuous-time limit. The high-resolution ODEs for NAG are as follows. t X(t) + s 2f(X(t)) X(t) + 1 + 3 s f(X(t)) = 0, (27) for t 3 s/2, with X(3 s/2) = x0, X(3 s/2) = s f(x0). In contrast, the high-resolution ODE for the heavy-ball method does not have the higher-order correction term s 2f(X(t)) X(t), which explains how the gradient correction step improves the overall convergence performance of NAG over the heavy-ball method. The high-resolution ODE framework also motivates the design of a new family of accelerated gradient descent algorithms that maintain the convergence rate of NAG. To prove the convergence of the high-resolution ODEs, Shi et al. (2022) employs the following Lyapunov function (see Shi et al. (2022, Eq. (4.36))): E(t) = t t + s (f(X) f(x )) + 1 2 t X + 2(X x ) + t s f(X) 2. (28) Let X(t) be the solution to (27), it is proven in Shi et al. (2022, Lemma 5) that for all t 3 s/2 f(X) 2 < 0. (29) As a direct consequence, for any t 3 s/2, we have f(X(t)) f(x ) (4 + 3s L) x0 x 2 t(2t + s) . (30) B. Two paradigms of quantum optimization Based on how the solution is encoded in a quantum state, there are two major paradigms in designing quantum algorithms for continuous optimization problems. In this section, we briefly discuss the two paradigms and compare their respective pros and cons. Solution vector as a quantum state. The first paradigm uses amplitude encoding, where an n-dimensional vector v is encoded into a q-qubit state with q = log2(n) : j=1 vj |j , where {|j }2q 1 j is the set of computational basis. This approach encompasses a vast majority of works in quantum optimization, including (Kerenidis & Prakash, 2020; Liu et al., 2024; Rebentrost et al., 2019). In this encoding scheme, the solution vector can be represented using O(log(n)) qubits and it allows us to exploit the rich quantum numerical linear algebra toolbox to accelerate existing classical algorithms. Nevertheless, the downside is that the recovery of the classical vector v from its amplitude-encoded state |v , a task known as quantum state tomography, would inevitably incur a Θ(n/ϵ) overhead due to the Heisenberg limit, where ϵ is the readout precision (Yuen, 2023). Therefore, the quantum state tomography can nullify the potential exponential quantum speedup in the computation. Quantum Optimization via Gradient-Based Hamiltonian Descent Superposition of all possible solutions. Another paradigm uses basis encoding, where an n-dimensional vector v corresponds to a unique computational basis |bv . To see how this works, we assume that each element in the real-valued vector v is represented by a fixed-point number vj with bit length q. Therefore, we can uniquely enumerate all possible solutions (corresponding to all possible fixed-point numbers) in the n-dimensional space using (2q)n computational basis, or equivalently, nq qubits. This encoding scheme is similar to how modern computers store an array with fixed/floating-point arrays. Nevertheless, the difference is that quantum computers can produce a superposition of basis states, i.e., where ρ is a probability distribution over the whole space. In this case, measuring the quantum state |Ψ is equivalent to sampling a point from the distribution ρ. Solving an optimization problem amounts to preparing an approximation of the Dirac-delta distribution at the minimizer x , i.e., the state |x . Compared to the first paradigm, there are two major advantages of this approach: First, there is no obvious fundamental limitation on extracting information from the quantum register, as we can prepare a Dirac-delta-like state for which the probability of obtaining a fixed solution can be arbitrarily close to 1. Second, the superposition of solutions |Ψ is a natural quantum wave function, so we can design a solution path by exploiting the toolbox of continuous-space quantum mechanics, which is historically less explored in the quantum computation literature. The drawback, however, is that we will not have exponentially improved space/qubit complexity to represent the solution. The first approach in principle only uses O(log2(n)) qubits to represent a solution vector, while representing the superposition state |Ψ requires O(n) qubits for an n-dimensional vector. C. Technical details of convergence analysis C.1. Commutation relations in gradient-based QHD This section proves the commutation relations in Lemma 3. Lemma 7. Let g: Rd R be a smooth function. We have i[ˆp2 j, g] = {ˆpj, jg}. Proof. Let φ be a test function. Note that (ˆp2 jg)(φ) = 2 x2 j (gφ) = ( jjgφ + 2 jg jφ + g jjφ) , (gˆp2 j)(φ) = g i[ˆp2 j, g]φ = i ( jjgφ + 2 jg jφ) . Meanwhile, we find that {ˆpj, jg}φ = ( i j)( jgφ) + jg ( i jφ) = i ( jjgφ + 2 jg jφ) , which concludes the proof. Lemma 8. Let g: Rd R, h: Rd R be two smooth functions. We have i[{ˆpj, h}, g] = 2h( jg). Proof. Let φ be a test function. Direct calculation shows that i[{ˆpj, h}, g]φ = i [(pjh + hpj) (gφ) g (pjh + hpj) φ] = i [pj(hgφ) + h(pjg)φ gpj(hφ) gh(pjφ)] = 2ih(pjg)φ = 2h( jg)φ, which implies i[{ˆpj, h}, g] = 2h( jg). This operator is again a multiplicative operator that commutes with both g and h. Quantum Optimization via Gradient-Based Hamiltonian Descent Now, we are ready to prove Lemma 3. Proof. 1. Recall that Aj = t 3/2pj + αt3/2vj, vj = f xj , A2 j = t 3p2 j + α{pj, vj} + α2t3v2 j . i[A2 j, f] = i[t 3p2 j + α{pj, vj}, f] = t 3{pj, vj} + 2αv2 j . The last identity invokes Lemma 7 and Lemma 8. 2. Since the vj part in Aj commutates with x and f, we can drop it from the commutator: i[f, {Aj, xj}] = i[f, {t 3/2pj, xj}] = it 3/2[{pj, xj}, f] = t 3/2xjvj, where we use Lemma 8 in the last step. 3. By dropping the v2 j part in A2 j, we get i[A2 j, x2 k] = i[t 3p2 j + α{pj, vj}, x2 k] = t 3i[p2 j, x2 k] + α[{pj, vj}, x2 k]. By Lemma 7 and Lemma 8, we obtain the following: i[A2 j, x2 k] = ( 0 (j = k) 2t 3{pj, xj} + 4αxjvj (j = k). 4. It can be readily verified that [Aj, Ak] = 0 and [Aj, xk] = 0 for any j = k. Therefore, if j = k, we will have i[A2 j, {Ak, xk}] = 0. When j = k, we first observe that i[Aj, xj] = i[t 3/2pj, xj] = t 3/2i[pj, xj] = t 3/2 due to the canonical commutation relation i[pj, xj] = 1. By leveraging the commutation relation between Aj and xj, i[A2 j, {Aj, xj}] = i A3 jxj + A2 jxj Aj Ajxj A2 j xj A3 j = i A2 j(xj Aj it 3/2) + A2 jxj Aj Ajxj A2 j (Ajxj + it 3/2)A2 j = i 2Aj(xj Aj it 3/2)Aj 2Ajxj A2 j 2it 3/2 = 4t 3/2A2 j. 5. This commutation relation is a direct consequence of Lemma 8. By dropping the vk part in Ak, we have i[v2 j , {Ak, xk}] = it 3/2[{pj, xk}, v2 j ] = 4t 3/2xkvj( kvj) = 4t 3/2 2f Quantum Optimization via Gradient-Based Hamiltonian Descent C.2. Proof of Lemma 2 Proof. By the definition of the Lyapunov function, we have d dt E(t) = t ˆO(t) + i[ ˆH(t), ˆO(t)] t, (31) where [A, B] := AB BA denotes the commutator of operators. In the following calculation, we omit the hat over quantum operators to simplify the notation. First, we calculate the t O(t) part. Direct calculations yield that t5 p2 j + α2tv2 j + 2αxjvj α 2t2 {pj, vj} 2 t3 {pj, xj} + (2t + ω)f. (32) As for the commutator part, it is worth noting that j=1 (t 1/2Aj + 2ˆxj)2 + t2 + ωt f 2x2 j + 1 t1/2 {Aj, xj} 3αtf. Therefore, we have i[H(t), O(t)] = i 2x2 j + {Aj, xj} Invoking the commutation relations 1-4 in Lemma 3 to simplify (34) and combining it with (32), we obtain the following identity: t O + i[H, O] = (2t + ω)(f(x) x f(x)) 2ωf(x). (35) Since f is convex, we have f(x) x f(x) 0 for any x Rd. Since ω = γ 3α 0, it follows that t O+i[H, O] 0 and f(x) 0 for all x Rd, which implies that d dt E(t) = t ˆO(t) + i[ ˆH(t), ˆO(t)] t 0. C.3. Proof of Lemma 5 Proof. Similar to the proof of Lemma 2, we have d dt F(t) = t ˆJ(t) + i[ ˆH(t), ˆJ(t)] t. (36) Direct calculation shows that t ˆJ(t) + i[ ˆH(t), ˆJ(t)] = I1 + I2(t), (37) where I1(t) = 2ωx f(x) + 2t(f(x) x f(x)) 0, (38) j,k=1 [v2 j , {Ak, xk}] = βt G(x) x G(x) 0. Quantum Optimization via Gradient-Based Hamiltonian Descent The second equation uses commutation relation 5 in Lemma 3, and the last inequality is deduced from the convexity of G(x). Combining (38) and (39), we prove the lemma. D. Technical details of complexity analysis Lemma 9. Assume that f : Rd R is a L-Lipschitz function. For sufficiently smooth wave function |Φ , we can prepare a quantum state |Ψ such that Ψ e ih Hk,2Φ ϵ using e O (αdh L) queries to the first-order oracle O f. Here, the e O( ) notation suppressed poly-logarithmic terms in 1/ϵ. Proof. Recall that 2 { i , f} = α j=1 {pj, vj}. Note that this operator is independent of time and thus of k. To simulate the Hamiltonian Hk,2, we need to perform spatial discretization for the operators pj and vj. The standard approach is to consider a d-dimensional regular mesh with N grid points on each dimension, e.g., (An et al., 2022; Childs et al., 2022). The momentum operators can be implemented by applying Quantum Fourier Transform 2 times (with overall gate complexity d poly log(N)), as discussed in (Li et al., 2023). The discretized Hamiltonian operator takes the following form: j=1 ( Pj Vj + Vj Pj), (40) where Pj O(N), and Vj L, with L the Lipschitz constant of f. By using O(1) queries to the first-order oracle O f, we can implement a block-encoding of the matrix e Hk,2 with a normalization factor a O(αd NL), and an additional O(d) ancilla qubits Gily en et al. (2019b, Lemma 29,30). With the block-encoded operator Hk,2, we can perform optimal Hamiltonian simulation by QSVT Gily en et al. (2019b, Corollary 32). The total number of queries to the block-encoding is O (ah + log(1/ϵ)) , with an additional O(d(ah + log(1/ϵ))) elementary gates. Given that the input wave function Φ is sufficiently smooth, the discretization number N can be chosen as N = poly log(1/ϵ) since the spatial discretization can be regarded as a pseudo-spectral method. It turns out that the overall query complexity of the Hamiltonian simulation reads e O (dαh L), where the e O( ) notation suppresses poly-logarithmic terms in 1/ϵ. Now, we are ready to prove Theorem 6. Proof. In Algorithm 1, each iteration requires the implementation of the quantum circuit Uk = e ih Hk,1e ih Hk,2e ih Hk,3. Note that Hk,1 = /(2t3 k) and the Laplacian operator can be diagonalized by Fourier transform, so we can implement e ih Hk,1 using O(d log2(N)) elementary gates. The Hamiltonian Hk,3 is a multiplicative operator with two commuting terms, i.e., e ih Hk,3 = e ih(α2+β)t3 k f 2/2e ih(t3 k+γt2)f. Since the functions f and f 2 are multiplicative operators and reduce to diagonal matrices after spatial discretization. Therefore, the Hamiltonian Hk,3 is fast-forwardable and can be implemented using O(1) uses of the zerothand first-order oracle of f, respectively. Finally, by Lemma 9, the Hamiltonian Hk,2 can be simulated using e O(αdh L) queries to the first-order oracle O f. By iterating these steps for K times, we can implement the quantum algorithm using O(K) queries to the zeroth-order oracle Of and e O(αdh KL) queries to the first-order oracle O f. Quantum Optimization via Gradient-Based Hamiltonian Descent E. Details of numerical experiments E.1. Numerical implementations of optimization algorithms In the numerical experiments, we test four optimization algorithms: Stochastic Gradient Descent with momentum (SGDM), Nesterov s accelerated gradient descent (NAG), Quantum Hamiltonian Descent (QHD), and Gradient-based QHD. Our Python implementation of the numerical algorithms can be found in the supplementary materials. SGDM. The iterative update rules for SGDM are as follows: vk = ηkvk 1 (1 ηk)skgk, xk = xk + vk, where 1 k K is the iteration number, ηk is the momentum coefficient, sk is the step size, gk is an unbiased gradient estimator at xk. We use ηk = 0.5 + 0.4k K , sk = s0 with s0 = 0.01, v0 = 0, and a uniformly random initial guess x0. The gradient estimator gk is obtained by adding a unit Gaussian random noise to the exact gradient f(xk). NAG. The update rules of NAG are as follows: xk = yk 1 s f(yk 1), yk = xk + k 1 k + 2(xk xk 1), for 1 k K. We choose y0 = 0 and a uniformly random initial guess x0. The step size is chosen as s = 0.01. QHD and gradient-based QHD. Both QHD and gradient-based QHD are simulated following Algorithm 1. Note that QHD is a special case of gradient-based QHD with α = β = γ = 0. The simulation is performed in a mesh grid with N = 128 grid points per dimension, with the momentum and kinetic operators implemented using FFT, as discussed in Appendix D. The step size varies with the test problems: We use h = 0.01 for the Styblinski-Tang and Michalewicz function, h = 0.02 for the Cube-Wave function, and h = 0.005 for the Rastrigin function. E.2. Non-convex test problems The test problems used in this paper are defined as follows: 1. Styblinski-Tang function: f(x, y) = 0.2 x4 16x2 + 5x + y4 16y + 5y , where we introduce a normalization factor of 0.2 for a better illustration. This function has a unique global minimizer at (x , y ) = ( 2.9, 2.9), with the minimal function value f(x , y ) 31.33. The numerical algorithms are implemented over the square region { 5 x, y 5}. 2. Michalewicz function: f(x, y) = sin(x) sin x2/π 20 sin(y) sin 2y2/π 20. This function has a unique global minimizer at (x , y ) = (2.2, 1.57), with the minimal function value f(x , y ) 1.8. The numerical algorithms are implemented over a square region {0 x, y π}. 3. Cube-Wave function: f(x, y) = cos(πx)2 + 0.25x4 + cos(πy)2 + 0.25y4. This function has 4 global minima, namely, (x , y ) = ( 0.5, 0.5). The minimal function value is f(x ) 0.03. The numerical algorithms are implemented over a square region { 2 x, y 2}. Quantum Optimization via Gradient-Based Hamiltonian Descent 4. Rastrigin function: f(x, y) = x2 10 cos(2πx) + y2 10 cos(2πy) + 20. This function has a unique global minimizer at (x , y ) = (0, 0), with the minimal function value f(x , y ) = 0. The numerical algorithms are implemented over a square region { 3 x, y 3}.