# refining_adaptive_zerothorder_optimization_at_ease__b71a708c.pdf

Refining Adaptive Zeroth-Order Optimization at Ease

Yao Shu 1 Qixin Zhang 2 Kun He 3 Zhongxiang Dai 4

Recently, zeroth-order (ZO) optimization plays an essential role in scenarios where gradient information is inaccessible or unaffordable, such as blackbox systems and resource-constrained environments. While existing adaptive methods such as ZO-Ada MM have shown promise, they are fundamentally limited by their underutilization of moment information during optimization, usually resulting in underperforming convergence. To overcome these limitations, this paper introduces Refined Adaptive Zeroth-Order Optimization (RAda ZO). Specifically, we first show the untapped variance reduction effect of first moment estimate on ZO gradient estimation, which improves the accuracy and stability of ZO updates. We then refine the second moment estimate based on these variance-reduced gradient estimates to better capture the geometry of the optimization landscape, enabling a more effective scaling of ZO updates. We present rigorous theoretical analysis to show (I) the first analysis to the variance reduction of first moment estimate in ZO optimization, (II) the improved second moment estimates with a more accurate approximation of its variance-free ideal, (III) the first variance-aware convergence framework for adaptive ZO optimizers, which may be of independent interest, and (IV) the faster convergence of R-Ada ZO than existing baselines like ZO-Ada MM. Our extensive experiments, including synthetic problems, black-box adversarial attack, and memory-efficient fine-tuning of large language models (LLMs), further verify the superior convergence of R-Ada ZO, indicating that R-Ada ZO offers an improved solution for realworld ZO optimization challenges.

1Hong Kong University of Science and Technology (Guangzhou) 2Nanyang Technological University 3Huazhong University of Science and Technology 4The Chinese University of Hong Kong, Shenzhen. Correspondence to: Zhongxiang Dai <daizhongxiang@cuhk.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

Zeroth-order (ZO) optimization has emerged as an indispensable technique at the forefront of machine learning, addressing critical challenges where gradient information is either unavailable or computationally prohibitive. This necessity stems from the prevalence of black-box optimization problems, such as adversarial attacks (Ru et al., 2020; Hiranandani et al., 2021), and resource-constrained environments, like fine-tuning large language models (LLMs) on memory-limited devices (Malladi et al., 2023; Zhang et al., 2024b). Consequently, ZO optimization algorithms, which rely solely on function evaluations, have become a crucial alternative to traditional gradient-based methods. Despite the growing body of research in ZO optimization, a significant portion of existing methods adapt stochastic gradient descent (SGD) updates to the ZO setting (Liu et al., 2018a;b; Shu et al., 2023; 2024). This reliance on SGD, however, will lead to performance limitations, especially in complex and non-convex optimization landscapes. The need for more adaptive and versatile update mechanisms is hence evident. However, the exploration of adaptive strategies beyond SGD-based updates remains surprisingly limited.

While adaptive methods such as ZO-Ada MM (Chen et al., 2019; Nazari et al., 2020) have demonstrated potential in addressing the missing adaptivity in zeroth-order optimization, they are fundamentally limited by their underutilization of moment information, often resulting in suboptimal convergence rates. This limitation in fact arises from their reliance on noisy and high-variance gradient estimates derived solely from function evaluations a stark contrast to the first-order (FO) methods that leverage direct and more stable gradients. This issue becomes even more pronounced in high-dimensional and complex settings.

To address this critical limitation, we introduce Refined Adaptive Zeroth-Order Optimization (R-Ada ZO), a novel approach that effectively capitalizes on moment information through two key innovations. First, R-Ada ZO is the first to analyze the untapped but inherent variance reduction effect of the first moment estimates on the gradient estimates in ZO optimization, leading to more accurate and stable ZO updates. This is accomplished through the integration of historical gradient estimates, which effectively averages out the estimation noise (Sec. 4.1). Second, R-Ada ZO refines

Refining Adaptive Zeroth-Order Optimization at Ease

the second moment using these variance-reduced gradient estimates, enabling better adaptation to the underlying geometry of the optimization landscape and facilitating a more effective scaling of ZO updates (Sec. 4.2).

Beyond simply presenting R-Ada ZO, we provide a thorough analysis that combines rigorous theoretical guarantees with extensive empirical validation, demonstrating its effectiveness. Specifically, we first provide the assumptions used in our theoretical analysis (Sec. 5.1). We then theoretically analyze that incorporating first-moment estimates into ZO optimization significantly reduces the variance, leading to more stable and reliable ZO updates, and theoretically demonstrate that our refined second moment estimates provide a more accurate approximation of its corresponding variance-free ideal (Sec. 5.2). We further introduce the first variance-aware framework to prove the convergence of adaptive ZO optimization methods, which is not limited to our specific method and can be used to analyze a wider range of similar algorithms, and theoretically prove that R-Ada ZO converges faster than established baseline methods, such as ZO-Ada MM, demonstrating its efficiency in optimization (Sec. 5.3). Through extensive experiments, including synthetic problems (Sec. 6.1), black-box adversarial attack (Sec. 6.2), and memory-efficient LLM fine-tuning (Sec. 6.3), we demonstrate that R-Ada ZO consistently outperforms existing methods in practice, exhibiting superior convergence.

To summarize, our contributions in this work include:

We propose R-Ada ZO to enhance the utilization of moment information in ZO optimization and significantly improve the convergence of adaptive ZO optimizers. We theoretically show (I) the first analysis to the variance reduction of first moment estimates in ZO optimization, (II) the effects of our refined second moment estimates, (III) the first variance-aware convergence framework for adaptive ZO methods, which may be of independent interest, and (IV) the improved convergence of R-Ada ZO. We use extensive empirical validation to show the consistent performance gains of R-Ada ZO over baselines.

2. Related Work

Recent ZO optimization research focuses on two key areas: ZO gradient estimation and ZO update rules.

ZO Gradient Estimation. Since ZO optimization only relies on function values, gradient estimation is essential for effective optimization. A common approach is to use finite difference approximations under input perturbations. Nesterov & Spokoiny (2017) propose to use Gaussian random noise perturbations, demonstrating theoretical convergence with smooth perturbations. Other methods also propose to use uniform sampling from the unit sphere (Flaxman et al.,

2005) or coordinate-wise perturbations (Lian et al., 2016). These methods often have a noisy gradient estimation. To address this, Cheng et al. (2021) introduce prior-guided gradient estimation, which leverages previous estimates to improve the current one, effectively smoothing the estimation noise. Recently, Shu et al. (2023; 2024) propose using kernel methods to learn a surrogate model of the objective function from historical function values, allowing for more accurate gradient estimation. Another line of work has focused on linear interpolation strategies for more accurate estimates by reusing queries from prior iterations to reduce complexity while maintaining sample quality (Wang et al., 2024b). Note that this paper does not aim to introduce a new gradient estimation approach, but focus on developing advanced update rules that are applicable to all these existing estimation methods.

ZO Update Rules. Building upon the estimated gradients from various ZO estimation methods, ZO optimizers often directly adopt update rules from first-order (FO) optimization. E.g., a large portion of existing ZO optimizers use stochastic gradient descent (SGD) and its variants as their update mechanism (Ghadimi & Lan, 2013; Ghadimi et al., 2016; Nesterov & Spokoiny, 2017; Liu et al., 2018b;a; 2019; Cheng et al., 2021; Shu et al., 2023). While simple to apply, the slow convergence of SGD has motivated few efforts (Chen et al., 2019; Nazari et al., 2020; Jiang et al., 2024) to explore the use of adaptive methods, such as Adam (Kingma & Ba, 2015), as the ZO update rule. However, these attempts often under-utilize the moment information inherent in adaptive methods when applied to ZO optimization, leading to suboptimal convergence. This paper addresses this critical issue by proposing refined update rules that are specifically designed to better leverage moment information, ultimately leading to more efficient ZO optimization.

3. Background

This paper tackles a stochastic zeroth-order (ZO) optimization problem, aiming to minimize the expected value of a function, defined as:

min θ Rd F(θ) Eξ [f(θ; ξ)] (1)

where θ Rd and f(θ; ξ) is a scalar-valued function whose evaluation depends on the parameters θ and a random variable ξ sampled from an underlying distribution. Crucially, we have access only to function evaluations f(θ; ξ) and not its gradient θf(θ; ξ). Throughout this paper, we adopt the following notational conventions. Vectors are represented in boldface, e.g., θ, and scalar constants are denoted by uppercase letters, e.g., L. All vector operations are assumed to be element-wise unless explicitly stated otherwise. We denote by i F the partial derivative of function F with respect to

Refining Adaptive Zeroth-Order Optimization at Ease

the i-th coordinate.

ZO Gradient Estimation. In ZO optimization, the absence of direct access to gradients, denoted as θf(θ; ξ), necessitates the use of gradient estimation techniques that rely solely on function evaluations. A widely used method is to approximate gradients using finite differences. E.g., let random vectors {uk}K k=1 be drawn uniformly from the sphere Sd 1 of a unit ball Bd, a common ZO gradient estimator, which is used throughout this paper, can be formed as:

ˆ f(θ, ξ) d

f(θ + µuk; ξ) f(θ; ξ)

where µ > 0 is a smoothing parameter, and K is the number of random vectors. While this paper utilizes this specific ZO gradient estimator as its foundation, the proposed method is extensible to other ZO gradient estimators as well.

Adaptive ZO Optimization. ZO optimization methods with a fixed step size typically suffer from slow convergence. To address this, adaptive methods like ZO-Ada MM (Chen et al., 2019) are used, which incorporate momentum using first moment estimates and per-parameter learning rates using second moment estimates. Specifically, in ZOAda MM, the parameter updates are computed as follows for every iteration t (see also Algo. 1):

mt β1mt 1 + (1 β1)gt (First Moment Est.)

vt β2vt 1 + (1 β2)g2 t (Second Moment Est.)

θt θt 1 η mt vt + ζ (ZO Update)

(3) where gt = ˆ f(θt 1) defined in (2), β1, β2 (0, 1) are exponential decay rates for moment estimates, and ζ is a small constant to prevent dividing by zero.

However, while these adaptive ZO approaches have shown promise, they often underutilize the moment information in the context of ZO optimization: (a) They typically treat first moment estimate mt as standard velocity accumulation in FO optimization, failing to consider its underlying variance reduction effect in ZO optimization by accumulating information from previous gradient estimates. (b) They fail to apply this variance-reduced gradient estimates to refine the second moment estimate vt, causing a less effective scaling of ZO updates.

4. Refined Adaptive ZO Optimization

To address the underutilization of momentum information in existing adaptive ZO optimization methods, we introduce RAda ZO (Refined Adaptive Zeroth-Order Optimization). Specifically, we first analyze the untapped variance reduction effect of first moment estimates on ZO gradient estimation, which is important for accurate and stable ZO

Algorithm 1 ZO-Ada MM Input: β1, β2, ζ, η, f Initialize: θ0, m0, v0 for iteration t [T] do

gt ˆ f(θt 1, ξt) mt β1mt 1 + (1 β1)gt vt β2vt 1 + (1 β2)g2 t θt θt 1 η mt

vt+ζ Output: θT

Algorithm 2 R-Ada ZO Input: β1, β2, ζ, η, f Initialize: θ0, m0, v0 for iteration t [T] do

gt ˆ f(θt 1, ξt) mt β1mt 1 + (1 β1)gt vt β2vt 1 + (1 β2)m2 t θt θt 1 η mt

vt+ζ Output: θT

updates (Sec. 4.1). We then apply these variance-reduced estimates to construct a refined second moment, enabling more effective scaling of ZO updates (Sec. 4.2).

4.1. Variance Reduction in First Moment Estimates

First moment estimation, while conventionally used for convergence speedup, inherently serve as a variance reduction mechanism for noisy gradients. To show this, consider the following standard first moment estimate with β1 (0, 1):

mt β1mt 1 + (1 β1)gt (4)

where mt is the estimated first moment at iteration t and gt is the gradient estimate at θt 1 via (2). Intuitively, this update averages the current noisy gradient estimate with past, correlated estimates. This averaging process effectively smooths out noise in gradient estimates, thereby reducing variance. For example, averaging two independent noisy gradient estimates (ie, mt 1 and gt) of variance σ2 results in a variance of [β2 1 + (1 β1)2]σ2, which is less than σ2. While current and past gradient estimates are not fully independent in practice, their local correlation still enables variance reduction through this averaging, which we will show theoretically in Sec. 5.

While this variance reduction effect has been proven in FO optimization (Liu et al., 2020), it is significantly more crucial in ZO optimization. Unlike FO methods that compute gradients directly with relatively low variance, ZO optimization approximates gradients using function evaluations (as in (2)), resulting in inherently noisier estimates. This disparity underscores the critical importance of the variance

Refining Adaptive Zeroth-Order Optimization at Ease

reduction effect of first moment estimates in ZO optimization, a connection we are the first to identify. We further provide theoretical support for this in Sec. 5.

4.2. Refinement to Second Moment Estimates

The second key innovation of R-Ada ZO lies in its refined second moment estimate, which is crucial for the adaptivity in ZO optimization. Existing adaptive ZO methods (Chen et al., 2019; Nazari et al., 2020) update the second moment estimate directly using the squared noisy gradient estimates:

vt = β2vt 1 + (1 β2)g2 t . (5)

However, this approach can be suboptimal in the ZO setting owing to the inherent high variance of the gradient estimates in (2), which could lead to unstable and unreliable second moment estimates. We thus propose to address this issue by simply leveraging the variance-reduced gradient information from the first moment. That is, we update the second moment estimate as below, which interestingly shares similar form with RMSProp (Hinton, 2012).

vt = β2vt 1 + (1 β2)m2 t . (6)

The first moment estimate, as revealed in Sec. 4.1, acts as a variance reduction mechanism by averaging historical gradient information. Using the squared first moment estimate then probably provides a smoothed and more stable second moment estimate. This refinement therefore may enable a more accurate representation for the underlying geometry of the optimization landscape, resulting in more effective scaling of ZO updates and thus accelerated convergence. Specifically, consider a scenario where E[mt] = E[gt] but mt has significantly lower variance than gt due to the smoothing effect, given the same vt 1, we can see that the refined vt in (6) achieves a smaller expected value compared to the standard one in (5). Hence, the update step (see (7)) using this refined vt in (6) is likely to be larger, allowing the algorithm to move faster towards the optimum. This claim will be rigorously established in Sec. 5.

4.3. Final Algorithm

Given the first and second moment estimates in (4) and (6) respectively, R-Ada ZO updates parameters by:

θt = θt 1 η mt vt + ζ (7)

where η is the base learning rate and ζ is a small constant for numerical stability. This update rule adaptively scales the effective learning rate based on the local geometry while incorporating the variance-reduced gradient estimates. The complete R-Ada ZO algorithm is detailed in our Algo. 2.

Computational and Memory Complexity. R-Ada ZO incurs the same computational cost of O(Kd) per iteration

for moment estimates and ZO updates (excluding function evaluations), and the same memory footprint of O(d) as ZOAda MM for moment estimates, where K is the number of function evaluations and d is the dimension of parameter θ.

Ease of Implementation. A key advantage of R-Ada ZO is its simple implementation. The core change involves updating the second moment estimate using the squared first moment estimate, a one-line change for existing adaptive ZO optimizers. This minimal change enables easy integration and fast deployment, while improving convergence.

5. Theoretical Analysis

This section presents a theoretical foundation for the efficacy of R-Ada ZO. We structure our analysis as follows: First, we introduce the required assumptions and preliminaries (Sec.5.1). Second, we prove the variance reduction in first moment estimate and the improvement of our refined second moment in R-Ada ZO (Sec. 5.2). Finally, we present the first variance-aware convergence framework for adaptive ZO methods and demonstrate the improved convergence of R-Ada ZO over other baselines (Sec. 5.3). To ease our proof, we primarily provide the theoretical analysis for (2) with u Unif(Sd 1).

5.1. Assumptions and Preliminaries

Our theoretical framework is built upon two fundamental assumptions concerning the non-convex function F. We impose a bounded function value as well as a coordinate-wise Lipschitz smoothness (Assump. 1), with a bounded variance of function values (Assump. 2). Of note, coordinate-wise Lipschitz smoothness is commonly used in the analysis of FO adaptive gradient methods, e.g., (Zhang et al., 2024a; Wang et al., 2024a).

Assumption 1. θ, θ Rd and i [d], f(θ, ξ) C , (8) i F(θ) i F(θ ) L θ θ . (9)

Assumption 2. θ Rd,

Eξ h f(θ, ξ) F(θ) 2i σ2 . (10)

Directly establishing the convergence of R-Ada ZO through the function F presents a primary challenge for adaptive ZO methods, due to the bias (i.e., E h ˆ f(θ, ξ) i = F(θ)) arising from the gradient estimation in (2). Thus, we propose to prove the convergence of R-Ada ZO with respect to the randomized smoothing function Fµ (Duchi et al., 2012) defined in (11) where u is a random vector drawn uniformly from a unit ball Bd and µ > 0 is a smoothing parameter. Of note, ZO gradient (2) with u Unif(Sd 1) in fact leads to

Refining Adaptive Zeroth-Order Optimization at Ease

an unbiased gradient estimation for this smoothing function, which we will show in Lemma 5.1 below.

Fµ(θ) Eu Bd [F(θ + µu)] . (11)

We introduce the following Lemma 5.1 (proof in Appx. A.1) to justify why Fµ, instead of F, servers as a better choice for the convergence framework of adaptive ZO methods. Lemma 5.1. Given (2) with u Unif(Sd 1), let Assump. 1 hold, θ Rd and i [d],

E h ˆ f(θ, ξ) i = Fµ(θ) , (12)

E F(θ) Fµ(θ) µL

Remark. Lemma 5.1 establishes that (a) Fµ is the expectation of the gradient estimate in (2), thereby overcoming the bias challenge mentioned above, and (b) the discrepancy between Fµ and F is bounded above by O(µ), implying that the convergence of R-Ada ZO with respect to F can be easily derived after obtaining the results with respect to Fµ. In light of these, Fµ will be a good choice for the convergence framework of adaptive ZO methods.

In addition, we provide the following Lemma 5.2 (proof in Appx. A.2) to ease our proof. Lemma 5.2. Given (2) with u Unif(Sd 1), let Assump. 1 hold, θ, θ Rd and i [d], i Fµ(θ) i Fµ(θ ) L θ θ , (14)

E ˆ if(θ, ξ) i Fµ(θ) 2 8(σ2 + C2)d

Remark. Lemma 5.2 establishes that (a) Fµ exhibits the same Lipschitz smoothness as F, and (b) the gradient variance associated with ZO optimization can be substantially large, particularly when K d and µ is small. Therefore, variance reduction is critical for improved ZO optimization.

5.2. Analysis on First and Second Moment Estimates

We first theoretically show the underlying variance reduction effect of first moment estimate in (4) using variance Σ2

defined below in Thm. 5.3 (proof in Appx. A.3).

Σ2 8(σ2 + C2)d

Theorem 5.3. Given first and second moment estimates (4) and (6) respectively, with Assump. 1, 2, t 1, i [d],

E h mt,i i Fµ(θt 1) 2i

1 β1 1 + β1 Σ2

| {z } Variance

+ β1(1 + β1)L2η2d

(1 β1)2(1 β2) + βt 1E h i Fµ(θt 1) 2i

| {z } Squared Bias

Remark. To the best of our knowledge, this theorem provides the first fundamental variance-bias decomposition for the first moment estimate in adaptive ZO algorithms. The variance, given by 1 β1

1+β1 Σ2, arises from the randomness in

gradient estimator (2) and reduces Σ2 in (15) by a factor of 1 β1

1+β1 , which can be further improved with a large β1. This thus theoretically demonstrates the variance reduction effect of first moment estimate in (4), which goes beyond increasing K to reduce variance. The bias, stemming from the difference between current and past estimates, can be reduced by using a small learning rate η, which limits the magnitude of update steps, or a small β1, which reduces the influence of past estimates. So, this decomposition unveils a fundamental trade-off controlled by the utilization (i.e., β1) of past estimates between variance and bias. Particularly, when β1 = 0, (17) simplifies to (15).

We then theoretically show that our refined second moment update in (6) is likely to be a more accurate approximation to its variance-free ideal in (18) and hence may better capture the underlying geometry of optimization landscape than (5) used in ZO-Ada MM, with the following Thm. 5.4 (proof in Appx. A.4) and Cor. 5.5 (proof in Appx. A.5).

vt,i = βt 2 v0,i +

τ=1 (1 β2)βt τ 2 i Fµ(θτ 1) 2 . (18)

Theorem 5.4. Given second moment estimate (6), with Assump. 1, 2, t 1 and i [d],

E vt,i βt 2 v0,i + (1 β1)Σ2 + β1(1 + β1)2L2η2d

(1 β1)2(1 β2)

τ=1 (1 β2)βt τ 2 E h i Fµ(θτ 1) 2i .

Corollary 5.5. Given second moment estimate in (5), with Assump. 1, 2, t 1 and i [d],

E vt,i βt 2 v0,i + (1 + β1)Σ2 +

τ=1 (1 β2)βt τ 2 E h i Fµ(θτ 1) 2i .

Remark. Thm. 5.4 introduces a novel variance-dependent upper bound for our refined second moment estimate (6). Compared with the bound (20) in Cor. 5.5 for the conventional estimate (5), our (6) reduces the influence of gradient estimation variance Σ2 (in green) by a factor of 1 β1

1+β1 . This

is crucial in ZO optimization where Σ2 is typically large. While our estimate introduces a bias (in orange), it is small with a small learning rate η. Note that (18) represents the variance-free ideal, which the conventional estimate (5) and our refined estimate (6) aims to approximate. Comparing

Refining Adaptive Zeroth-Order Optimization at Ease

the bounds in (19) and (20) with (18), our refined estimate (6) better approaches this ideal than (5), particularly when Σ2 dominates, thanks to its reduced impact of Σ2. This thus enables a better capture of geometry information during optimization and probably leads to improved optimization.

5.3. Variance-Aware Convergence Analysis

This section presents the first variance-aware convergence framework for adaptive ZO methods, particularly focusing on the convergence of R-Ada ZO and ZO-Ada MM. We first bound the averaged gradient norm of the smoothed function, Fµ, as a step towards bounding the averaged gradient norm of the original function F. Inspired by (Zhang et al., 2024a), the core proof idea lies in applying H older s inequality to decomposes this target into two components (Lemma 5.6): One involving the averaged square root norm of second moment estimate that will be variance-dependent and another involving a normalized gradient norm by second moment estimate. The subsequent analysis then focuses on bounding these two components using Lemma 5.7 and Thm. 5.8, respectively. The first component, i.e., the averaged square root norm of second moment, can be upper-bounded with Thm. 5.4 (see Lemma 5.7). The second component, i.e., the normalized gradient norm with second moment, is shown to be of order O(ϵ2) (see Thm. 5.8). By combining these bounds and incorporating the connection between F and Fµ in Lemma 5.1, we arrive at the final convergence results for R-Ada ZO (Thm. 5.9) and ZO-Ada MM (Thm. 5.10).

We first introduce Lemma 5.6 (proof in Appx. A.6).

Lemma 5.6. t 1, we have that

t=0 E Fµ(θt) !2

β2 vt + ζ i

" Fµ(θt) 2 p

Remark. Chen et al. (2019); Nazari et al. (2020) bound

B solely to demonstrate the convergence of adaptive ZO methods. However, we argue that this bound alone fail to include the effects of second moment estimate and therefore provides incomplete convergence information. In contrast, Lemma 5.6 allows us to include the effects of second moment (i.e., A ) and directly bound a more relevant quantity, 1 T PT 1 t=0 E Fµ(θt) . Note that this metric is a more widely accepted convergence criteria in optimization theory, directly measuring the distance to a stationary point (Arjevani et al., 2023; Zhang et al., 2024a). Overall, Lemma 5.6 enables us to provide a variance-aware convergence analysis,

strengthening the understanding of convergence behavior for adaptive ZO methods.

Leveraging Lemma 5.6, we then proceed to bound the terms

A and B in Lemma 5.7 (proof in Appx. A.7) and Lemma 5.8 (proof in Appx. A.8), respectively. These results rely on the following definition of V resulted from Thm. 5.4.

V 2 v0 + (1 β1)Σ2 | {z } Variance

+ β1(1 + β1)2L2η2d

(1 β1)2(1 β2) | {z } Squared Bias

Lemma 5.7. Given first and second moment estimates (4) and (6) respectively, with Assump. 1, 2, t 1, i [d],

β2 vt + ζ i

ζ + V d + (1 + β1)

t=0 E Fµ(θt) .

Remark. Lemma 5.7 demonstrates that A in Lemma 5.6 is variance-dependent. Specifically, A is asymptotically dominated by V as T , because 1

T PT 1 t=0 E Fµ(θt)

gradually decreases during optimization. This highlights that the asymptotic behavior of A is governed by both the bias and variance present in the first moment estimate (4).

Theorem 5.8 (Informal). Given Assump. 1, 2, let 1 β2 O(ϵ2), η O(ϵ2), and T O(ϵ 4). the following holds for R-Ada ZO if β1 β2, β2 1

2, m0,i = 0, v0,i > 0,

" Fµ(θt 1) 2 p

Remark. Of note, Thm. 5.8 attains the same rate of O( 1

T ) as (Chen et al., 2019; Nazari et al., 2020) to achieve that

1 T PT 1 t=0 E Fµ(θt 1) 2

By incorporating Lemma 5.1, Lemma 5.7, and Thm. 5.8 into Lemma 5.6, we derive the following convergence for RAda ZO (Thm. 5.9, proof in Appx. A.9) and ZO-Ada MM (Thm. 5.10, proof in Appx. A.10), respectively.

Theorem 5.9 (Informal). Given Assump. 1, 2, let 1 β2 O(ϵ2), η O(ϵ2), and T O(ϵ 4). We have the following for R-Ada ZO if β1 β2, β2 1/2 and i [d], m0,i = 0, v0,i > 0,

t=0 E F(θt) (1 + β1)

β1(1 β2) ϵ2+ 4p

d (25) where V is defined in (22).

Refining Adaptive Zeroth-Order Optimization at Ease

0 20000 40000 Iters (T)

Optimality Gap (log scale)

0 20000 40000 Iters (T)

0 20000 40000 Iters (T)

0 20000 40000 Iters (T)

ZO-SGD ZO-sign SGD ZO-RMSProp ZO-Ada MM R-Ada ZO

Figure 1: Convergence comparison among different adaptive ZO optimizers for various synthetic functions, in which y-axis represents the log-scale optimality gap F(θ) minθ F(θ ) and x-axis is the number of iterations T. Each curve denotes the mean from 3 independent runs.

Theorem 5.10 (Informal). Given Assump. 1, 2, let 1 β2 O(ϵ2), η O(ϵ2), and T O(ϵ 4). We have the following for ZO-Ada MM if β1 β2, β2 1/2 and i [d], m0,i = 0, v0,i > 0,

t=0 E F(θt) (1 + β1)

β1(1 β2) ϵ2+ 4p

d (26) where ˆV 2 v0 + (1 + β1)Σ2 | {z } Variance

Remark. To the best of our knowledge, our Thm. 5.9 and Thm. 5.10 are the first analyses to explicitly incorporate the impact of second moment estimate (measured by V or ˆV ) that is variance-dependent into the convergence of adaptive ZO methods. Specifically, Thm. 5.9 and Thm. 5.10 demonstrate that the convergence of 1

T PT 1 t=0 E F(θt) typically exhibits a dependence of O(

V ϵ) in adaptive ZO methods, highlighting the importance of an improved second moment estimate with reduced variance. This explains the advantage of R-Ada ZO over other adaptive ZO methods like ZO-Ada MM thanks to our refined second moment estimate (6) achieving a reduction of at most 1 β1

1+β1 on V . Moreover, Thm. 5.9 implies that a larger β1 typically leads to an improved convergence of R-Ada ZO especially when variance Σ2 dominates. Comparing Thm. 5.9 and Cor. 5.10, we observe that R-Ada ZO achieves a speedup

for the convergence of averaged gradient

norm primarily when variance Σ2 dominates. These will be further empirically supported by the experiments in our Sec. 6 below.

6. Experiments

In this section, we conduct extensive experiments on various tasks in practice, including synthetic functions (Sec. 6.1),

1 = 0:99 1 = 0:9 1 = 0:8 1 = 0:7 0.00

Cosine Similarity

(a) First Moment

1 = 0:99 1 = 0:9 1 = 0:8 1 = 0:7 0

Relative Error

(b) Second Moment

gt mt vt (ori) vt (ours)

Figure 2: Effects of (a) first and (b) second moment under varying β1 during the convergence of Quadratic function. Here, gt and mt corresponds to the results from the estimated gradient in (2) and the first moment in (4), and vt (ori) and vt (ours) are results of the second moment estimates defined in (5) and (6) respectively. The y-axis in (a) represents the cosine similarity between gt or mt and the true gradient F(θt 1), while the y-axis in (b) denotes the relative error between vt in (5) or (6) and the vt computed using the square of the true gradient F(θt 1).

black-box adversarial attack (Sec. 6.2), as well as memoryefficient fine-tuning of large language models (Sec. 6.3), to empirically support the superior efficacy of our R-Ada ZO (Algo. 2).

6.1. Synthetic Functions

On Convergence. We first compare the convergence of R-Ada ZO with ZO-SGD, ZO-sign SGD (Liu et al., 2019), ZO-Ada MM, and ZO-RMSProp, an integration of RMSProp (Hinton, 2012) and ZO gradient estimator, using four synthetic functions with d=104, including Quadratic, Cubic, Levy, and Rosenbrock function. We refer to Appx. B.1 for more details. The results are in Fig. 1, showing that RAda ZO generally achieves significantly faster convergence while maintaining competitive optimality gaps compared to other baselines. The consistent gain of R-Ada ZO across most synthetic functions suggests that R-Ada ZO is robust

Refining Adaptive Zeroth-Order Optimization at Ease

2500 5000 7500 Iters (T)

Training Loss

(a) OPT-1.3B (SST-2)

ZO-RMSProp ZO-Ada MM R-Ada ZO

2500 5000 7500 Iters (T)

(b) OPT-13B (SST-2)

ZO-RMSProp ZO-Ada MM R-Ada ZO

2500 5000 7500 Iters (T)

(c) OPT-13B (Copa)

ZO-RMSProp ZO-Ada MM R-Ada ZO

2500 5000 7500 Iters (T)

(d) OPT-1.3B (SST-2)

Figure 3: (a-c) Training loss comparison among various adaptive ZO optimizers for the fine-tuning of LLMs under different model sizes and different dataset. (d) Training loss comparison of varying β1 in R-Ada ZO on OPT-1.3B and SST-2 (Socher et al., 2013). Each curve denotes the mean from 3 independent runs.

to the structure of the underlying problem. Furthermore, Fig. 1 reveals a notable similarity in the convergence behavior of ZO-Ada MM and ZO-RMSProp across all four benchmark functions. In contrast, R-Ada ZO consistently demonstrates a substantial speedup compared to ZO-RMSProp. These results imply that the first moment itself contributes minimally to the convergence gains for adaptive ZO optimization, and underscores the critical role of our refined second moment estimate in achieving the superior performance of R-Ada ZO.

On First Moment. We further conduct an experimental analysis to understand how β1 affects first moment estimates during the optimization process of the Quadratic function. In Fig. 2 (a), we present the results, using cosine similarity to measure the alignment between the estimated gradient gt or the estimated first moment mt, and the true gradient F(θt 1). The results indicate that the estimated first moment mt exhibits better cosine similarity than gt, resulting from its variance reduction effect, as proven in Thm. 5.3. Moreover, we observe that increasing β1 generally enhances this variance reduction. However, excessively high values of β1 result in a minor decrease in similarity. This trend is consistent with the trade-off discussed in Thm. 5.3.

On Second Moment. We further conduct an experimental analysis to understand how β1 affects second moment estimates during the optimization process of the Quadratic function. Figure 2(b) compares the second moment estimates, vt(ori) from (5) and vt(ours) from (6), using the relative error against the second moment estimate based on the squared ground truth ( F(θt 1))2. The results demonstrate that our refined second moment estimate, vt(ours), significantly reduces the relative error compared to the standard second moment estimate, vt(ori), which therefore enables the capture of more accurate geometry information during optimization. Interestingly, increasing values of β1 generally lead to a lower relative error, a trend that contrasts with the behavior of first moment estimates. This lack of a

Table 1: Comparison of the number of iterations to achieve a successful black-box adversarial attack. Each cell represents mean standard deviation from five independent runs.

Measurement ZO-RMSProp ZO-Ada MM R-Ada ZO

# Iters ( 103) 15.6 3.2 15.5 4.1 2.9 0.8 Speedup 1.0 1.0 5.4

trade-off is likely due to the loose bound we derived for our refined second moment.

6.2. Black-Box Adversarial Attack

We further investigate the performance of R-Ada ZO within the realm of black-box adversarial attacks, a key application area for zeroth-order optimization (Cheng et al., 2021; Shu et al., 2023). In this context, the objective is to find an optimal perturbation δ that causes a black-box model to misclassify an input image x when perturbed to x + δ. Following the practice in (Shu et al., 2023), we also present a comparative analysis of the number of iterations required for successful black-box adversarial attacks on an image from the MNIST dataset (Lecun et al., 1998), using ZO-RMSProp, ZO-Ada MM, and R-Ada ZO in Tab. 1 (experimental setup in Appx. B.2). As shown in the table, ZO-RMSProp and ZO-Ada MM exhibit similar performance, requiring an average of approximately 15.6 and 15.5 thousand iterations, respectively. The standard deviations of the iteration counts were similar as well, about 3200 to 4100 iterations. These align with our results in Sec. 6.1. On the other hand, RAda ZO requires a significantly lower number of iterations with an average of only 2900, and a smaller standard deviation of 800 iterations, suggesting a faster and more stable convergence. The speedup achieved by R-Ada ZO, i.e., a speedup of 5.4 compared to the baseline ZO-RMSProp, is also highlighted in Tab. 1. These findings thus further underscore the superior efficacy of R-Ada ZO.

Refining Adaptive Zeroth-Order Optimization at Ease

6.3. Memory-Efficient LLM Fine-Tuning

Recent interest in memory-efficient fine-tuning of large language models using ZO optimization (Malladi et al., 2023; Zhang et al., 2024b) motivates our use of this setting to further demonstrate the superiority of R-Ada ZO over other adaptive ZO optimization algorithms (experimental setup in Appx. B.3). The results in Fig. 3(a-c) show that, for both OPT-1.3B and OPT-13B models (Zhang et al., 2022) and dataset SST-2 (Socher et al., 2013) and Copa (Roemmele et al., 2011), R-Ada ZO converges significantly faster than ZO-RMSProp and ZO-Ada MM on SST-2 dataset, achieving a speedup of 4.29 for OPT-1.3B and 3.75 for OPT13B to reach the same training loss. The optimization curves of ZO-RMSProp and ZO-Ada MM are indistinguishable, indicating the similar convergence behavior we have seen in Sec. 6.1 and Sec. 6.2. These empirical results strongly support R-Ada ZO as a more efficient and effective adaptive ZO optimizer. In addition, the results in Fig. 3(d) show that R-Ada ZO typically enjoys an improved convergence when a larger β1 is applied, which aligns with the insights provided by our Thm. 5.9.

7. Conclusion

In conclusion, this work introduces R-Ada ZO, a novel approach that addresses the critical limitations of existing adaptive ZO methods by effectively leveraging moment information. Through rigorous theoretical analysis, we have demonstrated the inherent variance reduction effect of first moment estimates on ZO gradient estimates, leading to more stable and accurate updates, as well as the improved accuracy of our refined second moment estimates. Furthermore, we establish the first variance-aware convergence framework for adaptive ZO methods and prove the superior convergence rate of R-Ada ZO. The consistent empirical performance gains of R-Ada ZO across diverse applications underscore its potential as a powerful and practical solution for real-world ZO optimization challenges.

Impact Statement

This paper presents work whose goal is to advance the field of zeroth-order optimization and machine learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. E. Lower bounds for non-convex stochastic optimization. Math. Program., 199(1):165 214, 2023.

Chen, X., Liu, S., Xu, K., Li, X., Lin, X., Hong, M., and

Cox, D. Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. In Proc. Neur IPS, 2019.

Cheng, S., Wu, G., and Zhu, J. On the convergence of priorguided zeroth-order optimization algorithms. In Proc. Neur IPS, 2021.

Duchi, J. C., Bartlett, P. L., and Wainwright, M. J. Randomized smoothing for stochastic optimization. SIAM J. Optim., 22(2):674 701, 2012.

Flaxman, A., Kalai, A. T., and Mc Mahan, H. B. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proc. SODA, 2005.

Flaxman, A. D., Kalai, A. T., and Mc Mahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. ar Xiv:cs/0408007, 2004.

Ghadimi, S. and Lan, G. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341 2368, 2013.

Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program., 155(1-2):267 305, 2016.

Hinton, G. Neural networks for machine learning - lecture 6a, 2012. http://www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_ lec6.pdf.

Hiranandani, G., Mathur, J., Narasimhan, H., Fard, M. M., and Koyejo, S. Optimizing black-box metrics with iterative example weighting. In Proc. ICML, 2021.

Jiang, S., Chen, Q., Pan, Y., Xiang, Y., Lin, Y., Wu, X., Liu, C., and Song, X. Zo-adamu optimizer: Adapting perturbation by the momentum and uncertainty in zerothorder optimization. In Proc. AAAI, 2024.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proc. ICLR, 2015.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, pp. 2278 2324, 1998.

Lian, X., Zhang, H., Hsieh, C., Huang, Y., and Liu, J. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to firstorder. In Proc. NIPS, 2016.

Liu, S., Kailkhura, B., Chen, P., Ting, P., Chang, S., and Amini, L. Zeroth-order stochastic variance reduction for nonconvex optimization. In Proc. Neur IPS, 2018a.

Refining Adaptive Zeroth-Order Optimization at Ease

Liu, S., Li, X., Chen, P., Haupt, J. D., and Amini, L. Zerothorder stochastic projected gradient descent for nonconvex optimization. In Proc. Global SIP, 2018b.

Liu, S., Chen, P., Chen, X., and Hong, M. signsgd via zeroth-order oracle. In Proc. ICLR, 2019.

Liu, Y., Gao, Y., and Yin, W. An improved analysis of stochastic gradient descent with momentum. In Neur IPS, 2020.

Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., and Arora, S. Fine-tuning language models with just forward passes. In Proc. Neur IPS, 2023.

Nazari, P., Tarzanagh, D. A., and Michailidis, G. Adaptive first-and zeroth-order methods for weakly convex stochastic optimization problems. ar Xiv:2005.09261, 2020.

Nesterov, Y. E. and Spokoiny, V. G. Random gradient-free minimization of convex functions. Found. Comput. Math., 17(2):527 566, 2017.

Roemmele, M., Bejan, C. A., and Gordon, A. S. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, 2011.

Ru, B., Cobb, A. D., Blaas, A., and Gal, Y. Bayesopt adversarial attack. In Proc. ICLR, 2020.

Shu, Y., Dai, Z., Sng, W., Verma, A., Jaillet, P., and Low, B. K. H. Zeroth-order optimization with trajectory-informed derivative estimation. In Proc. ICLR, 2023.

Shu, Y., Lin, X., Dai, Z., and Low, B. K. H. Federated zeroth-order optimization using trajectory-informed surrogate gradients. In Workshop on Differentiable Almost Everything (ICML), 2024.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. EMNLP, 2013.

Wang, B., Fu, J., Zhang, H., Zheng, N., and Chen, W. Closing the gap between the upper bound and lower bound of adam s iteration complexity. In Proc. Neur IPS, 2024a.

Wang, X., Qin, X., Yang, X., and Yan, J. Relizo: Sample reusable linear interpolation-based zeroth-order optimization. In Proc. Neur IPS, 2024b.

Zhang, Q., Zhou, Y., and Zou, S. Convergence guarantees for rmsprop and adam in generalized-smooth non-convex optimization with affine noise variance. ar Xiv:2404.01436, 2024a.

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. ar Xiv:2205.01068, 2022.

Zhang, Y., Li, P., Hong, J., Li, J., Zhang, Y., Zheng, W., Chen, P., Lee, J. D., Yin, W., Hong, M., Wang, Z., Liu, S., and Chen, T. Revisiting zeroth-order optimization for memory-efficient LLM fine-tuning: A benchmark. In Proc. ICML, 2024b.

Refining Adaptive Zeroth-Order Optimization at Ease

A.1. Proof of Lemma 5.1

Based on the definition of ˆ f(θ, ξ) in (2), we first prove (12) in Lemma 5.1 as below,

E h ˆ f(θ, ξ) i = d

k=1 Euk Sd 1 Eξ

f(θ + µuk; ξ) f(θ; ξ)

Euk Sd 1 F(θ + µuk)

Euk Sd 1 F(θ)

where the third equality is due to the fact that Euk Sd 1 h F (θ+µuk)

µ uk i = Fµ(θ)

d , which comes from Lemma 1 in (Flaxman

et al., 2004) with Fµ defined on a unit ball Bd as shown in (11).

We then prove (13) in Lemma 5.1 as below,

E F(θ) Fµ(θ) (a) = E Eu Bd [ F(θ) F(θ + µu)]

(b) E F(θ) F(θ + µu)

where (a) comes from the Leibniz s Rule, (b) results from Jensen s inequality, (c) is based on Assump. 1, and (d) is due to the fact that u 1. We therefore conclude our proof for Lemma 5.1.

A.2. Proof of Lemma 5.2

With Leibniz s Rule, Jensen s inequality, and (d) Assump. 1, the following holds for (14) in Lemma 5.2:

i Fµ(θ) i Fµ(θ ) = i Eu Bd F(θ + µu) F(θ + µu)

= Eu Bd i F(θ + µu) i F(θ + µu)

Eu Bd i F(θ + µu) i F(θ + µu)

Refining Adaptive Zeroth-Order Optimization at Ease

We finally prove (15) in Lemma 5.2 as below,

E ˆ if(θ, ξ) i Fµ(θ) 2

f(θ + µuk, ξ)

µ uk,i Euk Sd 1 F(θ + µuk)

µ uk,i Euk Sd 1 F(θ)

" f(θ + µuk, ξ)

µ uk,i Euk Sd 1 F(θ + µuk)

µ uk,i Euk Sd 1 F(θ)

" f(θ + µuk, ξ) F(θ + µuk)

µ uk,i + F(θ + µuk)

µ uk,i Euk Sd 1 F(θ + µuk)

" f(θ, ξ) F(θ)

µ uk,i + F(θ)

µ uk,i Euk Sd 1 F(θ)

" f(θ + µuk, ξ) F(θ + µuk)

2 + F(θ + µuk)

µ uk,i Euk Sd 1 F(θ + µuk)

" f(θ, ξ) F(θ)

µ uk,i Euk Sd 1 F(θ)

µ2 u2 k,i + F(θ + µuk)

µ2 u2 k,i + F(θ)

(f) 8(σ2 + C2)d

(30) Here, (a) is due to the fact that Euk Sd 1 h F (θ+µuk)

µ uk i = Fµ(θ)

d , which comes from Lemma 1 in (Flaxman et al., 2004)

with Fµ defined on a unit ball Bd as shown in (11), and the fact that Euk Sd 1 h F (θ)

µ uk,i i = 0. In addition, (b) comes from

the independence among {uk}K k=1, and (c), (d) are from Cauchy-Schwarz inequality. Besides, (e) results from Assump. 2

and the definition of variance. Finally, (f) is due to Assump. 1 and the fact that E h u2 k,i i = 1/d. We therefore conclude our proof for Lemma 5.2.

A.3. Proof of Thm. 5.3

We first show the following variance reduction effect in first moment estimate based on the definition of Σ2 in (16):

E h mt,i E[mt,i] 2i (a) = E

τ=1 βt τ 1 ˆ if(θτ 1, ξτ) i Fµ(θτ 1)

(b) = (1 β1)2 t X

τ=1 β2(t τ) 1 E ˆ if(θτ 1, ξτ) i Fµ(θτ 1) 2

(c) (1 β1)(1 β2t 1 ) 1 + β1 Σ2

where (b) comes from the independence among {ξτ}t τ=1 and (c) results from Lemma 5.2.

Remark. As suggested by (31), the standard bias correction term (i.e., 1 βt 1) in Adam (Kingma & Ba, 2015) is intentionally excluded to avoid compromising the variance reduction effect.

Refining Adaptive Zeroth-Order Optimization at Ease

We then show the bias in the first moment estimate as below,

" 1 1 βt 1 E[mt,i] i Fµ(θt 1)

τ=1 βt τ 1 i Fµ(θτ 1) i Fµ(θt 1)

τ,τ =1 E h D βt τ 1 ( i Fµ(θτ 1) i Fµ(θt 1)), βt τ

1 ( i Fµ(θτ 1) i Fµ(θt 1)) Ei

E h i Fµ(θτ 1) i Fµ(θt 1)) 2i + E h i Fµ(θτ 1) i Fµ(θt 1)) 2i

βt τ 1 (1 βt 1) 1 β1 E h i Fµ(θτ 1) i Fµ(θt 1)) 2i

(e) (1 β1)L2

τ=1 βt τ 1 E h θτ 1 θt 1 2i

(f) (1 β1)L2η2

τ=1 βt τ 1 (t τ)

" m2 s,i vs,i + ζ

(g) (1 β1)L2η2d (1 βt 1)(1 β2)

τ=1 βt τ 1 (t τ)2

(h) β1(1 + β1)L2η2d

(1 βt 1)(1 β1)2(1 β2)

where (c) is from Cauchy-Schwarz inequality, (d) is from the sum of geometric series, (e) is from (14) in Lemma 5.2, (f)

is based on the update rule in (7) and Cauchy-Schwarz inequality, (g) is due to the fact that m2 s,i vs,i+ζ m2 s,i (1 β2)m2 s,i . Finally,

(h) results from the following:

τ=1 τ 2βτ 1 = β1 1 + β1 (t + 1)2βt 1 + (2t2 + 2t 1)βt+1 1 t2βt+2 1

(1 β1)3 β1(1 + β1)

(1 β1)3 . (33)

By putting the results above together, we then conclude our proof for Thm. 5.3 as below:

E h mt,i i Fµ(θt 1) 2i

(a) = E h mt,i E mt,i + E mt,i i Fµ(θt 1) 2i

(b) = E h mt,i E mt,i 2i + E h E mt,i i Fµ(θt 1) 2i

(c) E h mt,i E mt,i 2i + 1 βt 1 E

" 1 1 βt 1 E[mt,i] i Fµ(θt 1)

+ βt 1E h i Fµ(θt 1) 2i

1 + β1 Σ2 + β1(1 + β1)L2η2d

(1 β1)2(1 β2) + βt 1E h i Fµ(θt 1) 2i

where (b) comes from the independence between mt,i E mt,i and E mt,i i Fµ(θt 1) with respect to {ξτ}t τ, (c)

is due the fact that (a + b)2 1 + 1 βt 1 βt 1

a2 + 1 + βt 1 1 βt 1

Refining Adaptive Zeroth-Order Optimization at Ease

A.4. Proof of Thm. 5.4

Based on our Thm. 5.3, we naturally can bound m2 t,i as below

E h mt,i 2i = E h mt,i i Fµ(θt 1) + i Fµ(θt 1) 2i

(1 + β1)E h mt,i i Fµ(θt 1) 2i + 1 + 1

E h i Fµ(θt 1) 2i

(1 β1)Σ2 + β1(1 + β1)2L2η2d

(1 β1)2(1 β2) + (1 + β1)2

β1 E h i Fµ(θt 1) 2i .

As vt,i is the moving average of mt,i, we conclude our proof for Thm. 5.4 as below

E vt,i = E h β2vt 1,i + (1 β2)m2 t,i i

E β2vt 1,i + (1 β2)

(1 β1)Σ2 + β1(1 + β1)2L2η2d

(1 β1)2(1 β2) + (1 + β1)2

β1 E h i Fµ(θt 1) 2i!

τ=1 (1 β2)βt τ 2

(1 β1)Σ2 + β1(1 + β1)2L2η2d

(1 β1)2(1 β2) + (1 + β1)2

β1 E h i Fµ(θt 1) 2i!

βt 2v0,i + (1 β1)Σ2 + β1(1 + β1)2L2η2d

(1 β1)2(1 β2) + (1 + β1)2

τ=1 (1 β2)βt τ 2 E h i Fµ(θτ 1) 2i .

A.5. Proof of Cor. 5.5

Similar to the proof in Appx. A.4, let gt = ˆ f(θt 1) we have

E h gt,i 2i = E h gt,i i Fµ(θt 1) + i Fµ(θt 1) 2i

1 + β1 1 + β1 + β2 1

E h gt,i i Fµ(θt 1) 2i +

1 + 1 + β1 + β2 1 β1

E h i Fµ(θt 1) 2i

1 + β1 + β2 1 Σ2 + (1 + β1)2

β1 E h i Fµ(θt 1) 2i .

Consequently,

E vt,i = E h β2vt 1,i + (1 β2)g2 t,i i

E β2vt 1,i + (1 β2)

1 + β1 + β2 1 Σ2 + (1 + β1)2

β1 E h i Fµ(θt 1) 2i!

βt 2v0,i + (1 + β1)2

1 + β1 + β2 1 Σ2 + (1 + β1)2

τ=1 (1 β2)βt τ 2 E h i Fµ(θτ 1) 2i

βt 2v0,i + (1 + β1)Σ2 + (1 + β1)2

τ=1 (1 β2)βt τ 2 E h i Fµ(θτ 1) 2i ,

which concludes our proof for Cor. 5.5.

Refining Adaptive Zeroth-Order Optimization at Ease

A.6. Proof of Lemma 5.6

By applying H older s inequality twice, we have the following

t=0 E Fµ(θt) !2

β2 vt + ζ i 1

" Fµ(θt) 2 p

β2 vt + ζ i! 1 T

" Fµ(θt) 2 p

which concludes our proof.

A.7. Proof of Lemma. 5.7

Based on (34), (35), and (36), we have

vt,i = β2vt 1,i + (1 β2)m2 t,i

β2vt 1,i + (1 β2) (1 + β1) mt,i i Fµ(θt 1) 2 + 1 + 1

i Fµ(θt 1) 2

= β2vt 1,i + (1 β2)

(1 + β1) mt,i E mt,i 2 + 2 mt,i E mt,i E mt,i i Fµ(θt 1)

+ 1 βt 1 1 1 βt 1 E[mt,i] i Fµ(θt 1)

2 + βt 1 i Fµ(θt 1) 2 + 1 + 1

i Fµ(θt 1) 2 !

τ=1 (1 β2)βt τ 2

(1 + β1) mt,i E mt,i 2 + 2 mt,i E mt,i E mt,i i Fµ(θt 1)

+ 1 βt 1 1 1 βt 1 E[mt,i] i Fµ(θt 1)

2 + (1 + β1)2

i Fµ(θt 1) 2 !

Consequently,

β2 vt + ζ i

β2V + 1 + β1 β1

1 β2β(t τ)/2 2 E i Fµ(θτ 1) !

ζ + V d + (1 + β1)

t=0 E Fµ(θt)

where (a), (b) comes from the fact that q Pd i=1 ai Pd i=1 ai. In addition, (b) also results from Jensen s inequality by

getting the square root of i Fµ(θt 1) 2 within the upper bound of vt,i first (i.e., (40)) and taking expectation over it later, and (c) is due to the sum of geometric series. Finally, (c) is also the consequence of the sum of geometric series.

Refining Adaptive Zeroth-Order Optimization at Ease

A.8. Proof of Thm. 5.8

Inspired by the proof of Adam (Kingma & Ba, 2015) in FO optimization (Wang et al., 2024a; Zhang et al., 2024a), we focus on the study of the potential function Fµ(xt) with xt defined as below:

xt θt β1/ β2θt 1

1 β1/ β2 . (42)

Consequently,

xt θt = β1/ β2 1 β1/ β2 (θt θt 1) , (43)

xt+1 xt = θt+1 θt β1/ β2(θt θt 1)

vt+1 + ζ + ηβ1mt/ β2vt + β2ζ

According to the Lipschitz smoothness of function Fµ, the following holds conditions on Ft, i.e., the stochastics up to iteration t:

E Fµ(xt+1)|Ft Fµ(xt) + E Fµ(xt), xt+1 xt |Ft +

d L 2 E h xt+1 xt 2 |Ft i . (45)

We first reframe the second term on the RHS of (45) as below using the update rule in (7):

E Fµ(xt), xt+1 xt |Ft

= E Fµ(θt), xt+1 xt |Ft + E Fµ(xt) Fµ(θt), xt+1 xt |Ft

Fµ(θt), η(1 β1) ˆ f(θt, ξt+1)

(1 β1/ β2) β2vt + ζ

Fµ(θt), ηmt+1/ p

vt+1 + ζ + ηmt+1/ β2vt + ζ

" Fµ(θt), ηβ1mt/ β2vt + β2ζ ηβ1mt/ β2vt + ζ

+ E Fµ(xt) Fµ(θt), xt+1 xt Ft

We then bound the 1 , 2 , 3 , and 4 term above separately. To begin with, we have the following based on the expectation of ˆ f(θt, ξt+1):

Fµ(θt), η(1 β1) ˆ f(θt, ξt+1)/ β2vt + ζ

" i Fµ(θt) 2 pβ2vt,i + ζ

Refining Adaptive Zeroth-Order Optimization at Ease

In addition, 2 can be upper bounded as below:

Fµ(θt), ηmt+1/ p

vt+1 + ζ + ηmt+1/ β2vt + ζ

η 1 β1/ β2 E

" i Fµ(θt) (1 β2)m2 t+1,i mt+1,i pvt+1,i + ζpβ2vt,i + ζ(pβ2vt,i + ζ + pvt+1,i + ζ)

i Fµ(θt) pβ2vt,i + ζ E

" 1 β2m2 t+1,i pβ2vt,i + ζ + pvt+1,i + ζ

2γ0 pβ2vt,i + ζ + γ0 2pβ2vt,i + ζ

" 1 β2m2 t+1,i pβ2vt,i + ζ + pvt+1,i + ζ

2γ0 pβ2vt,i + ζ + γ0E h m2 t+1,i|Ft i

2pβ2vt,i + ζ E

" (1 β2)m2 t+1,i pβ2vt,i + ζ + pvt+1,i + ζ 2

2γ0 pβ2vt,i + ζ + γ0E h m2 t+1,i|Ft i

" vt+1,i β2vt,i pvt+1,i + ζpβ2vt,i + ζ pβ2vt,i + ζ + pvt+1,i + ζ

2γ0 pβ2vt,i + ζ + γ0E h m2 t+1,i|Ft i

" 1 pβ2vt,i + ζ 1 pvt+1,i + ζ

η(1 β1) 4(1 β1/ β2)

i Fµ(θt) 2 pβ2vt,i + ζ +

(1 β1/ β2)(1 β1)µ2 E

" 1 pβ2vt,i + ζ 1 pvt+1,i + ζ

where (a) is due to the update rule of second moment estimate in (6), (b) results from |mt+1,i|

vt+1,i+ζ |mt+1,i|

1 β2|mt+1,i|, (c) is

2 , (d) is from the update rule in (6), (f) can be obtained by choosing γ0 = 2 1 β1 and pβ2vt,i + ζ > ζ, and (g) results from (35) and (50) below.

ˆ if(θ, ξ) =

f(θ + µuk; ξ) f(θ; ξ)

f(θ + µuk; ξ) f(θ; ξ)

τ=1 βt τ 1 ˆ if(θτ 1, ξτ)

τ=1 βt τ 1 ˆ if(θτ 1, ξτ)

Refining Adaptive Zeroth-Order Optimization at Ease

Let 2β2 1, 3 can be bounded as below:

3 E Fµ(θt), ηβ1mt/ β2vt + β2ζ ηβ1mt/ β2vt + ζ

(a) = ηβ1 1 β1/ β2

i Fµ(θt) (1 β2)ζ mt,i pβ2vt,i + β2ζpβ2vt,i + ζ(pβ2vt,i + β2ζ + pβ2vt,i + ζ)

(b) ηβ1 1 β1/ β2

i Fµ(θt) 1 β2 ζ pβ2vt,i + ζ

(c) ηβ1 ζ 1 β1/ β2

2γ1 pβ2vt,i + ζ + γ1(1 β2) 2pβ2vt,i + ζ

(d) ηβ1 ζ 1 β1/ β2

2γ1 pβ2vt,i + ζ + ηβ1γ1(1 β2)d

2(1 β1/ β2)

(e) = η(1 β1) 4(1 β1/ β2)

i Fµ(θt) 2 pβ2vt,i + ζ + ηβ2 1(1 β2)d ζ (1 β1/ β2)(1 β1)

where (b) results from |mt,i|

vt,i+ζ |mt,i|

1 β2|mt,i| and 2β2 1, (c) is from ab a2

2 , and (e) is obtained by choosing

γ1 = 2β1 ζ 1 β1 .

Finally, 4 is bounded as below:

4 E Fµ(xt) Fµ(θt), xt+1 xt |Ft

β1L/ β2 1 β1/ β2 E

θt θt 1 θt+1,i θt,i β1/ β2(θt,i θt 1,i)

β1L/ β2 (1 β1/ β2)2 E

" θt θt 1 2

d θt+1,i θt,i 2

d θt,i θt 1,i 2

d/ β2 2(1 β1/ β2)2 1 + 2β1/ p

β2 E h θt θt 1 2 Ft i + E h θt+1 θt 2 Ft i

(d) = β1Lη2

d/ β2 2(1 β1/ β2)2

" m2 t,i vt,i + ζ

" m2 t+1,i vt+1,i + ζ

(52) where (a) is from (43), (44), Cauchy-Schwarz inequality, and the Lipschitz smoothness of Fµ in Lemma 5.2. In addition,

(b) is from ab a2

2 , and (d) is based on the update rule in (7).

we finally bound the last term on the RHS of (45) as below:

d L 2 E h xt+1 xt 2 Ft i =

2(1 β1/ β2)2 E θt+1,i θt,i β1/ p

β2(θt,i θt 1,i) 2 Ft

2(1 β1/ β2)2 E h 2 θt+1,i θt,i 2 Ft i + 2β2 1/β2E h θt,i θt 1,i 2 Ft i

(1 β1/ β2)2

" m2 t+1,i vt+1,i + ζ

" m2 t,i vt,i + ζ

Refining Adaptive Zeroth-Order Optimization at Ease

By introducing (47), (48), (51), (52), (53) into (45), let β1 β2, m0,i = 0, v0,i > 0, we have the following

E Fµ(xt+1) E Fµ(xt)

η(1 β1) 2(1 β1/ β2)

" i Fµ(θt) 2 pβ2vt,i + ζ

2(1 β1/ β2)2

" m2 t+1,i vt+1,i + ζ

2(1 β1/ β2)2

" m2 t,i vt,i + ζ

(1 β1/ β2)(1 β1)µ2 E

" 1 pβ2vt,i + ζ 1 pvt+1,i + ζ

T ηβ2 1(1 β2)d ζ (1 β1/ β2)(1 β1)

η(1 β1) 2(1 β1/ β2)

" i Fµ(θt) 2 β2vt + ζ

(1 β1/ β2)2

ln βT 2 v0,i + 4C2d2/µ2 /v0,i

(1 β1/ β2)(1 β1)µ2

1 ζ + T(1 β2) ζ

+ T ηβ2 1(1 β2)d ζ (1 β1/ β2)(1 β1) (54) where the last inequality comes from the following (55) and (57).

" 1 pβ2vt,i + ζ 1 pvt+1,i + ζ

= 1 pβ2v0,i + ζ +

" 1 pβ2vt+1,i + ζ 1 pvt+1,i + ζ

" 1 pv T,i + ζ

" 1 pβ2vt+1,i + ζ β2 pβ2vt+1,i + ζ

1 ζ + T(1 β2) ζ .

Moreover, due to the fact that a 1+a ln(1 + a), the following holds:

(1 β2)m2 t,i vt,i =

(1 β2)m2 t,i vt,i (1 β2)m2 t,i

1 + (1 β2)m2 t,i vt,i (1 β2)m2 t,i

1 + (1 β2)m2 t,i vt,i (1 β2)m2 t,i

= ln vt,i β2vt 1,i

(56) Given the results above and 2β2 1, we have

m2 t,i vt,i + ζ

E ln v T,i ln v0,i T ln β2

βT 2 v0,i + 4C2d2/µ2

where the second inequality is due to ln a a 1 and last inequality comes from (50).

Refining Adaptive Zeroth-Order Optimization at Ease

Define Fµ(x1) F µ, by re-arranging (54), we have

" i Fµ(θt) 2 pβ2vt,i + ζ

2(1 β1/ β2)

ηT(1 β1) + 8Lη

d (1 β1/ β2)(1 β1)

ln βT 2 v0,i + 4C2d2/µ2 /v0,i

T(1 β2) + 2

1 T ζ + 1 β2 ζ

+ 2β2 1(1 β2)d ζ

By choosing 1 β2 = 8C2d3

(1 β1)2µ2 ζ + 2β2 1d ζ (1 β1)2 1 ϵ2/3 O(ϵ2), η = 16Ld

1 ϵ2/3 O(ϵ2) and

T = 3ϵ 2 2(1 β1/

β2) η(1 β1) + 8C2d3

(1 β1)2µ2 ξ + 8Lη

β2)(1 β1) Pd i=1

ln((βT 2 v0,i+4C2d2/µ2)/v0,i)

can simply have the following,

" Fµ(θt) 2 p

" i Fµ(θt) 2 pβ2vt,i + ζ

where the first inequality is due to the fact that P

i bi P ai/bi. We therefore conclude our proof of Thm. 5.8.

A.9. Proof of Thm. 5.9

By introducing (41) and (59) into Lemma 5.6, we have

t=0 E Fµ(θt) !2

β2 vt + ζ i! 1 T

" Fµ(θt) 2 p

ζ + V d + (1 + β1)

t=0 E Fµ(θt) !

By applying the formula for the root of square equation, we have the following

t=0 E Fµ(θt) (1 + β1)

β1(1 β2) ϵ2 + 4p

V d ϵ , (61)

which concludes our proof for Thm. 5.9.

A.10. Proof of Thm. 5.10

We first introduce Lemma A.1 and Lemma A.2 from (Wang et al., 2024a):

Lemma A.1 (Lemma 6 in (Wang et al., 2024a)). The following inequality holds for ZO-Ada MM if 0 < β2 1 < β2 < 1,

mt,i pvt,i + ζ 1 β1 1 β2 q

1 β2 1/β2 . (62)

Refining Adaptive Zeroth-Order Optimization at Ease

Proof. Let gτ,i ˆ if(θτ, ξτ+1), we have

mt,i pvt,i + ζ mt,i vt,i = (1 β1) Pt 1 τ=0 βt τ 1 gτ,i q

(1 β2) Pt 1 τ=0 βt τ 2 g2 τ,i

(a) 1 β1 1 β2

q Pt 1 τ=0 βt τ 2 g2 τ,i

r Pt 1 τ=0 β2 1/β2 t τ

q Pt 1 τ=0 βt τ 2 g2 τ,i

where (a) is due to the Cauchy-Schwarz inequality.

Lemma A.2 (Lemma 5 in (Wang et al., 2024a)). The following inequality holds for ZO-Ada MM if 0 < β2 1 < β2 < 1,

(1 β2) m2 t,i vt,i + ζ (1 β1)2 1 β1/ β2 2

Proof. Let gτ,i ˆ if(θτ, ξτ+1), similar to (56), we have,

(1 β2) g2 t,i vt + ζ ln v T

T ln β2. (65)

mt,i pvt,i + ζ mt,i vt,i (1 β1)

τ=0 βt τ 1 gτ,i vt,i (1 β1)

t τ gτ,i vτ,i . (66)

Therefore, we have,

(1 β2) m2 t,i vt,i + ζ (1 β2) (1 β1)2 T 1 X

t τ gτ,i vτ,i

(a) (1 β2) (1 β1)2 T 1 X

t τ g2 τ,i vτ,i

(1 β2) (1 β1)2

t τ g2 τ,i vτ,i

(b)= (1 β2) (1 β1)2

t τ g2 τ,i vτ,i

(1 β1)2 1 β1/ β2 2

(1 β2) g2 t,i vt,i

(c) (1 β1)2 1 β1/ β2 2

where (a) comes from Cauchy-Schwarz inequality and (c) is due to (65). In (b) we exchange the order of summation over t and τ.

Refining Adaptive Zeroth-Order Optimization at Ease

Following the proof idea in Appx. A.7, we have the following for ZO-Ada MM,

β2 vt + ζ i

β2 ˆV + 1 + β1 β1

1 β2β(t τ)/2 2 E i Fµ(θτ 1) !

ζ + ˆV d + (1 + β1)

t=0 E Fµ(θt)

where ˆV 2 v0 + (1 + β1) 8(σ2+C2)d

In addition, following the proof idea in Appx. A.8, we have

E Fµ(xt), xt+1 xt |Ft

Fµ(θt), η(1 β1) ˆ f(θt, ξt+1)

(1 β1/ β2) β2vt + ζ

Fµ(θt), ηmt+1/ p

vt+1 + ζ + ηmt+1/ β2vt + ζ

" Fµ(θt), ηβ1mt/ β2vt + β2ζ ηβ1mt/ β2vt + ζ

+ E Fµ(xt) Fµ(θt), xt+1 xt Ft

(69) where each item can be upper bounded as below.

1 = η(1 β1)

" i Fµ(θt) 2 pβ2vt,i + ζ

η(1 β1) 4(1 β1/ β2)

i Fµ(θt) 2 pβ2vt,i + ζ +

4η(1 β1)C2d2

(1 β1/ β2)(1 β2 1/β2)µ2 E

" 1 pβ2vt,i + ζ 1 pvt+1,i + ζ

3 η(1 β1) 4(1 β1/ β2)

i Fµ(θt) 2 pβ2vt,i + ζ + ηβ2 1(1 β1)(1 β2)d ζ

(1 β1/ β2)(1 β2 1/β2) ,

d/ β2 2(1 β1/ β2)2

" m2 t,i vt,i + ζ

" m2 t+1,i vt+1,i + ζ

Here, both 2 and 3 results from Lemma A.1.

Refining Adaptive Zeroth-Order Optimization at Ease

Finally, we have

E Fµ(xt+1) E Fµ(xt)

η(1 β1) 2(1 β1/ β2)

" i Fµ(θt) 2 β2vt + ζ

d Lη2(1 β1)2

(1 β1/ β2)4

ln βT 2 v0,i + 4C2d2/µ2 /v0,i

4η(1 β1)C2d3

(1 β1/ β2)(1 β2 1/β2)µ2

1 ζ + T(1 β2) ζ

+ T ηβ2 1(1 β1)(1 β2)d ζ

(1 β1/ β2)(1 β2 1/β2) .

Define Fµ(x1) F µ, by re-arranging the result above, we have

" i Fµ(θt) 2 pβ2vt,i + ζ

2(1 β1/ β2)

ηT(1 β1) + 8Lη

(1 β1/ β2)3

ln βT 2 v0,i + 4C2d2/µ2 /v0,i

T(1 β2) + 2

(1 β2 1/β2)µ2

1 T ζ + 1 β2 ζ

+ 2β2 1(1 β2)d ζ

1 β2 1/β2 .

By choosing 1 β2 = 8C2d3

(1 β2 1/β2)µ2 ζ + 2β2 1d ζ 1 β2 1/β2

1 ϵ2/3 O(ϵ2), η = 16Ld

d(1 β1) (1 β1/

1 ϵ2/3 O(ϵ2) and

T = 3ϵ 2 2(1 β1/

β2) η(1 β1) + 8C2d3

(1 β2 1/β2)µ2 ξ + 8Lη

d(1 β1) (1 β1/

β2)3 Pd i=1

ln((βT 2 v0,i+4C2d2/µ2)/v0,i)

O(ϵ 4), we can

simply have the following,

" Fµ(θt) 2 p

" i Fµ(θt) 2 pβ2vt,i + ζ

By introducing (68) and (73) into Lemma 5.6, we have

t=0 E Fµ(θt) !2

β2 vt + ζ i! 1 T

" Fµ(θt) 2 p

ζ + ˆV d + (1 + β1)

t=0 E Fµ(θt) !

By applying the formula for the root of square equation, we have the following

t=0 E Fµ(θt) (1 + β1)

β1(1 β2) ϵ2 + 4p

ˆV d ϵ , (75)

which concludes our proof for Thm. 5.10.

Refining Adaptive Zeroth-Order Optimization at Ease

B. Experiments

B.1. Experimental Setup of Synthetic Functions

Let input θ = [θi]d i=1, the Quadratic, Cubic, Levy, and Rosenbrock functions applied in our synthetic experiments are given below:

i=1 θ2 i , (Quadratic)

|θi|3 + θ2 i 2

F(θ) = sin2(πw1) +

i=2 (wi 1)2 1 + 10 sin2(πwi + 1) + (wd 1)2 1 + sin2(2πwd) , (Levy)

h 100(θi+1 θ2 i )2 + (1 θi)2i , (Rosenbrock)

where wi = 1 + θi 1

4 . Note that all functions have the same minimum of zero, i.e., min F(θ) = 0. For a fair comparison, we employ the same initialization and hyperparameters: β1 = 0.9, β2 = 0.99 and K = 10, η = 0.001, µ = 0.005, for all methods.

B.2. Experimental Setup of Black-Box Adversarial Attack

For the black-box adversarial attack experiment on the MNIST dataset, we use the same fully trained deep neural networks from (Shu et al., 2023) and adopt a L constraint of 0.2 on the input perturbation. For a fair comparison, we employ the same hyperparameters: β1 = 0.9, β2 = 0.99 and K = 2, η = 0.01, µ = 0.005, for all methods.

B.3. Experimental Setup of Memory-Efficient LLM Fine-Tuning

For the memory-efficient LLM fine-tuning on OPT-1.3B and OPT-13B on SST-2 dataset (Socher et al., 2013), we adopt the same configurations in (Malladi et al., 2023) and choose to fine-tune LLMs with Lo RA adapters. For a fair comparison, we employ the same hyperparameters: β1 = 0.9, β2 = 0.99 and K = 1, η = 0.00005, µ = 0.001, for all methods.