# flops_forward_learning_with_optimal_sampling__06ba7b36.pdf

Published as a conference paper at ICLR 2025

FLOPS: FORWARD LEARNING WITH OPTIMAL SAMPLING

Tao Ren1,2 , Zishi Zhang1,2 , Jinyang Jiang1,2 , Guanghao Li2 , Zeliang Zhang3 , Mingqian Feng4 , Yijie Peng1,2

1 Guanghua School of Management, Peking University 2 Xiangjiang Laboratory 3 Tsinghua Shenzhen International Graduate School, Tsinghua University 4 Huazhong University of Science and Technology 5 Johns Hopkins University {rtkenny,zishizhang,jinyang.jiang}@stu.pku.edu.cn ligh24@mails.tsinghua.edu.cn, hust0426@gmail.com, mfeng10@jh.edu, pengyijie@pku.edu.cn

Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we allocate the optimal number of queries within a set budget during training to balance estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique, we derive a novel plug-and-play query allocator with minimal parameters. Theoretical results are carried out to verify its optimality. We conduct extensive experiments for finetuning Vision Transformers on various datasets and further deploy the allocator to two black-box applications: prompt tuning and multimodal alignment for foundation models. All findings demonstrate that our proposed allocator significantly enhances the scalability of forward-learning algorithms, paving the way for realworld applications. The implementation is available at https://github.com/ RTkenny/FLOPS-Forward-Learning-with-OPtimal-Sampling.

1 INTRODUCTION

Ever since the success of backpropagation (BP) (Rumelhart et al., 1986), researchers have sought alternate methods to bypass the iterative computation of the backward pass for the sake of efficiency, interpretability, and biological plausibility (Ma et al., 2020; Nøkland, 2016; Jacot et al., 2018). Moreover, existing practical scenarios urge for scalable forward learning methods, when the blackbox nature renders taking the derivative infeasible or difficult. For example, as the model parameters increase, machine learning (ML) systems usually integrate with third-party black-box APIs (Achiam et al., 2023).

There have been continuous efforts to develop learning algorithms that rely solely on the forward pass (Lillicrap et al., 2020). The forward learning uses multiple queries on one data point to estimate the gradient (Spall, 1992; Zhang et al., 2024). The high dimensionality and the rugged loss landscape of ML problems pose arduous challenges for deriving a low-variance gradient estimator. Current literature has demonstrated that a sufficient number of queries on each data point is necessary for successful training (Chen et al., 2023; Zhang et al., 2024). To achieve good gradient estimation and task performance, all existing methods equally assign a large number of queries to different data points, neglecting the varying difficulty of gradient estimation at these data.

* These authors contributed equally to this work. Corresponding author.

Published as a conference paper at ICLR 2025

However, it is not always helpful for the forward learning algorithms to increase the number of queries. Especially for the large model training on a large dataset, it usually requires an extensive number of queries, which poses great challenges to the memory and computation cost. For example, The forward-forward (FF) (Hinton, 2022) algorithm and the feedback alignment (FA) (Nøkland, 2016) are only capable of training multilayer perceptrons on MNIST (Le Cun, 1998) or CIFAR (Krizhevsky et al., 2009). To scale up the forward algorithm to large model training on large datasets, Deep Zero (Chen et al., 2023) and Mezo (Malladi et al., 2023) explore the use of simultaneous perturbation stochastic approximation (SPSA) (Spall, 1992) in training the convolution network from scratch and fine-tune the language model, respectively. But with such large scale optimization problem, Deep Zero requires 192 queries to train a pruned Res Net-20 with only 12K parameters. Mezo needs meticulously selected prompts to ensure performance, and its gradient estimator is too noisy due to insufficient query numbers, thus suffering from highly unstable performance.

There is a great challenge required to be considered carefully in forward learning. On the one hand, consuming large amounts of queries on every data point results in high computational costs and long training time. On the other hand, using insufficient queries can save memory and computational costs but lead to unstable performance. Efficiently utilizing the queries with limited memory and computation budget is at the core of the scalability for forward learning to reduce the gradient estimation variance and improve the model training efficiency.

To achieve a good balance between the query cost and gradient estimation variance, we propose to study the query allocation problem under the forward learning framework. In our paper, we first unify the different estimators under the perturbation-based framework and formulate the allocation problems. Then, we propose the optimal allocation by constructing a query budget allocator to assign queries adaptively following a data-aware paradigm. An overview of our method and the comparison with previous work can be found in Figure 1. Specifically, our contribution can be summarized as follows,

1. We unify different BP-free methods under the framework of perturbation-based optimization. We are the first to formulate the query allocation problem for forward learning in the literature.

2. We find a simplified objective for the allocation problem and solve it via the likelihood ratio method in a lightweight style. With the optimal allocation, we can compress the queries to the minimum number of 20.

3. We theoretically deduce that the enhancement in allocation is assured by a defined lower bound. The allocator would assign the queries for different data in a mini-batch according to the variance of the gradient estimation. Intuitively, the estimator with high variance would be assigned with more queries, and vice versa.

4. We conduct extensive experiments to fine-tune Vision Transformer (Dosovitskiy, 2020) on challenging datasets, including Image Net-1K Deng et al. (2009), achieving state-of-theart performance in forward learning. Furthermore, we apply the proposed method to two real-world black-box problems: black-box prompt tuning and alignment between foundation models. Our approach demonstrates scalability and efficiency improvements over all baselines, including backpropagation (BP).

2 RELATED WORK

Backpropagation-free learning. People have explored various algorithms without backpropagating the error (Wierstra et al., 2014; Frenkel et al., 2021; Journ e et al., 2022; Kao & Hariharan, 2024). Frenkel et al. (2021) eliminates the feedback pathway by leveraging fixed random projections of target labels to hidden layers. Kao & Hariharan (2024) employs a dual-network structure to propagate input and target signals in anti-parallel directions. Forward gradient (Baydin et al., 2022; Silver et al., 2021), applying the chain rule from a different direction than BP, utilizes forward-mode automatic differentiation (AD) to train the ML model. Forward-mode AD needs specific AD software and full access to the model structure. Therefore, training the deep model under black-box scenarios is not feasible by the forward gradient (Ren et al., 2022).

Published as a conference paper at ICLR 2025

Allocator 𝑃(𝐴|𝜆; Φ)

Black-box module 𝜙1( )

Target module

Black-box module 𝜙2( )

𝐴1: 𝐴2: 𝐴3 1: 1: 1

Perturbation 𝑧 𝑓( )

Loss value ℒ(𝑥, 𝑧, 𝜃)

(b) Adaptive allocation

module Black-box module 𝜙1( )

Target module

Black-box module 𝜙2( )

Perturbation 𝑧 𝑓( )

Loss value ℒ(𝑥, 𝑧, 𝜃) (a) Vanilla (equal allocation)

Figure 1: Illustration of allocating the query budget of the forward learning paradigm. As shown in (a), previous methods equally allocate the query across different data. Our method, as shown in (b), adaptively allocate the queries under theoretically guaranteed optimality.

Perturbation-based forward learning. Built on the theory of stochastic optimization, researchers have devised surrogate methods to approximate the first-order gradients, such as SPSA (Spall, 1992). The SPSA injects noise with opposite signs into the model parameters to form a finite-difference style estimator. Recently, Mezo (Malladi et al., 2023) applied SPSA to fine-tune the language model. Deep Zero (Chen et al., 2023) integrates SPSA with sparsity and trains the network from scratch with many queries. However, their methods still have some limitations and their query budgets are equally allocated, neglecting the data difference. Jiang et al. (2023) extends the likelihood ratio (LR) estimator to train a wide range of neural networks, whereas LR has not been verified on modern-scale architectures, such as Transformers (Vaswani, 2017), leaving a significant gap for further research to improve the scalability of forward learning.

Machine learning for black box scenarios. The applications to real-world black-box scenarios call for scalable BP-free algorithms. When deploying ML algorithms on specific hardware systems, computation resources are limited and BP becomes infeasible. In the fields of physics and chemistry, ML models have to interact with environments that include non-differentiable operations (Momeni et al., 2023; Gu et al., 2021). Moreover, extremely large models are often only accessible through third-party API (Sun et al., 2022). Evolutionary algorithms and reinforcement learning have been applied to tune the prompt for the black-box language model (Diao et al., 2022).

Advanced technique for data sampling and reshuffling. In the literature on stochastic optimization, apart from uniform sampling, people have designed various methods to sample data points from the dataset by importance sampling (Zhao & Zhang, 2015), reshuffling (Lu et al., 2022), etc. Their method can reduce the variance of the stochastic gradient. Their methods share some similarities with our OPtimal Sampling. However, the key distinction between our work and the existing methods lies in the probability spaces considered. While the existing methods focus on reducing variance in the probability space associated with sampling data points from the dataset , our approach focuses on perturbation-based forward learning, where the variance lies in the probability space of the injected noise for gradient estimation.

3 A UNIFIED PERSPECTIVE OF PERTURBATION-BASED FORWARD LEARNING

Given a generic neural network with no assumption on its architecture, the input is denoted as x Rdx, and the output is given by y = ϕ2(φ(ϕ1(x); θ)) Rdy, where ϕ1( ) and ϕ2( ) are blackbox parts, and φ( ; θ) is the targeted module with trainable parameters θ Θ Rdθ. Conducting backpropagation through ϕ1( ) and ϕ2( ) is infeasible or difficult. Training the neural network is to solve such an optimization problem: minθ Θ 1 |X| P

x X L(y), where X is the dataset for training, and L( ) is an evaluation procedure yielding the loss feedback. The growing scale of parameters in modern neural networks and the dataset makes gradient-based methods the only feasible solution. Since forward learning do not rely on the chain rules to compute the true gradient, the primary challenge remains in estimating the gradient accurately and efficiently.

Published as a conference paper at ICLR 2025

In addition to approaches requiring computation graphs like BP, forward-learning algorithms estimate gradients by perturbing the neural activity (intermediate output or parameter of the targeted module φ( ; θ)) and observing the impact of the injected perturbation on the final loss value, thereby solving the stochastic version of the problem: minθ Θ 1 |X| P

x X Ez f( )[L(y)], where z is the injected noise with a density f( ).

Forward learning algorithms can be divided into two categories: the LR family and the SPSA/ES family. The LR family perturb the intermediate output of the targeted module to perform estimation Jiang et al. (2023), i.e., y = ϕ2(φ(ϕ1(x); θ) + z). The gradient can be estimated by

θE[L(y)] = Ez f( )[GLR(x, θ, z)] Ez f( )[ L(y)D θ φ(ϕ1(x); θ) z ln f(z)], (1)

where Dba Rda db denote the Jacobian matrix of a Rda with respect to b Rdb.

The SPSA/ES family injects noise to the parameters (Spall, 1992; Salimans et al., 2017), i.e., y = ϕ2(φ(ϕ1(x); θ + z)). For continuous noise distributions, the gradient estimation can be derived as a special case of equality (1) as follows:

θE[L(y)] = Ez f( )[GES(x, θ, z)] Ez f( )[ L(y) z ln f(z)], (2)

which takes the same form termed as the evolutionary strategies (ES) estimation (Salimans et al., 2017). The SPSA is a special case of ES which uses Gaussian noise z N(0, σ2Id) and applies an antithetic variable trick for variance reduction to equality (2). The SPSA can be interpreted as randomized finite differences in high-dimensional space. It takes the form

θE[L(y)] = Ez N( )[GSP SA(x, θ, z)]

= Ez N( ) z

2σ (L(ϕ2(φ(ϕ1(x); θ + z)) L(ϕ2(φ(ϕ1(x); θ z))) . (3)

To estimate the expectation of gradient according to equation (1, 2 and 3), both SPSA/ES and LR family use Monte Carlo sampling. With A > 0 queries of Monte-Carlo sampling to estimate the gradient on data point x, the corresponding estimations take the sample-average form θ ˆL(y) = 1

A PA i=1 G(x, θ, zi), where zi is sampled independently from f( ) and G(x, θ, zi) is the i th sample of the corresponding gradient estimator (G(x, θ, zi) can be GLR(x, θ, zi), GES(x, θ, zi) or GSP SA(x, θ, zi)).

The parameter is updated iteratively with a mini-batch of data points randomly drawn from X, denoted as {x1, , x B}, where B is the batch size. We denote the gradient estimation on data xj as θ ˆ Lj(θ) = 1 Aj PAj i=1 G(xj, θ, zi,j), where Aj denotes the allocated query size, and the batch-

wise gradient estimation is θ ˆL(θt) = 1 B PB j=1 θ ˆLj(θt). The updating recursion can be written as

θt+1 = θt ηt θ ˆL(θt) = θt ηt 1 B

j=1 θ ˆLj(θt) = θt ηt 1 B

i=1 G(xj, θ, zi,j), (4)

where θt is the network parameter at the t-th step, and ηt is the learning rate.

By default, the query amount Aj allocated to estimate the gradient associated with each data point xj is equally allocated. However, the difficulty of gradient estimation and the model performance differ over data points. To conserve computational resources while maintaining the quality of gradient estimation, the number of queries should be properly allocated across data within the mini-batch. Utilizing the likelihood ratio technique, we propose a universal accelerator for all perturbation-based training methods from a data-aware perspective.

4 OPTIMAL SAMPLING VIA LIKELIHOOD RATIO METHOD

4.1 GAUSSIAN ALLOCATOR AND BERNOULLI ALLOCATER

Definition. The query allocator is defined as a random vector A = (A1, , AB), where each component Aj represents the (relative) number of queries allocated to the j-th data point in the batch.

Published as a conference paper at ICLR 2025

The allocator is assumed to follow a probability measure A P(A|λ; Φ), which is parameterized by λ Λ and conditioned on features Φ, such as the loss. At each step of neural network training, a sample of A is drawn from this distribution, and queries are allocated to data points based on the sample. In this work, we select a Gaussian Allocator (GA)

A N(µ, Σ|λ; Φ), (5)

where λ = (µ, Σ) RB RB B. The matrix Σ captures the correlation structure between data points and facilitates the allocation of queries. Intuitively, structurally similar data points possess similar levels of importance, leading to a similar trend in query allocation. It is important to note that GA has a positive probability of generating negative samples. In practice, any data point receiving a negative allocation is assigned zero queries.

Bernoulli Allocator (BA) is another lightweight alternative, similar to the idea in Qin et al. (2023). In an equal allocation scenario, each data in the batch receives A queries. The BA introduces a probabilistic mechanism where, with probability p, the number of queries assigned to a data point is reduced to A/2; otherwise, it remains at A. The probability p is the only parameter of the allocator. The BA is in the form of a conditional probability distribution

P(Aj = A/2|Φ = Lj) = p, Lj < L 0, Lj L , (6)

where Lj is the clean loss on data xj, and L is the mean of loss value in the batch(we use the loss value without injected noise into the network).

It is important to note that A represents the relative number of queries, and thus the actual allocation proportion to the j-th data point is given by Aj/PB i=1 Ai. This definition eliminates the need for imposing a fixed total budget constraint in the following allocator optimization problem. However, as a consequence, we must specify that the total query budget per step is fixed at A0 = A B during the training process. Note that, since the data points selected in the mini-batch differ at each step, At j is actually a function of t. For simplicity of notation, we omit this dependency in the remainder of the paper.

4.2 OPTIMIZATION OF THE QUERY ALLOCATOR.

We employ a gradient-based method to optimize the allocator parameters λ, introducing an additional optimization step before each iteration of neural network training. However, by formulating an equivalent optimization objective and utilizing reparameterization techniques, we ensure that this auxiliary optimization remains computationally lightweight and incurs minimal overhead.

Our objective is to maximize the performance improvement at this step, i.e.,

arg max λ Λ PIt(λ) EA P ( |λ)[

Lj(θt+1) Lj(θt) , (7)

where Lj is the loss corresponding to the j-th data point. Assuming that Lj is L-smooth, j = 1, , B, and Θ is a convex set, we can derive a lower bound of the optimization objective PIt(λ).

PIt(λ) LBt(λ) EA P ( |λ)

ηt θ ˆL(θt)) θLj(θt) 1

2η2 t L θ ˆL(θt) 2 . (8)

Inequality (8) (the proof is provided in Appendix A.1) closely resembles the Descent Lemma commonly used in the analysis of gradient descent. The lower bound LBt is often utilized as a surrogate optimization objective in the literature on adaptive step size (Pirotta et al., 2013) and adaptive batch size (Papini et al., 2017). Since our decision variable is λ now, we can further simplify the form of LBt(λ). Before proceeding, according to the Central Limit Theorem, it is canonical to assume that the perturbation-based gradient estimator θ ˆLj(θt) = 1 Aj PAj i=1 G(xj, θt, zi,j), zi,j f( ), j = 1, , B, approximately follows a d-dimensional Gaussian distribution.

Assumption 1. j = 1, , B, θ ˆLj(θt) N(µj,t G , Σj,t G Aj ), where µj,t G and Σj,t G are the expectation and the covariance matrix of G(xj, θt, z), z f( ), respectively.

Published as a conference paper at ICLR 2025

Theorem 1 (Equivalent Objective). Suppose that the Assumption 1 holds, maximizing the lower bound LBt(λ) over λ Λ is equivalent to minimizing

Jt(λ) EA P ( |λ)

j=1 Tr(Σj,t G Aj ) (9)

over λ Λ, i.e., max λ Λ LBt(λ) min λ Λ Jt(λ).

The equivalent expression (9) of the optimization objective makes the optimization of the allocator both computationally and statistically tractable (see the proof in Appendix A.2). First, it demonstrates that the optimization of the allocator is independent of both the smoothness parameter L and the expectation µj,t G of the estimator sample. Consequently, we can avoid estimating these unknown and bias-inducing high-dimensional parameters. More importantly, the optimization of the allocator does not require estimating the off-diagonal elements of the covariance matrix Σj,t G , which can be statistically and computationally challenging (Fan et al., 2008), especially since Σj,t G is of dimension d2. It only requires estimating the trace of Σj,t G , i.e., the sum of the diagonal variances, which significantly reduces the computational complexity.

Moreover, we adopt a subsampling technique in which the trace of the covariance matrix corresponding to the parameters of the deep layers of the neural network serves as an approximation for the overall trace. This approximation is justified as these layers typically account for the majority of the variance in the model. In practice, prior to the optimization process of the allocator, we sample a small number of initial queries for each data point to estimate Tr(Σj,t G ). In the subsequent computations, Tr(Σj,t G ) is replaced by its estimate in a plug-in manner. In addition, from Theorem 1, we can see that generally data points with higher variance should be allocated more query resources.

Then, the gradient of Jt(λ) with respect to allocator parameter λ is given by

λJt(λ) = λE B X

j=1 Tr(Σj,t G Aj ) = E B X

j=1 Tr(Σj,t G Aj ) λ ln(P(A)) . (10)

The LR gradient estimator for the allocator takes the sample average form

λ ˆJ(λ) = 1

j=1 Tr( Σj,t G A(k) j ) λ ln(P(A(k))) , (11)

where A(k) = (A(k) 1 , , A(k) B ) are i.i.d samples from P( |λ) and K is the number of sampling from P( |λ). At each step t of neural network training, we employ the above gradient estimator to search for the optimal allocator parameter λ t arg maxλ Λ J(λ). Let LB 1:T and LBequal 1:T represent the lower bound of the cumulative performance improvement of the neural network after T steps of training when using the optimal allocator A P( |λ t ) and using an equal allocator Aequal, respectively. Then the difference between these two allocation strategies is lower bound by the following expression.

Theorem 2 (Theoretical Improvement). Suppose that the Assumption 1 holds, then we have

E LB 1:T LBequal 1:T

η2 t L 2BA0

Tr(Σj,t G ) q

Tr(Σk,t G ) 2 0.

Theorem 2 (see the proof in Appendix A.3) quantifies the superiority of our proposed optimal allocator over the vanilla equal allocation strategy. The greater the differences in Tr(Σj,t G ) across different data points (i.e., the stronger the heterogeneity), the more pronounced the superiority of the optimal allocator, and this difference equals zero if and only if all Tr(Σj,t G ) are equal.

Published as a conference paper at ICLR 2025

4.3 REPARAMETERIZATION OF THE GAUSSIAN ALLOCATOR.

The original parameter dimension of the Gaussian allocator is B + B(B+1)

2 , which might be problematic due to its high dimensionality. To reduce the dimensionality of the allocator s optimization problem and to incorporate more relevant features, we adopt a reparameterization approach, i.e., imposing a specific structure to the parameter λ.

Drawing inspiration from the literature on data pruning (Qin et al., 2023), the loss value serves as a strong indicator of the data point s characteristics. Therefore, we assume that the mean of the allocator follows a linear structure: µ = β0 + β1Φ,

where Φ = Tanh(L(x; θ; )) is the clean loss without injected noise. The original loss value is processed through a Tanh function to constraint Φ within the range [ 1, 1]. The initial value for β0 and β1 are set to A and A/2, respectively.

For the covariance structure Σ, based on the observation that similar data may have similar influences on the training of neural networks, we reparameterize the covariance matrix as follows: the covariance between the allocation of the i-th data point and the j-th data point is

Σ(i, j) = σ2exp di,j

where di,j is the cosine similarity of the embedding between the i-th and the j-th data points. σ2

and γ2 are two parameters that control the scale and decay of the covariance, respectively. The σ is initialized as A/5 and γ is initialized as 1. This kernel-based reparameterization form, also known as the Radial Basis Function (RBF), is widely used in Gaussian process regression and Bayesian Optimization literature (Shahriari et al., 2015). It enables us to model dependencies between data based on their pairwise distances.

By applying the reparameterization techniques to the covariance matrix Σ and mean vector µ, the original high-dimensional optimization problem for these parameters has been effectively transformed into a more manageable optimization problem for 4 variables λ = (β0, β1, σ, γ). The pseudocode for the forward learning under optimal allocation is in Appendix B.

5 EXPERIMENTS

5.1 FINETUNING VISION TRANSFORMER

Experimental setting: We evaluate our model s performance using a diverse set of widely used benchmark datasets, each chosen for its unique characteristics and specific challenges contributing to a comprehensive analysis of our method s generalization across different domains: Image Net (Deng et al., 2009), Caltech101 (Fei-Fei et al., 2004), Food101 (Bossard et al., 2014), Flowers102 (Nilsback & Zisserman, 2008), CIFAR10/100 (Krizhevsky et al., 2009), and Euro SAT (Helber et al., 2019).

Evaluation metrics: We focus on image classification and evaluate the performance by accuracy in different training contexts: BP, LR, and SPSA. We compare the performance with and without the query allocation. The GA is employed for all experiments.

Using different gradient estimators, we conduct full-parameters fine-tuning for the Vision Transformer (both base and large model). We include Deep Zero (SPSA with n queries, denoted as n SPSA), Mezo (SPSA with only 2 queries), and LR with equally allocated query budgets as baselines in our experiments. For the SPSA-based methods, we include a version with the optimal allocator (OPS-SPSA) to compare with their corresponding baselines. We apply the same settings to the LR, comparing the optimally allocated version (OPS-LR) with the vanilla version.

We train the network for 10 epochs with the learning rate of 1 10 4 and the Adam optimizer. The batch size is 128. All the methods in our experiments use the same query budgets, except for Mezo, which uses only 2 queries per data point in accordance with its original memory-efficient settings. As shown in Table 1, OPS-LR outperforms all other methods. In contrast, Mezo with only 1 query performs poorly in the experiment. On the contrary, Deep Zero achieves acceptable accuracy compared with Mezo. The phenomenon suggests that an adequate number of queries is crucial for

Published as a conference paper at ICLR 2025

Table 1: Classification accuracy (%) for finetuning vit on downstream dataset.

Model Method Image Net Caltech101 Food101 Flowers102 CIFAR10/100 Euro SAT

BP (upper bound) 80.5 92.2 92.4 93.5 96.3 / 90.7 91.9

Mezo (2 queries) 9.3 42.6 37.3 38.2 42.1 / 37.5 39.8 Mezo (n-SPSA) 54.8 77.2 76.7 80.2 84.4 / 71.5 78.4 Deep Zero (n-SPSA) 55.6 78.4 77.8 78.4 85.6 / 70.2 78.8 LR 57.3 85.5 87.6 86.5 87.7 / 80.8 82.9 OPS-SPSA (Ours) 60.8 87.3 87.6 85.6 89.5 / 82.2 83.2 OPS-LR (Ours) 65.8 88.9 88.4 91.6 93.2 / 88.7 89.3

BP (upper bound) 81.7 93.4 93.3 94.2 98.2 / 93.5 93.4

Mezo (2 queries) 10.7 45.6 38.8 40.5 45.9 / 38.3 41.7 Mezo (n-SPSA) 56.7 78.8 78.4 78.5 86.7 / 74.3 79.2 Deep Zero (n-SPSA) 56.4 79.5 78.9 79.6 86.3 / 73.7 80.1 LR 58.2 87.2 87.7 86.8 88.4 / 81.4 85.7 OPS-SPSA (Ours) 62.2 88.6 88.1 87.3 91.3 / 83.9 87.6 OPS-LR (Ours) 67.6 90.2 89.5 92.3 95.5 / 90.8 90.1

the success of perturbation-based methods since the gradient s variance must be properly controlled. The OPS-SPSA achieves better accuracy than the Deep Zero with an equally allocated budget. Our proposed optimal allocator provides universal acceleration across all perturbation-based methods. Moreover, LR achieves higher accuracy than Deep Zero, suggesting that LR is a better gradient estimator than the SPSA in this experiment context. Please refer to the Appendix C.1 for the details and further analysis.

5.2 BLACK-BOX TUNING FOR MODEL AS A SERVICE

Text Encoder

fish cat dog

Black-Box Vision Language Model

Cross-entropy loss

Image Encoder

Figure 2: illustration paradigm of fine-tuning the prompt for blackbox vision language model.

We can tune the prompts for vision-language models (VLMs) in a Model-as-a-Service (Maa S) environment using the OPSLR framework, as shown in Figure 2. We address the challenge of fine-tuning VLMs without access to model internals or gradient information through a scalable black-box optimization approach. We use CLIP (Radford et al., 2021), with embedding dimensions of DT = 512 for the text encoder and DI = 756 for the image encoder, as the backbone foundation model. Both the text encoder and the image encoder are kept frozen as black-box modules.

Tunable prefix or suffix prompt tokens are injected into the text and image token sequences, respectively: ST = [cls, p T , e T ] R(1+n T +m T ) DT is the input sequence for the text encoder, SI = [cls, e I, p I] R(1+m I+n I) DI for the image counterpart. Previous researches have suggested that large foundation models have low intrinsic dimensions. Therefore, it is plausible to project the original embedding space onto a low-dimension subspace: p T = z T WT ; WT Rd T DT , z T Rm T d T and p I = z I WI; WI Rd I DI, z I Rm I d I. WT and WI are the projection matrix. The projection matrix is kept frozen after being initialized with a Gaussian distribution. With the projection, our black-box tuning problem is reduced to dimensions of d T + d I DT + DI. The optimization problem, using the cross-entropy loss, takes the form: Z = arg max z T ,z I L([e T , e I]; θ = [z T DT , z I DI]; ), (12)

where Z is the unified formulation of [z T , z I]. By applying the OPS-LR framework, we can optimize the black-box prompt tuning problems without access to the model structure. We can achieve an accuracy of 69.7% on Image Net by tuning only a few thousand parameters, surpassing the performance of other black-box methods and Manual Prompting (Radford et al., 2021). The performance is shown in Table 2. Please refer to the Appendix C.2 for more details.

Published as a conference paper at ICLR 2025

5.3 ALIGNMENT BETWEEN BLACK-BOX FOUNDATION MODEL

Table 2: Classification accuracy (%) for blackbox prompt tuning.

Methods Image Net Caltech101 Food101

OPS-LR 65.1 90.5 82.7 OPS-SPSA 64.2 89.3 81.5 LR 62.5 87.4 79.2 SPSA 60.7 85.3 77.6 Manual Prompt 57.9 83.3 76.4

Foundation models like GPT (Brown, 2020), LLa Ma Touvron et al. (2023), Vicuna (Chiang et al., 2023), and CLIP (Radford et al., 2021) contain rich domain knowledge from the pretraining. Multimodal tasks, such as video captioning (Krishna et al., 2017) and video grounding(Gao et al., 2017), require models to have sufficient crossdomain understanding, and proper alignment between different modalities is essential.

Aligning foundation models from a single modality is a more economical approach than pretraining from scratch using multimodal data. It is a common practice to add an adapter between the visual encoder and the Large Language model. The adapter is a linear projector, g( ), that projects the visual embedding to the textual embedding space. Since foundation models contain billions of parameters, the backward pass through the model requires extra memory for storing the computation graph, and recursive computations through such a large pre-trained model are time-consuming. Our OPS-LR framework efficiently bypasses the need for recursive computation to estimate the gradient with a large foundation model in between, as shown in Figure 3.

Large Language Model

Visual Encoder

A macaque sitting on a large cannon overlooking a seascape.

Visual Adapter

Perturbation

A macaque sitting on a large cannon overlooking a seascape. Loss value

Hard for BP

A macaque sitting on a large cannon overlooking a seascape.

Text paired with image

Figure 3: illustration alignment between foundation model for video understanding.

The training paradigm for the multimodal alignment follows an autoregressive style using image-text pairs < I, T >. We use a frozen CLIP Vi T-L/14 as the visual encoder, denoted as V i T. The CLS tokens of the output sequence of V i T are utilized as the global feature for the frame. We denote the visual token after the projection as Zv = g(V i T(I)CLS). The visual token is then concatenated with the text tokens sequence: [Zv, t1, t2, ..., tm], where [t1, t2, ..., tm] is the text tokens sequence with a length of m. Consequently, we can train the adapter on the concatenated sequence with the autoregressive objective of the LLM. We report the wall time for the feature alignment procedure on the LCS558K dataset (Liu et al., 2024). Our OPS-LR framework can reduce almost half of the training time since the recursive backward computation is skipped. From Figure 5, the wall time with or without the allocator is almost the same, which further illustrates that our allocator is lightweight and has minimal impact on runtime. Please refer to the Appendix C.3 for more details.

5.4 ABLATION STUDY

We conduct an ablation study using Vi T-base on the CIFAR100 dataset. We investigate the necessity of the allocation policy for the forward-learning training and determine whether GA or BA is the better allocator. Furthermore, we discern the varying degrees of difficulty in estimation across different network components. The results are shown in Figure 4.

BP OPS-LR LR OPS-SPSA SPSA 0

Wall Clock Time

Figure 5: Wall clock time for feature alignment between different methods.

Impact of different query budgets. To further investigate the necessity of the procedure of query budget allocation, we compare the performance increment under various query budgets for different allocation policies. Traditionally, the queries are equally allocated (EA) across different data. The performance improvements offered by GA and BA are considerably more significant than those of EA. When the number of queries per data is small, the benefit from the allocation is still non-ignorable. In other words, the more query budget you have, the greater the improvement gains from allocation. Meanwhile, the ablation on different transformer layers indicates that the variance of the gradient increases as the layer goes deeper, which validates our subsampling technique.

Published as a conference paper at ICLR 2025

BA-LR BA-SPSA GA-LR GA-SPSA LR SPSA

0 100 200 300 400 queries per data point

cosine similarity

(a) Key matrix of layer-1.

0 100 200 300 400 queries per data point

cosine similarity

(b) Key matrix of layer-12.

0 100 200 300 400 queries per data point

cosine similarity

(c) FFN of layer-1.

0 100 200 300 400 queries per data point

cosine similarity

(d) Key matrix of layer-1.

0 100 200 300 400 queries per data point

cosine similarity

(e) Key matrix of layer-12.

0 100 200 300 400 queries per data point

cosine similarity

(f) FFN of layer-12.

Figure 4: Ablation study on the effect of different allocators and estimation difficulty at different layers. We show the cosine similarity between the estimated and true gradients. We estimate the gradient of the Key matrix in multi-head attention (MHA) and the first linear layer in the feedforward network (FFN)). Layer 1 is adjacent to the embedding and the layer 12 is adjacent to the classification head.

Impact of different allocators, Gaussian or Bernoulli. We compare different allocators in Figures 4(a), (b), (d), and (e). GA controls the allocation with the mean vectors and the covariance matrix. The covariance matrix captures the similarity between different data points. Structurally similar data points possess similar levels of importance. Therefore, the allocation policy for those data points should be correlated. If the covariance matrix is diagonal, the similarity between data points is ignored, which could undermine the performance of the allocation strategy. Compared with the GA, the BA is a simpler strategy. Intuitively, it prunes the queries on unimportant data points to eliminate unnecessary computation. Our ablation study suggests that GA is superior to BA.

Estimation on different components of Transformer. We present the cosine similarity results on the multi-head attention (MHA) and the feed-forward network (FFN) in Figures 4(a)-(c) and (f). Estimation on the FFN is easier than on the MHA. In the Transformer block, the MHA precedes the FFN, which may explain this phenomenon. The low cosine similarity on MHA attention of deep layers may lead to the performance gap on large-scale dataset like Image Net. It is also worth noting that a large number of queries when training Transformer can improve the estimation quality. However, the memory cost of increasing queries is extremely high since the complexity of the attention module is O(n2). Integrating efficient MHA techniques, such as Flash Attention (Dao et al., 2022), is left for future work.

6 CONCLUSIONS

In this paper, we present a unified perspective on forward learning through a perturbation-based framework and propose an optimal query allocation strategy to enhance its efficiency. By leveraging a likelihood ratio-based objective and a reparameterized Gaussian allocator, our method adaptively distributes queries based on data-specific gradient estimation difficulty, ensuring an optimal balance between computational cost and estimation accuracy. Theoretical analysis guarantees the optimality of our allocation strategy, while extensive experiments demonstrate its effectiveness in fine-tuning Vision Transformers and addressing real-world black-box optimization problems, such as prompt tuning and multimodal model alignment. Our results show that the proposed method significantly improves the scalability of forward learning, achieving state-of-the-art performance compared to existing approaches. While our method significantly improves scalability, closing the gap with backpropagation remains an open challenge for future research.

Published as a conference paper at ICLR 2025

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. ar Xiv preprint ar Xiv:2303.08774, 2023.

Atılım G unes Baydin, Barak A Pearlmutter, Don Syme, Frank Wood, and Philip Torr. Gradients without backpropagation. ar Xiv preprint ar Xiv:2202.08587, 2022.

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 mining discriminative components with random forests. In Computer vision ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part VI 13, pp. 446 461. Springer, 2014.

Tom B Brown. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Jiancheng Liu, Konstantinos Parasyris, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, and Sijia Liu. Deepzero: Scaling up zeroth-order optimization for deep model training. ar Xiv preprint ar Xiv:2310.02025, 2023.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R e. Flashattention: Fast and memoryefficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344 16359, 2022.

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Shizhe Diao, Zhichao Huang, Ruijia Xu, Xuechun Li, Yong Lin, Xiao Zhou, and Tong Zhang. Black-box prompt learning for pre-trained language models. ar Xiv preprint ar Xiv:2201.08531, 2022.

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Jianqing Fan, Yingying Fan, and Jinchi Lv. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics, 147(1):186 197, 2008.

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pp. 178 178. IEEE, 2004.

Charlotte Frenkel, Martin Lefebvre, and David Bol. Learning without feedback: Fixed random learning signals allow for feedforward training of deep neural networks. Frontiers in neuroscience, 15:629892, 2021.

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pp. 5267 5275, 2017.

Jiaqi Gu, Hanqing Zhu, Chenghao Feng, Zixuan Jiang, Ray Chen, and David Pan. L2ight: Enabling on-chip learning for optical neural networks via efficient in-situ subspace optimization. Advances in Neural Information Processing Systems, 34:8649 8661, 2021.

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217 2226, 2019.

Geoffrey Hinton. The forward-forward algorithm: Some preliminary investigations. ar Xiv preprint ar Xiv:2212.13345, 2022.

Published as a conference paper at ICLR 2025

Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.

Jinyang Jiang, Zeliang Zhang, Chenliang Xu, Zhaofei Yu, and Yijie Peng. One forward is enough for neural network training via likelihood ratio method. ar Xiv preprint ar Xiv:2305.08960, 2023.

Adrien Journ e, Hector Garcia Rodriguez, Qinghai Guo, and Timoleon Moraitis. Hebbian deep learning without feedback. ar Xiv preprint ar Xiv:2209.11883, 2022.

Chia-Hsiang Kao and Bharath Hariharan. Counter-current learning: A biologically plausible dual network approach for deep learning. ar Xiv preprint ar Xiv:2409.19841, 2024.

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pp. 706 715, 2017.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009.

Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

Timothy P Lillicrap, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. Backpropagation and the brain. Nature Reviews Neuroscience, 21(6):335 346, 2020.

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.

Yucheng Lu, Si Yi Meng, and Christopher De Sa. A general analysis of example-selection for stochastic gradient descent. In International Conference on Learning Representations (ICLR), volume 10, 2022.

Wan-Duo Kurt Ma, JP Lewis, and W Bastiaan Kleijn. The hsic bottleneck: Deep learning without back-propagation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 5085 5092, 2020.

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. Advances in Neural Information Processing Systems, 36:53038 53075, 2023.

Ali Momeni, Babak Rahmani, Matthieu Mall ejac, Philipp Del Hougne, and Romain Fleury. Backpropagation-free training of deep physical neural networks. Science, 382(6676):1297 1303, 2023.

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pp. 722 729. IEEE, 2008.

Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. Advances in neural information processing systems, 29, 2016.

Matteo Papini, Matteo Pirotta, and Marcello Restelli. Adaptive batch size for safe policy gradients. Advances in neural information processing systems, 30, 2017.

Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Adaptive step-size for policy gradient methods. Advances in Neural Information Processing Systems, 26, 2013.

Ziheng Qin, Kai Wang, Zangwei Zheng, Jianyang Gu, Xiangyu Peng, Zhaopan Xu, Daquan Zhou, Lei Shang, Baigui Sun, Xuansong Xie, et al. Infobatch: Lossless training speed up by unbiased dynamic data pruning. ar Xiv preprint ar Xiv:2303.04947, 2023.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021.

Published as a conference paper at ICLR 2025

Mengye Ren, Simon Kornblith, Renjie Liao, and Geoffrey Hinton. Scaling forward gradient with local losses. ar Xiv preprint ar Xiv:2210.03310, 2022.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. nature, 323(6088):533 536, 1986.

Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv preprint ar Xiv:1703.03864, 2017.

Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1): 148 175, 2015.

David Silver, Anirudh Goyal, Ivo Danihelka, Matteo Hessel, and Hado van Hasselt. Learning by directional gradient descent. In International Conference on Learning Representations, 2021.

James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332 341, 1992.

Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, pp. 20841 20855. PMLR, 2022.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023.

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.

Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and J urgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949 980, 2014.

Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin-Yu Chen, Jason D Lee, Wotao Yin, Mingyi Hong, et al. Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark. ar Xiv preprint ar Xiv:2402.11592, 2024.

Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pp. 1 9. PMLR, 2015.

Published as a conference paper at ICLR 2025

A THEORETICAL DETAILS

A.1 PROOF OF INEQUALITY (8)

PIt(λ) EA P ( |λ)

ηt θ ˆL(θt)) θLj(θt) 1

2η2 t L θ ˆL(θt) 2 .

Proof. For each loss function Lj, by exploiting Taylor s theorem, we have

Lj(θt+1) Lj(θt) = Z 1

d dτ Lj(θt + τ(θt+1 θt)) dτ

0 θLj(θt + τ(θt+1 θt)), θt+1 θt dτ.

By adding and subtracting θLj(θt),

0 θLj(θt), θt+1 θt dτ + Z 1

0 θLj(θt + τ(θt+1 θt)) θLj(θt), θt+1 θt dτ

= θLj(θt), θt+1 θt + Z 1

0 θLj(θt + τ(θt+1 θt)) θLj(θt), θt+1 θt dτ.

(14) Now we focus on the second term in (14). By the Cauchy-Schwarz inequality, we have

| θLj(θt + τ(θt+1 θt)) θLj(θt), θt+1 θt | θLj(θt+τ(θt+1 θt)) θLj(θt) θt+1 θt .

And according to the L-smoothness assumption, we have

θLj(θt + τ(θt+1 θt)) θLj(θt) L τ(θt+1 θt) = Lτ θt+1 θt .

Therefore, the second term in (14) becomes Z 1

0 θLj(θt + τ(θt+1 θt)) θLj(θt), θt+1 θt dτ

0 Lτ θt+1 θt 2 dτ = L

2 θt+1 θt 2.

Combining (14) and (15), and substitute θt+1 = θt + ηt θ ˆL(θt), we have

Lj(θt + ηt θ ˆL(θt)) Lj(θt) ηt θLj(θt), θ ˆL(θt) L

2 η2 t θ ˆL(θt) 2.

Then, taking expectation over A P( |λ) and summing over j,

PIt(λ) EA P ( |λ)

ηt θ ˆL(θt)) θLj(θt) 1

2η2 t L θ ˆL(θt) 2 ,

which completes the proof.

A.2 PROOF OF THEOREM 1

ηt θ ˆL(θt)) θLj(θt) 1

2η2 t L θ ˆL(θt) 2

j=1 θ ˆLj(θt) θLj(θt) 1

j=1 θ ˆLj(θt) 2 ,

where θ ˆLj(θt) N(µj,t G , Σj,t G Aj ), and j = 1, , B. Then, by taking the conditional expectation.

Published as a conference paper at ICLR 2025

E θ ˆ L(θt)

j=1 θ ˆLj(θt) θLj(θt) 1

2η2 t L 1 B

j=1 θ ˆLj(θt)

j = 1, , B, the expected inner product on the LHS of (17)

E θ ˆ L(θt)

j=1 θ ˆLj(θt) θLj(θt) A = ηt

j=1 µj,t G θLj(θt). (18)

Next, we focus on the expected squared norm on the RHS of (17).

E θ ˆ L(θt)

j=1 θ ˆLj(θt)

k=1 E θ ˆLj(θt) θ ˆLk(θt) A . (19)

For j = k, we have

E θ ˆLj(θt) θ ˆLk(θt) A = µj,t G µk,t G . (20)

For j = k, we express θ ˆLj(θt) as the sum of the mean and a zero-mean normal random vector:

θ ˆLj(θt) = µj,t G + Z,

where Z N(0, Σj,t G Aj ), meaning Z is a zero-mean random vector with covariance Σj,t G Aj . Then we have

E θ ˆLj(θt) θ ˆLk(θt) A = E µj,t G + Z µj,t G + Z A

= E µj,t G µj,t G + 2 µj,t G Z + Z Z A

= µj,t G 2 + 0 + E Z Z|A .

The expected value of the quadratic form Z Z in (21) for Z N(0, Σj,t G Aj ) is given by the trace of the covariance matrix, i.e.,

Therefore, with (20), (21) and (22), we have

j=1 Tr Σj,t G Aj

Then, with (17), (18) and (23), we have

LBt(λ) = EA p( |λ)

j=1 µj,t G (

j=1 θLj(θt))

j=1 Tr Σj,t G Aj

Notice that, µj,t G is independent of the allocation A, which only depends on the estimator type and the parameters of the injected noise in the perturbation. Therefore, it is easy to see that maximizing the lower bound LBt(λ) over λ Λ is equivalent to minimizing

Jt(λ) EA P ( |λ)

j=1 Tr(Σj,t G Aj ) (25)

Published as a conference paper at ICLR 2025

A.3 PROOF OF THEOREM 2

Proof. Based on the optimality of λ t at step t, we know that for any GA following N(µ, Σ), the corresponding E(LBGA t ) E(LB t ). Now, if we set Σ = 0B B, the Gaussian distribution becomes deterministic, and the query allocation becomes a fixed, non-random strategy. To establish a lower bound for E(LB t ) E(LBequal t ), we construct an optimal deterministic allocator Adet as the intermediary strategy between A and Aequal, which satisfies

E(LBequal t ) E(LBdet t ) E(LB t ),

where LBdet t is lower bound of the PI corresponding to Adet.

For the optimal deterministic allocator Adet, we aim to maximize the objective function

Tr(Σj,t G ) Aj

subject to the constraint PB j=1 Aj = A0. The Lagrangian function is

Tr(Σj,t G ) Aj λ B X

where λ is the Lagrange multiplier. Take the partial derivative of L with respect to each Aj and set it to zero. L Aj = Tr(Σj,t G ) A2 j λ = 0.

Then we have

Adet j = A0

Tr(Σj,t G ) PB k=1

Tr(Σk,t G )

Then corresponding objective is

Tr(Σj,t G ) Aj =

Tr(Σj,t G )

Tr(Σj,t G ) PB k=1

Tr(Σk,t G )

Tr(Σj,t G ) 2

As for the equal allocator Aequal j = A0

B , the corresponding objective function becomes

Jt(Aequal) = A0

Tr(Σj,t G )

j=1 Tr(Σj,t G ).

It follows that

Jt(Aequal) Jt(Adet) =

Tr(Σj,t G ) 2

j=1 Tr(Σj,t G ).

Notice that

j=1 Tr(Σj,t G ) B X

Tr(Σj,t G ) 2 = X

Tr(Σj,t G ) q

Tr(Σk,t G ) 2 .

Jt(Aequal) Jt(Adet) = 1

Tr(Σj,t G ) q

Tr(Σk,t G ) 2 .

Published as a conference paper at ICLR 2025

This difference is always non-negative and equals zero if and only if all Tr(Σj,t G ) are equal. Then we have

E LB 1:T LBequal 1:T E LBdet 1:T LBequal 1:T =

η2 t L 2B (Jt(Aequal) Jt(Adet))

η2 t L 2BA0

Tr(Σj,t G ) q

Tr(Σk,t G ) 2 ,

which completes the proof.

B PSEUDOCODE

Algorithm 1 Perturbation-based training via optimal allocation

Input: Target module φ( ; θ), loss function L( ), dataset X, noise density f( ), allocator P(A|λ, Φ), update interval M. 1: Initialize network parameter θ and allocator parameter λ. 2: repeat 3: t 0. 4: for one mini-batch{xi}B 1 i=0 non-overlapping sampled in X do 5: Compute the loss, Φ = L(y), without injected noise. 6: Sample initial queries to estimate Tr(Σj,t G ). 7: repeat 8: Update the λ by estimated gradient following (11). 9: until λ converges 10: Sample the allocation decision A P(A|λ, Φ) for the mini-batch. 11: Augment the {xi}B 1 i=0 according to A = (A1, A2, ..., AB). 12: for θj θ do 13: Compute the L(y|z) with z f( ) injected to the φ( ; θj). 14: Update θj by estimated gradient following (1) or (3) with A queries. 15: end for 16: t t + 1. 17: end for 18: until network parameter converges. Output: Network parameter θ.

C EXPERIMENT DETAILS

Platform: All the experiments are conducted on a machine with 8 NVIDIA A800 GPUs. Each A800 GPU has 80GB of memory.

C.1 EXPERIMENTS FOR VIT

Datasets: Image Net: Over 1.2 million images in 1,000 classes, a standard large-scale challenge for object classification. Caltech101: 9,144 images across 101 categories, focusing on object recognition with varying perspectives and poses. Food101: 101,000 images in 101 food categories, presenting challenges in texture, lighting, and fine-grained classification. Flower102: 8,189 images across 102 flower species, testing subtle feature recognition. CIFAR-10/100: Consisting of 60,000 32x32 images, CIFAR-10 has 10 classes, while CIFAR-100 has 100 classes, offering challenges in small-scale image classification with increasing complexity in category distinctions. Euro SAT: 27,000 satellite images in 10 land-use categories, challenging models to perform in remote sensing tasks. All the input images are resized to 224 before entering the Vi T backbone.

Vi T-base: The base model has 12 transformer layers. The hidden dimension is 786, with 12 heads for the multi-head attention. The intermediate dimension for feed-forward mlp is 3072. 12 layers

Published as a conference paper at ICLR 2025

are equally divided into three groups from bottom to top. The standard deviation of the injected Gaussian noise for each group is initialized as 1 10 5, 1 10 4, 1 10 3. The query budget for each group is 20 queries per data. The query budget for the classification head is also 20 queries per data.

Vi T-large: The large model has 24 transformer layers. The hidden dimension is 1024, with 16 heads for the multi-head attention. The intermediate dimension for feed-forward mlp is 4096. 24 layers are equally divided into four groups from bottom to top. The standard deviation of the injected Gaussian noise for each group is initialized as 5 10 6, 1 10 5, 1 10 4, 1 10 3. The query budget for each group is 20 queries per data. The query budget for the classification head is 20 queries per data.

We use the CLS token on top of the Transformer backbone to calculate the cosine similarity between different data points, di,j, in the kernel for reparameterization. The pruned queries for the BA will be equally allocator to the other unpruned data points to ensure the total number of queries is the same.

Table 3: Results of Vi T-base under natural noise and adversarial noise.

Datasets Methods Original Natural Noise Adversarial Noise

Gaussian Uniform Poisson FGSM I-FGSM MI-FGSM

Image Net BP 80.5 78.4 65.9 44.2 15.2 10.9 9.9 OPS-LR 65.8 65.4 65.4 60.3 34.5 15.6 13.8 OPS-SPSA 60.8 58.2 59.3 55.2 33.7 13.5 12.3

Food101 BP 92.4 88.3 84.7 66.4 19.6 12.3 10.5 OPS-LR 88.4 85.3 82.1 70.3 37.2 25.2 18.6 OPS-SPSA 87.6 84.7 80.3 70.4 36.4 23.6 15.3

CIFAR100 BP 90.7 83.8 75.2 43.5 18.3 9.3 7.5 OPS-LR 88.7 84.3 77.6 52.9 25.4 14.8 12.3 OPS-SPSA 82.2 79.3 76.8 48.2 22.5 13.8 9.4

Evaluation of robustness: Furthermore, we assess the robustness of BP, OPS-LR, and OPS-SPSA across three data sets Image Net, Food101, and CIFAR100 through three distinct criteria: 1) Primary task efficacy: accuracy of classification on unaltered samples (Original); 2) Endurance against natural corruption: accuracy of classification on datasets tainted with natural noise, incorporating Gaussian, uniform, and Poisson disturbances; 3) Robustness to adversarial attacks: accuracy of classification on samples altered by adversarial tactics, specifically the fast gradient sign method (FGSM), iterative fast gradient sign method (I-FGSM), and momentum-based iterative fast gradient sign method (MI-FGSM). For adversarial assaults, we cap the allowable perturbation per pixel at 8/255, and for I-FGSM and MI-FGSM, we limit the iteration count to a maximum of 5. In our evaluations, we encompass the entire test set for corruption assessment. As shown in Table 3, BP achieves the best accuracy on the original data. However, the robustness of OPS-LR and OPS-SPSA is superior to BP. Especially under adversarial attacks, the perturbation methods have a significant advantage over BP.

C.2 EXPERIMENTS FOR BLACK-BOX TUNING

We use the open-source CLIP with Vi T-B/32 as the visual encoder. The intrinsic dimension, d I +d T , is 1000. The visual prompt length is 10 and the text prompt length is 12. The batch size is 64. We use 60 queries per data to tune the intrinsic dimension. We use Adam optimizer with a learning rate of 1 10 4 and train for 50 epochs with early stopping. All methods use the same 16-shot split for training and are evaluated on the full test sets for evaluation.

C.3 EXPERIMENTS FOR MULTIMODAL ALIGNMENT

we use Vicuna v1.5 7B as the Large Language Model and train the 7B version. We train one epoch for the feature alignment by an Adam optimizer with the learning rate of 5 10 4 and cosine learning rate decay. The batch size is 64 and the query budget is 60 queries per data.

Published as a conference paper at ICLR 2025

C.4 EXTENDED ABLATION

BA-LR BA-SPSA GA-LR GA-SPSA LR SPSA

0 100 200 300 400 queries per data point

cosine similarity

(a) Key matrix of layer-1.

0 100 200 300 400 queries per data point

cosine similarity

(b) Key matrix of layer-12.

0 100 200 300 400 queries per data point

cosine similarity

(c) Key matrix of layer-24.

0 100 200 300 400 queries per data point

cosine similarity

(d) Key matrix of layer-1.

0 100 200 300 400 queries per data point

cosine similarity

(e) Key matrix of layer-12.

0 100 200 300 400 queries per data point

cosine similarity

(f) Key matrix of layer-24.

0 100 200 300 400 queries per data point

cosine similarity

(g) FFN of layer-1.

0 100 200 300 400 queries per data point

cosine similarity

(h) FFN of layer-12.

0 100 200 300 400 queries per data point

cosine similarity

(i) FFN of layer-24.

Figure 6: We show the cosine similarity between the estimated and true gradients. We compare LR and SPSA estimators on the Key matrix of the multi-head attention. The estimation of the feedforward network is also included. Layer 1 is the Transformers layer adjacent to the embedding and layer 24 is adjacent to the classification head.

We provide a more extensive ablation study on Vi T-Large following the same setting as the main text. Figures 6(a)-(c) and (d)-(f) show the result of LR and SPSA estimator, respectively. Both the estimators can benefit from the allocation and the Gaussain allocator has better performance than the Bernoulli allocator. The gradient of deep layer like layer 1 is more difficult to estimate than the shallow layer and the corresponding variance accounts for most of the variance of the whole model. Comparing Figures 6(a)-(c) to (g)-(i), estimation on the feed-forward network is easier than on the multi-head attention. In practice, it is recommended to add gradient clipping to the multi-head attention layer.