# debora_efficient_bilevel_optimizationbased_lowrank_adaptation__70494214.pdf

Published as a conference paper at ICLR 2025

DEBORA: EFFICIENT BILEVEL OPTIMIZATION-BASED LOW-RANK ADAPTATION

Emanuele Zangrando1, Sara Venturini2, Francesco Rinaldi3, Francesco Tudisco1,4,5

1Gran Sasso Science Institute, L Aquila, Italy 2MOBS Lab, Northeastern University, Boston, US 3Department of Mathematics, University of Padova, Padova, Italy 4School of Mathematics and Maxwell Institute, University of Edinburgh, UK 5Miniml.AI Ltd, UK

Low-rank adaptation methods are a popular approach for parameter-efficient finetuning of large-scale neural networks. However, selecting the optimal rank for each layer remains a challenging problem that significantly affects both performance and efficiency. In this paper, we introduce a novel bilevel optimization strategy that simultaneously trains both matrix and tensor low-rank adapters, dynamically selecting the optimal rank for each layer. Our method avoids the use of implicit differentiation in the computation of the hypergradient, and integrates a stochastic away-step variant of the Frank-Wolfe algorithm, eliminating the need for projection and providing identifiability guarantees of the optimal rank structure. This results in a highly efficient and cost-effective training scheme that adaptively allocates the parameter budget across the network layers. On top of a detailed theoretical analysis of the method, we provide different numerical experiments showcasing its effectiveness.

1 INTRODUCTION

Parameter-efficient fine-tuning methods have become essential for adapting large-scale pre-trained models to various downstream tasks without incurring prohibitive computational costs. Low-rank adaptation (Lo RA) methods, which involve the addition of low-rank matrices to the existing weights of neural networks, have emerged as a popular solution due to their efficiency and simplicity. These methods keep the main model weights frozen, significantly reducing the number of trainable parameters and making the fine-tuning process more resource-efficient.

However, a persistent challenge in this area is the selection of the optimal rank for each low-rank adapter, which is crucial for balancing performance and efficiency. Traditional Lo RA methods often rely on a static rank configuration across all layers, leading to suboptimal results. Recent approaches such as Dy Lo RA (Valipour et al., 2023) and Ada Lo RA (Zhang et al., 2023) have attempted to address this by dynamically adapting the rank during training, but they still require considerable heuristic tuning or involve expensive steps that can limit their applicability and generalizability.

In this paper, we propose d EBORA (Efficient Bilevel Optimization-based low-Rank Adaptation), a novel approach that leverages bilevel optimization to simultaneously train matrix and tensor lowrank adapters while dynamically selecting the optimal rank for each layer. Unlike previous bilevel optimization-based methods for low-rank adaptation (Qiang et al., 2024), d EBORA eliminates the need for implicit differentiation in the computation of the hypergradient, thereby hugely simplifying the resources required by the optimization process. The upper-level rank-adaptation step is performed by means of a stochastic away-step variant of the Frank-Wolfe algorithm, which not only removes the need for costly projection steps but also provides identifiability guarantees of the optimal rank structure.

We provide a detailed theoretical analysis of our method, establishing its convergence properties and optimality guarantees. Additionally, we conduct extensive numerical experiments across a range of benchmarks, including natural language understanding and generation tasks, to demonstrate the

Published as a conference paper at ICLR 2025

effectiveness of d EBORA. Our results show that d EBORA outperforms existing low-rank adaptation methods in both efficiency and performance, particularly in settings with stringent parameter budgets.

Our method addresses the key limitations of existing low-rank adaptation techniques in several ways. First, by dynamically adjusting the rank for each layer during training, d EBORA ensures an adaptive allocation of the parameter budget across the network, thereby improving both computational efficiency and model performance. Second, the use of bilevel optimization provides a principled framework for handling the interaction between the model parameters and the adaptation budget hyperparameters, resulting in a more robust and theoretically grounded approach to low-rank adaptation.

The rest of this paper is organized as follows: Section 2 reviews related work on parameter-efficient fine-tuning and low-rank adaptation methods. Section 3 presents the proposed bilevel optimization framework and the dynamic rank adaptation mechanism, while Section 4 the specific structure of the problem is used to create an efficient closed-form approximation for the hypergradient. Section 5 and Section 6, introduce the stochastic version of the away-step Frank-Wolfe algorithm for solving the upper-level problem, along with its theoretical properties. Section 7 details the experimental setup and results. Finally, Section 8 concludes the paper and discusses potential future directions.

2 RELATED WORK

Parameter-Efficient Fine-Tuning. Large-scale pre-trained models have demonstrated significant improvements across numerous tasks. However, fine-tuning these models for specific downstream tasks often involves updating millions, if not billions, of parameters, leading to substantial computational and memory overhead. This has motivated the development of parameter-efficient fine-tuning (PEFT) techniques, which aim to reduce the number of trainable parameters while maintaining or even enhancing the performance of the model.

One of the seminal works in this area is Low-Rank Adaptation (Lo RA) (Hu et al., 2021), which introduces low-rank incremental matrices to the frozen pre-trained weights, significantly reducing the trainable parameters. Inspired by recent findings in e.g. (Li et al., 2018) and (Aghajanyan et al., 2021), which suggest that pre-trained models possess a low intrinsic dimension, Lo RA leverages this property to achieve efficient adaptation without compromising performance.

Several variants of Lo RA have been proposed to further improve its efficiency and adaptability. Dy Lo RA (Valipour et al., 2023) dynamically adjusts the ranks of the low-rank matrices during training based on the importance of learned representations, while QLo RA (Dettmers et al., 2024) introduces quantization techniques to reduce the memory footprint of Lo RA, making fine-tuning accessible on hardware with even more limited resources. Another extension is Lora Hub (Huang et al., 2023), which facilitates the modular composition of multiple Lo RA modules across different tasks, enhancing the cross-task generalization capability of fine-tuned models.

Similar to our approach, Ada Lo RA (Zhang et al., 2023) takes this a step further by adaptively allocating the parameter budget based on the importance of different modules within the model. By parameterizing the low-rank updates using a singular value decomposition (SVD) framework, Ada Lo RA iteratively prunes less significant singular values, thus optimizing the allocation of the limited parameter budget across the network.

Bi Lo RA (Qiang et al., 2024) employs a bi-level optimization framework, where the learning of pseudo singular values and vectors is decoupled and assigned to different levels of the optimization hierarchy. This separation mitigates the risk of overfitting by allowing each component to be optimized independently on separate subsets of the training data. This approach is inspired by differentiable architecture search (DARTS) (Liu et al., 2019), where the architecture and weights are optimized on different datasets, preventing overfitting to any particular training set. Unlike the proposed d EBORA, Bi Lo RA computes the hypergradients directly using implicit differentiation, resulting in high computational demand.

Low-Rank Methods for Pre-Training. Low-rank methods have also been successfully applied during the pre-training and training phases of neural networks, capitalizing on the observation that large models often possess a low intrinsic dimensionality. Techniques like Pufferfish (Wang et al., 2021), intrinsic dimension reduction (Aghajanyan et al., 2020), and DLRT (Schotthöfer et al., 2022) reduce the number of parameters during training, potentially improving both model efficiency and

Published as a conference paper at ICLR 2025

generalization. Recent developments such as (Zangrando et al., 2024) use Riemmanian optimization to explore the parameter space, dynamically adapting the rank during training and ensuring model accuracy and substantial memory reduction, even with initially incorrect rank estimates. Re Lo RA (Lialin et al., 2023) introduces a method for training large models efficiently by applying multiple low-rank updates, achieving significant memory savings while maintaining performance. Ga Lore (Zhao et al., 2024) is based on projecting the gradients onto a low-rank subspace, allowing for memory reductions without sacrificing model accuracy.

Bilevel Optimization in Deep Learning. Bilevel optimization, with its origins in classical optimization theory (Colson et al., 2007), has gained traction in machine learning for a variety of applications, including hyperparameter optimization (Franceschi et al., 2018), meta-learning (Finn et al., 2017), and neural architecture search (Liu et al., 2019). The core challenge in bilevel optimization lies in the computation of the hypergradient, which captures the dependence of the upper-level objective on the lower-level variables. This often scales the complexity of first-order optimization methods to that of second-order ones, making it computationally prohibitive for large-scale problems.

To address this challenge, numerous techniques have been proposed to approximate the hypergradient efficiently. Backpropagation through time (BPTT) (Franceschi et al., 2017) is one such approach, along with methods that use efficient inverse Hessian approximations (Lorraine et al., 2020) and approximate implicit differentiation (AID) techniques (Grazzi et al., 2020). These methods have made bilevel optimization feasible for high-dimensional problems, albeit with trade-offs in terms of computational accuracy and convergence speed.

Our work introduces a novel bilevel optimization strategy for parameter-efficient fine-tuning, specifically tailored for low-rank adaptation. By avoiding implicit differentiation and integrating a stochastic away-step variant of the Frank-Wolfe algorithm, our approach eliminates the need for costly gradient propagations and offers a more scalable solution. This framework allows for adaptive rank selection across network layers, providing a robust and efficient mechanism for fine-tuning.

3 LOW-RANK ADAPTATION VIA BILEVEL OPTIMIZATION

In this section, we introduce the bilevel optimization framework for fine-tuning low-rank adapters in tensor format, which provides a parameter-efficient adaptation strategy for large pre-trained models.

Consider the adaptation of pre-trained model weights W0 Rn n through a low-rank update Ψ = USV , where U, V Rn r are the low-rank basis matrices, S is a diagonal matrix containing the singular values si, and r is the rank of the adaptation. This low-rank formulation can be equivalently expressed as

i=1 si ui vi,

where ui and vi denote the i-th columns of U and V , respectively, and refers to the outer product of tensors.The rank r directly controls the number of trainable parameters, making this formulation particularly suitable for rank-adaptive parameter-efficient fine-tuning (Zhang et al., 2023).

This low-rank adaptation naturally extends to tensor-based models using the CP (canonical polyadic) decomposition. For a layer weight tensor with d modes, W0 Rn1 nd, the low-rank adaptation is defined as:

W0 + Ψ(B, s),

i=1 si u(1) i u(d) i ,

Network Layer

where B = (U (1), . . . , U (d)) represents the collection of low-rank matrices U (i) Rni r, u(j) i denotes the i-th column of the j-th basis matrix, and s = (s1, . . . , sr) 0 is a vector of nonnegative

Published as a conference paper at ICLR 2025

parameters representing a form of tensor singular values. In this setting, the low-rank adaptation is compactly represented by the basis matrices B and the vector of scaling factors s.

To fine-tune Ψ(B, s) while simultaneously minimizing the rank r, we introduce a bilevel optimization strategy. Given a fine-tuning loss function ℓ(B, s; x, y) and a dataset split into two parts D1 and D2, we consider objective functions for the two data sets

fi(B, s) = 1 |Di|

(x,y) Di ℓ(B, s; x, y), i = 1, 2.

The fine-tuning process is then formulated as the following bilevel optimization problem:

(upper-level) min s Rr:s 0, s 1 τ f1(B (s), s)

(lower-level) s.t. B (s) arg min B V Rn1 r Rnd r f2(B, s), (1)

where τ > 0 is a regularization parameter that controls the sparsity of the singular values. The bilevel nature of this problem arises from the fact that the optimal basis matrices B (s) for a given s are determined by minimizing a second objective function, f2(B, s). Moreover, given the nature of the parameterization Ψ, none of the objective functions is in general convex, even if ℓis.

A critical aspect of solving bilevel optimization problems is computing the hypergradient, which captures how changes in the upper-level parameters s affect the lower-level solution B (s). By the chain rule, the hypergradient of the upper-level objective f1(B (s), s) is given by:

d dsf1(B (s), s) = Bf1(B (s), s) s B (s) + sf1(B (s), s). (2)

The key computational challenge lies in determining s B (s), which represents how the optimal basis B (s) changes with s. Using first-order stationarity conditions, this dependency is governed by the implicit gradient system:

2 Bf2(B (s), s) s B (s) = s Bf2(B (s), s). (3)

Solving this linear system is necessary to compute the implicit gradient in Equation (2).

Given the computational complexity of solving the implicit gradient system exactly, we explore efficient approximation techniques that exploit the structure of the problem. As discussed in Section 2, various methods, such as inverse Hessian approximation and conjugate gradient techniques, have been proposed in the literature. In the following section, we introduce a new closed-form approximation that leverages the tensor structure of the problem to significantly reduce computational overhead while maintaining accuracy.

For the sake of simplicity, we decided to limit the theoretical discussion to the Euclidean case. However, everything transfers with minor adjustments to the Riemannian setting by interpreting all the derivatives through their Riemannian counterpart. A similar derivation is done for example in (Li & Ma, 2024) and the computations in our case would be similar to those presented there. In particular, if we have Riemannian manifold only in the lower-level problem (the case we considered in our experiments), the stationarity condition should be interpreted as Bf2|B (s),s = 0 where Bf2|B (s),s is now the differential restricted to the tangent space TB (s)V to the manifold V at the point B (s) (with V either the Steifel or the oblique manifold in our case). The implicit gradient equation (3) is derived in the same way by interpreting the Hessian as the Riemannian Hessian on V.

4 EFFICIENT HYPERGRADIENT APPROXIMATION

As discussed in the previous section, the primary challenge in solving the bilevel optimization problem in Equation (1) lies in the fact that first-order methods require second-order information, particularly the calculation of the implicit derivative, which is obtained as the solution to Equation (3). While this can be computationally intensive, structured numerical approximations of the Hessian 2 Bf 2 (such as diagonal or low-rank approximations) have been employed in prior work (Zhang et al., 2022; Lorraine et al., 2020) to simplify the hypergradient computation. Nonetheless, these approaches come with computational overhead.

Published as a conference paper at ICLR 2025

In our setting, we exploit the specific parameterization structure of the problem to derive an efficient closed-form approximation for the hypergradient. This leads to a computationally tractable solution without fully solving the implicit gradient system. The following theorem formalizes this result. For brevity, we will use a superscript to denote the evaluation of variables and functions on the optimal pair (B (s), s), solution to the lower-level problem.

Theorem 4.1 (Hypergradient Approximation). Denote by L2(Ψ(B, s)) := f2(B, s) as a function of Ψ. In the setting of Section 3, assume that the gradient is locally approximately constant. That is, there exists constants K 0 such that

2L2(Ψ(B (s), s)) K, s : s 1 τ.

Additionally, assume that the following bilinear operator is uniformly invertible and bounded:

( ΨL 2 2 BΨ ) 1 = sup B 2=1 ( ΨL 2 2 BΨ ) 1[B, ] β.

Then, for d = 2 (the matrix case), we derive the following closed-form, Hessian-free solution for the approximate hypergradient:

G(s) := sf 1 diag h U (1), U (1)f 1 + U (2), U (2)f 1 i s 1,

where refers to the Hadamard product (entrywise).

Moreover, the numerical solution obtained by neglecting the Hessian satisfies the following error bound for d = 2: G(s) d

dsf1(B (s), s) Kβ.

For d > 2 (tensor case), a tridiagonal approximation of 2 BΨ leads to the following closed-form approximation:

G(s) := sf 1 diag

i=1 U (i), U (i)f 1

with the corresponding error bound: G(s) d

dsf1(B (s), s) Kβ +

|j i| 2 2 U (i),U (j)Ψ .

We provide a detailed proof of this theorem in Appendix B. We would like moreover to remark that, if the constraint V on the basis B is compact and f2 is twice continuously differentiable, then automatically the first assumption in Theorem 4.1 is satisfied for Weierstrass theorem. This will naturally hold if we impose orthonormal constraints on the basis, as we will discuss in Section 7.

Given the approximation error in Theorem 4.1, it is important to modify the constraint in the upperlevel problem to s 1 τ + rε = eτ and s ε entrywise, where ε > 0 is a small constant to prevent numerical instability due to division by s. This adjustment ensures that the approximation remains well-behaved, particularly in cases where the singular values si become small. More precisely, one can consider the perturbed L1 simplex:

S := {s Rr | s ε, s 1 eτ},

which is still convex and thus amenable to projection-free methods, motivating our choice of stochastic Frank-Wolfe for the upper-level problem in the next Section 5.

5 OPTIMIZATION WITH STOCHASTIC AWAY-STEP FRANK WOLFE

Our goal is to minimize an objective function defined as the finite sum of a set of functions. The main challenge here lies in the high computational cost: calculating the objective value or its gradient requires aggregating information across all functions in the set, which is substantially more expensive than evaluating these quantities for a single function or a bunch of them. This is especially problematic when the set of functions is large, which can quickly make optimization infeasible in practice.

Published as a conference paper at ICLR 2025

In order to solve the bilevel optimization problem in Equation (1), we hence proceed as follows: first, we compute an approximate solution to the lower-level problem using a standard stochastic gradient-based method; then we compute a stochastic approximation e G(s) of G(s) by sampling a minibatch of data to represent the loss f1; finally we use the e G(s) to update the solver for the upper-level problem.

The hypergradient approximation G(s) obtained in Theorem 4.1 can be efficiently computed using automatic differentiation. Specifically, to calculate G(s), we first approximate the optimal basis B (s) for a fixed s by optimizing for a finite number of steps, then we compute the partial derivative sf2(B (s), s), leveraging the fact that automatic differentiation can efficiently compute Bf 1 in a single backpropagation step. By exploiting the structure of the parameterization, this method bypasses the need for full Hessian and allows for scalable fine-tuning even in large-scale settings.

As the upper level is a constrained optimization problem with convex simplex constraints, we approach it with a stochastic version of the away-step Frank-Wolfe algorithm. This allows us to employ a projection-free approach, which reduces the cost of the upper-level solver and allows us to obtain important theoretical guarantees of convergence to a sparse structure in a finite number of steps (See Section 6). At each iteration sn, the main steps of the Frank-Wolfe algorithm (Guélat & Marcotte, 1986; Beck & Shtern, 2017; Bomze et al., 2020) are as follows:

Frank-Wolfe direction Compute a search direction hn that minimizes a linear local approximation of the upper-level loss function as:

hn = zn sn where zn = arg min s S e G(sn) s

Note that, as S is a sparse simplex, the variable zn can be computed in an efficient way by simply looking at the entries of e G(sn). See Appendix C.

Convergence criterion Stop if e G(sn) hn ep Away-step direction Compute a search direction bn that maximizes a linear local approximation of the upper-level loss function as:

bn = sn yn, where yn = arg max s Cn e G(sn) s,

with Cn = {s S : suppε(s) = suppε(sn)} and suppε(s) = {i : si > ε}. Note that this problem can be solved efficiently as for the FW direction computation by sweeping through the entries of e G(sn). See Appendix C.

Steepest direction Set dn = hn if e G(sn) hn e G(sn) bn. Set dn = bn, otherwise. Line search Choose αmax n = 1 if dn = hn. Otherwise, choose αmax n as the largest α such that sn + αdn S. Then choose αn (0, αmax n ] using a line search. Update variables Update the current iterate as sn+1 = sn + αndn. If n n0 then truncate sn+1 by removing the entries that are smaller than ε and remove the corresponding columns from U and V , effectively reducing the rank of the problem.

The resulting low-rank adaptation scheme is summarized in Algorithms 1 and 2. In the next section, we will see that the proposed algorithm converges in expectation with a sublinear rate (see Theorem 6.2 for further details) and it is also able to identify the surface containing a stationary point once it gets close enough to it (see Theorem 6.5 for further details). Identifying the surface containing a solution is of paramount importance in our framework since it allows us to shrink the dimension of the problem under analysis and eventually apply more sophisticated solution techniques in the end. In Appendix G we provide a detailed per-iteration cost analysis of d EBORA, together with the real GPU consumption compared with (Zhang et al., 2023).

6 THEORETICAL PROPERTIES OF THE STOCHASTIC AWAY-STEP FRANK-WOLFE ALGORITHM

In this section, we propose a convergence analysis of the fine-tuning strategy. In order to simplify the notation, we let f(s) = f1(B (s), s) where B (s) is a solution to the lower-level problem. We hence consider, in place of Problem (1), the following optimization problem: min s S f(s) , (4)

Published as a conference paper at ICLR 2025

Algorithm 1 d EBORA: Efficient Bilevel Optimization-based low-Rank Adaptation

1: Input: Choose τ, ε > 0, precision ep > 0, and truncation step n0. Initialize adapters B and s0 S 2: For n = 0, 1, . . . 3: Compute e G(sn) stochastic estimation of the hypergradient as in Algorithm 2 4: hn = FW-direction( e G(sn), τ, ε); 5: If Convergence-criterion( e G(sn), hn, ep) then STOP 6: bn = Away-step-direction( e G(sn), sn, τ, ε) 7: dn = Steepest-direction{hn, bn} 8: αn = Linesearch(sn, dn) 9: Set sn+1 = sn + αndn 10: If n n0 then truncate sn+1 by removing entries smaller than ε and reduce rank accordingly 11: End for

Algorithm 2 Stochastic approximation of the hypergradient e G(s)

1: Given s S 2: Compute an approximation B (s) to arg min B f2(B, s) 3: Sample a data minibatch and define ef1 as the estimation of f1 on it 4: Compute B ef1(B , s) and s ef1(B , s) 5: Assemble e G(s) := s ef1(B , s) diag[B , B ef1(B , s)] s 1

and denote with the diameter of S. Moreover, we assume the stochastic approximation e G of the hypergradient f is available, such that the following assumption is satisfied. 1

Assumption 6.1. Let us denote e(s, s) = ( f( s) e G( s)) (s s). We assume that for any s S, there exist χ 0 such that

E e(s, s)2 χ2 s S. (5)

Remark 6.4 will discuss how the proposed analysis transfers to the chosen biased approximation e G(s) from Algorithm 2. Notice that, by Jensen and Holder inequalities respectively, if Assumption 6.1 holds, we have the following

|E[e(s, s)]| E[|e(s, s)|] E[(e(s, s))2]1/2 χ. (6)

Note moreover that, in the stochastic optimization literature, it is standard to quantify the accuracy of the stochastic first-order oracle assuming bounded variance of the stochastic gradient estimates

(Braun et al., 2022): for every s S, there exists σ2 > 0 such that E f(s) e G(s) 2 σ2. It

is easy to see that this assumption on the gradient estimates implies Assumption 6.1. Indeed, we have:

E e(s, s)2 E f( s) e G( s) 2 s s 2 E f( s) e G( s) 2 s s 2 σ2 2.

Let us define gn = f(sn) hn, egn = e G(sn) hn and g F W n = f(sn) h F W n , with h F W n arg mins S{ f(sn) (s sn)} sn the Frank-Wolfe direction obtained using the exact gradient. Since S is a convex set, a point s S is said to be stationary for the upper-level problem in (1) when f(s ) (s s ) 0 for all s S. Then, g F W n is an optimality measure, i.e., g F W n = 0 if and only if sn S is a stationary point. Now we show that Assumption 6.1 is satisfied with a sufficiently small χ if the stepsize αn is generated with a suitable line search. Thus, Algorithm 1 converges in expectation to a stationary point at a sublinear rate on non-convex objectives with a Lipschitz continuous gradient. The constant in the convergence rate depends on the quality of the gradient estimate (the more precise the estimate, the smaller the constant). The proof can be found in Appendix D.

1We have a biased stochastic estimate of the original hypergradient, so E[ e G] = E[ f].

Published as a conference paper at ICLR 2025

Theorem 6.2. Let {sn} be a sequence generated by Algorithm 1 applied to f on S, where f is Lipschitz continuous with constant M, e G satisfies Assumption 6.1 with

χ η 2 + 2η ep, 0 η < 1

and the step size αn satisfies

αn αn = min αmax n , egn M 2

f(sn) f(sn + αndn) ρ αnegn, (9) with some fixed ρ s.t.

0 < ρ (1 3η) 2M

2(1 η)2g F W 0 . (10)

Then for every T N,

2 2M(f(s0) f )

Tρ(1 η)2 , (11)

where g T = min 0 n T 1 g F W n and f = mins S f(s).

Equations (8) and (9) can be satisfied with suitable line searches/stepsize rules (see, e.g., (Bomze et al., 2020; 2021; Rinaldi & Zeffiro, 2022)). In particular, Lemma 6.3 below shows that this is the case for the fixed step size αn = αn, and the modified Armijo line search rule which sets αn = δj, (12) where j is the smallest non-negative integer such that f(sn) f(sn + αndn) γαnegn, (13) with γ (0, 1/2) and δ (0, 1) being two fixed parameters. The proof can be found in Appendix E. Lemma 6.3. Let Assumption 6.1 hold with

χ η 2 + 2η ep, 0 η < 1

and let αn being defined as in Equation (8). At iteration n, if αn is determined by

the fixed step size rule αn = αn, then Equation (8) holds and Equation (9) is satisfied with

0 < ρ min (1 3η) 2M

2(1 η)2g F W 0 , 1 η 2(1 + η)

the Armijo line search described in Equations (12) and (13), then αn min{1, 2δ(1 γ η)} αn, and Equations (8) and (9) hold. Remark 6.4. It is easy to see that Assumption 6.1 holds with a χ that satisfies Equations (7) and (14) when the batch size chosen to build the approximate stochastic hypergradient e G(s) is large enough. We also observe that it is possible to decompose χ in Equation (6) as a first term completely depending on the norm of the Hessian, and a second term controlled totally by the stochastic part. In particular, one would decompose the error as

e(s, s) = ( f(s) G(s)) (s s) + (G(s) e G(s)) (s s), where G(s) is again the closed form estimate of the hypergradient presented in Theorem 4.1, and e G(s) its stochastic counterpart where the objective function is evaluated on a data batch. Under the hypothesis that the constant β in Theorem 4.1 works across different batches, by using Cauchy Schwarz and the triangular inequality we get

E[|e(s, s)|] Kβ + E[|(G(s) e G(s)) (s s)|] Kβ + χ, where now we define χ as an upper bound of the third term. In particular, this allows to get potentially sharper bounds when estimating the minimal batch size needed to satisfy Equations (7) and (14).

Published as a conference paper at ICLR 2025

Table 1: De BERTa V3-base fine-tuning on GLUE benchmark.

Method # Params MNLI SST-2 Co LA QQP QNLI RTE MRPC STS-B (Acc) (Acc) (Mcc) (Acc/F1) (Acc) (Acc) (Acc) (Corr)

Full FT 184M 89.90 95.63 69.19 92.40/89.80 94.03 83.75 89.46 91.60

HAdapter 1.22M 90.13 95.53 68.64 89.27 94.11 84.48 89.95 91.48 PAdapter 1.18M 90.33 95.61 68.77 89.40 94.29 85.20 89.46 91.54 Lo RA r = 8 1.33M 90.29 95.29 68.57 90.62/90.61 93.91 85.5 89.75 89.10 Ada Lo RA 1.27M 90.44 95.64 68.76 90.59/90.65 94.11 86.00 89.44 91.41

d EBORA τ = 16 0.4M 90.01 95.29 68.72 91.88/89.20 93.42 83.75 90.16 90.84 d EBORA (Oblique) τ = 16 1.03M 90.0 94.83 69.15 88.62/85.09 87.33 83.03 91.42 91.24 d EBORA (Stiefel) τ = 16 0.8M 89.79 95.65 68.39 89.74/86.58 93.83 84.12 91.18 91.54

We now report a local identification result related to the Algorithm 1. More specifically, we state that our method successfully identifies the manifold containing a stationary point once it is sufficiently close to it. As we already noticed, guarantees of identification of this manifold in a finite number of steps is of critical importance within our framework, as it facilitates the reduction of the problem dimensionality and ultimately enables the application of more sophisticated solution techniques in the final stages. In other words, once the optimal face of the simplex is found, we are guaranteed that optimization can be continued just there without any repercussion, potentially drastically reducing the number of parameters.

The proof of this result can be found in Appendix F. To ease the analysis, we first introduce some useful theoretical tools and notations. We define the face of S exposed by f(s), with s S, as

Fe( f(s)) = arg min z S f(s) z,

and the minimal face of S containing s S as F(s). We let λv(s) = f(s) (v s), v, s S and notice that for a stationary point s of Problem (4) it holds λv(s ) 0 v V, where V is the set of extreme points in S. Furthermore, we define the value

λMIN v (s ) = min v V +(S) λv(s ), (15)

with V +(S) = {v V : λv(s ) > 0}. Finally, we indicate with BΓ(x) the ball with radius Γ(x) > 0 centered in x. We have:

Theorem 6.5. Let {sn} be a sequence generated by Algorithm 1. Let s S be a stationary point of Problem (4) s.t.

λMIN v (s ) 2χ > 0,

Fe( f(s )) = F(s ), that is strict complementarity holds in s .

Then, there exists Γ(s ) > 0 s.t. if sn BΓ(s ) F(s ) then sn+1 F(s ). Furthermore if the level set L(f(sn)) is such that L(f(sn)) BΓ(s ) then sn+1 BΓ(s ) F(s ).

7 EXPERIMENTAL EVALUATION

In this section, we present numerical experiments to showcase the effectiveness of the proposed bilevel optimization method for fine-tuning low-rank adapters. For fair comparison, the validation set was not used in any of the two losses during training. To create the two loss functions f1, f2, we randomly partitioned the dataset into equally sized subsets, using one partition for the upper-level loss and the other for the lower-level loss. All experiments were run on a 80GB NVIDIA A100 GPU.

7.1 GLUE BENCHMARK

In our first experiment, we fine-tuned De BERTa V3 (He et al., 2023) on the GLUE benchmark (Wang et al., 2019), comparing the proposed approach with state-of-the-art methods for fine-tuning large language models (LLMs). These include Ada Lo RA (Zhang et al., 2023), Lo RA (Hu et al., 2022), Pfeiffer adapter (Pfeiffer et al., 2021), Houlsby adapter (Houlsby et al., 2019), and full fine-tuning.

Published as a conference paper at ICLR 2025

Table 2: Fine-tuning performance on tensor layers. Left: Res Net50 fine-tuning on CIFAR-10. Right: Fine-tuning of Dreambooth Stable Diffusion. First rank refers to the Lo RA rank of the UNet layers, the second refers to the text encoder. In the parameter count, M stands for million and K for thousands. Initial ranks for Ada Lo RA are 8 for each layer.

Method Val. Accuracy (%) # Params

Lo RA (r = 16) 89.66 1.17M Lo RA (r = 8) 87.5 588K Lo RA (r = 2) 70.03 147K Ada Lo RA 93.28 316K Ada Lo RA 90.97 79K

d EBORA τ = 8 93.74 64K d EBORA (Oblique) τ = 16 85.65 66K d EBORA (Stiefel) τ = 16 85.83 72K

Method Loss # Params

Lo RA (r = 2, r = 8) 0.245 3.4M Lo RA (r = 4, r = 8) 0.247 5.5M Lo RA (r = 10, r = 8) 0.241 11.3M Ada Lo RA 0.245 4.7M Ada Lo RA 0.247 1.78M

d EBORA τ = 16 0.262 0.4M d EBORA (Oblique) τ = 16 0.231 0.4M d EBORA (Stiefel) τ = 16 0.252 0.4M

As in (Zhang et al., 2023), the adapters were applied to each attention layer, including the query, key, value matrices, and feed-forward layers. Results are presented in Table 1.

Additionally, we tested our approach in both unconstrained and constrained settings by forcing the basis matrices to have columns of unitary norm (Oblique) or orthonormal columns (Stiefel). As both these constraints form a manifold with an explicit retraction, we compute B by performing first-order stochastic Riemannian optimization Sato (2021) at the lower level. Given the product nature of the constraints and the relatively small starting rank, retractions in these cases remained computationally affordable.

7.2 FINE-TUNING OF HIGHER-ORDER TENSOR LAYERS (d > 2)

In this experiment, we evaluate our bilevel optimizer on fine-tuning higher-order tensor layers. We tested Res Net50 (He et al., 2015) on CIFAR-10 (Krizhevsky & Hinton, 2009) and Stable Diffusion (Rombach et al., 2021), with results reported in Table 2 (left) and Table 2 (right), respectively. For Res Net50, we applied adapters to each convolutional layer while retraining the final fully connected layer with an appropriately sized one. All methods used the same training settings: a constant learning rate of 5 10 1, weight decay of 1 10 3, Lo RA α = 32, and no dropout. We note that the original Lo RA implementation (Hu et al., 2022) for convolutional layers uses matrix factorization of the weight tensor, scaling the number of parameters as O(r(F + Ck2)). In contrast, our CP-like factorization scales more efficiently as O(r(2k + C + F)), where F is the number of output features, C is the number of input channels, and k is the kernel size. This more efficient representation is reflected in the total number of trainable parameters reported in Table 2 (left). Despite using fewer parameters, our approach competes well with the baselines, particularly in the unconstrained setting, which is computationally less challenging due to the absence of retractions.

For the Stable Diffusion experiment, we inserted adapters into the convolutional layers of the UNet and the linear layers of the text encoder. All models were trained under the same conditions as in (Mangrulkar et al., 2022), and the results are reported in Table 2 (right). For Ada Lo RA, we implemented the same tensor-based convolutional adapter representation used in our method, which led to higher compression compared to Lo RA, which does not utilize tensor factorization. Even with a similar parameterization, our method consistently outperforms the other baselines, achieving better results with fewer effective parameters.

8 CONCLUSION

In this paper, we introduced a novel bilevel optimization framework for low-rank adaptation, providing a parameter-efficient fine-tuning strategy for large-scale neural networks. Our approach leverages a dynamic rank selection mechanism within a bilevel structure, enabling efficient adaptation while minimizing the number of trainable parameters. A key contribution of our work is the theoretical analysis of the proposed stochastic bilevel optimizer, where we established convergence guarantees. Through a variety of experiments on the GLUE benchmark, Res Net50, and Stable Diffusion, we demonstrated that our method consistently matches or outperforms existing state-of-the-art approaches such as Ada Lo RA and Lo RA, while significantly reducing the parameter count.

Published as a conference paper at ICLR 2025

FUNDING ACKNOLEDGEMENTS

The work of Francesco Rinaldi has been partially funded by the European Union - Next Generation EU under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.1 - Call PRIN 2022 No. 104 of February 2, 2022 of Italian Ministry of University and Research; Project 2022BMBW2A (subject area: PE - Physical Sciences and Engineering) Large-scale optimization for sustainable and resilient energy systems.", CUP I53D23002310006. The work of E. Zangrando was funded by the MUR-PNRR project Low-parametric machine learning . Francesco Tudisco is partially funded by the PRIN-MUR project MOLE code: 2022ZK5ME7 MUR D.D. financing decree n. 20428 of November 6th, 2024, CUP B53C24006410006; and by the PRIN-PNRR project FIN4GEO within the European Union s Next Generation EU framework, Mission 4, Component 2, CUP P2022BNB97.

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning, 2020.

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7319 7328, 2021.

Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient for non-strongly convex functions. Mathematical Programming, 164:1 27, 2017.

Dimitri Bertsekas. Nonlinear Programming, volume 4. Athena Scientific, 2016.

Immanuel M Bomze, Francesco Rinaldi, and Damiano Zeffiro. Active set complexity of the away-step frank wolfe algorithm. SIAM Journal on Optimization, 30(3):2470 2500, 2020.

Immanuel M Bomze, Francesco Rinaldi, and Damiano Zeffiro. Frank wolfe and friends: a journey into projection-free first-order optimization methods. 4OR, 19(3):313 345, 2021.

Gábor Braun, Alejandro Carderera, Cyrille W Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta. Conditional gradient methods. ar Xiv preprint ar Xiv:2211.14103, 2022.

Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of operations research, 153(1):235 256, 2007.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLo RA: Efficient finetuning of quantized LLMs. Advances in Neural Information Processing Systems, 36, 2024.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1126 1135. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/v70/ finn17a.html.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1165 1173. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/v70/franceschi17a.html.

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In International Conference on Machine Learning, pp. 1568 1577. PMLR, 2018.

Published as a conference paper at ICLR 2025

Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration complexity of hypergradient computation. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 3748 3758. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/grazzi20a.html.

Jacques Guélat and Patrice Marcotte. Some comments on wolfe s away step . Mathematical Programming, 35(1):110 119, 1986.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385.

Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2023. URL https://arxiv.org/ abs/2111.09543.

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2790 2799. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr. press/v97/houlsby19a.html.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=n Ze VKee FYf9.

Chenghao Huang, Qing Liu, Bingyu Lin, Tong Pang, Changlong Du, and Ming Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition. ar Xiv preprint ar Xiv:2307.13269, 2023.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs. toronto.edu/~kriz/learning-features-2009-TR.pdf.

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018.

Jiaxiang Li and Shiqian Ma. Riemannian bilevel optimization. ar Xiv preprint ar Xiv:2402.02019, 2024.

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Relora: High-rank training through low-rank updates, 2023.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search, 2019. URL https://arxiv.org/abs/1806.09055.

Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicit differentiation. In Silvia Chiappa and Roberto Calandra (eds.), Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 1540 1552. PMLR, 26 28 Aug 2020. URL https://proceedings.mlr.press/v108/lorraine20a.html.

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github. com/huggingface/peft, 2022.

Published as a conference paper at ICLR 2025

Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning, 2021. URL https://arxiv. org/abs/2005.00247.

Rushi Qiang, Ruiyi Zhang, and Pengtao Xie. Bi Lo RA: A Bi-level Optimization Framework for Overfitting-Resilient Low-Rank Adaptation of Large Pre-trained Models. ar Xiv preprint ar Xiv:2403.13037, 2024.

Francesco Rinaldi and Damiano Zeffiro. Avoiding bad steps in frank-wolfe variants. Computational Optimization and Applications, pp. 1 40, 2022.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models, 2021.

Hiroyuki Sato. Riemannian optimization and its applications, volume 670. Springer, 2021.

Steffen Schotthöfer, Emanuele Zangrando, Jonas Kusch, Gianluca Ceruti, and Francesco Tudisco. Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. In Advances in Neural Information Processing Systems, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ file/7e98b00eeafcdaeb0c5661fb9355be3a-Paper-Conference.pdf.

Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. Dy Lo RA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 3274 3287, 2023.

Sara Venturini, Andrea Cristofari, Francesco Rinaldi, and Francesco Tudisco. Learning the right layers a data-driven layer-aggregation strategy for semi-supervised learning on multilayer graphs. In International Conference on Machine Learning, pp. 35006 35023. PMLR, 2023.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URL https://arxiv.org/abs/1804.07461.

Hongyi Wang, Saurabh Agarwal, and Dimitris Papailiopoulos. Pufferfish: Communication-efficient models at no extra cost. Proceedings of Machine Learning and Systems, 3:365 386, 2021.

Emanuele Zangrando, Steffen Schotthöfer, Gianluca Ceruti, Jonas Kusch, and Francesco Tudisco. Rank-adaptive spectral pruning of convolutional layers during training. In Advances in Neural Information Processing Systems, 2024.

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter-efficient finetuning, 2023.

Yihua Zhang, Yuguang Yao, Parikshit Ram, Pu Zhao, Tianlong Chen, Mingyi Hong, Yanzhi Wang, and Sijia Liu. Advancing model pruning via bi-level optimization. In Advances in Neural Information Processing Systems, volume 35, pp. 18309 18326, 2022.

Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection, 2024.

In the next sections, we will include the proofs of the results in the main text and a schematic implementation of the training procedure.

Published as a conference paper at ICLR 2025

B PROOF OF THEOREM 4.1

The hypergradient equation given in Equation (2) requires just to calculate s B (s), i.e., a solution of the implicit gradient Equation (3)

2 Bf 2 s B + s Bf 2 = 0. (16)

Using the chain rule and the fact that f2(B, s) = L2(Ψ(B, s)), we have that

2 Bf 2 = B Bf 2 = B ΨL 2 BΨ = 2 ΨL 2 BΨ BΨ + ΨL 2 2 BΨ . (17)

In a similar manner, we have that

s Bf 2 = B ΨL 2 sΨ = 2 ΨL 2 BΨ sΨ + ΨL 2 B sΨ . (18)

By plugging Equations (17) and (18) in equation Equation (16) we get the following:

( 2 ΨL 2 BΨ BΨ + ΨL 2 2 BΨ ) s B + 2 ΨL 2 BΨ sΨ + ΨL 2 B sΨ = 0,

by suitably rearranging this expression, we have

2 ΨL 2 BΨ ( BΨ s B + sΨ ) + ΨL 2( 2 BΨ s B + s BΨ ) = 0. (19)

Assuming for the moment that 2 ΨL 2 0, then the solution of Equation (19) is given by the solution of 2 BΨ s B = s BΨ . Since Ψ(B, s) is multilinear in the basis matrices B = (U (1), . . . , U (d)), the partial derivatives have diagonal terms equal to zero, i.e. 2 U (i)Ψ = 0, we get the following system of equations

j=1,j =i 2 U (i),U (j)Ψ s U (j) = s U (i)Ψ , i = 1, . . . , d.

We hence just need to calculate

2 U (i),U (j)Ψ = U (i),U (j)

k=1 sk U (1)ek U (i)ek . . . U (j)ek U (d)ek.

For d 3, the last system cannot be solved in closed form, but given the symmetry of the Hessian one could exploit the structure through the use of Krylov methods. For d = 2 instead, the last calculation leads to the following equality:

γ sηU (2) γβ eα eγ = δβηeα U (2)eβ.

By testing the last equation on the left by ei and on the right by ej we get:

sβδαi sηU (2) jβ = δβηδαi U (2) jβ ,

that finally leads to

sηU (2) jβ = δβηU (2) jβ s 1 β ,

sηU (1) jβ = δβηU (1) jβ s 1 β .

We hence get the final hypergradient expression by means of Equation (2): d dsf1(B (s), s) = sf 1 diag[U (1) , U (1)f 1 + U (2) , U (2)f 1 ] K s 1.

We are just left to estimate the error between this approximate solution and the real hypergradient. The structure of Equation (19) can be summarized by giving an error estimate for a linear system of the form

A(Bx + b) + C(Dx + d) = 0,

A = 2 ΨL 2, B = BΨ BΨ ,

C = ΨL 2, D = 2 BΨ , b = sΨ , d = s BΨ , x = s B .

Published as a conference paper at ICLR 2025

Now let x = (AB + CD) 1(Ab Cd) be the solution of the above linear system, and let y = D 1d the solution of Dx + d = 0, which exactly represents our approximation. Let us define M = AB + CD and estimate the distance between the two solutions:

y x = D 1 M 1C d + M 1Ab D 1 M 1C d + M 1 A b .

By using the fact that for two invertible linear operators P, Q, the identity (P + Q) 1 = P 1 P 1Q(P + Q) 1 holds, we get that

D 1 M 1C = D 1 ((CD) 1 (CD) 1(AB)M 1)C = (CD) 1ABM 1C. (20)

By inserting Equation (20) in the last estimate we get

y x (CD) 1 A BM 1C d + M 1 A b = α A ,

α = (CD) 1 BM 1C d + M 1 b ,

from which, by using the hypothesis on the uniform invertibility and boundedness of CD, and the control on the Hessian norm, we get by the use of Neumann expansion for AB(CD) 1 Kβ < 1

M 1 (CD) 1 1 (CD) 1 AB β 1 Kβ B .

Finally, we get

1 Kβ B B C d + β 1 Kβ B b A βK,

where the omitted constant is bounded because the basis lies in a product of Stiefel manifolds, which are compact, and thus all the terms in parenthesis can be bounded. For the general tensor case d > 2, following the previous calculation, a tridiagonal approximation e D of D immediately leads to an error bound:

y x βK + D 1 e D 1 βK + D 1 e D 1 D e D .

By taking the difference G(s) d

dsf 1 (s) Bf 1 x y ,

and using the boundedness of Bf 1 , we get the desired result.

C ADDITIONAL DETAILS FOR SECTION 5

In this section, we report additional details on the computational cost of the Stochastic Away-Step Frank Wolfe algorithm presented in Section 5. The optimization problems used to determine the Frank-Wolfe and Away-step directions can be solved efficiently since we are concentrating on minimizing a function over the perturbed L1 simplex. In the Frank-Wolfe step, we indeed minimize a linear function over a polytope S. We know, by means of the fundamental theorem of linear programming, that one of the vertices of S is solution of such a linear program (See,e.g., (Bertsekas, 2016)). Therefore we simply need to evaluate that function on this finite set of elements, a task that can be performed at a linear cost. In particular,

( ε Pr j=1 G(sn)j if i = 0,

ε Pr j=1 G(sn)j + τG(sn)i otherwise.

with V = {v0, v1, . . . , vr} the set of vertices of our feasible set S. Therefore, if G(sn)i > 0 i = 1, . . . , r then ˆı = 0, otherwise we choose ˆı arg mini G(sn)i. Similarly, in the Away step, we simply need to evaluate the function on the set of active vertices Sn used to describe the iterate. Therefore, if 0 Sn and G(sn)j < 0 j Sn then ˆȷ = 0, otherwise we choose ˆȷ arg maxj Sn G(sn)j.

Published as a conference paper at ICLR 2025

D PROOF OF THEOREM 6.2

We first notice that, using barycentric coordinates, Problem (4) can be written in the following form:

min f(s) s.t. s = Aw

e w = 1 w 0 ,

with A = [v0, v1, . . . , vr] Rr (r+1). So given a feasible iterate sn Rr in Problem (4), we can suitably define the non-negative weights w Rr+1 related to the barycentric coordinates and define the set Sn = {j : wj > 0}, the set of indices related to the so-called active vertices, that is the vertices with a positive weight wj in the barycentric coordinates. It is then easy to see that, in this case, the linearized problem defined in the Away-step-direction procedure (Step 6 of our algorithm), simply reduces to select the vertex vˆȷ that maximizes the scalar product e G(sn) vˆȷ, with ˆȷ Sn = {j : wj > 0}.

The following chain of inequalities holds:

f(sn) h F W n f(sn) hn e G(sn) hn ϵ e G(sn) h F W n ϵ f(sn) h F W n 2ϵ, (21)

where we used Equations (5) and (6) in the second and the last inequality with ϵ = χ + χ = 2χ, while the first and the third inequality follow from the definition of h F W n and hn. In particular,using the definitions of egn, gn and g F W n , from Equation (21) we can write

g F W n egn ϵ, (22)

egn g F W n ϵ, (23) gn egn ϵ. (24)

Using Equations (7) and (22), we also have

ϵ η(egn ϵ) ηg F W n . (25)

Now, let us distinguish three cases.

A) If αn < αmax n , from Equation (8) it follows that egn M 2 < αmax n and αn = egn M 2 . Let 1A denote the indicator function for this case. We observe that, from Equations (23) and (25), we have egn (1 η)g F W n . (26)

Using Equations (9) and (26), we can write

1A{f(sn) f(sn+αndn)} 1A{ρ αnegn} 1A

M 2 (g F W n ) 2 . (27)

As for Sn, by hypothesis, we have either dn = hn = vˆı sn or dn = bn = sn vˆȷ for some ˆı, ˆȷ [0 : r]. In particular, Sn+1 Sn {ˆı} so that |Sn+1| |Sn| + 1.

B) If αn = αn = αmax n and dn = hn, then αn = αmax n = 1. Let 1B denote the indicator function for this case. By the standard descent lemma, we can write

1B{f(sn) f(sn+1)} = 1B{f(sn) f(sn + dn)}

Using Equation (24) and the fact that egn M 2 1, we obtain

1B{f(sn) f(sn+1)} 1B

Published as a conference paper at ICLR 2025

From Equations (23) and (25), we also have

2 ϵ g F W n

2ηg F W n = 1 3η

2 g F W n .

Hence, we can write

1B{f(sn) f(sn+1)} 1B

M 2 (g F W n ) 2 , (28)

where in the last inequality we used Equation (10). Reasoning as in the case above we also have |Sn+1| |Sn| + 1. C) If αn = αn = αmax n and dn = bn = sn vˆȷ for ˆȷ Sn. Then we have wi n = 0 for i [1 : r] \ Sn {ˆȷ} and wi n+1 > 0 for i Sn \ {ˆȷ}. In particular, |Sn+1| = |Sn| 1.

Now, based on the three cases analyzed above, we partition the iterations {0, 1, . . . , T 1} into three subsets NA, NB, NC and defined as follows:

NA(T) = {n < T : αn < αmax n },

NV (T) = {n < T : αn = αmax n , dn = hn}, NC(T) = {n < T : αn = αmax n , dn = bn}. We have by induction on the recurrence relation proved for |Sn| that

|ST | |S0| |NA(T)| + |NB(T)| |NC(T)| for every T N. (29)

Since NC(T) = T |NA(T)| |NB(T)|, from Equation (29) we get

|NA(T)| + |NB(T)| T + |ST | |S0|

where we used |S0| = ℓ |ST |.

Considering just the good case, where we can have a bound on the decrease, using Equations (27) and (28), we can write

En [f(sn) f(sn+1)] = En [(1A + 1B){f(sn) f(sn+1)}] ρ(1 η)2

M 2 (g F W n ) 2. (31)

Using Equation (31) and taking full expectations, we obtain

n=0 E [f(sn) f(sn+1)]

NA(T ) E [f(sn) f(sn+1)] + X

NB(T ) E [f(sn) f(sn+1)]

|NA(T)| min n NA(T ) ρ(1 η)2

M 2 E (g F W n )2 + |NB(T)| min n NB(T ) ρ(1 η)2

M 2 E (g F W n )2

|NA(T)| min n NA(T ) ρ(1 η)2

M 2 E g F W n 2 + |NB(T)| min n NB(T ) ρ(1 η)2

M 2 E g F W n 2

(|NA(T)| + |NB(T)|)ρ(1 η)2

M 2 min 0 n T 1 E g F W n 2

(|NA(T)| + |NB(T)|)ρ(1 η)2

M 2 E [g T ]2

M 2 E [g T ]2 ,

where the fourth inequality follows from Jensen s formula, in the second-last inequality we use the definition of g T and the last inequality is due to Equation (30).

Published as a conference paper at ICLR 2025

2 2M(f(s0) f )

leading to the desired result.

E PROOF OF LEMMA 6.3

The proof partially follows the one in (Venturini et al., 2023). Reasoning as in the proof of Theorem 6.2, we have that Equation (24) holds. By the standard descent lemma, we can write

f(sn) f(sn + αdn) αgn α2 M dn 2

2 α(egn ϵ) α2 M 2

2 , α R, (32)

where the last inequality follows from Equation (24). At iteration n, if α is determined by:

α = αn, then we can replace the stepsize in the right-hand side with egn M 2 in Equation (32), obtaining

f(sn) f(sn + αndn) egn(egn ϵ)

M 2 eg2 n 2M 2 = eg2 n 2M 2 egnϵ M 2 .

Using Equation (14), we obtain

f(sn) f(sn + αndn) eg2 n 2M 2 1 M 2 eg2 n

= 1 η 2(1 + η) eg2 n M 2 .

Thus, we have

f(sn) f(sn + αndn) ρ eg2 n M 2 ρ αnegn.

the Armijo line search, then

f(sn) f(sn + αdn) γαegn α 0, 2(1 γ)egn ϵ

Since αn is computed by Equations (12) and (13), we can write

αn min 1, 2δ (1 γ)egn ϵ

min 1, 2δ (1 γ η)egn

min(1, 2δ(1 γ η)) αn,

where the second inequality follows from Equation (14). We hence have

αn min 1, c egn

for some c > 0.

We now consider two cases: when c 1 then αn is of course a lower bound for the step size αn, and when c < 1 we can still recover Equation (8) by considering f M = M/c instead of M as Lipschitz constant.

F PROOF OF THEOREM 6.5

By the stationarity of s and by definition of exposed face, we can write

λv(s ) 0 v V, (33) λv(s ) = 0 iff v F(s ). (34)

Published as a conference paper at ICLR 2025

By continuity, we can choose

Γ(s ) = λMIN v (s ) 2χ

2 > 0 s.t. (35)

λv(sn) > λv(s ) Γ(s ) v V \ (V F(s )). (36)

Using Equations (5) and (6) and Equation (36), we can write

e G(sn) (v sn) + 2χ f(sn) (v sn) > λv(s ) Γ(s ). (37)

e G(sn) (v sn) λv(s ) 2χ λMIN v (s ) 2χ

2 > 0 v V \ (V F(s )),

where in the first inequality we used Equation (37) and in the second inequality we used Equation (34) and the definitions of λMIN v (s ) in Equation (15) and Γ(s ) in Equation (35). The algorithm will then only choose a Frank-Wolfe or an away direction that guarantees the new iterate sn+1 to stay in the face, so sn+1 F(s ). If we further have that the level set L(f(sn)) is such that L(f(sn)) BΓ(s ) then, considering the fact that the sequence {f(sn)} is non-increasing, we can easily see that sn+1 BΓ(s ) F(s ).

G PER-ITERATION COST OF DEBORA

We can give an upper bound on the number of arithmetic operations carried out at every d EBORA iteration: each iteration indeed requires Tmax Cr operations, where Cr is the cost of one lower-level optimization step with rank r and Tmax is the maximal number of iterations for the lower-level optimization step. Notice that Cr includes the cost of forward, backward, and the optimization step. In particular, Cr is equal to the cost of performing one optimization step for (Zhang et al., 2023) without rank adaptation. Moreover, we need to add the cost of calculating the hypergradient which is O(Lrn2), where L is the number of layers in the neural network (assuming for simplicity layers have square dimensions of dimension n). Notice that when the correct face is identified, the cost of calculating the single iteration at the lower-level Cr and the cost of the upper-level step decreases. Finally, we have a linear minimization oracle cost of O(r), and an optional line search with cost O(1). The cost of calculating the Frank Wolfe direction (Linear minimization oracle) is O(r) because it requires solving the subproblem mins S e G(sn) s. This is a linear minimization problem on a convex polytope, so its solution lies in one of the vertices for the fundamental theorem of linear programming: since vertices are the null vector and the vectors of the canonical basis, the solution can be found by simply looking at the entries of e G(sn) and finding the smallest one. Summing up, each iteration of the method has a cost of O(Tmax Cr + Lrn2) against a O(Cr) for (Hu et al., 2022) and a O(Cr + Lrn2) for (Zhang et al., 2023) (including their cost of calculating the sensitivity metric for rank truncation). While the single iteration of d EBORA is more expensive than the one in (Zhang et al., 2023) by a factor Tmax (which can be predefined by the user as the maximal number of iterations for the lower-level), in Figure 4 we report different GPU statistics during training on the MRPC task (GLUE benchmark) showing that the effective computing time of d EBORA is competitive. As we can observe from Figure 4, the effective memory consumption is lower together with the average GPU power usage. Moreover, while the cost of single iteration is bigger, the effective time to same accuracy is comparable.

Published as a conference paper at ICLR 2025

0 200 400 600 800 1000 Time [s]

GPU time vs accuracy

d EBORA Ada Lo RA

Figure 1: Time vs accuracy for the MRPC task.

0 200 400 600 800 1000 Time [s]

GPU memory allocated[bytes]

1e10Effective memory over Time

d EBORA Ada Lo RA

Figure 2: GPU memory consumption for the MRPC task.

0 200 400 600 800 1000 Time [s]

Energy Consumption Over Time

d EBORA Ada Lo RA

Figure 3: Energy consumption on the MRPC task.

Figure 4: GPU statistics during training (A100 80GB).