# parameterefficient_finetuning_with_controls__57b956c3.pdf

Parameter-Efficient Fine-Tuning with Controls

Chi Zhang * 1 Jingpu Cheng * 1 Yanyu Xu 2 Qianxiao Li 1

In contrast to the prevailing interpretation of Low Rank Adaptation (Lo RA) as a means of simulating weight changes in model adaptation, this paper introduces an alternative perspective by framing it as a control process. Specifically, we conceptualize lightweight matrices in Lo RA as control modules tasked with perturbing the original, complex, yet frozen blocks on downstream tasks. Building upon this new understanding, we conduct a thorough analysis on the controllability of these modules, where we identify and establish sufficient conditions that facilitate their effective integration into downstream controls. Moreover, the control modules are redesigned by incorporating nonlinearities through a parameter-free attention mechanism. This modification allows for the intermingling of tokens within the controllers, enhancing the adaptability and performance of the system. Empirical findings substantiate that, without introducing any additional parameters, this approach surpasses the Lo RA algorithms across all assessed datasets and rank configurations.

1. Introduction

Large-scale deep neural networks, in particular Transformers (Vaswani et al., 2017), have demonstrated unprecedented performance in areas such as computer vision (Dosovitskiy et al., 2020), natural language processing (Vaswani et al., 2017) and speech recognition (Radford et al., 2023). The prevailing methodology for attaining optimal model performance typically entails pre-training with an extensive text or image corpus, followed by subsequent fine-tuning on a compact task-specific dataset. For example, the Vision Transformer (Vi T) (Dosovitskiy et al., 2020) is first pre-trained on the Imagenet-21K (Deng et al., 2009) or JFT (Sun et al.,

*Equal contribution 1Department of Maths, National University of Singapore, Singapore 2The Joint SDU-NTU Research Center of Artificial Intelligence, Shandong University, China. Correspondence to: Qianxiao Li <Qianxiao@nus.edu.sg>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

2017) dataset, and then undergoes fine-tuning to facilitate adaptation on various downstream tasks.

The downside of such an approach is that the large models often consist of millions or even billions of parameters, leading to the exhaustive consumption of GPU memories during the full-tuning process. To alleviate this issue, a series of parameter-efficient fine tuning (PEFT) algorithms (Bapna et al., 2019; Houlsby et al., 2019; Jia et al., 2022; Lian et al., 2022) have been proposed in recent studies. One such example is the Low-Rank Adaptation (Lo RA) (Hu et al., 2021), which integrates pairs of additional trainable low-rank matrices whilst keeping the original model fixed during the training process.

Despite its broad success in practical scenarios, the underlying mechanism of these PEFT algorithms remains underexplored. For instance, the general rationale behind Lo RA algorithms is often framed as the change in weights during model adaptation has a low intrinsic rank (Aghajanyan et al., 2020; Hu et al., 2021). Consequently, to accurate replicate weight changes, low-rank matrices should be applied to all parameters within the model, given that they are all trainable in model adaptation. Yet, empirical results frequently suggest that it is sufficient to apply Lo RA solely to specific matrices within each block, such as only the Query matrices QK.

In particular, we notice the recent Adapt Former algorithm (Chen et al., 2022) opens the gate for a new direction. Unlike traditional approaches where Lo RA matrices are nested within the attention blocks, Adapt Former positions these low-rank matrices in parallel to the feed-forward layer. This arrangement allows for an alternative interpretation of PEFT algorithms in this paper: the Lo RA matrices should not be merely regarded as the weight differences after model adaptation, but rather as control modules designed to perturb the original model.

Such an interpretation draws parallels to classical control theory (Bishop, 2011; Kwakernaak & Sivan, 1972), wherein light-module controls are frequently positioned to steer a complex system towards desired states. In the field of robotics (Slotine & Li, 1987; Lewis et al., 2003), for example, a controller may be employed to govern the movements of a robotic arm, orchestrating its precise positioning and motion to accomplish specific tasks. Analogously, transfer

Parameter-Efficient Fine-Tuning with Controls

learning for large models could be performed in a similar way: a series of lightweight modules (e.g., low-rank matrices) can be designed to control the Vi T model, aiming to achieve minimal adaptation loss.

Following this new interpretation, we study the controllability of these low-rank modules in a general setting of the continuous-time analogue of multi-layer models (E, 2017; Haber & Ruthotto, 2017). Specifically, we establish sufficient conditions by necessitating the controller to span the space with full rank at any given time. Adhering to these conditions, we illustrate that the existing (almost) linear controls employed by Adapt Former may face challenges in achieving full controllability, particularly when dealing with large datasets. Moreover, when considering transformers that incorporate attention mechanisms, our analysis indicates that there exist cases where the absence of cross-patch dynamics may lead to the failures of control.

These theoretical analyses further motivate us to devise new nonlinear controllers tailored for transformers on downstream tasks. To achieve this goal, we delve into the underlying mechanism of the original Vi T model and propose a new nonlinear control featuring a parameter-free attention mechanism. Such a design allows the patches to be intermingled within controllers, thereby contributing to an enhancement in the overall controllability of the low-rank modules. Empirical findings consistently support that, without introducing any additional parameters, this approach outperforms the existing Lo RA-like algorithms by a large margin on all assessed datasets and rank configurations.

In summary, our contributions encompass two key aspects. (1) Following the control-oriented perspective, we establish sufficient conditions for the controllability of perturbation functions. We scrutinize existing algorithms from this control viewpoint and demonstrate that they may fail to meet these conditions in certain cases. (2) In response to this, we redesign the controller module to align with the transformer architecture by incorporating a cross-patch attention mechanism. Numerical verification confirms that the proposed algorithm satisfies the controllability condition, and its effectiveness is further demonstrated across multiple datasets.

2. Related Works

Parameter-Efficient Fine-Tuning With the emergence of large-scale deep neural networks, transfer learning (Pan & Yang, 2009) with pre-trained models has become the de facto approach for adaptation on downstream tasks. Fulltuning the entire model often necessitates substantial GPU memories and suffers from the slow training process. Consequently, recent studies in transfer learning have concentrated on optimizing pre-trained models by selecting a limited subset of parameters or introducing extra lightweight parame-

ters. In particular, the prompt-based algorithms (Radford et al., 2018; Brown et al., 2020) advocate for the incorporation of extra trainable tokens to guide the behavior of language models. While this tuning method has also found applications in vision-related downstream tasks (Jia et al., 2022), its drawback lies in the significant drop in accuracy after increasing the prompt number to specific values, as demonstrated in (Chen et al., 2022). The Lo RA algorithm (Hu et al., 2021), inspired by the studies of Adapter (Houlsby et al., 2019; Karimi Mahabadi et al., 2021), offers an alternative solution by injecting low-rank matrices into the original attention block. Subsequently, the Adapt Former (Chen et al., 2022) shifts the perturbation to the feed-forward layer and places it in parallel to the original Vi T block. This departs from the setting in Lo RA, where low-rank matrices are nested within the attention module. Neural architecture search methods have also been utilized in the following studies (Zhang et al., 2022; Chavan et al., 2023), in order to search a PEFT architecture to maximize the downstream performance. Compared with Lo RA-like algorithms, these search methods typically require more training time to identify a proper architecture for each specific downstream task.

Control for Machine Learning Control theory (Franklin et al., 2002; Ogata, 2010) focuses on the analysis and design functions to regulate the behavior of dynamical systems. A cohort of recent studies have utilized machine learning methods to solve the classical control problems, such as the stability analysis (Chang et al., 2019; Dai et al., 2021). But using control to solve practical machine learning problems remains a relatively less explored area. In particular, the early works (E, 2017; Haber & Ruthotto, 2017; Chang et al., 2018) consider the machine learning process, especially learning with Res Net (He et al., 2016), as a function approximation via a control system. Following this understanding, a series of studies (Li et al., 2017; Li & Hao, 2018; Zhang et al., 2019; Kerimkulov et al., 2021) has been working on providing an optimal control viewpoint in the development and understanding of optimization methods for deep learning tasks. In addition to this optimal control view, the controllablity analysis (Ogata, 2010) defines the ability to match arbitrary input and target states, with certain admissible manipulations. It leads to a connection to the classical studies on the expressive ability of continuous-time neural networks (Raghu et al., 2017; Lu et al., 2017). As such, some universal approximation results of deep Res Nets have been established based on the controllability analysis (Cheng et al., 2023; Ruiz-Balet & Zuazua, 2023; Li et al., 2022; Cuchiero et al., 2020).

In particular, the controllability analysis of transfer learning with PEFT algorithms has not been explored in the existing literature.

Parameter-Efficient Fine-Tuning with Controls

3. A Control Formulation of PEFT Algorithms

We present a control-oriented view for parameter-efficient fine-tuning algorithms in this part, followed by discussions on the principles guiding practical controller design.

3.1. Preliminary and Notations

We begin by revisiting the widely-used Vision Transformer (Vi T). Given an image x0 RC H W , the Vi T model first splits and embeds the sample image into a series of visual tokens x 0 Rm d, where m denotes the number of tokens and d refers to the length of each token. Subsequently, an additional class-token xcls R1 d is concatenated with these tokens, followed by the addition a positional embedding into each token to form x1 R(m+1) d.

These visual tokens are then fed into a set of transformer layers, with the t-th block defined as:

2 = MHSA(LN(xt)) + xt, (1)

xt+1 = FFN(LN(xt+ 1

2 )) + xt+ 1

for t [1, , T 1]. Here MHSA, FFN and LN denote the multi-head self-attention, feed-forward network and layer normalization, respectively.

The encoded class-token xcls T will go through a linear layer to conduct the final prediction.

3.2. Dynamics of Controlled Vi T Systems

To leverage a pre-trained Vision Transformer (Vi T) for downstream tasks without incurring the computational burden associated with full-tuning, a suite of parameter-efficient fine-tuning (PEFT) algorithms has been developed in prior research. In broad terms, these studies concentrate on training specific parts of the original network or incorporating additional lightweight parameters, thereby customizing the pre-trained model for targeted downstream tasks.

In particular, the Lo RA algorithm (Hu et al., 2021) endeavors to find a sequence of functions {gt} to be applied to the Vi T blocks:

xt+1 = ft (xt, θt, gt(xt, ut)) .

Here gt can be construed as a control function that takes the original xt as input and contains some new parameters ut. The purpose of this function is to introduce perturbations to the original attention parameter θt, which remains constant throughout the learning process.

For efficiency concerns, Lo RA employs the linear function gt(xt, ut) = xtut, and imposes the constraint that the weight matrix ut possesses the low-rank property:

ut = At Bt, At Rd d , Bt Rd d, d d.

From a control perspective, the controls within Lo RA are integrated into the attention blocks, and analyzing such controls tends to be non-trivial. In contrast, the subsequent Adapt Former (Chen et al., 2021) relocates the low-rank matrices to the Feed-Forward Network (FFN) layer, positioning them in parallel with the original Vi T block. The controlled dynamics can be expressed as:

xt+1 = ft(xt, θt) + gt(xt, ut).

A notable advantage of this approach is that the control function gt is no longer embedded within the original ft. This decoupling of the control module significantly simplifies the control analysis. As such, we shall adhere to this additive formulation throughout the paper, but consider more general controller designs for gt.

The overall goal is to design and optimize the parameters of gt such that the terminal loss on downstream tasks could be minimized:

min {ut,θT } 1 N

i=1 L(x Pred,i, yi) (3)

s.t. xt+1,i = ft(xt,i, θt) + gt(xt,i, ut), t [1, , T 1]

x Pred,i = f T (xcls T,i, θT )

3.3. Principles of Controller Design

In general, the control function gt can take arbitrary forms, linear or nonlinear. But for practical implementations, design of such a control module should adhere to two principles: efficiency and controllability.

For efficiency considerations, the parameter number of the control module should be significantly smaller than the original block. This ensures the introduction of extra blocks does not offset our gains from freezing the original Vi T blocks. In practice, this reduces the consumption of GPU memories and often accelerates the whole training process.

On the other hand, the controllability defines whether our control blocks possess the capability to steer the Vi T system to the desired states.This condition becomes more challenging when operating with constraints imposed by limited parameters.

Due to its inherent low ranks, the Adapt Former algorithm naturally functions as a parameter-efficient method during the downstream adaptation process. This characteristic enables effective adaptation to specific tasks without the need to tune an excessive number of parameters. But the question remains whether these controlling blocks possess sufficient controllability to steer the Vi T system to the desired states.

Parameter-Efficient Fine-Tuning with Controls

4. Controllability Analysis

In this section, we present the controllability analysis for the fine-tuning of pre-trained, multi-layer models within a continuous-time context. This analysis aims to provide mathematical perspectives in designing effective controllers for PEFT algorithms.

4.1. A Sufficient Condition for Controllability

We begin by considering a general multi-layer pre-trained model incorporating skip connections between layers (He et al., 2016). The model s dynamics are described as follows:

xt+1 = xt + ht(xt), t = 0, , T 1, xt RD (4)

where x0 is the input, x T is the output of the model, and ht denotes the map represented by the t-th layer of the model. Since the parameters of the original model are frozen in the tuning process, we simply use ht(x) to represent the original dynamics. Viewing the layer index t as a temporal variable transforms our model into a continuous-time analogue, as explored in prior studies (E, 2017; Haber & Ruthotto, 2017):

x(s) = h(x(s), s), s [0, S], (5)

where the time s is the continuous analogue of the layer index t. Let φ : x(0) x(S) denote the input-output relation of dynamics (5). In this framework, we consider the effect of introducing a small scale control function to perturb the model dynamics:

x(s) = h( x(s), s) + εg( x(s), u(s)), x(0) = x(0) (6)

where g : U RD RD is the controller with the control parameter u U, and ε > 0 is a scaling factor. In the following discussion, we will assume U to be a compact set. Notice that compared to the general case in (3) where the controller gt can have different structures across layers, here we are considering the special case where the controller g keeps the same structures, but with different control parameters u across layers. This resembles the setting adopted in current PEFT algorithms.

Let φε,u : x(0) x(S) denote the perturbed input-output map, which is effectively the feature map of the perturbed model in the deep transformer case. Ideally, we hope a good controller should enable us to adjust the feature map of the perturbed model across a specific dataset. Specifically, for a given set of data samples X = {xi}N i=1 RD, we expect a good controller to be able to perturb the effect of φ over X arbitrarily, at least on a small scale. This leads us to the concept of local controllability that we now define:

Definition 4.1. We say the system (5) with controller g(u, x) is locally controllable over a dataset X =

{xi}N i=1 RD, if there exists ε > 0 such that the set

{(φε,u(x1), , φε,u(x N)) | u L ([0, S], U)}

is an open neighborhood of (φ(x1), , φ(x N)) in (RD)N.

Intuitively, suppose the feature map for the downstream task differs from the pre-trained model in a small scale, then a controller meeting Definition 4.1 enables the perturbed feature map to generate the required output features over X, regardless of the distribution of perturbation over the data points in X.

Assume that there is a partition 0 = s0 < < s L = S of [0, S], such that h(x, s) is C2 on each [si, si+1] Ω, where Ωis the domain of data. Then, the following theorem gives a sufficient condition on g for the local controllability over X to hold: Theorem 4.2. Assume that g(u, x) is locally Lipschitz continuous in both u and x. Also, assume that g(0, x) 0, and for any u and x, there exists v such that g(v, x) = g(u, x)(image set of g is symmetric). For given dataset X = {xi}N i=1 RD, suppose that the set

{(g(x1(s), u), , g(x N(s), u) (RD)N) | u U} (7)

spans (RD)N for each s [0, S], where xi(s) denotes the state of the original dynamics(5) at time s with initial value xi. Then, the original system with controller g is locally controllable over X.

Proof Idea. Set xi(s) = xi(s)+εzi(s), then an asymptotic analysis gives that there exists a uniform constant C > 0, such that the solution zi(s) of

zi(s) = x h(s, xi(s)) z(s) + g(u(s), xi(s)), zi(0) = 0 (8) satisfies zi(s) zi(s) Cε for all initial value xi, s [0, S] and u( ) L ([0, S], U). By the theory of linear ODE, the solution of (8) at the terminal time T is given by

zi(S) = µi(S) Z S

0 µ 1 i (s)g(u(s), x(s))ds, (9)

where µi(t) Rd d is a fundamental solution matrix of the homogeneous equation of (8), i.e.

µi(s) = x h(xi(s), s) µi(s), µ(0) = Id. (10)

Therefore, zi(S) depends linearly on g(u(s), xi(s)) for each s [0, S]. We then split the time interval [0, S] into N small subintervels [sj, sj+1], i = 0, , N. When N is large enough, for any given perturbation vector V (RD)N, since µ(s) is continuous in s and g satisfies the condition in the theorem, one can choose piece-wise constant u such that the integration of

(µ(S)µ 1(s)g(u(s), x1(s)), , µ(S)µ 1(s)g(u(s), x N(s)))

Parameter-Efficient Fine-Tuning with Controls

over each [sj, sj+1] is along the same direction with Vj. This yields that ( z1(S), , zi(S)) can take any value in a some neighborhood of the origin in (RD)N. Since zi(S) differs from zi(S) with only O(ε), we conclude that the local controllability holds for small ε.

Remark 4.3. The full proof of Theorem 4.2 is given in the appendix. The insight of Theorem 4.2 is actually straightforward: suppose the controller allows all the directions over (RD)N to be admissible for perturbation at each time horizon, then one has the freedom to perturb the final outputs of the dynamics arbitrarily in a small scale.

4.2. Insights on the benefits of nonlinear controller

While the theorem provides just a sufficient condition for local controllability in a continuous-time setting, it yields valuable insights into the controller design. Observations drawn from the condition in Theorem 4.2 reveal that the linear controller is unable to satisfy the condition in Theorem 4.2 when the dataset X is large. In particular, we have the following Proposition. Proposition 4.4. Suppose g(u, x) = A(u)x + B(u) which is linear in x, then the condition in Theorem 4.2 cannot hold when N > D + 1.

A recent study (Luo et al., 2023) demonstrates that it is empirically safe to eliminate the Re LU function from Adapt Former, resulting in a linear controller. But the above Proposition indicates that the expressive ability of the linear controller is highly related to the original dynamics, where it can be possibly compromised if the original dynamics exhibit near-linear behavior across a specific dataset. On the other hand, nonlinear controllers can have the potential to satisfies the condition in Theorem 4.2 for more general datasets X, which is an implication of a strong controllability. This observation then motivates us to consider nonlinear controllers for PEFT algorithms.

Moreover, in the case when the original dynamics function h(x, s) is linear in x, the effect of a linear controller does not add any nonlinear characteristics to the original dynamics. Consequently, the controllability as defined in Definition 4.1 cannot be achieved.

5. Nonlinear Controller Design

Motivated by the above insights suggesting potential benefits of nonlinear controllers, we now turn to the design of a practical nonlinear controller for Vi T.

5.1. Cross-Patch Attention is What You Need

Linear control nevertheless offers a straightforward perturbation mechanism on the frozen Vi T blocks and are often simple to design and analyze. However, as explored in Sec-

tion 4.2, the linear controller alone cannot ensure robust controllability for general models; its effectiveness depends on the complexity and nonlinearity inherent in the original dynamics.

Nonlinear control, on the other hand, holds the promise of employing a more complex perturbation mechanism, but the form of such a nonlinear control should be meticulously designed. As an example, the pioneering work Adapt Former attempts to incorporate nonlinearities by considering gt = σ(xt At)Bt, where σ represents an activation function such as Re LU. But subsequent research demonstrates (Luo et al., 2023) that such an activation function has minimal effects in practice, and the performance of Adapt Former closely resembles that of its linear counterpart.

To get more insights on the design of nonlinearity, let us delve into the original Vi T system. In the dynamics of Vi T, the state x = (x1, , xm)T Rm d is consisting of a sequence of tokens. Note each Vi T block comprises two consecutive components: MHSA and FFN. In particular, each head within the MHSA block conducts an attention mechanism via:

Attn(Q, K, V ) = Softmax QKT

A key insight for this attention module is that the tokenized patches are intermingled to generate the new tokens. The following FFN block merely offers a patch-independent linear transformation. Yet, such a cross-patch attention (CPA) mechanism is missing in the design of algorithms like Adapt Former: each tokenized patch is projected downward and upward through linear transformations, with no inclusion of a token mixture process. Such a token-wise controller design can limit the controllability over general models and data, as shown in the following proposition. Proposition 5.1. Suppose the controller has the form:

g(u, x) = ( g(u, x1), , g(u, xm))T (13)

which is token-wise applied to x, then the condition in Theorem 4.2 cannot hold when there exists some patch xj1 i1 and xj2 i2 of xi1 and xj1 in X, such that xj1 i1 = xj2 i2.

In addition, if the original dynamics h(s, x) are also applied token-wise, there exists cases where the controllability defined in Definition 4.1 is compromised. This observation indicates the necessities of cross-patch information in tuning states that sharing common patches in the original dynamics.

5.2. Nonlinear Controller Design

The above Proposition motivates us to devise a nonlinear control mechanism well-suited for the transformer. In particular, the key step in designing an effective nonlinear controller lies in how to introduce the CPA mechanism, while

Parameter-Efficient Fine-Tuning with Controls

Trainable Frozen Head

Multi-Head Controller

Figure 1. Vi T with nonlinear controls.

at the same time minimizing the introduction of additional parameters.

This paper explores the utilization of the Lo RA matrices as the base, and proposes a series of heads with minimal parameters. The objective is to ensure the intermingling of tokens from different patches through the implementation of these heads.

In particular, we explore both linear and nonlinear heads, as depicted in Figure 1. For linear heads, a straightforward approach is to direct the controller to output the Lo RA results. Regarding nonlinear heads, we examine the following mechanism:

xp(j ) = CPA(x) = X

exp xp(i), xp(j) P

m exp xp(i), xp(m) xp(j),

where xp(j) denotes the j-th patch of x. Note in the above CPA design, we refrain from introducing a linear transformation for x to obtain Q, K, V , and we also omit the scaling factor

d. This deliberate omission allows the attention to be performed in a parameter-free manner, with only matrix multiplication for x being necessary to compute attention.

The nonlinear output is subsequently combined with the linear output to generate the overall control, resulting in the following control formulation:

Control(x) = Lo RA(x) + CPA (Lo RA(x)) (15)

And dynamics of controlled-Vi T can be depicted as:

2 = MHSA(LN(xt)) + xt, (16)

xt+1 = FFN(LN(xt+ 1

2 )) + Control(xt+ 1

2 ) + xt+ 1

Let us make a few comments to the control design. The nonlinear head (14) within the controller has no parameters,

ensuring that the total number of parameters in our approach remains identical to the previous work. From a control perspective, incorporating such a head introduces both the nonlinearities and cross-patch information to the controller.

From the machine learning aspect, such a parameter-free head allows the patches to be mixed up to generate the new tokens, akin to the standard attention mechanism. Note the control is applied in parallel to the FFN layer in the above formulation, following the pioneering work (Chen et al., 2022). Alternatively, it could be applied to the entire Vi T block, and we observe only very minimal differences in practical scenarios.

Finally, while proving this controller s effectiveness in general cases is challenging, we provide numerical demonstrations that illustrate its satisfaction of the sufficient condition on Vi T examples, as later discussed in Section 6.2.

5.3. Multi-Head Controller

One may consider the attention shares the weight of Lo RA matrices in the above formulation (14), namely:

Q = K = V = At Bt. (18)

This allows the controller to utilize the existing parameters of Lo RA and refrains from introducing extra parameters.

A simple extension of this parameter-sharing mechanism is to increase the head numbers, in order to boost the complexity of controller. In Figure 1, we consider such a scenario by utilizing a multi-head controller. Note such a setting introduces extra parameters and we limit the head number to 2 in this paper for efficiency concerns.

6. Experiment

In this part, we evaluate the effectiveness of the nonlinear controllers by conducting a series of experiments on vision datasets.

6.1. Preliminary

Experimental Settings. For fair comparison, we mirror the experimental settings in Adapt Former (Chen et al., 2022). This involves utilizing the same pretrained Vision Transformer (Vi T) backbone, choosing identical downstream tasks, and configuring parameters based on the specifications provided in their original study.

Competing Algorithms. The proposed Attentionaugmented Nonlinear Control Algorithm, denoted as ANC, is compared with a few commonly used tuning algorithms: (1) Full-Tuning: all parameters are trainable; (2) Linear

Parameter-Efficient Fine-Tuning with Controls

Probing: appending an additional trainable linear layer on top of the pre-trained model while keeping the rest parameters fixed; (3) Visual Prompt Tuning (VPT) (Jia et al., 2022): concatenating a set of trainable tokens with existing image tokens; (4) Low-Rank Adaptation (Lo RA) (Hu et al., 2021): injecting trainable low-rank matrices to WQ and WV ; (5) Adapt Former (Chen et al., 2022): a vision-specific Lo RA algorithm by perturbing the FFN layer with (almost) linear controls.

6.2. A Toy Example On the Controllability

We commence with a numerical verification of the condition outlined in Theorem 4.2 through a small-size example. In particular, we consider a scenario wherein the original model is a randomly initialized 10-layer Vi T model. The state x encompasses 4 tokens in dimension 5. Subsequently, we introduce our controller defined as follows:

g(u, x) := Ax + b + CPA(Ax + b),

and proceed to numerically evaluate the conditions outlined in Theorem 4.2 at each layer of the model.

We randomly generate 20 tokens x1, , x20 as the input dataset X, and 4000 samples {uj = (Aj, bj) | j = 1, , 4000}, as the control parameters. Subsequently, at a fixed layer t {1, , 10}, we compute the vectors

(g(uj, x1 t), , g(uj, x20 t ))

for each j = 1, , 4000. The generated vectors are normalized and stacked into a 4000 400 matrix. Next, we compute the singular values and contrast them with the singular values of the matrix obtained using the linear controller. The comparison of the 5-th transformer block is presented in Figure 2.

0 50 100 150 200 250 300 350 400 Index

Singular Value

0 50 100 150 200 250 300 350 400 Index

Singular Value

Figure 2. Comparison between the singular value of the matrix obtained by our controller (left) and the linear controller (right) at the 5-th transformer block.

At each block in the model, all singular values with an index greater than 30 become zero in the case of the linear controller, indicating that the condition in Theorem 4.2 cannot be satisfied. In contrast, for our controller, the singular values exhibit a much slower decay, with a minimum value around 0.04. This signifies that the condition in Theorem 4.2 holds for our controller.

6.3. Experiments on Vision Benchmarks

With the above observation, we now proceed to validate our approach on various vision benchmarks, including CIFAR100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011) and Food-101 (Bossard et al., 2014).

Table 1 reports the performance of all algorithms, with the same pre-trained Vi T backbone. Notably, the Full-Tuning algorithm consistently achieves the highest levels of test accuracy across all three datasets, establishing itself as a practical upper limit for Lo RA-like algorithms. In contrast, the linear-probing algorithm experiences a considerable decline in accuracy ranging from 18.07% to 30.76% across these datasets, indicating that the mere insertion of a linear layer may not yield satisfactory results. This observation also highlights the necessities of perturbing the lower levels of a Vi T model to enhance performance.

The PEFT algorithms on the other hand surpass the linearprobing algorithm by a large margin. In particular, we compare the Lo RA, Adapt Former and ANC algorithm with the same number of parameters in Table 1. Note the Lo RA algorithm needs to set a low-rank perturbation to both WQ and WV , hence its rank would be half of the other two algorithms with equivalent parameter counts. The findings reveal that the Attention-augmented Nonlinear Control algorithm (ANC) consistently outperforms all competing PEFT algorithms across diverse datasets and for all rank configurations. As an illustration, on the Food-101 dataset, the performance demonstrates a notable improvement from 85.70% of Lo RA-16 to 88.06% of ANC-32, achieved without introducing any additional parameters. The gap to Full-Tuning could be decreased from 4.39% to 2.03% in this case, where PEFT algorithms only need to tune a small fraction (0.78%) of parameters.

Table 1. Comparison of algorithm performance. We reuse the data reported in Adapt Former and present single-run results, while repeated experiments for ANC are provided in Appendix B.3. [ ] lr for the full-tuning algorithm is decreased by 0.1 to maximize performance of full-tuning on the CIFAR-100 dataset.

Algorithm # Params(M) CIFAR-100 SVHN Food-101 Full-Tuning 86.04 87.90 97.67 90.09 Linear-Probing 0.07 69.83 66.91 69.74 VPT 0.08 82.44 94.02 82.98 Lo RA-16 0.67 85.31 96.29 85.70 Adapt Former-32 0.67 85.42 96.45 86.21 ANC-32 0.67 86.69 96.94 88.06 Lo RA-32 1.26 85.42 96.42 86.09 Adapt Former-64 1.26 85.90 96.89 87.61 ANC-64 1.26 87.06 97.03 88.33 Lo RA-64 2.44 85.88 96.58 86.42 Adapt Former-128 2.44 86.12 96.92 87.78 ANC-128 2.44 87.17 97.11 88.50

By examining the training curves in Figure 3, we notice the ANC-64 algorithm consistently obtains the lowest training loss during the learning process. The terminal training loss

Parameter-Efficient Fine-Tuning with Controls

is 19.23% smaller than Adapt Former-64. Its superior fitting capability is reflected in a relatively higher test accuracy, as demonstrated in Table 1, notwithstanding the models sharing identical parameter counts. This observed phenomenon persists across all datasets and rank configurations, suggesting the superior approximation ability of this nonlinear control in practical applications.

0 25 50 75 100 125 150 175 200 Training Epoch

Lo RA-32 Adapt Former-64 ANC-64

Figure 3. Comparison of training loss for different algorithms on the CIFAR-100 dataset. Similar results on SVHN and Food-101 are available in Appendix B.2.

6.4. Multi-Head Controller Experiment

The incorporation of a nonlinear transformation in the ANC algorithm allows the perturbations to be performed in a more complex way, where different patches could be mixed together through a parameter-free attention mechanism. But the reuse of At Bt as Q, K, V nevertheless limits its complexity. To address this limitation, we may increase the head numbers by adopting a multi-head controller, as illustrated in Figure 1. Note such an approach is no longer parameter-free, and we generally need to increase the number of training variables to boost the complexity.

Table 2. Algorithm performance of multi-head controllers. 2H denotes the 2-Head controller.

Algorithm CIFAR-100 SVHN Food-101 ANC-32 86.69 96.94 88.06 ANC-32-2H 87.14 (+0.45) 97.03 (+0.09) 88.43 (+0.37) ANC-64 87.06 97.03 88.33 ANC-64-2H 87.35 (+0.29) 97.12 (+0.09) 88.65 (+0.32) ANC-128 87.17 97.11 88.50 ANC-128-2H 87.52 (+0.35) 97.18 (+0.07) 89.01 (+0.51)

Table 2 presents the performance of utilizing such a multihead control strategy. The results reveal that increasing the head number can further boost the overall performance across all datasets and ranks. Notably, this control strategy enables a smaller performance gap to the full-tuning algorithm. For instance, ANC-128-2H attains a test accuracy level of 87.52%, while the full-tuning algorithm achieves 87.90% on the CIFAR-100 dataset. It is noteworthy that the

former algorithm only needs to tune 4.82% of the whole parameters, whereas the latter requires full-tuning all 86.04 million trainable variables.

Moreover, by comparing algorithms with the same number of parameters (e.g., ANC-32-2H and ANC-64), it becomes evident that augmenting nonlinearity exhibits a marginally superior performance compared to increasing the rank. Consequently, elevating the rank is no longer the exclusive avenue for enhancing the performance of PEFT algorithms.

6.5. Ablation Studies on Nonlinearity

This paper explores the architectural design by leveraging Lo RA matrices as the foundation, supplemented with the incorporation of a Cross-Patch Attention (CPA) head to introduce nonlinearity. This departure from prevailing methodologies, which typically introduce nonlinearity through activation functions between At and Bt. We highlight the necessities of token intermingling by contrasting with the following algorithms: (1) the standard Adapt Former utilizing a Re LU function; (2) a purely linear control method achieved by excluding the Re LU function; (3) a nonlinear control by injecting a sigmoid function into Adapt Former.

Table 3. Ablation study on the forms of nonlinearity Algorithm CIFAR-100 SVHN Food-101 ANC-64 87.06 97.03 88.33 Adapt Former-64 85.90 96.89 87.61 Linear-64 86.01 96.85 87.68 Adapt Former-64-Sigmoid 84.61 95.99 85.80

Table 3 indicates that the incorporation of the nonlinear Re LU function yields marginal impact on the final performance of Adapt Former, in comparison with its linear counterpart. Moreover, the incorporation of a sigmoid function leads to performance declines across all datasets, notably resulting in a significant decrease of 1.4% on the CIFAR100 dataset. Consequently, the form of nonlinearity has to be meticulously designed, in order to outperform the pure linear controls. The attention-augmented algorithm consistently outperforms both linear controls and nonlinear controls employing activation functions. Note the rank is set to 64 in Table 3, but the observed phenomenon persists across all rank configurations.

7. Conclusion

This paper bridges the controllability analysis with recent investigations into PEFT algorithms. Specifically, we recast the Lo RA-like algorithms as a control problem and conduct a comprehensive analysis of the controllability of low-rank modules, thereby establishing sufficient conditions for downstream controls. The controller modules are further redesigned by introducing nonlinearities through a parameter-free attention mechanism, enabling token inter-

Parameter-Efficient Fine-Tuning with Controls

mingling within the controllers. Empirical results demonstrate that this approach outperforms the existing Lo RA-like algorithms across all evaluated datasets and rank configurations, without introducing additional parameters.

Impact Statement

This paper aims to bridge parameter-efficient algorithms with control theory. There are minor potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

This research is supported by the National Research Foundation, Singapore, under the NRF fellowship (project No. NRF-NRFF13-2021-0005).

Aghajanyan, A., Zettlemoyer, L., and Gupta, S. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. ar Xiv preprint ar Xiv:2012.13255, 2020.

Bapna, A., Arivazhagan, N., and Firat, O. Simple, scalable adaptation for neural machine translation. ar Xiv preprint ar Xiv:1909.08478, 2019.

Bishop, R. C. D. R. H. Modern control systems. 2011.

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 mining discriminative components with random forests. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 446 461. Springer, 2014.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020.

Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., and Holtham, E. Reversible architectures for arbitrarily deep residual neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

Chang, Y.-C., Roohi, N., and Gao, S. Neural lyapunov control. Advances in neural information processing systems, 32, 2019.

Chavan, A., Liu, Z., Gupta, D., Xing, E., and Shen, Z. One-for-all: Generalized lora for parameter-efficient finetuning. ar Xiv preprint ar Xiv:2306.07967, 2023.

Chen, S., Ge, C., Tong, Z., Wang, J., Song, Y., Wang, J., and Luo, P. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664 16678, 2022.

Chen, T., Zhang, Z., Ouyang, X., Liu, Z., Shen, Z., and Wang, Z. bnn-bn=? : Training binary neural networks without batch normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4619 4629, 2021.

Cheng, J., Li, Q., Lin, T., and Shen, Z. Interpolation, approximation and controllability of deep neural networks. ar Xiv preprint ar Xiv:2309.06015, 2023.

Cuchiero, C., Larsson, M., and Teichmann, J. Deep neural networks, generic universal interpolation, and controlled odes. SIAM Journal on Mathematics of Data Science, 2 (3):901 919, 2020.

Dai, H., Landry, B., Yang, L., Pavone, M., and Tedrake, R. Lyapunov-stable neural-network control. ar Xiv preprint ar Xiv:2109.14152, 2021.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

E, W. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1 (5):1 11, 2017.

Franklin, G. F., Powell, J. D., Emami-Naeini, A., and Powell, J. D. Feedback control of dynamic systems, volume 4. Prentice hall Upper Saddle River, 2002.

Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse problems, 34(1):014004, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Parameter-Efficient Fine-Tuning with Controls

He, K., Chen, X., Xie, S., Li, Y., Doll ar, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000 16009, 2022.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790 2799. PMLR, 2019.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021.

Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B., and Lim, S.-N. Visual prompt tuning. In European Conference on Computer Vision, pp. 709 727. Springer, 2022.

Karimi Mahabadi, R., Henderson, J., and Ruder, S. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34: 1022 1035, 2021.

Kerimkulov, B., ˇSiˇska, D., and Szpruch, L. A modified msa for stochastic control problems. Applied Mathematics & Optimization, pp. 1 20, 2021.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Kwakernaak, H. and Sivan, R. Linear optimal control systems, volume 1. Wiley-interscience New York, 1972.

Lewis, F. L., Dawson, D. M., and Abdallah, C. T. Robot manipulator control: theory and practice. CRC Press, 2003.

Li, Q. and Hao, S. An optimal control approach to deep learning and applications to discrete-weight neural networks. In International Conference on Machine Learning, pp. 2985 2994. PMLR, 2018.

Li, Q., Chen, L., Tai, C., et al. Maximum principle based algorithms for deep learning. ar Xiv preprint ar Xiv:1710.09513, 2017.

Li, Q., Lin, T., and Shen, Z. Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 25(5):1671 1709, 2022.

Lian, D., Zhou, D., Feng, J., and Wang, X. Scaling & shifting your features: A new baseline for efficient model tuning. Advances in Neural Information Processing Systems, 35:109 123, 2022.

Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: A view from the width. Advances in neural information processing systems, 30, 2017.

Luo, G., Huang, M., Zhou, Y., Sun, X., Jiang, G., Wang, Z., and Ji, R. Towards efficient visual adaption via structural re-parameterization. ar Xiv preprint ar Xiv:2302.08106, 2023.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. 2011.

Ogata, K. Modern control engineering fifth edition. 2010.

Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10): 1345 1359, 2009.

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. 2018.

Radford, A., Kim, J. W., Xu, T., Brockman, G., Mc Leavey, C., and Sutskever, I. Robust speech recognition via largescale weak supervision. In International Conference on Machine Learning, pp. 28492 28518. PMLR, 2023.

Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl Dickstein, J. On the expressive power of deep neural networks. In international conference on machine learning, pp. 2847 2854. PMLR, 2017.

Ruiz-Balet, D. and Zuazua, E. Neural ode control for classification, approximation, and transport. SIAM Review, 65 (3):735 773, 2023.

Slotine, J.-J. E. and Li, W. On the adaptive control of robot manipulators. The international journal of robotics research, 6(3):49 59, 1987.

Sun, C., Shrivastava, A., Singh, S., and Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843 852, 2017.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Zhang, D., Zhang, T., Lu, Y., Zhu, Z., and Dong, B. You only propagate once: Accelerating adversarial training via maximal principle. Advances in Neural Information Processing Systems, 32, 2019.

Zhang, Y., Zhou, K., and Liu, Z. Neural prompt search. ar Xiv preprint ar Xiv:2206.04673, 2022.

Parameter-Efficient Fine-Tuning with Controls

A. Proof for Theorems and Propositions

A.1. Proof of Theorem 4.2

Set xi(s) = xi(s) + εzi(s) and substitute it into the perturbed dynamics (6), we have:

xi(s) + ε zi(s) = h( xi(s), s) + εg( xi(s), u(s)). (19)

Subtracting the original dynamics, and expand with respect to ε gives:

zi(s) = x h(xi(s), s) zi(s) + g(u(s), xi(s)) + δi(s), z(0) = 0. (20)

Since h is piece-wise C2, g is locally Lipschitz, U is compact, we deduce that there exists constant C, which depends only on g, h such that δi(s) Cε for all xi and s [0, S]. Therefore, if we denote z as the solution of the following linear ODE: zi(s) = x h(xi(s), s) zi(s) + g(x(s), u(s)), z(0) = 0, (21)

then the difference between z(s) and z(s) is in O(ε). Let µ(s) Rd d be a fundamental solution matrix of the homogeneous equation of (8), i.e. µ(s) = x h(x(s), s) µ(s), µ(0) = Id, (22)

By the theory of linear ODE, the solution of (8) is given by

z(s) = µ(s) Z t

0 µ 1(s)g(x(s), u(s))ds

For the proof of Theorem 4.2, we need the following technical lemma. The lemma states that the spatial average of trajectories can be realized via average in time.

Lemma A.1. Let xk( ) : [0, S] RD, k = 1, , x N be the N different trajectories of the original ODE. Suppose γ(x, s) : [0, S] RD RD is a continuous map satisfies that for any s [0, S], there exists λ1, , λm R and u1, , um U, such that |λ1| + + |λm| 1 and

γ(xk(s), s) =

i=1 λig(xk(s), u(s)), for all k = 1, , N.

Then, for any δ > 0, there exists a trajectory u( ) : [0, S] U such that

0 µ 1 k (s)γ(xk(s), s)ds Z S

0 µ 1 k (s)g(xk(s), u(s))ds| < δ, for all k = 1, , N,

where µk( ) denotes the trajectory of (22) where the trajectory x( ) is given by xk( ).

Proof of lemma. Let M1 := supk,s h(xk(s), s) , M2 := supk,s,u g(xk(s), u(s)) , M3 := sups,k µk(s) 1 2. Since f is piece-wise C1, g is continuous, U is bounded, we have that M1, M2, M3 < . Let Lg and L For integer N > 0, we split the interval [0, S] into N subintervals [tj 1, tj], j = 1, N, where tj = j N S. At each tj, there exists λj,1, , λj,mj R and uj,1, , uj,mj U, such that |λk,1| + + |λj,mj| 1 and

γ(xk(tj), tj) =

i=1 λj,ig(xk(tj), uj,i), for all k = 1, , N.

For convenience, let βl := Pl i=1 λj,i for l = 0, , mk, and let βmk+1 = 1. Then, we define

u(s) = uj,i, for s [tj 1 + βl 1(tj tj 1), tj 1 + βl(tj tj 1)), l = 1, , mj,

Parameter-Efficient Fine-Tuning with Controls

and u(s) = 0 for s [tj 1 + βmj(tj tj 1), tj]. Then, we have:

tj 1 µ 1 k (s)γ(xk(s), s)ds Z tj

tj 1 µ 1 k (s)g(xk(s), u(s))ds

tj 1 µ 1 k (s)(γ(xk(s), s)ds γ(xk(tk 1), tk 1))ds

tj 1 µ 1 k (s)(γ(xk(tj 1), tj 1)ds g(xk(s)), u(s))ds

(tj tj 1)M3 ω(γ, S

N (1 + M1)) +

tj 1 µ 1 k (s)(γ(xk(tj 1), tj 1)ds g(xk(s), u(s)))ds

(23) where ω(f, δ) = sup{|f(x) f(y)| : |x y| δ} denotes the modulus of continuity of a function. For the second term in the last line, we have the estimate

tj 1 µ 1 k (s)(γ(xk(tj 1), tj 1)ds g(xk(s), u(s)))ds

Z tj 1+βp(tj tj 1)

tj 1+βp 1(tj tj 1) µ 1 k (s)(g(xk(tj 1), uj,l))ds λj,p

Z tj 1+βl(tj tj 1)

tj 1+βl 1(tj tj 1) µ 1 k (s)g(xk(s), uj,l)ds

p=1 λj,lλj,p(M3ω(g, M2S

N ) + M2ω(µ 1 k , S

=(tj tj 1) (M3ω(g, M2S

N ) + M2ω(µ 1 k , S

N )) (24) Combining (23) and (24), we have

0 µ 1 k (s)γ(xk(s), s)ds Z S

0 µ 1 k (s)g(xk(s), u(s))ds| S(M3ω(γ, S

N (1+M1))+M3ω(g, M2S

N )+M2ω(µ 1 k , S

(25) Since γ, g and µ 1 k are continuous, the lemma is proved when N goes to infinity.

Proof of Theorem 4.2 . By the assumption of the theorem, for any s [0, S], there exists an r(s) > 0 such that the ball with radius r(s) Br(s) := {x RNd | x < r(s)}

is contained in the set

i=1 λi (g(x1(s), ui), g(x2(s), ui), , g(x N(s), ui)) | m Z+, |λ1| + + |λm| 1}.

Therefore, for any vector V = (v1, , v N) RND(where vi RD) with unit norm, one can construct a continuous function γ(x, s) such that

(µ1(S)µ 1 1 (s)γ(x1(s), s), µ2(S)µ 1 2 (s)γ(x2(s), s), , µN(S)µ 1 N (s)γ(x N(s), s))

is a.e. non-zero for s [0, S] and is along the same direction with V . Therefore, the vector

0 µ 1 1 (s)γ(x1(s), s)ds, , µN(S) Z T

0 µ 1 N (s)γ(x N(s), s)ds

for some positive constant c. According to Lemma A.1, this implies that

0 µ 1 1 (s)g(x1(s), u(s), )ds, , µN(S) Z S

0 µ 1 N (s)g(x N(s), u(s))ds

Parameter-Efficient Fine-Tuning with Controls

can be arbitrarily close to c V when u( ) varies. Since V is arbitrary and z(S) = µ(S) R S 0 µ 1(s)g(x(s), u(s))ds is the leading term of φε,u φ, we deduce that there exists ε > 0 such that the set

{(φε,u(x1), , φε,u(x N)) | u L ([0, S])}

is an open neighborhood of (φ(x1), , φ(x N)).

A.2. Proof for Proposition 4.4

Proof. When the dataset X is fixed, the expression

A(u)x + B(u) = X

1 i,j D Aij(u)(Eijx) + Bi(u)ei

is a linear combination of no more than D2 + D fixed vectors, where Eij are the (i, j)-th matrix unit and ei are the standard basis vectors for Rd. Therefore, the space spanned by (7) is at most in D2 + D dimension. Then, when N > D + 1, N D will surpass D2 + D and the condition in Theorem 4.2 cannot hold.

A.3. Proof of Proposition 5.1

Proof. For convenience, suppose that the first token x1 1 and x1 2 of x1 and x2 are the same. Since g is token-wise applied, this implies that the first token of g(x1, u) and g(x2, u) will always be the same.

Therefore, the set in (7) will be restricted to a subspace with co-dimension at least d.

0 20 40 60 80 100 Train Epoch

Lo RA-32 Adapt Former-64 ANC-64

0 25 50 75 100 125 150 175 200 Train Epoch

Lo RA-32 Adapt Former-64 ANC-64

Figure 4. Train loss on the SVHN (left) and Food-101 (right) dataset.

B. Additional Experiments

B.1. Experimental Settings

In general, the experimental configurations adhere to the settings employed in the prior Adapt Former (Chen et al., 2022) study. A plain Vision Transformer (Vi T-Base) model serves as the underlying architecture, pre-trained on the Image Net-21K dataset (Deng et al., 2009) with MAE (He et al., 2022). The down-projection layer weights in the controls are initialized using Kaiming Norm (He et al., 2015), while the up-projection layer weights are set to 0. Analogously, all biases in the controls are initialized to 0.

The Stochastic Gradient Descent (SGD) algorithm with a momentum of 0.9 is employed for optimizing the controls during the training process. Its batch-size is set to 128 and the learning rate is set to 0.05. All experiments are conducted on the Nvidia-3090.

Parameter-Efficient Fine-Tuning with Controls

B.2. Training Loss for SVHN and Food-101 Datasets

We report the training curves on the CIFAR-100 dataset in Figure 3. For completeness, the corresponding curves for the SVHN and Food-101 datasets are depicted in Figure 4. The observed results indicate that the ANC algorithm consistently achieves the lowest training losses on these two datasets.

B.3. Repeated Experiments for ANC

Table 4. Repeated experiments on the ANC algorithm. Algorithm CIFAR-100 SVHN Food-101 ANC-32 86.68 0.03 96.96 0.03 88.08 0.05 ANC-64 87.06 0.02 97.04 0.02 88.33 0.03 ANC-128 87.18 0.03 97.12 0.02 88.50 0.05 ANC-32-2H 87.15 0.03 97.05 0.02 88.42 0.04 ANC-64-2H 87.35 0.03 97.12 0.02 88.66 0.04 ANC-128-2H 87.51 0.07 97.18 0.02 89.03 0.05

We present the algorithm performance on a single-run in Table 1. In particular, we set the seed as 42 whenever possible. As such, the up and down projections in all Lo RA-like algorithms are initialized with the same value, so that the only difference lies in the control architecture. This approach effectively mitigates the impact of divergent initializations across different algorithms.

For completeness, we repeat the ANC experiments thrice with different seeds, and report its performance in Table 4. The results demonstrate that the performance of ANC remains generally stable across different seed values, exhibiting minimal variances.