# on_the_role_of_attention_in_prompttuning__290e5249.pdf On the Role of Attention in Prompt-tuning Samet Oymak * 1 Ankit Singh Rawat * 2 Mahdi Soltanolkotabi * 3 Christos Thrampoulidis * 4 Prompt-tuning is an emerging strategy to adapt large language models (LLM) to downstream tasks by learning a (soft-)prompt parameter from data. Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting. In this work, we explore prompt-tuning for one-layer attention architectures and study contextual mixture-models where each input token belongs to a context-relevant or -irrelevant set. We isolate the role of prompttuning through a self-contained prompt-attention model. Our contributions are as follows: (1) We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention under our contextual data model. (2) We analyze the initial trajectory of gradient descent and show that it learns the prompt and prediction head with near-optimal sample complexity and demonstrate how the prompt can provably attend to sparse context-relevant tokens. (3) Assuming a known prompt but an unknown prediction head, we characterize the exact finite sample performance of prompt-attention which reveals the fundamental performance limits and the precise benefit of the context information. We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information. 1. Introduction Transformer models have achieved remarkable success in a wide array of machine learning domains spanning language *In alphabetical order 1University of Michigan & UC Riverside, USA 2Google Research NYC, USA 3University of Southern California, USA 4University of British Columbia, Canada. Correspondence to: Christos Thrampoulidis . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). modeling, vision, and decision making. Recently, one of the key techniques that has helped pave the way for the deployment of transformers to ever increasing application areas is their ability to adapt to multiple unseen tasks by conditioning their predictions through their inputs a technique known as prompt-tuning (Lester et al., 2021; Li & Liang, 2021). Concretely, prompt-tuning provides a more efficient (cheaper/faster) alternative to fine-tuning the entire weights of the transformer by instead training (fewer) so-called prompt parameters that are appended on the input and can be thought of as an input interface. In fact, several recent works have demonstrated experimentally that prompttuning is not only more efficient, but often even becomes competitive to fine-tuning in terms of accuracy (Lester et al., 2021; Liu et al., 2023). However, there is currently limited formal justification of such observations. This motivates the first question of this paper: How does prompt-tuning compare to fine-tuning in terms of expressive power? Are there scenarios prompt-tuning outperforms fine-tuning in that regard? The core constituent of a transformer, and thus of prompttuning, is the attention mechanism. Through the attention layer, prompts get to interact with other input features, create/modify attention weights, and facilitate the model to attend on latent task-specific information. The standard attention layer relies on softmax nonlinearities. Operationally, the softmax function allows a model to selectively focus on certain parts of the input tokens when generating attention outputs. However, there is little rigorous understanding of attention-based prompt-tuning. Concretely, What is the role of the softmax-attention in prompt-tuning in terms of optimization and generalization? How does it locate and extract relevant contextual information? Contributions. Our contributions are as follows: We show that a particular form of attention which we refer to as prompt-attention naturally arises from the selfattention model with prompt-tuning. We identify provable scenarios where it is more expressive than self-attention and linear-prompt-attention. 1 This separation result reveals insightful data models where prompt-tuning can be superior to fine-tuning with self-attention. 1Our emphasis is on the role of attention (whether promptor self-). However, we analyze the general problem where attention weights are optimized jointly with the linear classifier head. On the Role of Attention in Prompt-tuning We develop new statistical foundations for gradient-based prompt-tuning: we study the optimization and generalization dynamics of the initial trajectory of gradient descent for optimizing prompt-attention. Concretely, we show the first few iterations learn the prompt and prediction head with near-optimal sample complexity while achieving high accuracy. Our results provide insights into the critical role of softmax in facilitating attention: we show how the initial trajectory of gradient descent utilizes softmax to provably attend to sparse context-relevant tokens, ignoring noisy/nuisance tokens. We also characterize the exact finite sample performance of prompt-attention assuming known prompt but unknown prediction head. This reveals the fundamental performance limits and precisely quantifies the benefits of context information. Our results highlight various trade-offs among different model parameters: (i) the role of sparsity, i.e., the fraction of context-relevant tokens, and (ii) the relative effects of the different constituents of context-relevant tokens. Finally, we empirically validate our theoretical insights on both synthetic contextual-mixture datasets and imageclassification datasets. Specifically, we compare multiple variants of prompt-tuning against standard fine-tuning on the latter. Our results highlight the role of promptattention in selecting relevant tokens in the image classification setting. Related works. Attention, specifically the so-called self-attention, is the backbone mechanism of transformers (Vaswani et al., 2017). It differs from conventional models (e.g., multi-layer perceptrons and convolutional neural networks) in that it computes feature representations by globally modeling interactions between different parts of an input sequence. Despite tremendous empirical success (see, e.g., Vaswani et al., 2017; Brown et al., 2020; Saharia et al., 2022; Ramesh et al., 2022; cha; Narayanan et al., 2021; Reed et al., 2022, and references therein), the underlying mechanisms of the attention layer remain largely unknown: How does it learn? What makes it better (and when) compared to conventional architectures? Yun et al. (2020) show that self-attention based transformers with large enough depth are universal approximators of seq2seq functions. Focusing on the self-attention component, Edelman et al. (2021) show that self-attention can efficiently represent sparse functions of its input space, while Sahiner et al. (2022); Ergen et al. (2022) analyze convex-relaxations of Self-attention, and Baldi & Vershynin (2022); Dong et al. (2021) study the expressive ability of attention layers. However, these works do not characterize the optimization and generalization dynamics of attention. To the best of our knowledge, the only prior works attempting this are Jelassi et al. (2022) and Li et al. (2023). Jelassi et al. (2022) assume a simplified attention structure in which the attention matrix is not directly parameterized in terms of the input sequence. Our paper also distinguishes itself from contemporaneous work by Li et al. (2023) in several ways: (1) Unlike their data model, ours incorporates a context vector and employs a sub-Gaussian noise model instead of assuming bounded noise. (2) We provide a precise asymptotic analysis that elucidates the role of various problem parameters. (3) While Li et al. (2023) primarily focuses on vanilla self-attention, our study centers on understanding the potential benefits of prompt-tuning through prompt-attention. 2. Problem setting 2.1. Motivation: Prompt-tuning Consider a single-head self-attention layer Opre = ϕ(XWQW KX )XWV , (1) with input X RT d consisting of T tokens of dimension d each, trainable parameters (WK,WQ,WV ) and a softmax nonlinearity ϕ RT RT , [ϕ(v)]t = evt/ t [T ] evt that acts row-wise when its argument is a T T matrix. We scalarize the output of the self-attention layer with a trainable linear head U which yields ypre = U,Opre = U,ϕ(XWQW KX )X . (2) Note here that we implicitly subsume the value matrix WV in the linear head via U = UW V . We assume that the model above is pre-trained so that WK,WQ,U are fixed. Our goal is to use the pretrained transformer on (potentially) new classification tasks. Towards this goal, we explore the use of prompt-tuning, introduced in Li & Liang (2021); Lester et al. (2021) as an alternative to fine-tuning the existing transformer weights. Prompt-tuning appends a trainable prompt P Rm d to the input features X RT d with the goal of conditioning the transformer to solve the new classification task. Let XP = [P X] R(T +m) d be the new transformer input. The output of the attention-layer is thus is of the form O = ϕ(XP WQW KX )X. Note that this is slightly different from (1) in that now the layer computes a cross-attention between the augmented inputs XP and the original inputs X. This is also equivalent to self-attention on XP after masking the prompt P from keys. This masking is used to cleanly isolate the residual contribution of the prompt without impacting the frozen attention output. Concretely, let Whead be the prediction head associated with the prompt tokens. As before, we On the Role of Attention in Prompt-tuning scalarize the output by projecting with a linear head of size (T + m) d as follows: y = [Whead U ] ,ϕ(XP WQW KX )X (3) = Whead,ϕ(P WQW KX )X ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ prompt-attention ynew + U,ϕ(XWQW KX )X ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ frozen self-attention ypre Here, ynew captures the additive impact of prompt-tuning on the prediction. We denote the trainable parameters in the model above as θ = (Whead,P ) 2 Since the ynew term becomes a self-contained module and the features attend directly to the prompt vector, we will refer to it as promptattention. Our goal is understanding the expressivity, training dynamics, and generalization properties of the above model. To simplify our analysis, we consider the following setting. 1. We focus our attention on the novel component ynew of the model output in (3) so as to isolate and fully understand the capabilities of prompt-attention. 2. We assume WK,WQ Rd d are full-rank. 3. We assume a single trainable prompt q Rd i.e., m = 1. Prompt-attention model. Using these assumptions and setting q = WKW QP Rd and w = W head Rd, we arrive at our core prompt-attention model f ATT θ (or simply f ATT): f ATT θ (X) = w,X ϕ(Xq) , θ = (w,q). (4) We shall see that this model exhibits interesting properties to learn rich contextual relationships within the data and can even be more expressive than a single self-attention layer. We remark that the model above is of interest even beyond the context of prompting: the prompt-attention model in (4) is reminiscent of the simplified model proposed in earlier seq2seq architectures (Bahdanau et al., 2014; Xu et al., 2015; Chan et al., 2015) preceding self-attention and Transformers (Vaswani et al., 2017). Indeed, in the simplified attention mechanism of (Bahdanau et al., 2014; Xu et al., 2015; Chan et al., 2015), the tokens relevance scores and corresponding attention weights are determined by a = ϕ(Xq) in which q is a trainable vector and ϕ is the softmax-score transformation. Note here that the trainable parameter q corresponds exactly to the trainable prompt vector in (4). 2In our model, we train the classifier head Whead in addition to the prompt vectors P . Despite the additional training for the classifier head, the computational overhead remains minimal, and the overall scheme remains significantly more efficient compared to updating the entire model WQ, WK, U. 2.2. Contextual data model Consider supervised classification on IID data (X,y) D with features X RT d and binary label y { 1}. Dataset model. We assume the following about an example (X,y) drawn from D: The labels y are distributed as P(y = 1) = 1 P(y = 1) = π; for simplicity, we set π = 1/2 so that E[y] = 0. The tokens xt,t [T] of input example X = [x1,...,x T ] are split into a context-relevant set R [T] and context-irrelevant set Rc = [T] R. Specifically, conditioned on the labels and relevance set R, tokens xt,t [T] within X are i.i.d. as follows q + yw , t R (relevant token) δqq δwyw + zt, t / R (irrelevant token). (DATA) In the above, q is a context-vector that indicates token relevance and w is a regressor vector. y,δ = (δq,δw),(zt)T t=1 are independent variables as follows: δ = (δq,δw) R2 is a bounded random variable that obeys δq 0. Thus, δ reflects out-of-context information within irrelevant tokens. However, δ is allowed to expose label information through δw. When δ = (0,0) almost surely, we call the resulting distribution core dataset model. zt are independent centered subgaussian and random variables with covariance Σ (see Ass. 1.a). When Σ = 0, we call the resulting distribution discrete dataset model. We allow the relevance set R to be non-stochastic. This includes R being adversarial to the classification model. We assume constant fraction ζ = R /T (0,1) of labelrelevant tokens for each input example X drawn from D. Thus, ζ represents the sparsity of relevant signal tokens. Training dataset S = (Xi,yi)n i=1. We draw n i.i.d. samples from D to form our training dataset S = (Xi,yi)n i=1. For i th example (Xi,yi), we denote the tokens by (xi,t)T t=1, noise by (zi,t)T t=1, relevance set by Ri, and outof-context variable by δi = (δq i ,δw i ). Ideally, for i th example, we would like to identify its context-relevant set Ri and discard the rest. This would especially help when the signal-to-noise-ratio is small, i.e. ζ = Ri /T 1. This is precisely the role of the contextvector q : Observe that, per our construction, relevant tokens have positive correlation with q whereas irrelevant tokens have non-positive correlation with q in expectation. Thus, by focusing attention onto tokens based on their q correlation, we can potentially select the relevant set. Remark 1 (Model interpretation). (DATA) can be thought of as a simplified model for binary image classification with tokens being image patches of two types: ones revealing On the Role of Attention in Prompt-tuning information about the label (set R) and uninformative ones containing noise. Tokens in R contain information indicating: (i) class-membership via signed-regressor yw and (ii) context-relevance via context-vector q . The signedregressor differs across tokens of examples belonging to different classes y { 1}, while the context-vector is common for all context-relevant tokes across classes. For a concrete example, consider images each depicting multiple, say R , birds of one type surrounded by label-irrelevant/noisy background. The goal is to classify images according to one of two types of birds. Here, think of context as feature-information indicating corresponding pixels belong to bird (of either type) rather than background, while the regressor represents feature information useful to distinguish between two bird types. Alternatively, (DATA) may be modeling deep representations (rather than raw pixels) of the original images. Overall, simplified models similar to (DATA) have been used previously to analyze optimization and generalization dynamics of training fullyconnected (Frei et al., 2022) and convolutional models (Cao et al., 2022). Specifically, (DATA) is an extension of the commonly used (sub)-Gaussian mixture model customized to the nature of attention: each example is tokenized and context-relevant information is described in terms of both a regressor (differing between classes) and a context (common across classes). 2.3. Baseline Models We compare performance of the prompt-attention model in (4) with the following three baseline models. The linear model is parameterized by θ = w and outputs f LIN(w) = 1 T w X 1T = 1 T t [T ] w xt. (5) Note this corresponds to a prompt-attention model with uniform attention weights [a]t = 1/T,t [T]. The self-attention model is a strict generalization of the linear model. Recalling (2), let us merge the key-query weights W = WQW K (without losing generality) and gather weights into θ = (U,W ); We then write it as f SATT(U,W ) = 1 T U,ϕ(XW X )X . (6) Rather than using a Td dimensional U, we will also consider the simpler token-pooling via U = 1T u for u Rd. The linear-attention model parameterized by θ = (w,q) replaces the softmax score transformation in (4) with a linear function and outputs f LIN-ATT(w,q) = w X Xq/T. (7) 2.4. Training Given S = (Xi,yi)n i=1 drawn i.i.d. from D, we solve squareloss empirical risk minimization to obtain ˆθ = ( ˆw, ˆq) ˆθ = arg min θ LS(θ) = 1 n i=1 (yi fθ(Xi))2. (8) Within our theoretical investigation, we are interested in the following performance criteria for models f {f ATT,f LIN-ATT,f SATT}: Classification error: For a model fˆθ this is defined as ERR(ˆθ) = P(yfˆθ(X) < 0). Test risk: L(ˆθ) = E(y,X) D[(y fˆθ(X))2]. 2.5. Assumptions and notations First, we formally state our assumptions on the noisy tokens. The more general condition is that noise is subgaussian and satisfies a mild zero third-moment condition. Assumption 1.a. The noise vector z SN(σ) is centered σ-sub Gaussian, i.e. z ψ2 = σ. Moreover, its distribution is symmetric and has zero-third moment, i.e. E[z (z z)] = 0. Let Σ = E[zz ] denote the noise covariance. For some of our results it will be convenient to further assume that noise is Gaussian since this leads to precise formulas that are easily interpretable. Assumption 1.b. The noise vector z N(0,σ2I) is isotropic Gaussian with variance σ2. Second, we require a mild assumption on the correlation between the context q and classifier w to guarantee that pure signal tokens q + yw are correctly classified by the true regressor w , i.e. yw (q + yw ) > 0. For convenience, we denote W = w , Q = q , ρ = q w /( q w ). Assumption 3.a. Correlation satisfies ρ < W/Q. We will also often consider the special case of zero correlation ρ and thus state it as separate assumption below. This orthogonality assumption, is useful for more tractable analysis as it helps decouple feature selection and prediction. Assumption 3.b. The context and classifier vectors are orthogonal, i.e. q w . Notation. We use boldface letters for vectors and matrices. 1m represents an m-dimensional all-ones vector. For a vector v, v denotes its Euclidean norm and v/ v its normalization. ϕ( ) denotes the softmax transformation. Q( ) denotes the tail function of the standard normal distribution. and denote the minimum and maximum of two numbers, respectively. O() and notations suppress logarithmic dependencies. Finally, denotes proportionality. On the Role of Attention in Prompt-tuning Label relevance Context relevance LIN-ATT fails (a) Failure modes 0 1 2 3 4 5 / crit Linear Attention Self Attention Prompt Attention (b) Varying with δq = δw 0 1 2 3 4 5 / crit Linear Attention Self Attention Prompt Attention (c) Varying with δq = δw Figure 1. This figure summarizes and verifies the outcomes of Theorem 1. Fig (a) depicts the outcome of our Theorem 1. Relevant token is at position yw + q whereas red tokens (irrelevant) are in positions δ(q yw ) with δ {0, }. Figures (b) & (c) plot the performance of our attention models under the contextual dataset with δ equally-likely over {0, } for a synthetic setup (cf. Section 5.1). We set n = 100, d = T = 10, ζ = 0.4 and train with 100 SGD epochs. We report the median test accuracy over 20 trials. Fig (b) sets δ = δq = δw and verifies self-attention has 50% error (for crit= (1 ζ) 2). Fig (c) sets δ = δq = δw and verifies linear-prompt-attention has 25% error (when δ crit = ζ/(1 ζ) in this case). 3. Contrasting prompt-attention to baselines In this section, we establish separation results between prompt-attention (cf. (4)) and the baselines of self-attention (cf. (6)) and linear attention (cf. (7)). For this, we focus on the discrete dataset model with noiseless irrelevant tokens (zt = 0,t [T]). We first observe that if δ = (δq,δw) admits a single value, even a linear model can solve the discrete dataset model. Observation 1 (Linear model solves singleton). Suppose (δq,δw) = ( q, w) almost surely for q, w R. Set w = (I q q )w . As long as w ζ/(1 ζ) and w 0, f LIN(w ) or f LIN( w ) achieves perfect accuracy. Thus, to investigate the expressivity of f ATT,f SATT,f LIN, (δq,δw) would need to admit two or more values. Perhaps surprisingly, we prove that, as soon as, (δq,δw) comes from a binary distribution, then both f SATT and f LIN can indeed provably fail. Importantly, this happens in the regime δq 0 where prompt-attention thrives. Theorem 1 (Separation of population accuracies). Consider the discrete dataset model where we set Σ = 0 in (DATA). The following statements hold: 1. Prompt-attention: Suppose ρ2 < 1, δq 0, and δw C almost surely. Define q = (I w w )q ,w = (I q q )w . For Γ > log(C(1/ζ 1)) Q2(1 ρ2) , choosing θ = (w ,Γq ), f ATT θ achieves perfect classification accuracy on (DATA). 2. Self-attention: In (DATA), choose (δq,δw) to be (0,0) or ( , ) equally-likely with > (1 ζ) 2. For any choice of (U = 1u ,W ), f SATT(1u ,W ) achieves 50% accuracy (i.e. random guess). For any choice of (U,W ), there exists a (DATA) distri- bution with adversarial relevance set choices such that f SATT(U,W ) achieves 50% accuracy. 3. Linear-attention: In (DATA), choose (δq,δw) to be (0,0) or ( , ) equally-likely with > ζ/(1 ζ). For any choice of (w,q), f LIN-ATT(w,q) achieves at most 75% accuracy. See Fig. 1 for an illustration of the main takeaways from Thm. 1 and a numerical validation of its conclusion on synthetic data. While surprising, the reason prompt-attention can provably beat self-attention is because it is optimized for context-retrieval and can attend perfectly on the relevant contextual information. In contrast, self-attention scores are fully feature-based; thus, context information is mixed with other features and can be lost during aggregation of the output. Also note that all results, with the exception of self-attention for general U, hold for arbitrary choices of the relevance sets (including adversarial ones). The reason is that tokens are pooled and the particular choice of R does not matter. Only for f SATT(U,W ) we need to adapt the relevance set R to the output layer U (as well as (y,δ) variables) to promote misclassification. Otherwise, with the hindsight knowledge of the relevance set, U can intelligently process individual tokens of the self-attention output to filter out confusing tokens. In fact, for the same failure dataset model, self-attention can achieve perfect accuracy by choosing U = 1Rw where 1R is the vector of ones over the (known!) relevance set R (see Lemma 17). However, this is of course only known in hindsight. On the Role of Attention in Prompt-tuning 4. Gradient-based analysis of prompt-attention This section investigates how gradient-descent optimization of the prompt-attention model learns (DATA). Concretely, it shows that a few gradient steps can provably attend to the context-relevant tokens leading to high-classification accuracy. Our results capture requirements on sample complexity in terms of all problem parameters, i.e. dimension d, correlation ρ, context / signal energies Q / W, number of tokens T, and sparsity ζ. This allows studying tradeoffs in different regimes. Our analysis in this section concerns the prompt-attention model f ATT θ , so we simply write fθ. Also, without any further explicit reference, we focus on the core dataset model, i.e. (DATA) with δ = (0,0). All our results here hold under the mild noise and correlation assumptions: Assumption 1.a and Assumption 3.a (we will not further state these). Finally, for simplicity of presentation we assume here isotropic noise Σ = σ2I and handle the general case in the appendix. 4.1. Gradient-based algorithm For data generated from (DATA), we show the three-step gradient-based algorithm described below achieves high test accuracy. Our analysis also explains why three appropriately chosen steps suffice. Algorithm: We split the train set in three separate subsets S1,S2,S3 of size n each. Starting from w0 = 0,q0 = 0, the algorithm proceeds in three gradient steps for step sizes η > 0 and γ > 0 and a final debiasing step as follows: w1 = η w LS1(0,0), (9a) q1 = γ q LS2(0, w1), (9b) w2 = η w LS3( q1, w1), (9c) where LSj,j = 1,2,3 is the loss in (8) evaluated on sets Sj. The debiasing step is defined in Section 4.3. 4.2. Population analysis To gain intuition we first present results on the population counterpart of the algorithm, i.e., substituting L(w,q) with its population version L(w,q) = E[ L(w,q)] in all three steps in (9). It is convenient to introduce the following shorthand notation for the negative gradient steps Gq(q,w) = q LD(θ) = E(X,y) D[(y fθ(X)) qfθ(X)] and Gw(q,w) = w LD(θ) = E(X,y) D[(y fθ(X)) wfθ(X)]. The first gradient step (cf. (9a)) is easy to calculate and returns a classifier estimate that is already in the direction of w . Lemma 1 (Population first step). The first population gradi- ent step w1 = ηGw(0,0) satisfies w1 = ηζw since under (DATA), Gw(0,0) = E(X,y)[y X 1]/T = ζw . The second gradient step q1 = γGw(w1,0) is more intricate: unless q w , q1 also has nonzero components in both directions q and w . Lemma 2 (Population second step). The second population gradient step q1 = γGw(w1,0) satisfies for α = ηζ q1 = γαW 2(ζ ζ2)(1 + ασ2 T αζ(W 2 + ρ2Q2))q (10) + γαρQW(ζ ζ2)(1 2ζαW 2 (1 + 2 Proof. Since this computation involves several terms, we defer complete proof to Appendix D.1. The above simplification is made possible by leveraging the assumption on the third-moment of noise (cf. Assumption 1.a). Lemma 2 highlights the following key aspects: (i) As mentioned, q1 also picks up the w direction unless ρ = 0. However, we can control the magnitude of this undesired term by choosing small step-size η (see Cor. 2). (ii) As αW 2 grows, the gradient component in the q direction might end up pointing in the direction of q . This is because large signal along the w direction might still allow to predict 1 label. However, this can always be avoided by choosing sufficiently small step-size η (see Cor. 2). (iii) Similarly, as the noise strength σ2 grows, gradient in the q direction grows as well. This is because, going along q direction attenuates the noise and cleans up the prediction. (iv) Finally, as ζ 1 and ζ ζ2 0 the magnitude of the gradient decays because all tokens contain signal information and there is no need for q . To see how q1 selects good tokens, we investigate the relevance scores (normalized by the step size γ) rt = x t q1/γ of relevant vs irrelevant tokens. Attending to context-relevant tokens requires their relevance scores to be larger than those of the noisy ones. Concretely, suppose we have B = min t R {rt = (q + yw )q1 γ } 2 max t Rc{rt = z t q1 Note above that the relevance scores are the same for each t R. Thus, R eγB + Rc eγB/2 S = t [T ] eγrt R eγB, which implies the following for the attention weights as step size increases γ : at = [ϕ(Xq1)]t = eγrt/S S 0 t Rc . (12) Provided (11) holds, a large enough second gradient step (i.e. large γ) finds q1 that attends (nearly) perfectly to contextrelevant tokens in R and attenuates (almost) all irrelevant tokens in Rc. The following theorem formalizes the above intuition. We defer the complete proof to Appendix D.3. On the Role of Attention in Prompt-tuning Theorem 2 (Main theorem: Population). Consider the model θγ = (wγ 2,qγ 1 ) where qγ 1 = γGq(w1,0), wγ 2 = Gw(0,qγ 1 ) and w1 = ηGw(0,0) for step-size η small enough (see Eq. (66) for details). Then, there exists an absolute constant c > 0, sufficiently large context strength Q and step-size γ > 0 such that ERR (fθγ) 2Te c α2Q2 (1 ρ2/2)Q 2 ρ W 1 + 3ρ2 αQ. (13) Eq. (13) guarantees the desired condition (11) holds. When ρ = 0 (q w ), α can be as large as 1 in (13) in which case the rate is 2Te c Q2/σ2. For ρ 0, (13) imposes ρ < Q 2W , in which the role of Q,W is reversed compared to ρ W Q in Assumption 3.a: the latter guarantees classifier energy is larger so that signal yw dominates q , while for promptattention to attend to relevant tokens it is favorable that energy of q dominates w . Finally, we compare the theo- rem s error to the error Q( 1 ζ ) of the linear model in Fact A.1. For concreteness, consider a setting of extreme sparsity ζ = O(1/ T) and W = O(1). Then the error of linear model is O(1), while the (population) algorithm in (9) for prompt-attention achieves an error of e O(Q2), which is exponentially decreasing in Q. 4.3. Finite-sample analysis Here, we investigate the behavior of the algorithm in (9) with finite sample-size n. For convenience, we first introduce an additional de-biasing step after calculating the three gradients in (9). Specifically, for a sample S4 of size n we compute a bias variable b = 1 n n i=1 f( q1, w2)(Xi), and use it to de-bias the model s prediction by outputting f(θ,b)(X) = fθ(X) b. While this extra step is not necessary, it simplifies the statement of our results. Intuitively, b helps with adjusting the decision boundary by removing contributions of the context vector in the final prediction (the context vector is useful only for token-selection rather than final prediction). Below we provide a simplified version of our main result where noise variance σ 1 and subsumes constants. Refer to Theorem 6 in the appendix for precise details. Theorem 3 (Main theorem: Finite-sample). Suppose Q,W and ρ are such that there exists α (0,3/16) for which (3/16 ρ2/8)Q (9/4 ρ + 1/16)W α Q Fix any ϵ > 0. For sufficiently small step-size η Q 2, sufficiently large step-size γ = γ(ϵ), and n d(Q/ζW)4 log5(nd), the following statements hold with high probability (see Eq. (63)) over the training set: 1. Prompt attends to relevant tokens. Concretely, for any fresh sample (X,y), with probability at least 1 2Te cα2Q2, the attention coefficients at = [ϕ(X q1)]t satisfy: ϵ (1 ζ)T ,t R. 2. Prompt-attention learns relevant features. Concretely, for some absolute constant c > 0, P(X,y) ( X ϕ(X q1) (q + yw ) < ϵ) 2Te cα2Q2 . 3. The test error of the model f θ similarly satisfies ERR(f θ) 2Te cα2Q2 . Assuming small correlation coefficient ρ and W < Q, we can set α = O(1). Then, similar to Theorem 2, prompt attention achieves a test error rate of e O(Q2) which is a strict improvement over the linear baseline of Fact A.1 whenever Q2 ζ2W 2T. Secondly, our bound achieves a sample complexity of n d(Q/ζW)4. The linear growth in d is intuitive from counting degrees of freedom. Interestingly, large Q does improve the test error, however, it degrades sample complexity. This is because it makes the estimation of parameters more challenging. Finally, larger ζW improves sample complexity since ζW (combining sparsity level and magnitude) captures the strength of the label-relevant information within relevant tokens t R. Sharp error rates: Finally, in Appendix A we provide an exact analysis of the classification error when q is known and only w is estimated from data. This analysis exactly quantifies the value of context-information and how prompt-tuning retrieves it. Specifically, we prove a sharp asymptotic error rate of Q( e Q2/4 1+ISNR(n/d) where ISNR(α) = α 1 (1 ζ)e Q2/2 rate LIN , This uniformly (for all Q 0 values) improves the optimal rates for (context-free) Gaussian mixture models thanks to the context information. 5. Experiments First, we verify the utility of prompt-attention via experiments on a synthetic setting that precisely follows the contextual data model from Section 2.2. Subsequently, we explore prompt-tuning on various image classification tasks that are motivated by the contextual data model and compare it with the standard fine-tuning method. Finally, we validate the utility of prompt vectors in distinguishing relevant tokens from irrelevant tokens via prompt-attention under an image classification setting. On the Role of Attention in Prompt-tuning 0.5 1.0 1.5 2.0 2.5 3.0 3.5 ||q || Classification error Prompt attention Linear model Oracle Prompt attn w/ q Oracle Prompt attn w/ (q , w ) Figure 2. Performance of prompt-attention on the synthetic setting described in Section 5.1. For prompt-attention, we employ the algorithm in (9) to obtain ˆq and ˆw. For the baseline linear model and two oracle settings, we have closed-form expressions for their asymptotic test error (cf. Theorem 1), which are depicted by solid lines. On the other hand, markers show the finite sample performance of these three methods. All finite sample performances are obtained by averaging over 20 independent runs. 5.1. Synthetic experiments Here, we generate a synthetic dataset according to the core dataset model, i.e. we have δ = (δq,δw) = (0,0) for all examples in the dataset. In particular, we consider a setting with T = 500, d = 50, and ζ = 0.1, i.e. each example has 500 tokens out of which 10% tokens are relevant. As for the noisy tokens, they consists of i.i.d. N(0,I) vectors. Assuming that q w and TW = 3, we generate n = 10 d training examples from the core dataset model for varying Q. Fig. 2 showcases the performance of prompt-attention (cf. (4)) when combined with the estimates ˆq and ˆw produced by gradient-based algorithm in (9). We also showcase the performance of the linear model (cf. (5)) and two oracle methods where we assume access to true q and true (q ,w ), respectively, while applying the prompt-attention. Note that prompt-attention achieves a vanishing classification test error in this setting whereas a natural baseline (linear model) can fail to achieve a good performance. On the other hand, the prompt-attention enabled by (9) successfully achieves a high accuracy as the context energy (defined by Q) increases, validating the utility of prompt-attention as well as our gradient-based algorithm in (9). Finally, we also consider a stochastic δ = (δq,δw) (0,0) to validate Theorem 1 as described in Fig. 1. 5.2. Image classification experiments Dataset. Motivated by our contextual data model, we construct three datasets based on CIFAR-10 (Krizhevsky et al., 2009) to conduct our evaluation (see Fig. 3 for examples). Due to spaces constraints, we defer the additional details regarding datasets to Appendix H.1. We also refer the reader to Appendix H.1 for details regarding the model architecture and training procedure. Methods. In our fine-tuning experiments, we update all pre-trained model parameters. As for prompt-tuning, we only update newly introduced (prompt) variables and keep the pre-trained network frozen. We consider three variants of prompt-tuning: 1) PROMPT-TUNING-I (Lester et al., 2021), where we add trainable vector between CLS token embedding and first image (patch) embeddings only at the input; 2) PROMPT-TUNING-II (Li & Liang, 2021), where we add the same trainable vectors between the CLS embedding and the first image embedding at the input of every transformer layer; and 3) PROMPT-TUNING-III, where we add different trainable vectors between the CLS embedding and the first image patch embedding at the input of every transformer layer. Note that the number of trainable parameters in PROMPT-TUNING-I and PROMPT-TUNING-II do not scale with the number of layers whereas we have linear scaling with number of layers in PROMPT-TUNING-III. Interestingly, all three prompt-tuning variant are identical when the number of layers is 1, which corresponds to the setup we theoretically analyzed in the paper. However, they exhibit remarkably different behavior for a multi-layer transformer model, as we show next. (In Appendix H.2, we also compare prompt-tuning with fine-tuning only first layer self-attention parameters for the underlying Vi T model as per the single-layer nature of our theoretical results.) Results. Here, the main goal of our exploration is to highlight the different behavior of fine-tuning and prompt-tuning. We utilize a model trained on FULL-TILED dataset as the pre-trained model. This model achieves top-1 (in-domain) accuracy of 80.43 on FULL-TILED test set. In contrast, it achieves zero-shot top-1 test accuracy of 56.35 and 17.97 on PARTIAL-TILED and EMBED-IN-IMAGENET, respectively. This alludes to the fact that EMBED-IN-IMAGENET corresponds to a larger distribution shift from the pre-training distribution (FULL-TILED), as compared to PARTIAL-TILED. Fig. 4 and Fig. 5 (cf. Appendix H.2) showcase the performance of fine-tuning and prompt-tuning approaches on EMBED-IN-IMAGENET and PARTIAL-TILED, respectively. Note that fine-tuning outperforms prompt-tuning in a datarich setting (cf. Fig. 4a and 5a). This is due to fine-tuning having the ability to update a large number of model parameters (5.4M in our case). In contrast, with 2000 prompt vectors, PROMPT-TUNING-III (the most expensive prompt-tuning method out of all three) only updates 460.8K parameters. Interestingly, in the data-limited regimes, prompt-tuning becomes more competitive. In fact, the best performing prompt-tuning method outperforms the fine-tuning (cf. Fig. 4b and 4c) on EMBED-IN-IMAGENET, where finetuning can easily overfit as it cannot leverage the benefits of the pre-training data due to a large distribution-shift between FULL-TILED and EMBED-IN-IMAGENET. Part of the performance gap between PROMPT-TUNING-III and PROMPT-TUNING-II can be attributed to the larger num- On the Role of Attention in Prompt-tuning Label: bird Label: ship (a) FULL-TILED Label: bird Label: ship (b) PARTIAL-TILED Label: bird Label: ship (c) EMBED-IN-IMAGENET Figure 3. Illustration of different CIFAR-10 based datasets utilized in image classification experiments (cf. Section 5.2). Note that all three variants correspond to 10-way multiclass classification tasks corresponding to 10 original classes in CIFAR-10. 0 50 100 200 Number of prompt vectors Top-1 accuracy Fine-tuning Prompt-tuning-I Prompt-tuning-II Prompt-tuning-III (a) Full dataset 0 50 100 200 Number of prompt vectors Top-1 accuracy (b) Capped 10% 0 50 100 200 Number of prompt vectors Top-1 accuracy (c) Capped 2% Figure 4. Performance of fine-tuning vs. prompt-tuning on 10-way classification tasks defined by EMBED-IN-IMAGENET dataset. Full dataset has 50K training examples. Capped 10% and 2% correspond to sub-sampled train sets where we select exactly 500 and 100 examples per class from the full dataset. Note that number of prompt vectors equal to 0 corresponds to zero-shot performance. ber of trainable parameters available to PROMPT-TUNINGIII. Note that PROMPT-TUNING-II consistently outperforms PROMPT-TUNING-I, even with the same number of trainable parameters. This alludes to the fact that optimization and architecture choices play a major role beyond just the number of trainable parameters. As mentioned earlier, our theoretical treatment for a single-layer model cannot distinguish among these different prompt-tuning approaches. As a result, we believe that our empirical observations point to multiple interesting avenues for future work. 5.3. Attention weights for prompt vectors Finally, we explore what role prompt-attention, i.e. the attention weights with prompt vectors as keys and image patches/tokens as values, plays towards underlying task. In Fig. 6 (cf. Appendix H.2), we illustrate one representative example. The figure shows how average attention weights from prompt vectors to image tokens/patches evolve across transformer layers, when we employ PROMPT-TUNING-III. Indeed, the figure verifies that prompt-attention helps distinguish the relevant tokens/patches from the irrelevant patches, validating our starting hypothesis in Section 2.1 and 2.2. 6. Discussion In light of remarkable success of attention architectures, we initiate a theoretical investigation of one of its core mechanisms: prompt-tuning. For a one-layer attention model, we motivate and analyze a simplified model for prompt-tuning, which we coin prompt-attention. Through this model, we developed new statistical foundations for gradient-based prompt-tuning, characterized its optimization and generalization dynamics, and explored how it facilitates attending to context-relevant information. We also showed that under (DATA) one-layer softmax-prompt-tuning can provably be superior to alternatives including one layer self-attention. Thorough experiments support our theory on how prompttuning attends to the context and how it can potentially outperform full fine-tuning. Our results also suggest many interesting future directions including 1) extension to deeper architectures by characterizing the role of softmax-attention in individual layers, 2) developing a stronger theoretical and empirical understanding of when/if prompt-tuning is superior to fine-tuning, 3) extending our model to include multiple prompt vectors (and perhaps extending (DATA) to include multiple context vectors), and 4) investigating the role of multi-head attention in prompt-tuning. Acknowledgement We thank the reviewers for their feedback and suggesting additional experiments involving fine-tuning only first layer self-attention weights. SO was supported by the NSF grants CCF-2046816 and CCF-2212426, Google Research Scholar award, and Army Research Office grant W911NF2110312. MS is supported by the Packard Fellowship in Science and Engineering, a Sloan Fellowship in Mathematics, an NSFCAREER under award #1846369, DARPA Learning with Less Labels (Lw LL) and Fast NICS programs, and NSF-CIF awards #1813877 and #2008443. CT was partially supported by the NSERC Discovery Grant RGPIN-2021-03677 and by the NSF grant CCF-2009030. On the Role of Attention in Prompt-tuning https://openai.com/blog/chatgpt/. Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473, 2014. Baldi, P. and Vershynin, R. The quarks of attention. ar Xiv preprint ar Xiv:2202.08371, 2022. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020. Cao, Y., Chen, Z., Belkin, M., and Gu, Q. Benign overfitting in two-layer convolutional neural networks. ar Xiv preprint ar Xiv:2202.06526, 2022. Chan, W., Jaitly, N., Le, Q. V., and Vinyals, O. Listen, attend and spell. ar Xiv preprint ar Xiv:1508.01211, 2015. Dehghani, M., Gritsenko, A., Arnab, A., Minderer, M., and Tay, Y. Scenic: A jax library for computer vision research and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21393 21398, 2022. Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pp. 2793 2803. PMLR, 2021. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https:// openreview.net/forum?id=Yicb Fd NTTy. Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. ar Xiv preprint ar Xiv:2110.10090, 2021. Ergen, T., Neyshabur, B., and Mehta, H. Convexifying transformers: Improving optimization and understanding of transformer networks. ar Xiv preprint ar Xiv:2211.11052, 2022. Frei, S., Chatterji, N. S., and Bartlett, P. L. Random feature amplification: Feature learning and generalization in neural networks. ar Xiv preprint ar Xiv:2202.07626, 2022. Jelassi, S., Sander, M. E., and Li, Y. Vision transformers provably learn spatial structure. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https: //openreview.net/forum?id=e MW9Ak Xa REI. Karp, S., Winston, E., Li, Y., and Singh, A. Local signal adaptivity: Provable feature learning in neural networks beyond kernels. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 24883 24897. Curran Associates, Inc., 2021. URL https://proceedings. neurips.cc/paper/2021/file/ d064bf1ad039ff366564f352226e7640-Paper. pdf. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Ledoux, M. and Talagrand, M. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991. Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. ar Xiv preprint ar Xiv:2104.08691, 2021. Li, H., Wang, M., Liu, S., and Chen, P.-Y. A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=j Cl Gv3Qjhb. Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. ar Xiv preprint ar Xiv:2101.00190, 2021. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1 35, 2023. Mohammadi, H., Zare, A., Soltanolkotabi, M., and Jovanovi c, M. R. Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem. ar Xiv preprint ar Xiv:1912.11899, 2019. Narayanan, D., Shoeybi, M., Casper, J., Le Gresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., et al. Efficient large-scale language model training on gpu clusters using megatronlm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1 15, 2021. On the Role of Attention in Prompt-tuning Oymak, S. Stochastic gradient descent learns state equations with nonlinear activations. In conference on Learning Theory, pp. 2551 2579. PMLR, 2019. Pollard, D. Empirical processes: theory and applications. Ims, 1990. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Gimenez, M., Sulsky, Y., Kay, J., Springenberg, J. T., et al. A generalist agent. ar Xiv preprint ar Xiv:2205.06175, 2022. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211 252, 2015. doi: 10.1007/s11263-015-0816-y. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., et al. Photorealistic text-to-image diffusion models with deep language understanding. ar Xiv preprint ar Xiv:2205.11487, 2022. Sahiner, A., Ergen, T., Ozturkler, B., Pauly, J., Mardani, M., and Pilanci, M. Unraveling attention via convex duality: Analysis and interpretations of vision transformers. International Conference on Machine Learning, 2022. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048 2057. PMLR, 2015. Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=Byx RM0Ntvr. On the Role of Attention in Prompt-tuning A. Sharp characterization of accuracy under known context While the discrete dataset model is insightful, incorporating noise is crucial for understanding the fundamental limits of the benefits of context in attention. To this end, let us focus on the core dataset model where we set δq = δw = 0 and explore the role of noise in population accuracy. Also assume that noise is standard normal, i.e. Assumption 1.b. Linear model. The linear model aggregates tokens to obtain a simple Gaussian mixture distribution. Specifically, aggregated tokens are exactly distributed as 1 T X 1T N(ζw , 1 ζ T I), leading to the following well-known result. Fact A.1. For linear models, optimal accuracy obeys minw ERR(f LIN(w)) = Q( 1 ζ ) where Q( ) is the tail function of the standard normal distribution. Prompt-attention model. Since prompt-attention strictly generalizes the linear model, its accuracy is at least as good. The theorem below quantifies this and demonstrates how context vector can enable an optimal weighting of relevant and irrelevant tokens to maximize accuracy. A general version of this theorem is proven under a non-asymptotic setting (finite T,d) as Theorem 8. Theorem 4. Consider the prompt-attention model f ATT θ . Suppose w q and let τ, τ > 0 be hyperparameters. Consider the following algorithm which uses the hindsight knowledge of q to estimate w and make prediction: Set ˆw = (I q q ) Lw(0,τ q ) and θ = ( ˆw, τ q ). Suppose ζ2W 2T,1 ζ,α = n/d,e Q2,eτ each lie between two positive absolute constants. Suppose T is polynomially large in n and these constants and O( ) hides polynomial terms in n. Define inverse-signal-to-noise-ratio: ISNR(α,τ) = (1 ζ)e2τ(τ Q) αζ2W 2T . In the limit T,d , the test error converges in probability to Q( e Q τ τ2 1+ISNR(α,τ) 1 ζ ). In this limit, optimal hyperparameters are τ = τ = Q/2 and leads to optimal ISNR(α) = (1 ζ)e Q2/2 αζ2W 2T and the error ERR(α,Q,W) = Q e Q2/4 1 + ISNR(α) Here, a few remarks are in place. Note that rate LIN = 1 ζ term is the population error rate of f LIN from Fact A.1. In the limit α = n/d , the rate of f ATT is simply given by e Q/4rate LIN demonstrating the strict superiority of prompt-attention. Moreover setting Q = 0 (no prompt info), since feature-output of f LIN (i.e. X ϕ(X1)) is (essentially) a binary Gaussian mixture distribution, our error-rate recovers the Bayes-optimal f LIN classifier which has a finite-sample rate of rate LIN/ 1 + (1 ζ)/(αζ2W 2T 2). Prompt-tuning also strictly improves this because our ISNR(α) introduces an additional e Q2/2 multiplier. B. Gradient calculations and concentration In this section, we focus on finite-sample analysis of Algorithm 9. Introduce the following shorthand notation analogous to the population counterparts in Section 4.2: Gq(q,w) = q LD(θ) = 1 n i [n] (yi fθ(Xi))X i ϕ (Xiq)Xiw , Gw(q,w) = w LD(θ) = 1 n i [n] (yi fθ(Xi))X i ϕ((Xiq)). (14) B.1. Gradient Calculations We begin with the gradient calculations for the first two steps of the algorithm. For convenience, we make use of the following shorthands Rq = Rw,q = w q , Rw = Rw,w = w w , αi = α(w,yi) = Rq + yi Rw , γi = γ(Zi) = 1 T Z i 1, βi = β(Zi;w) = γ i w, ˆΣi = 1 On the Role of Attention in Prompt-tuning where Zi R(1 ζ)T d is the matrix of irrelevant tokens zi,t,t Rc for sample i [n]. Lemma 3. Under dataset model (DATA) and Assumption 1.a, we have Gw(0,0) = ζw + ζq 1 n i [n] yi + 1 n i yiγi. (16) Lemma 4. Under dataset model (DATA) and Assumption 1.a, we have that Gq(0,w) = 1 n i=1 (yi ζαi βi)[((ζ ζ2)αi ζβi)(q + yiw ) + ˆΣiw (ζαi + βi)γi] . (17) B.1.1. PROOF OF LEMMA 3 By direct computation, Gw(0,0) = w LD(θ) = 1 n T i [n] yi X i 1T = 1 n T i [n] t [T ] yixi,t n( i [n] yi)q + ζ n i [n] yiγi . B.1.2. PROOF OF LEMMA 4 Note that ϕ (0) = 1 T I 1 T 2 11 ; hence, for θ = (0,w): Gq(0,w) = q LD(θ) = 1 1 T (yi fθ(Xi))X i Xiw ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ Term1,i 1 T 2 (yi fθ(Xi))X i 11 Xiw ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ Term2,i Moreover, note that, X i Xi = ζT(q + yiw )(q + yiw ) + Z i Zi, X i 1 = ζT(q + yiw ) + Z i 1, where recall the notation in Lemma 3 for Zi. Hence, using the lemma s notation (repeated here for convenience) αi = Rq + yi Rw , βi = β(Zi;w) = 1 T 1 Ziw, γi = γ(Zi) = 1 T Z i 1, ˆΣi = 1 we find that yi fθ(Xi) = yi ζαi βi 1 T Xi X i w = ζαiq + ζyiαiw + ˆΣiw 1 T 2 X i 11 Xiw = ζ (ζαi + βi)q + ζ (ζαi + βi)yiw + (ζαi + βi)γi. With the above, each one of the two terms becomes: Term1,i = (yi ζαi βi)ζαiq + (yi ζαi βi)ζαiyiw + (yi ζαi βi) ˆΣiw Term2,i = ζ (yi ζαi βi)(ζαi + βi)q + ζ (yi ζαi βi)(ζαi + βi)yiw + (yi ζαi βi)(ζαi + βi)γi . Combining the above: Term1,i Term2,i = (yi ζαi βi)[ζ ((1 ζ)αi βi)(q + yiw ) + ˆΣiw (ζαi + βi)γi] . (18) On the Role of Attention in Prompt-tuning B.2. Concentration of Gradient Gq(0,w) in the q direction The main result of this section is the following lemma about concentration of gradient with respect to q. Lemma 5 (Concentration of Gq(0,w)). Fix any vectors v,w Rd. For convenience define Rv,q = v q and Rv,w = v w and recall Rw ,Rq notations from Lemma 4. Then, we can decompose v Gq(0,w) = v Gq(0,w) + v Gq(0,w), where the expectation term is given by v Gq(0,w) = E[v Gq(0,w)] = ((ζ ζ2)(Rw + w Σw/T) (ζ2 ζ3)(R2 w + R2 q )) Rv,q + (((ζ ζ2) 2(ζ2 ζ3)Rw )Rq ) Rv,w ((1 + 2/T)(ζ ζ2)Rq )v Σw , (19) and the deviation term obeys v Gq(0,w) = [((ζ ζ2) 2(ζ2 ζ3)Rw )Rq Rv,q + ((ζ ζ2)Rw (ζ2 ζ3)(R2 w + R2 q ))Rv,w ]( 1 + [( ζ + 2ζ2)Rq Rv,q + ( ζ + ( ζ + 2ζ2)Rw )Rv,w + (1 ζ)w Σv]( 1 n i=1 (γ i w)) + [ζRw ζ2(R2 q + R2 w ) + (1 ζ) n i=1 v γi) + [( ζ + ( ζ + 2ζ2)Rw )Rv,q + ( ζ + 2ζ2)Rq Rv,w ]( 1 n i=1 yi (γ i w)) + [ζRq 2ζ2Rq Rw ]( 1 n i=1 yi (v γi)) + ζRv,q ( 1 n i=1 (γ i w) 2 (1 ζ) + ζRv,w ( 1 n i=1 (γ i w) 2 yi) + [1 2ζRw ]( 1 n i=1 yi (w γi)(v γi)) + (1 ζRw )( 1 n i=1 yiv ˆΣiw) n i=1 v ˆΣiw (1 ζ)v Σw) n i=1 (w γi)(v γi) 1 ζ n i=1 (v γi)((w γi) 2 (1 ζ) n i=1 (γ i w)(v ˆΣiw (1 ζ)w Σv)). (20) Moreover, all random terms in (20) are zero-mean and concentrate as prescribed by Lemma 6 below . Lemma 6 (Main concentration lemma). Let yi,i [n] be iid Rademacher random variables. Let Zi R(1 ζ)T d,i [n] be iid copies of a random matrix Z. Each row zt,t [(1 ζ)T] of Z is an iid copy of a random vector z satisfying Assumption 1.a. For convenience denote γi = Z i 1/T and ˆΣi = Z i Zi/T. Then, the following statements are true for all vectors w,v Rd: XXXXXXXXXXXX 1 n i [n] yi XXXXXXXXXXXXψ2 C n On the Role of Attention in Prompt-tuning XXXXXXXXXXXX 1 n i [n] γ i w XXXXXXXXXXXXψ2 XXXXXXXXXXXX 1 n i [n] yiγ i w XXXXXXXXXXXXψ2 Cσ 1 ζ w 2 XXXXXXXXXXXX 1 n i [n] (γ i w)(γ i v) 1 ζ T v Σw XXXXXXXXXXXXψ1 XXXXXXXXXXXX 1 n i [n] yi (γ i w)(γ i v) XXXXXXXXXXXXψ1 Cσ2(1 ζ) w 2 v 2 XXXXXXXXXXXX 1 n i [n] w ˆΣiv (1 ζ)w Σv XXXXXXXXXXXXψ1 XXXXXXXXXXXX 1 n i [n] yiw ˆΣiv XXXXXXXXXXXXψ1 Cσ2 1 ζ w 2 v 2 XXXXXXXXXXXX 1 n i [n] ((γ i w) 2 1 ζ T w Σw)γ i v XXXXXXXXXXXXψ2/3 Cσ3(1 ζ)3/2 w 2 2 v 2 T 3/2 n log n XXXXXXXXXXXX 1 n i [n] (γ i w)(w ˆΣiv (1 ζ)w Σv) XXXXXXXXXXXXψ2/3 Cσ3(1 ζ) w 2 2 v 2 T n log n. Also, all the random variables that appear above are zero mean. B.2.1. PROOF OF LEMMA 5 We split (17) in four terms and handle each of them separately. n n i=1 (yi ζαi βi)((ζ ζ2)αi ζβi)q We first focus on n i=1 (yi(1 ζRw ) βi ζRq )(yi(ζ ζ2)Rw ζβi + (ζ ζ2)Rq )q = Aq . We we can express A above conveniently as follows (recall y2 i = 1): A = ζ(ζ ζ2)R2 q + (1 ζRw )(ζ ζ2)Rw + ((1 ζRw )(ζ ζ2)Rq ζ(ζ ζ2)Rw Rq )( 1 + ( (ζ ζ2)Rq + ζ2Rq )( 1 n i=1 βi) + ( (1 ζRw )ζ (ζ ζ2)Rw )( 1 n i=1 yiβi) + ζ ( 1 n i=1 β2 i ) = (ζ ζ2)Rw (ζ2 ζ3)(R2 w + R2 q ) + ((ζ ζ2) 2(ζ2 ζ3)Rw )Rq ( 1 + ( ζ + 2ζ2)Rq ( 1 n i=1 βi) + ( ζ + ( ζ + 2ζ2)Rw )( 1 n i=1 yiβi) n i=1 β2 i (1 ζ) T w Σw) + (ζ ζ2) From Lemma 6, all random terms above are zero mean. Hence, E[A] = (ζ2 ζ3)R2 q + (1 ζRw )(ζ ζ2)Rw + (ζ ζ2)w Σw T = (ζ2 ζ3)R2 q + (ζ ζ2)(Rw + w Σw/T) (ζ2 ζ3)R2 w = (ζ ζ2)(Rw + w Σw/T) (ζ2 ζ3)(R2 w + R2 q ). (21) On the Role of Attention in Prompt-tuning Term II = 1 n n i=1 (yi ζαi βi)((ζ ζ2)αi ζβi)yiw Term II = 1 n i=1 (yi(1 ζRw ) βi ζRq )(yi(ζ ζ2)Rw ζβi + (ζ ζ2)Rq )yiw = Bw . We we can express B above conveniently as: B = ((ζ ζ2)Rw (ζ2 ζ3)(R2 w + R2 q )) 1 n i [n] yi + ((ζ ζ2) 2(ζ2 ζ3)Rw )Rq + ( ζ + 2ζ2)Rq ( 1 n i=1 βiyi) + ( ζ + ( ζ + 2ζ2)Rw )( 1 n i=1 βi) + ζ ( 1 n i=1 β2 i yi). All the random terms above are zero-mean. Hence, E[B] = ((ζ ζ2) 2(ζ2 ζ3)Rw )Rq . (22) Term III = 1 n n i=1 (yi ζαi βi) ˆΣiw Fix any vector v v {Term III} = ( 1 n i=1 yiv ˆΣiw) ζ (Rq + yi Rw )( 1 n i=1 v ˆΣiw) ( 1 n i=1 (γ i w)v ˆΣiw) = (1 ζRw )( 1 n i=1 yiv ˆΣiw) ζRq ( 1 n i=1 v ˆΣiw) ( 1 n i=1 (γ i w)v ˆΣiw) = (1 ζRw )( 1 n i=1 yiv ˆΣiw) ζRq ( 1 n i=1 v ˆΣiw (1 ζ)v Σw) (ζ ζ2)Rq v Σw n i=1 (γ i w)(v ˆΣiw (1 ζ)w Σv) + (1 ζ)(w Σv) 1 n i=1 (γ i w) From Lemma 6 all random terms above are zero mean. Hence, E[v {Term III}] = (ζ ζ2)Rq w Σv. (23) Term IV = 1 n n i=1 (yi ζαi βi)(ζαi + βi)γi For fixed vector v, v {Term IV} = 1 n n i=1 (yi ζαi βi)(ζαi + βi)v γi. Reorganizing, note that (yi ζαi βi)(ζαi + βi) = ζyi(Rq + yi Rw ) ζ2(R2 q + R2 w + 2yi Rq Rw ) 2ζRq βi 2ζRw yiβi + yiβi β2 i = (ζRq 2ζ2Rq Rw )yi + (ζRw ζ2(R2 q + R2 w )) (2ζRq )βi + (1 2ζRw )yiβi β2 i v {Term IV} = (ζRw ζ2(R2 q + R2 w ))( 1 n i=1 v γi) + (ζRq 2ζ2Rq Rw )( 1 n i=1 yi (v γi)) + (1 2ζRw )( 1 n i=1 yi (w γi)(v γi)) n i=1 (w γi)(v γi) 1 ζ T v Σw) (2ζRq ) 1 ζ n i=1 (v γi)((w γi) 2 (1 ζ) T w Σw) + (1 ζ) n i=1 (v γi)) . On the Role of Attention in Prompt-tuning According to Lemma 6 all random terms above are zero mean. Thus, E[v {Term IV}] = 2ζRq 1 ζ T w Σv (24) Combined The desired identities (19) and (20) follow by combining all the terms above. B.2.2. PROOF OF LEMMA 6 First bound: Obvious by boundedness (hence, sub-gaussianity) of yi and Fact G.1. Second bound: For convenience set T = (1 ζ)T and assume wlog that Rc = [ T]. Recall that T t=1 z i,tw = 1 ζ T t=1 z i,tw. Also for all t: z i,tw ψ2 K w 2. Thus, from Fact G.1: βi ψ2 Cσ(1 ζ) w 2 The bound then follows by applying Fact G.1 once more. For the second term in this bound recall that yi { 1} and βi = t z i,tw/T. Also, for all i [n]: yi,{zi,t}t are zero-mean and independent. Thus (i) E[yiβi] = 0, and (ii) {yizi,t D zi,t and yizi,t yizi,t } Ô yiβi D βi. Thus, the same bound as the first term holds. Third bound: It is easy to compute E[(γ i w)(γ i v)] = 1 T t =1 E[w zi,tzi,t v] = 1 ζ T v Σw, (26) and, using (25) (γ i w)(γ i v) E[(γ i w)(γ i v)] ψ1 C (γ i w)(γ i v) ψ1 C γ i w ψ2 γ i v ψ2 = Cσ2(1 ζ) w 2 v 2 Since (γ i w)(γ i v),i [n] are independent, the desired bound on the first term follows from Fact G.1. Consider now the second term. By independence of yi,γi it holds that E[yi (γ i w)(γ i v)] = 0. Arguing as we did above for the second bound, yiγ i w D γ i w. Hence, the subexponential bound is the same as for the first term. Fourth bound: First, it is easy to compute that for all i [n]: E[w Σiv] = 1 T E[w Z i Ziv] = 1 T t=1 E[w z i,tzi,tv] = T T w Σv = (1 ζ)w Σv. w ˆΣiv E[w ˆΣiv] = w ˆΣiv (1 ζ)w Σv = 1 T t=1 ((z i,tw)(z i,tv) w Σv) n i=1 w ˆΣiv (1 ζ)w Σv = 1 T t=1 ((z i,tw)(z i,tv) w Σv). On the Role of Attention in Prompt-tuning Now, each random variable in the double sum above is independent and such that (z i,tw)(z i,tv) w Σv ψ1 2 (z i,tw)(z i,tv) ψ1 2 z i,tw ψ2 z i,tv ψ2 Cσ2 w 2 v 2. (28) Hence, from Fact G.1, n i=1 w ˆΣiv (1 ζ)w Σv ψ1 Cσ2 1 ζ w 2 v 2 The bound for the second term follows along the same lines. The two key observations are that (i) E[yiw ˆΣiv] = 0 because yi and zi,t are independent, and, (ii) yiw ˆΣiv = 1 T w yi Z i Ziv D 1 T w Z i Ziv = 1 T t=1 ( z i,tw)(z i,tv) where Zi is an independent copy of Zi. Fifth bound: From (29) and (26), we have for all i [n] that (γ i w)(γ i v) 1 ζ T w Σw ψ1 Cσ2(1 ζ) w 2 2 T . Moreover, recall from Eq. (25) that γ i v ψ2 Cσ 1 ζ v 2 Combining the above two displays and applying Fact G.2 we find for all i [n] that ((γ i w) 2 1 ζ T w Σw)γ i v ψ2/3 Cσ3(1 ζ)3/2 w 2 2 v 2 T 3/2 . (29) The desired bound follows from the above after using Fact G.3. Sixth bound: From Eq. (28): w ˆΣiv (1 ζ)w Σv ψ1 Cσ2 1 ζ w 2 v 2 T and from Eq. (25) γ i w ψ2 Cσ 1 ζ w 2 Next we use Fact G.2 with α = 2 and β = 1 to find that (γ i w)(w ˆΣiv (1 ζ)w Σv) ψ2/3 CK3(1 ζ) w 2 2 v 2 T . Next we use Fact G.3 which allows us to conclude that n i=1 (γ i w)(w ˆΣiv (1 ζ)w Σv) ψ2/3 CK3(1 ζ) w 2 2 v 2 T n log n. Finally, the zero-mean property follows since E[(1 Ziw)(w Zi Z i v)] = T t =1 E[(z i,tw)(w zi,t z i,t v)] = T t =1 E[w zi,t tr(zi,t z i,t vw )] = T 2E[tr(z w)tr(zz vw )] = T 2E[tr((z w) (zz vw ))] = T 2 tr(E[(z zz )](w vw )) = 0, where the last equality follows by the zero third moment property in Assumption 1.a. On the Role of Attention in Prompt-tuning C. Finite-sample gradient analysis In the following, we assume without explicit further reference that Ass. 3.a holds (i.e. ρ < W/Q) and additionally that Q > W. To simplify the results we further assume σ 1 and W 1. C.1. First gradient step The lemma below studies the deviation of the first-step of GD w1 with respect to its population counterpart w1. Provided that n and nζT/d are larger than appropriate functions of other problem parameters, then the deviations are of small multiplicative order. Lemma 7 (First gradient step). Consider the one-step population and finite updates w1 = ηGw(0,0) and w1 = η Gw(0,0), respectively. For convenience denote Rw = w 1w , Rq = w 1q , Rw = w 1w , Rq = w 1q . For any u > 0 and any small constant c0 > 0, there exist absolute constants c,c > 0 and large enough constant C = C(c0) > 0 such that if ζ 1 1, (30) then, with probability at least 1 c e cu2 Rw Rw c0Rw and Rq Rq c0ηζQW . (31) Additionally, if nζT C(1 + u) σ then with probability 1 e cu2 e cdu2 w1 w1 c0ηζW. (33) Proof. Note that the conclusions of the lemma are all homogeneous in η. Hence, without loss of generality, set η = 1. Also recall by Lemma 3 that w1 = Gw(0,0) = ζw + ζq 1 n i [n] yi + 1 n i yiγi , (34) and w1 = Gw(0,0) = ζw ; thus, Rw = ζW 2. From these, and also using Lemma 6 , for any u > 0 with probability at least 1 2e cu2 Rw Rw = w ( w1 w1) ζ ρ WQ RRRRRRRRRRRR 1 n i [n] yi RRRRRRRRRRRR + 1 n i yiγ i w Cuζ ρ WQ n + Cuσ 1 ζ ζW nζT c0ζW 2 = c0Rw . where the last inequality follows by assuming n,ζT large enough as in the condition of the lemma and using ρ 1. Similarly, Rq Rq = q ( w1 w1) ζQ2 RRRRRRRRRRRR 1 n i [n] yi RRRRRRRRRRRR + 1 n i yiγ i q CuζQ2 n + Cuσ 1 ζ ζQ nζT c0ζQW for sufficiently large n per (30). Finally, with probability at least 1 e cu2 e cdu2 w1 w1 w1 w1 CuζQ n + C(1 + u)σ 1 ζ ζ d nζT c0ζW . (35) where, again, the last inequality follows by assuming n,ζT large enough as stated in the lemma. In the second inequality, we used from Lemma 6 that i [n] yiγi/n is Cσ 1 ζ/ n T-sub Gaussian and applied Fact G.4 to get a high-probability bound on its euclidean norm. On the Role of Attention in Prompt-tuning C.2. Second gradient step Next, we move on to the second gradient update in the direction of Gq(0, w1). Recall our goal of controlling the relevance scores of signal and noisy tokens. The first lemma below takes a step in this direction by computing the signal and noise relevance scores assuming access to the population gradient Gq(0, w1) = E[ Gq(0, w1)]. Lemma 8 ( Gq(0, ) control: Expectation term). Let Gq(0,w1) = E[ Gq(0,w1)] be the expectation of a gradient step in the q-direction evaluated at (0, w1) and recall that w1 = η Gw(0,0) for η > 0. Suppose w1 satisfies (31) and (33) for sufficiently small enough constant c0 > 0. Further assume that the step-size η satisfies the following for sufficiently small absolute constant cη > 0: η = cη σ2Q2 . (36) Then, for y { 1} it holds that (q + yw ) Gq(0, ˆw1) ηζ(ζ ζ2)W 2Q((1/4 ρ2/8 3c0)Q (9/4 ρ + c0)W) , (37) Gq(0, ˆw1) ηζ (ζ ζ2) W 2Q(13/4 + 2c0) . (38) Proof. Fix any v and recall the notation of Lemma 7. With these, we have from Lemma 5 that v Gq(0, w1) = ((ζ ζ2)( Rw + σ2 w1 2/T) (ζ2 ζ3)( R2 w + R2 q )) v q + ((ζ ζ2) 2(ζ2 ζ3) Rw ) Rq v w σ2 (1 + 2/T)(ζ ζ2) Rq v w1 = (ζ ζ2) Rw (1 + σ2 w1 2/(T Rw ) ζ ( Rw + R2 q / Rw )) v q + (ζ ζ2) Rq (1 2ζ Rw ) v w σ2 (1 + 2/T)(ζ ζ2) Rq v w1 = (ζ ζ2) Rw (1 + σ2 w1 2/(T Rw ) ζ ( Rw + R2 q / Rw )) ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ + (ζ ζ2) Rq (1 2ζ Rw ηζσ2 (1 + 2/T)) ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ v w ησ2 (1 + 2/T) ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ (ζ ζ2) Rq v δ1 (39) where, in the last line we used (34) and set δ1 = ( w1 w1)/η = ζq 1 n i [n] yi + 1 Recall from (31) and (33) that Rw /2 Rw 3Rw /2, Rq Rq c0ηζQW and w1 w1 + c0ηζW. Also, from (33) we have that δ1 c0ζW,. Here and onwards, c0 > 0 is a small enough absolute constant (smaller than 1/2) whose value may change from line to line. Further recall Rw = ηζW 2, Rq = ηζρWQ and w1 = ηζW. With these, we can set small enough step size η σ 2Q 2 such that C1 [1/2,3/2], C2 [ 1/8,1], C3 [0,(c3/ζ)/Q2], for constant c3 > 0 to be made small enough later in the proof. From the above, we can compute C3 Rq δ1 c3 ζQ2 ηζ(c0QW + ρWQ)(c0ζQ) c3c0ηζ Q. On the Role of Attention in Prompt-tuning ˆC1 Rw Q2 ηζW 2Q2/4 ˆC2 Rq ρWQ = ˆC2ρWQRq + C2ρWQ( Rq Rq ) ηζρ2Q2W 2/8 c0ηζQ2W 2 ηζρ2Q2W 2/8 c0ηζW 2Q2 . Putting the above displays together we find that q Gq(0, ˆw1) ηζ(ζ ζ2)(W 2Q2/4 ρ2Q2W 2/8 c0W 2Q2 c3c0Q2) . Similarly, we can compute that ˆC1 Rw ρQW (9/4)ηζW 3Q ρ ˆC2 Rq W 2 ηζ( ρ W 3Q + c0QW 3). Hence, for y { 1} yw Gq(0, ˆw1) ηζ (ζ ζ2)4W 3Q ρ + c0QW 3 + c3c0QW . The above two displays put together yield (37). Specifically, we also use the simplifying assumption Q W 1 and pick c3 small enough so that c3W 2Q 1 1. The norm bound in (38) follows again starting from (39) and using similar arguments as above to show: Gq(0, ˆw1) ηζ(ζ ζ2)( C1Rw Q + C2 Rq W + C3 Rq δ1 ) ηζ(ζ ζ2)((9/4)W 2Q + ( ρ + c0)W 2Q + c3c0Q) . The next lemma controls the effect on the relevance scores of the deviation term Gq(0, w) = Gq(0, w) Gq(0, w). Lemma 9 ( Gq(0, ) control: Deviation term). Let Gq(0, w1) = Gq(0, w1) Gq(0, w1) and suppose w1 satisfies (31) and (33). Also assume σ = 1. Fix any u > 0 and any small constant c1 > 0. Then, there exists small enough constant cη (dependent on c1) such that if step-size η is small enough as per (36) the following statements hold. With probability at least 1 c e cu2/3 for positive constants C,c ,c > 0 it holds for signal tokens that (q + yw ) Gq(0, w1) u C c1 Q(Q σ σ2 T ) log(n) n , y { 1} (40) Moreover, with probability at least 1 c de cu2/3 it holds that Gq(0, ˆw1) u C c1 (Q σ σ2 Proof. We study each one of the terms of Gq(0, w1) in (20) separately. We repeat the terms here for convenience, also On the Role of Attention in Prompt-tuning noting the substitutions Rw Rw = w 1w and Rq Rq = w 1q . v Gq(0, w1) = [((ζ ζ2) 2(ζ2 ζ3) Rw ) Rq Rv,q + ((ζ ζ2) Rw (ζ2 ζ3)( R2 w + R2 q ))Rv,w ]( 1 + [( ζ + 2ζ2) Rq Rv,q + ( ζ + ( ζ + 2ζ2) Rw )Rv,w + (1 ζ) w 1Σv]( 1 n i=1 (γ i w1)) + [ζ Rw ζ2( R2 q + R2 w ) + (1 ζ) T w 1Σ w1]( 1 n i=1 v γi) + [( ζ + ( ζ + 2ζ2) Rw )Rv,q + ( ζ + 2ζ2) Rq Rv,w ]( 1 n i=1 yi (γ i w1)) + [ζ Rq 2ζ2 Rq Rw ]( 1 n i=1 yi (v γi)) + ζRv,q ( 1 n i=1 (γ i w1) 2 (1 ζ) + ζRv,w ( 1 n i=1 (γ i w1) 2 yi) + [1 2ζ Rw ]( 1 n i=1 yi ( w 1γi)(v γi)) + (1 ζ Rw )( 1 n i=1 yiv ˆΣi w1) n i=1 v ˆΣi w1 (1 ζ)v Σ w1) [2ζ Rq ]( 1 n i=1 ( w 1γi)(v γi) 1 ζ n i=1 (v γi)(( w 1γi) 2 (1 ζ) T w 1Σ w1)) n i=1 (γ i w1)(v ˆΣi w1 (1 ζ) w 1Σv)) . (42) Recall from the lemma assumption that (31) holds and from Rw = ηζW 2 and Rq = ηζρWQ, that 1/2ηζW 2 Rw 3/2ηζW 2 and Rq ηζ( ρ + c0)WQ. The observation is that we can choose step-size η small enough (as stated in (36)) to bound (in absolute value) all the coefficients in (42) (aka all terms in square brackets) that include Rw , Rq . Therefore, for any small positive constant c1 > 0 it can be checked that there is sufficiently small constant cη that determines step-size η On the Role of Attention in Prompt-tuning in (36) such that v Gq(0, w1) c1 [ Rv,q + Rv,w ] 1 + [c1 Rv,q + (1 + c1) Rv,w + (1 ζ)σ2 w 1v ] 1 n i=1 (γ i w1) + [c1 + (1 ζ) T σ2 w1 2] 1 + [(1 + c1) Rv,q + c1 Rv,w ] 1 n i=1 yi (γ i w1) n i=1 yi (v γi) n i=1 (γ i w1) 2 (1 ζ) n i=1 (γ i w1) 2 yi + [1 + c1] 1 n i=1 yi ( w 1γi)(v γi) + [1 + c1] 1 n i=1 yiv ˆΣi w1 n i=1 v ˆΣi w1 (1 ζ)v Σ w1 n i=1 ( w 1γi)(v γi) 1 ζ n i=1 (v γi)(( w 1γi) 2 (1 ζ) n i=1 (γ i w1)(v ˆΣi w1 (1 ζ) w 1Σv) . (43) Now, we use successively Lemma 6 to bound the random terms. Also note that Rv,q Q v , Rv,w W v and w1 ηζ(1 + c0)W = ηζM. Here, we denote M = (1+c0)W for convenience. With these, for any u > 0, with probability at least 1 c e cu2/3 we have On the Role of Attention in Prompt-tuning v Gq(0, w1) u C n v c1 (W + Q) n T v (((1 + c1)(2W + 2Q) + c1ηζ(1 ζ)σ2M)ηζM + (2c1 + σ2(1 ζ)η2ζ2M 2/T)) + u Cσ2(1 ζ) T n v (η2ζ3M 2Q + η2ζ3WM 2 + (1 + c1)ηζM) + u Cσ2 1 ζ n T v ((1 + 2c1)ηζM) + u Cσ2(1 ζ) T n v (c1ηζM) + u Cσ3(1 ζ)3/2 log(n) T 3/2 n v (η2ζ2M 2) + u Cσ3(1 ζ)log(n) T n v η2ζ2M 2 . (44) Now, again using small step size η as per (36) (recall that M = (1 + c0)W Q), this can be further simplified to the following (here the value of constant c1 might be different from (44)) v Gq(0, w1) u C n v c1 (W + Q) + u Cσ 1 ζ n T v c1 + u Cσ2(1 ζ) + u Cσ2 1 ζ n T v c1 + u Cσ3(1 ζ)3/2 log(n) T 3/2 n v c1 + u Cσ3(1 ζ)log(n) u v C n c1 (Q σ σ2 T σ3 log(n) T 3/2 σ3 log(n) u v C log(n) n c1 (Q σ σ2 Now, we can compute the deviation of the relevance scores. For signal tokens we have for both y { 1}, and all u > 0 with probability at least 1 c e cu2/3, there exist constant C > 0 such that (q + yw ) Gq(0, w1) u C c1 Q(Q σ σ2 T ) log(n) n (46) Similarly, since (45) holds for all v, we can apply it for all standard basis vectors v = ej,j [n] and union bounding yields for all u > 0 with probability at least 1 c de cu2/3, Gq(0, w1) u C c1 (Q σ σ2 C.3. First and second gradient steps combined: Learning the relevant features With lemmas 7, 8, and 9 at hand, we are now ready to put things together stating our final bounds for relevance scores. The finding is presented as a stand-alone lemma below. Lemma 10 (Put things together). Consider the finite-sample gradient step q1 = Gq(0, w1), where recall that w1 = η Gw(0,0). Fix any u0,u1,u2,u3 > 0 and any small constants c0, c1 > 0. Suppose step-size η of first gradient step satisfies (36) for sufficiently small constant cη = cη( c1) > 0 and further assume n u0 C0 Q W and d (1 + u0) C0 σ W 1/ζ 1. (47) On the Role of Attention in Prompt-tuning for some large enough constant C0 = C0( c0) > 0. Finally, make the following mild assumption (for simplicity), σ σ2 Q. For a fresh dataset (Xi,yi)i [n] consider the signal relevance scores ˆri,t = ri,0 = (q + yiw ) Gq(0, ˆw1) for t R. There exist positive absolute constants ci,c i,i = 0,1,2,3 such that the following statements hold. With probability at least 1 c 0e c0u2 0 c 1ec1u2/3 1 , the signal relevance scores satisfy min i [n]ri,0 ηζ(ζ ζ2)W 2Q((1/4 ρ2/8 2 c0)Q (9/4 ρ + 2 c0)W) u1 c1 Q2 log(n) n = B(u1). (48) With probability at least 1 c 0e c0u2 0 c 3dec3u2/3 3 , the norm of Gq(0, ˆw1) satisfies Gq(0, ˆw1) ηζ (ζ ζ2) W 2Q(13/4 + 2 c0) + u3 c1σQlog(n) Proof. The lemma follows by collecting (37), (40), (38), (41), and applying union bound for the noise terms over i [n],t Rc. From the lemma above, we can show that provided Q is large enough with respect to W and ρ is small enough (see (50) below) then the normalized signal relevance score is with high probability over the training set proportional to Q. Corollary 1. Suppose there exists positive constant α (0,3/16) such that AQ = (3/16 ρ2/8)Q (9/4 ρ + 1/16)W α Q. (50) Then, for sufficiently small step-size η Q 2 and large enough n there exist constants c 0,c0,c 1,c1,c3,c 3 such that with probability at least 1 c 0 exp( c0n ((W/Q)2 ζ2W 2T d )) c 1 exp( c1n1/3(W/Q)4/3ζ4/3/log2/3(n)) c 3dexp( c3(n/d)1/3(W/Q)4/3ζ4/3/log2/3(n)) , (51) it holds that (q + yw ) ( Gq(0, w1) Gq(0, w1) ) (α/14) Q. Proof. Suppose (50) holds and recall the notation AQ defined therein. Further suppose n is large enough such that (47) holds so that we can invoke Lemma 10. Therein set u1 ηW 2ζ2 n log(n) (W/Q)2ζ2 n log(n), u3 ηW 2ζ2 n d (W/Q)2ζ2 n and u0 n((W/Q) ζW Note that the latter condition is consistent with (47). On the other hand, the former two conditions are chosen so that the following two hold. ηζ(ζ ζ2)W 2Q AQ > 2u1 c1Q2 log n n . Thus, (48) and setting 2 c0 = 1/16 give (q + yw ) Gq(0, w1) 1 2ηζ(ζ ζ2)W 2QAQ On the Role of Attention in Prompt-tuning ηζ(ζ ζ2)W 2Q > u3 c1Qlog n Thus, from (49) Gq(0, ˆw1) 2ηζ (ζ ζ2) W 2Q(13/4 + 1/16). Combining the above we conclude with the desired: the signal correlation is lower bounded by (q + yw ) Gq(0, w1)/ Gq(0, w1) AQ/14 (α/14) Q. Having shown that the second gradient step has high signal relevance score, we can use it to show in the lemma below that the attention features are close to the signal features with high-probability. This property will play a key role in finalizing the finite-sample analysis. Indeed, the rate of the event in (55) will turn out to govern the error rate of our algorithm. Lemma 11 (From learning the context to learning good features). Suppose the second gradient step q1 has correlation coefficient α with q + yw for any y { 1}, i.e., (q + yw ) q1/ q1 αQ. Then, for any ϵ > 0 there exists sufficiently large γ (ϵ) and absolute constant c > 0 such that setting qγ 1 = γ q1 q1 for any γ γ (ϵ), we have P(X,y)( X ϕ(Xqγ 1 ) (q + yw ) ϵ) 1 2Te c α2Q2 Proof. Recall X consists or ζT relevant and (1 ζ)T irrelevant tokens. For relevant tokens, we have by assumption that B = (q + yw ) q1 αQ, where we denote q1 = q1/ q1 , for convenience. Let z1,...,z(1 ζ)T denote the irrelevant tokens, which are σ-subgaussian. In order to ensure that softmax-attention perfectly selects the relevant tokens, we require M = max 1 t (1 ζ)T z T t q1 αQ/2 = B/2. Condition on this event, which holds with at least probability 1 Te cα2Q2 σ2 . Then, the attention coefficients at = [ϕ(γXq1)]t are as follows: t relevant: at = 1 ζT + (1 ζ)Teγ(M B) 1 ζT + (1 ζ)Te γB/2 = 1 t irrelevant: at = 1 ζTeγ(B M) + (1 ζ)T 1 ζTeγB/2 + (1 ζ)T = 1 (1 ζ)T a I = 1 (1 ζ)T 1 1 + ζ 1 ζ eγB/2 . Therefore, X ϕ(γXq1) (q + yw ) (Q + W)(1 a R) + a I max 1 t (1 ζ)T zt . To continue, further condition on the event max 1 t (1 ζ)T zt Cσ d + αQ (53) which holds with probability at least 1 (1 ζ)Te cα2Q2/σ2. Also note that 1 a R = a I. Hence, with probability at least 1 2Te cα2Q2/(8σ2) X ϕ(γXq1) (q + yw ) ((1 + α)Q + W + Cσ The right hand-side above can be made smaller than ϵ by choosing γ large enough (depending on ϵ,α,Q,W,σ and d). This completes the proof. Combining Lemma 11 and Corollary 1 we arrive at the following result, which we state as a stand-alone theorem since it summarizes the effect of the first two-gradient steps on learning good (aka relevant) features. On the Role of Attention in Prompt-tuning Theorem 5 (Summary result of first two GD steps). Suppose Q,W and ρ are such that there exists positive constant α (0,3/16) for which AQ = (3/16 ρ2/8)Q (9/4 ρ + 1/16)W α Q. Fix any ϵ > 0. For sufficiently small step-size η Q 2, large enough n and sufficiently large γ = γ (ϵ) such that the following statements are true about the first two gradient steps: w1 = η Gw(0,0) qγ 1 = Gq(0, w1), for any γ γ . There exist constants c 0,c0,c 1,c1,c3,c 3,c such that with probability at least 1 c 0 exp( c0 n ((W/Q)2 ζ2W 2T d )) c 1 exp( c1n1/3(W/Q)4/3ζ4/3/log2/3(n)) c 3dexp( c3(n/d)1/3(W/Q)4/3ζ4/3/log2/3(n)) , (54) it holds for any fresh sample (X,y) that P( X ϕ(Xqγ 1 ) (q + yw ) ϵ) 1 2Te c α2Q2 Moreover, it holds that C.4. Third gradient step With the characterization of the quality of learnt features in Theorem 5, we are now ready to turn our attention to the third gradient step. For this last step, it turns out all we need is that w2 = η Gw( q1, w1) has a strictly positive correlation with w . This is indeed the case and the result is formalized in the lemma below. Lemma 12 (Third step). Suppose that the first and second gradient steps w1, q1 are such that the following hold. First, for absolute constant cη > 0 w1 cηζW/Q2 . Second, for ϵ > 0 and any fresh datapoint (X,y) there exists δ that does not depend on ϵ such that P(X,y) ( X ϕ(X q1) (q + yw ) ϵ) 1 δ. Consider the third gradient step w2 = η Gq( q1, w1). There exists absolute constants c,C such that, for all sufficiently small ϵ and cη, with probability at least 1 nδ 2ecn, w 2w w2 C W 2 Proof. Note that the lemma s conclusion is insensitive to the choice of step-size η for the third gradient step. Thus, without loss of generality, assume below that η = 1. Recall the gradient formula Gw(q,w) = w LD(θ) = 1 n i [n] (yi fθ(Xi))X i ϕ((Xiq)) and denote for convenience ϵi = X i ϕ(Xiq) (q + yiw ), i [n]. On the Role of Attention in Prompt-tuning With this notation, we can conveniently rewrite the third gradient step evaluated at q = qγ = γ Gq(0, w1) for any γ > 0 as follows: w2 = Gw(q, w1) = 1 n i [n] yi X i ϕ(Xiq) 1 n i [n] ( w T 1 X i ϕ(Xiq))X i ϕ(Xiq) = w + q 1 n i [n] yi + 1 n i [n] yi (X i ϕ(Xiq) (q + yiw )) n i [n] ( w T 1 X i ϕ(Xiq))(q + yiw ) 1 n i [n] ( w T 1 X i ϕ(Xiq))(X i ϕ(Xiq) (q + yiw )) = w + q 1 n i [n] yi + 1 n i [n] yiϵi n i [n] w 1ϵiϵi 1 n i [n] w 1(q + yiw )ϵi 1 n i [n] w 1ϵi(q + yiw ) 1 n i [n] w 1(q + yiw )(q + yiw ). (56) In order to control the correlation w 2w / w2 it suffices to control w 2v for arbitrary v Rd. In view of the expression above, it will suffice bounding the two random terms below: Term I = RRRRRRRRRRRR 1 n i [n] yi RRRRRRRRRRRR Term II = ϵi = X i ϕ(Xiq) (q + yw ) For the first term, we have with probability at least 1 2eu2 1/2 that Term I u1/ n. For the second term, we know by assumption that with probability at least 1 nδ, Putting the above together, with probability at least 1 2e u2/2 nδ, we have that w w2 W 2 ρ QW u1 n ϵW 3 2ηζ (ϵ2W 2 + 2ϵW 2 (Q + W) + W 2 (Q + W)2) 2ηζ (ϵ2 + 2ϵ(Q + W) + (Q + W)2)) where we also used the lemma s assumption on w1 (3/2)ηζW. To further lower bound w w2, recall that ρ W/Q, W 1 and that ϵ can be made arbitrarily small constant. Further pick u1 = c1 n and η = cη/Q2 (57) for sufficiently small constants c1 and cη. With these, we guarantee with probability at least 1 2e cn nδ that w 2w W 2 . (58) Next, we use similar arguments to bound w2 . Conditioning on the event where the bounds derived above hold for Term I and Term II, we have from (56) that w2 W + Q u1 n + ϵ + 3 2ηζW (ϵ2 + 2ϵ(Q + W) + (Q + W)2) Q, where in the second inequality, we chose u1,η as in (57) and used again that ϵ is arbitrarily small constant, as well as, Q > W 1. All the above combined, shows that w 2w w2 W 2 This completes the proof. On the Role of Attention in Prompt-tuning C.5. De-biasing step Lemma 13 (Debiasing predictions). For some ϵ > 0, suppose q1 is such that a test example (y,X) satisfies P(X,y) ( X ϕ(X q1) (q + yw ) ϵ) 1 δ and that w2 is such that w 2w w2 > 4ϵ. Given a fresh dataset S = (yi,Xi)n i=1, set b = 1 n n i=1 fθ(Xi) = 1 n n i=1 w 2vi where vi = X i ϕ(Xiq1). Set the debiased classifier f θ(X) = fθ(X) b. Suppose n 8log ( 2 δn). Then, with probability 1 2δn over S, the test error of f θ obeys ERR(f θ) δ. Proof. First, let us prove the following intermediate statement: With probability 1 2δn over S, for a new test sample (y,X), with probability 1 δ, yf θ(X) w 2w n w 2w + 2ϵ w2 . (59) To see the above, start with observing that, with probability 1 nδ over the dataset (yi,Xi)n i=1, for each vi, w 2vi w 2(q + yiw ) ϵ w2 . Set b = w 2q abd y = 1 n n i=1 yi . With probability 1 δn, y 2 log(2/δn) n . Combining, with overall probability at least 1 2δn, the classifier bias obeys To finalize, for a new sample (y,X), with probability 1 δ, we have that w 2v w 2(q + yw ) ϵ w2 where v = X ϕ(Xq1). Thus, the prediction f (X) = f(X) b obeys yf θ(X) y(fθ(X) b) b b w 2w n + ϵ w2 . (60) To conclude with (59), note that y(fθ(X) b) w 2w ϵ w2 , and apply triangle inequality with (60). To prove the statement of the lemma, note that, when n 8log(2/δn) and w 2w > 4ϵ w2 , a test sample (with 1 δ probability) obeys yf θ(X) w 2w n w 2w 2ϵ w2 0.5w 2w 2ϵ w2 > 0. (61) Thus, the classifier makes the correct decision with the same probability. C.6. Finishing the finite sample analysis Theorem 6 (Main theorem: Finite-sample). Suppose Q,W and ρ are such that there exists positive constant α (0,3/16) for which (3/16 ρ2/8)Q (9/4 ρ + 1/16)W α Q. (62) On the Role of Attention in Prompt-tuning Fix any ϵ > 0. For sufficiently small step-size η Q 2, sufficiently large step-size γ = γ(ϵ), and large enough n, there exist constants c j,cj,j = 0,1,2,3,4 such that the following statements hold with probability at least 1 c 0 exp( c0 n ((W/Q)2 (ζ2W 2T d 1))) c 1 exp( c1n1/3(W/Q)4/3ζ4/3/log2/3(n)) c 3dexp( c3(n/d)1/3(W/Q)4/3ζ4/3/log2/3(n)) 2n T exp( c4α2Q2) , (63) over the training set: 1. Prompt attends to relevant tokens: For any test sample (X,y), with probability at least 1 2T exp( c4α2Q2), the attention coefficients at = [ϕ(X q1)]t after the second gradient step satisfy: ζT t relevant ϵ (1 ζ)T t irrelevant . (64) 2. Prompt learns relevant features: The prompt attention mechanism outputs relevant tokens with the same probability. Concretely, P(X,y) ( X ϕ(X q1) (q + yw ) ϵ) 1 2T exp( c4α2Q2) . 3. Test error: The test error of the model f θ satisfies ERR(f θ) 2T exp( c4α2Q2) . Proof. The theorem follows by combining Theorem 5, Lemma 12 and Lemma 13. D. Proofs for Population-gradient Analysis in Section 4.2 This section includes the missing proofs of all the results in Section 4.2 regarding population analysis of Algorithm 9. D.1. Proof of Lemma 2 We repeat here the lemma for convenience also stated for general (not necessarily isotropic) noise covariance Σ. Lemma 14. The second population gradient step q1 = γGw(w1,0) satisfies the following for α = ηζ Gq(0,αw ) = ((ζ ζ2)(αW 2 + α2w Σw /T) α2(ζ2 ζ3)(W 4 + (w q ) 2)) q + (((ζ ζ2) 2(ζ2 ζ3)αW 2)α (w q )) w ((1 + 2/T)(ζ ζ2)α (w q ))αΣw (65) Proof. The lemma follows immediately from Eqn. (19) of Lemma 5 by recognizing that for w = αw it holds Rq = αq w and Rw = αW 2. D.2. Corollary 2 Corollary 2. Suppose small enough step-size η obeying η (ζ2 (W 2 + Q2) ζ σ2/T) 1/2. (66a) ηζ (2ζW 2 + (1 + 2/T)σ2) 5/4. (66b) ηζ (σ2/T) 1/2. (66c) Then, for C1 [1/2,3/2] and C2 [ 1/4,1], we have that q1 = γηζ(ζ ζ2)W (C1Wq + C2ρQw ) . In particular, q q1 = γηζ(ζ ζ2)W 2Q2 (C1 + C2ρ2) and w q1 = γηζ(ζ ζ2)W 3Qρ(C1 + C2). On the Role of Attention in Prompt-tuning Proof. Set α = ηζ and 3/2 C1 = (1 + ασ2/T) αζ (W 2 + ρ2Q2) 1/2. (67a) 1 C2 = 1 2αζW 2 (1 + 2/T)ασ2 1/4. (67b) The gradient formula follows directly from (10). For the lower/upper bounds on C1,C2 use (66a), (66b) and (66c). Remark 2 (Condition on correlation). To classify correctly the signal tokens, we need yw (q + yw ) > 0 yρQ + W > 0 ρ < W/Q (68) Note that if (68) holds, then C1 1 + ασ2/T 2αζW 2 = C2 + (1 + 3/T)ασ2. (69) D.3. Proof of Theorem 2 We start by showing that for any ϵ > 0, and all sufficiently large γ γ (ϵ), it holds P(X,y) D( X ϕ(Xqγ 1 ) (q + yw ) ϵ) 1 2Te c α2Q2 σ2 = 1 δ. (70) We can get this by applying Lemma 11 provided only that we show the correlation of the normalized gradient-step qγ 1 / qγ 1 (= q1 below) with signal-relevant tokens is at least αQ. Concretely, we have from Corollary 2 that qγ 1 = γq1 = γηζ(ζ ζ2)W (C1Wq + C2ρQw ) with 3/2 C1 1/2, 1 C2 1/4. Define for convenience q1 = ηζ(ζ ζ2)W (C1Wq + C2ρQw ) and consider the normalized gradient step q1 = q1 q1 2 = C1Wq + C2ρQw C2 1W 2Q2 + C2 2ρ2Q2W 2 + 2C1C2ρ2W 2Q2 = C1Wq + C2ρQw QW C2 1 + C2ρ2(C2 + 2C1) . We can lower-bound its correlation with a signal token q + yw as follows: q T 1 (q + yw ) = C1W(Q2 + yρWQ) + C2ρQ(ρWQ + y W 2) C2 1 + C2ρ2(C2 + 2C1) = C1(Q + yρW) + C2ρ(ρQ + y W) C2 1 + C2ρ2(C2 + 2C1) = (C1 + C2ρ2)Q + yρ(C1 + C2)W) C2 1 + C2ρ2(C2 + 2C1) (C1 + C2ρ2)Q + yρ(C1 + C2)W) 1 + 3ρ2 1 + (C2/C1)ρ2 1 + 3ρ2 Q ρ (1 + C2/C1) (1 + (C2/C1)ρ2)Q ( ρ (1 + C2/C1))W (1 ρ2/2)Q 2 ρ W where: (i) the inequality C2 1 + ρ2C2(C2 + 2C1) C1 1 + 3ρ2 used in in the third line follows because C1 > 0 and C2 C1 from (69); (ii) the penultimate inequality uses C2/C1 [ 1/2,1] (for the lower bound recall C2 1/4,C1 1/2); (iii) the last inequality is because of the theorem s assumption in (13). Next, recall that wγ 2 = Gw(0,qγ 1 ) = E(X,y) D [y X ϕ(Xqγ 1 )] = E(X,y) D [yv(X)] , (71) where we set v(X) = X ϕ(Xqγ 1 ) for convenience. Thus, the prediction of the model with parameters θ = (wγ 2,qγ 1 ) for test datapoint ( X, y) is ˆo = yfθ( X) = y wγ 2,v( X) = wγ 2, yq + w ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ˆo1 + y wγ 2,v( X) (q + yw ) ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹ ˆo2 On the Role of Attention in Prompt-tuning Let E denote the event for which v( X) (q + yw ) ϵ, which has probability at least 1 δ by (70). Note that ERR(fθ) = P(ˆo < 0) P(ˆo < 0 E) + Pr(E) = P(ˆo < 0 E) + δ . (73) Hence, our goal below is to bound P(ˆo < 0 E). In fact, we will show that P(ˆo < 0 E) 0, so the error rate is δ as stated in the theorem. To do this, condition on E for which ˆo2 ϵ wγ 2 , thus ˆo ˆo1 ϵ wγ 2 . Further note that ˆo1 = wγ 2,w wγ 2,q . Thus, it suffices to show that wγ 2,w wγ 2,q + ϵ wγ 2 . (74) For this, go back to wγ 2 and write continuing from (71) wγ 2 = E(X,y) D [yq + w ] + E(X,y) D [y (v(X) (q + yw ))] = w + E(X,y) D [y (v(X) (q + yw ))] . Denote for convenience e = e(y,X) = v(X) (q + yw ). Thus, wγ 2,w = W 2 + E[y w ,e ] wγ 2,q = ρQW + E[y q ,e ] wγ 2 W + E[ e ] . where in the last line we used triangle and Jensen s inequalities. We now compute (recall E is the event for which e ϵ): E[ w ,e ] E[ w ,e E] + E[ w ,e Ec]P(E) ϵW + W δ E[ e Ec] . (75) Similarly, we can upper bound E[ q ,e ] and E[ e ]. Combining these with the above displays, the desired Eq. (74) holds provided: W 2 ρ QW ϵ(W + Q) + δ (W + Q) B + ϵδB + ϵ2 . (76) Above, we have denoted B = E[ e Ec]. wγ 2 W + B . Note that the LHS of 76 is > 0 because of Assumption 3.a that ρ W/Q . Thus, we can guarantee (76) holds once ϵ is small enough (by making γ large enough) and δ is also small enough (by making γ large enough). It only remains to bound B. To do this, note that e 2 v(X) + Q + W max t [T ] xt + Q + W 2(Q + W) + max t Rc zt . By Lemma 15 we further have that E[max t Rc zt 2 Ec]P(Ec) δ Cσ log (2T/δ). Hence, B 2(Q + W) + Cσ log (2T/δ). D.3.1. AUXILIARY LEMMA Lemma 15 (Subgaussian euclidean-norm tail control). Let Let zi Rd,i [N] be K-subgaussian random vectors. Then, for any event E with P(E) = δ, it holds that E[max i [N] zi Ec] 12K log (2N/δ). On the Role of Attention in Prompt-tuning Proof. Set Z = maxi [n] zi and define event B = {Z M} for M = 4K log (2N/δ). By Fact G.4 for all t > 0, P(Z > t) 2Ne t2/(16d K2). Thus, by choice of M, P(B) P(E) = δ. Denote the pdf and cdf complement of Z by f Z,QZ respectively. Observe that, we set QZ(M) δ. Using integration by parts we have, E[Z B]P(B) = M zf Z(z)dz = M zd QZ(z) = M QZ(z)dz [QZ(z)z] M M QZ(z)dz + δM M P(Z t)dt δM + M 2Ne t2/(16d K2)dt 2 log(2N/δ) e u2/2du log(2N/δ) + 2 πK We can conclude the proof by noting: E[Z E]P(E) = E[Z E Bc]P(E Bc) + E[Z E B]P(E B) Mδ + E[Z B]P(B). E. Proofs of results on discrete datasets E.1. Proof of Theorem 1 and Observation 1 Proof for Prompt-attention: Let w = w / w and q = q / q . q be the projection of q to the orthogonal complement of w i.e. q = q w w q . Similarly, let w be the projection of w to the orthogonal complement of q i.e. w = w q q w . Denote correlation coefficient between two vectors by ρ(a,b) = a b a b . To proceed, observe that, q q = q 2 ( w q )2 = q 2(1 ρ(q ,w )2) > 0. The positivity follows from the fact that q ,w are not parallel, thus, the absolute value of their correlation coefficient is strictly bounded away from 1. Similarly w w = w 2(1 ρ(q ,w )2)>0. To proceed, set ρ = 1 ρ(q ,w )2 and observe that the classifier θ = (w ,Γq ) achieves the attention scores ai = ϕ(Xq )i = S 1e q 2Γ ρ if i relevant, S 1e q 2Γδq ρ if i irrelevant, where S = Tζe q 2Γ ρ + T(1 ζ)e q 2Γδq ρ. Using orthogonality of w and q , the final prediction obeys yfθ(X) = w 2 ρS 1 [ζe q 2Γ ρ δw(1 ζ)e q 2Γδq ρ] . The classifier achieves perfect accuracy when ζe q 2Γ ρ > δw (1 ζ)e q 2Γδq ρ. Since we have δq 0 and we have assumed δw is a C-bounded variable (i.e. δw C), thus, the desired inequality can be guaranteed by choosing Γ > 1 q 2 ρ log(C(1 ζ) Proof for Observation 1: To prove this, observe that for any δq = q, δw = w choices, using orthogonality of q ,w , for any (y,X) D, we have yf LIN(w ) = w 2 ρ(ζ (1 ζ)δw). Thus, as long as δw ζ/(1 ζ), sign(yf LIN(w )) is always 1 or always 1, resulting in perfect accuracy for w or w . Proof for Self-attention: The proof is provided under Theorem 7. On the Role of Attention in Prompt-tuning Proof for Linear Prompt-attention: Let W1 = w w ,W2 = w q , Q1 = q q ,Q2 = q w . Since context-irrelevant tokens are of the form δqq yδww , the model decision is given by 1 T w X Xq = ζw (yw + q )(yw + q ) q + (1 ζ)w (yδww + δqq )(yδww + δqq ) q and 1 T f(X) = ζ(y W1 + W2)(Q1 + y Q2) + (1 ζ)(yδw W1 + δq W2)(yδw Q2 + δq Q1) = ζy(W1Q1 + W2Q2) + ζ(W2Q1 + W1Q2)+ (1 ζ)yδqδw(W1Q1 + W2Q2) + (1 ζ)(δq2W2Q1 + δw2W1Q2). T = (ζ + (1 ζ)δqδw)(W1Q1 + W2Q2) + y((ζ + (1 ζ)δq2)W2Q1 + (ζ + (1 ζ)δw2)W1Q2). To proceed, set (δq,δw) to be (0,0) or ( , ) equally-likely for > ζ/(1 ζ). For fixed , for any choice of W1,W2,Q1,Q2 observe that, with 1/2 probability the event E = {y((ζ +(1 ζ)δq2)W2Q1 +(ζ +(1 ζ)δw2)W1Q2) 0} happens. On this event (which is over the label y), probability that (ζ + (1 ζ)δqδw)(W1Q1 + W2Q2) > 0 is at most 1/2 because sign(ζ + (1 ζ)δqδw) is Rademacher variable. Combining, we find that P( yf(X) T 0) 25% as advertised whenever > E.2. Failure proof for Self-attention We have the following theorem regarding self-attention. Theorem 7. Fix > 0 to be sufficiently large. In (DATA), choose δ = (δq,δw) to be (0,0) or ( , ) equally-likely, where > 1/(1 ζ)2. For any choice of (U = 1u ,W ), f SATT(1u ,W ) achieves 50% accuracy (i.e. random guess). For any choice of (U,W ), there exists a (DATA) distribution with adversarial relevance set choices such that f SATT(U,W ) achieves 50% accuracy. Here, adversarial relevance set choice means that, the relevance set can be chosen adaptively to the label y, out-of-context term δ, and the self-attention model weights (U,W ) to cause misclassification. Proof. Let w = W w and q = W q . Also let bw = u w and bq = u q . Since W is allowed to be full-rank and arbitrary, w, q are allowed to be arbitrary as well (but fixed given W ). In our analysis, the critical terms are the attention weights given by the correlation between the relevant/irrelevant keys/queries. Setting attention queries as the raw tokens (without losing any generality), relevant queries x R and keys k R become x R = yw + q , k R = y w + q. Thanks to our choice of δ = δw = δq to be equally-likely in {0, }, observe that irrelevant queries and keys are simply x I = δx R, k I = δk R. This will greatly help the proof because it will mean that attention weights are highly structured. Specifically, set ρ = x Rk R. All weights of the attention similarities belong to the set (ρ, δρ,δ2ρ). Consequently, softmax-attention output A = ϕ(XW X )X = is given by ζeρ δ(1 ζ)e δρ ζeρ+(1 ζ)e δρ 1 T x R if i R (relevant) ζe δρ δ(1 ζ)eδ2ρ ζe δρ+(1 ζ)eδ2ρ 1 T x R if i Rc (irrelevant) . (77) On the Role of Attention in Prompt-tuning Set a+ = eρ ζeρ+(1 ζ)e δρ , a = e δρ ζeρ+(1 ζ)e δρ , b = e δρ ζe δρ+(1 ζ)eδ2ρ , b+ = eδ2ρ ζe δρ+(1 ζ)eδ2ρ . With this, we also set R = ζeρ δ(1 ζ)e δρ ζeρ + (1 ζ)e δρ = ζa+ δ(1 ζ)a I = ζe δρ δ(1 ζ)eδ2ρ ζe δρ + (1 ζ)eδ2ρ = ζb δ(1 ζ)b+. Also define i = R if i is relevant and I otherwise. With this, we have ai = ix R based on (77). The following lemma will be helpful for the downstream analysis. The goal of this lemma is showing that, by choosing δ {0, }, we can confuse the model output. Lemma 16. Fix a scalar κ. Set fδ = κ R + (1 κ) I. Recalling ρ = x Rk R, the following statements hold: Set δ = 0. Suppose 1 κ 0 OR κ 1,ρ 0 OR κ 0,ρ 0 . Then fδ > 0. Fix 0 α 1. Suppose δ > 0 = 1 1 ζ max( ζ α(1 ζ), 1 1 α). and κ α,ρ 0 OR κ α,ρ 0 . Then fδ < 0. Proof. Plugging in δ, we write fδ = κ R + (1 κ) I = κζa+ δκ(1 ζ)a + ζ(1 κ)b δ(1 ζ)(1 κ)b+ (78) = ζ(κa+ + (1 κ)b ) δ(1 ζ)(κa + (1 κ)b+). (79) Suppose δ = 0. In this case, we obtain the first statement of the lemma as follows ζeρ + 1 ζ + 1 κ > 0 whenever 1 κ 0 OR κ 1,ρ 0 OR κ 0,ρ 0 (80) Now suppose δ > 0. First, assume ρ 0 and κ α. We use the facts 1/ζ a+ 1, 1 a 0, b+ 1, 1 b 0. Observe that, since b+ a and κ α κa + (1 κ)b+ b+ if κ 0 (1 α)b+ if κ 0 1 α. Additionally, if κ 0, we have that κa + (1 κ)b+ b+ b κa+ + (1 κ)b . If κ 0, we obtain fδ ζb δ(1 ζ)b+. Thus, fδ < 0 whenever δ > 0 ζ/(1 ζ). If κ 0, we use κa+ + (1 κ)b 1/ζ to obtain that whenever δ > 0 1 (1 ζ)(1 α) fδ 1 δ(1 ζ)(1 α) < 0. Now assume ρ 0 and κ α. We use the facts 1 a+ 0, a 1, 1 b+ 0, 1 1 ζ b 1. On the Role of Attention in Prompt-tuning Observe that, since b+ a and κ α κa + (1 κ)b+ a if κ 1 αa if κ 1 α. Additionally, if κ 1, we have that κa + (1 κ)b+ a a+ κa+ + (1 κ)b . If κ 1, we obtain fδ ζa+ δ(1 ζ)a . Thus, fδ < 0 whenever δ > 0 ζ/(1 ζ). If κ 1, we use κa+ + (1 κ)b 1 1 ζ to obtain that whenever δ > 0 ζ (1 ζ)2α fδ ζ 1 ζ δ(1 ζ)α < 0. To proceed, we will conclude with the proof as follows. Set νi = yu i x R for i [T] where ui is the ith row of the output layer weights U. Here νi is obviously y-dependent. However, we will show that for any choice of y, the model accuracy is at most 50%. Towards this we fix y and (mostly) omit it from the notation during the following discussion. Let ai be the ith token of the attention output. The linear output layer U aggregates u i ai to obtain yf(U,W ) = T i=1 u i ai = T i=1 νi i. Aggregating v+ = 1 T i R=relevant vi and v = 1 T i Rc=irrelevant vi and recalling from (77) that over relevant/irrelevant sets attention tokens are given by Rx R and Ix I, we find 1 T yf(U,W ) = νR R + νI I. Scenario 1: Rows of U are identical and we have U = 1u . In this scenario, we simply have νi = ν and νR = ζν and νI = (1 ζ)ν. Thus, we find 1 T yf(U,W ) = Tν[ζ R + (1 ζ) I]. Set fδ = ζ R + (1 ζ) I. We claim that sign(fδ) is Rademacher (given arbitrary y choice) which will prove that accuracy is at most 50%. Specifically, let us apply Lemma 16 with κ = ζ and α = ζ. When δ = 0, we have fδ > 0. When δ = , since the conditions κ α and κ α hold, for any choice of ρ, for > 0 = 1 (1 ζ)2 we have that fδ < 0. Thus, we have that Pδ(fδ > 0) = Pδ(fδ < 0) = 0.5 as advertised. This follows from the fact that fδ > 0 for δ = 0 and fδ < 0 for δ = and δ is equally likely over two options. Scenario 2: Suppose rows of U are not identical. In this case, we will leverage the fact that relevant set R is allowed to be chosen adversarially with respect to the self-attention weights. We will show that by selecting R adversarially, on any label y event, accuracy is a coin flip. First consider the scenario νtot = νR + νI 0: We will show that model achieves at least 50% error on label y: Let us denote νR with νR R which makes the relevance set dependence explicit. Given R, fixing δ = 0, the model outputs (following (80)) 1 T yf(U,W ) = νR R eρ ζeρ + 1 ζ + νR I . Suppose there is a relevance set R0 (that depends on y) such that the right hand side is non-positive. Let us select this R0 as our relevance set. Then, the model makes 50% error on label y thanks to the event δ = 0 (which is exactly what we want). If there is no such R0, then, for all R, we have νR R eρ ζeρ + 1 ζ + νR I > 0 On the Role of Attention in Prompt-tuning By taking average of all relevance sets ( T choose ζT many), all vi s will be equally-weighted and we obtain νtot = νR + νI > 0. This contradicts with our initial νtot 0 assumption, thus, R0 has to exist. Now consider the scenario νtot = νR + νI > 0: Let D be the uniform distribution over T choose ζT relevant sets R. Clearly ED[νR R ] = ζνtot > 0. Thus, there is a relevance set R+ such that νR+ R ζνtot and there is a relevance set R such that νR R ζνtot. We will make use of these two sets to finalize the proof. To proceed, set κ = νR /νtot and again set α = ζ and 0 = 1 (1 ζ)2 in Lemma 16. Here, we are investigating the sign of the prediction 1 Tνtot yf(U,W ) = νR R + νI I νtot = κ R + (1 κ ) I. First, assume that the attention weights are so that ρ = ρy 0. In this case (and for this particular label y), When δ = 0, we choose the relevance set R+ which ensures κ+ ζ 0 and f0 > 0. When δ = > 0, we choose the relevance set R which ensures κ ζ and f < 0. Secondly, assume that the attention weights are so that ρ = ρy 0. In this case, When δ = 0, we choose the relevance set R which ensures κ ζ 1 and f0 > 0. When δ = > 0, we choose the relevance set R+ which ensures κ+ ζ and f < 0. In either case, by adaptively choosing R {R+,R } as a function of (δ,y) pair, we ensure accuracy is at most 50% because f and f0 have conflicting signs. E.3. Success proof for R-Adaptive Self-Attention Consider the setting of Theorem 1 and Appendix E.2. We have the following lemma which shows that self-attention can succeed in Theorem 1 if U can adapt to the relevance set (rather than R being adversarial to U). Lemma 17. In (DATA), choose (δq,δw) to be (0,0) or ( , ) equally-likely. Consider the self-attention model f SATT(U,W ) where we set U = 1Rw and W = ΓI. This model achieves perfect accuracy whenever w = (I q q )w 0 by choosing Γ > 1 (1 + )( q w )2 + w q (1 ρ(q ,w ) ) log( 1 ζ where ρ( ) is the correlation coefficient.3 Proof. Thanks to the masking 1R, we only need to consider the attention scores along relevant tokens. Let c = yw + q 2. For each relevant token, the attention rows are given by eΓc if i R e Γc if i / R. To proceed, attention tokens corresponding to relevant tokens are given by f = i R ai(w + yq ) i/ R ai(yw + q ) (81) = (ζeΓc (1 ζ)e Γc)(yw + q ). (82) 3Note that the only instance Γ does not exist is when q = cw for c 1. In this scenario, classification is impossible using the linear head w without a bias term because all tokens are in the sign(c) direction regardless of the label y. On the Role of Attention in Prompt-tuning Thus, using w w > 0, sign(yf SATT(U,W )) = sign(yw f) = sign(ζeΓc (1 ζ)e Γc). Thus, we need e(1+ )Γc > 1 ζ ζ which is implied by Γ > 1 (1+ )c log( 1 ζ ζ ). To conclude, note that for both y = 1 c yw + q 2 q 2 + w 2 2 q w ( q w )2 + w q (1 ρ(q ,w ) ) > 0. where we used q w = q w ρ(q ,w ) . F. Proofs of sharp population risk formulas (Theorem 4) Throughout this section, we use slightly different notation from the one stated in the main body for compactness purposes. Specifically, we set Q = q 2,W = T w 2 rather than Q = q ,W = w . Theorem 8. Consider the prompt-attention model f ATT θ . Set Q = q 2,W = T w 2, suppose w q , and let τ, τ > 0 be hyperparameters. Consider the following algorithm which uses the hindsight knowledge of q to estimate w and make prediction: 1. ˆw = (I q q ) Lw(0,τ q ). 2. Set θ = ( ˆw, τ q ). Suppose ζ2W,1 ζ,α = n/d,e Q,eτ each lie between two positive absolute constants. Suppose T is polynomially large in n and these constants and O( ) hides polynomial terms in n. Define inverse-signal-to-noise-ratio: ISNR(α,τ) = ζ2W α . With probability 1 2e t2/2 O(T 1/3) over the training data, the test error obeys ERR(f ATT θ ) = Q d )ISNR(α,τ) Above, , highlights the upper/lower range of the test error (see (86) for exact statement). In the limit T,d , the test error converges in probability to ERR(α,ζ,Q,W,τ, τ) = Q e Q τ τ 2 1 + ISNR(α,τ) In this limit, optimal hyperparameters are τ = τ = Q/2 and leads to optimal ISNR(α) = (1 ζ)e Q/2 ζ2W α and the error ERR(α,ζ,Q,W) = Q e Q/4 1 + ISNR(α) Proof. Without losing generality, assume first ζT tokens are relevant and remaining tokens are irrelevant. Consider XI of size (1 ζ)T d induced by the irrelevant tokens with normal distribution. Observe that g = XI w and h = XI q are two independent i.i.d. N(0,I(1 ζ)T ) vectors. Also for standard normal g N(0,1), recall that moment-generating function is given by E[eτg] = eτ 2/2. Step 1: Characterizing the distribution of ˆw. Note that, the attention weights have the form a = ϕ(τ [ Q1ζT h ]). Here, the softmax denominator is T DT where DT = (ζe Qτ + 1 T (1 ζ)T i=1 eτhi). Define eτh to be the numerator corresponding to irrelevant tokens i.e. eτh = [eτh1 ... eτh(1 ζ)T ]. On the Role of Attention in Prompt-tuning Define the matrix Q = I q q , W = I w w . Set the vector v = 1 T Q X I eτh and v = 1 T h eτh q . To proceed, observe that, for a single sample (y,X), the gradient has the form Ly,X w (0,τ q ) = y X a = ζ(w + yq )e Qτ + v + v DT . (83) After projection this onto the q -complement Q , we get rid of the q direction to obtain ˆwy,X = Q Ly,X w (0,τ q ) = 1 DT [ζw e Qτ + Q X I eτh/T]. The projected gradient over the full dataset is given by the empirical average ˆw = Q Lw(0,τ q ) = 1 1 Di,T [ζw e Qτ + Q X i,Ieτhi/T]. Here hi,Xi,I,Di,T denote the random variables induced by the ith sample. Here, a critical observation is the fact that Q Xi,I is independent of hi (thanks to Gaussian orthogonality), thus, Q Xi,Ieτhi is normal conditioned on hi. To proceed, we apply Chebyshev s inequality over number of tokens T. Recall that we assumed eτ C for an absolute constant C 1. This means that ecτ 2 Ccτ Cc log C is polynomial in C and is also upper bounded by a constant. In what follows O( ) only reflects the T dependence and subsumes polynomial dependence on the terms n,C. For all 1 i n, applying Chebyshev s inequality, for T poly(n,eτ 2), with probability 1 T 1/3, we have that Since eτhi 2/T = 1 T T j=1 e2τhij thus eτhi 2/T (1 ζ)e2τ 2 O(T 1/3), Set E[DT ] = D = ζe Qτ + (1 ζ)eτ 2/2. Di,T D O(T 1/3). With these, set bi = eτhi eτhi which is a vector with fixed ℓ2 norm that is perfectly parallel to eτhi. Since bi 2 = E[ eτhi 2/T] = (1 ζ)e2τ 2, from above, observe that, T eτhi O(T 1/3). n i=1 Q X i,Ibi. Since Q X i,I,bi are independent and bi has fixed ℓ2 norm, we have that v N(0,(1 ζ)e2τ 2Q ). Finally, let c = ζe Qτw . Recalling T w = W, combining the perturbations bounds above, we have that 1 Di,T (ζw e Qτ) O(T 1/3). Combining these observe that Tζe Qτw v/ n O(T 1/3). (84) Since v is normally distributed, above also implies that TD ˆw converges to the normal distribution Tζe Qτw , (1 ζ)e2τ2 n Q ) in the limit T . Lemma 18 (Inverse Signal-to-Noise Ratio (ISNR)). Set W = I w w . Define SNR of ˆw to be ISNR( ˆw) = W ˆw 2 On the Role of Attention in Prompt-tuning Recall ISNR(α,τ) = (1 ζ)e2τ(τ ζ2W α . With probability 1 2e t2/2 T 1/3 over the dataset, we have that + ISNR( ˆw) ISNR(α,τ) (1 + τ Proof. Let us recall the standard normal concentration: For g N(0,Id 1), d 1 E[ g ] d 1 d . Thus, with probability 1 2e t2/2, through Lipschitz concentration, This means that, with the same probability d + t v 1 ζeτ 2 ( We first upper bound W ˆw 2. Recalling (84), W ˆw v/ n O(T 1/3). d + t)2 + O(T 1/3) n W ˆw 2 (1 ζ)e2τ 2 ( d 1 t)2 + O(T 1/3). Using w 2T = W, We similarly have that w ˆw 2 Wζ2e2 Qτ O(T 1/3). To conclude, with probability 1 2e t T 1/3, ISNR( ˆw) obeys d + t)2 + O(T 1/3) Wζ2e2 Qτ O(T 1/3) n (1 ζ)e2τ 2 ISNR( ˆw) ( d 1 t)2 + O(T 1/3) Wζ2e2 Qτ + O(T 1/3) Rewriting this bound, we find 2 (1 ζ)e2τ 2 ζ2Wαe2 Qτ ISNR( ˆw) (1 1 + t ζ2Wαe2 Qτ . Recalling the definition of ISNR(α,τ) = (1 ζ)e2τ(τ ζ2W α , we conclude with the bound. Step 2: Characterizing the error rate of θ = ( ˆw, τq ). To achieve this goal, we will leverage Theorem 9. Since conditions of this theorem is satisfied (noticing that their γ is our ISNR( ˆw) which is upper bounded by a positive constant), for a new test point (y,X), we have that RRRRRRRRRRR ERR(f ATT θ ) Q e Q τ τ 2 1 + ISNR( ˆw) RRRRRRRRRRR O(T 1/3). Using the Lipschitzness of the Q-function (i.e. Q(x + ϵ) Q(x) = x+ϵ x e t2/2dt ϵ), as we have done in Theorem 9, we pull out the perturbation term O(T 1/3) within ISNR( ˆw) to obtain the advertised bound d)ISNR(α,τ) O(T 1/3) ERR(f ATT θ ) (85) d )+ISNR(α,τ) + O(T 1/3). (86) On the Role of Attention in Prompt-tuning To emphasize, this bound holds with probability 1 2e t2/2 O(T 1/3) over a new test datapoint (y,X). To see the optimal choices for τ,τ, we need to optimize the error bound. This results in τ = arg min τ τ = arg min τ ISNR(α,τ) = 2τ(τ Theorem 9. Consider the prompt-attention model f ATT θ where we set θ = (w + p,τ q ). Set Q = q 2,W = T w 2. Here τ is a tuning parameter and p is a perturbation vector and assume all vectors are perpendicular i.e. p w q . Set γ = p 2/ w 2 and suppose 1 + γ,ζ2W,1 ζ,e Q,eτ each lie between two positive absolute constants. O( ) subsumes polynomial dependencies in these constants. We have that RRRRRRRRRRR ERR(f ATT θ ) Q e Qτ τ 2 1 + γ RRRRRRRRRRR O(T 1/3). Thus, as T , the optimal tuning obeys τ = Q/2 and yields an error of Q(e Q/4 Proof. Let us recap the notation of Theorem 4. Without losing generality, assume first ζT tokens are relevant and remaining tokens are irrelevant. Consider XI of size (1 ζ)T d induced by the irrelevant tokens with normal distribution. Using orthogonality of q ,w ,p, observe that g = XI( w + p w ) N(0,(1 + γ)I(1 ζ)T ) and h = XI q N(0,I(1 ζ)T ) are independent vectors. Also for standard normal g N(0,1), recall that moment-generating function is given by E[eτg] = eτ 2/2. Note that, the attention weights have the form a = ϕ(τ [ Q1ζT h ]). Here, the softmax denominator is T DT where DT = (ζe Qτ + 1 T (1 ζ)T i=1 eτhi). Define eτh to be the numerator corresponding to irrelevant tokens i.e. eτh = [eτh1 ... eτh(1 ζ)T ]. Define the matrix Q = I q q , W = I w w . To proceed, observe that, the prediction with θ = (w + p,τ q ) is given by T w yf ATT θ (X) = T( w + p w ) [ζe q τ(w + yq ) + XIeτh T g eτh. (90) With this, conditioned on eτh observe that g eτh N(0, 1 T eτh 2), thus, Pg(yf ATT θ (X) > 0) = 1 Q( ζe Qτ W 1 + γ eτh / To proceed, similar to Theorem 4, we apply Chebyshev s inequality over number of tokens T to find that with probability 1 O(T 1/3) over h, eτh 2/T e2τ 2 O(T 1/3). In aggregate, this implies that, with probability 1 O(T 1/3) over h, we have that 1 Q((1 + O(T 1/3))ζe Qτ W 1 + γeτ 2 ) Pg(yf ATT θ (X) > 0) 1 Q((1 O(T 1/3))ζe Qτ W 1 + γeτ 2 ), Finally, note that since ζe W 1+γeτ2 is upper/lower bounded by a positive constant, and since Q(x+ϵ) Q(x) = x+ϵ x e t2/2dt ϵ, we can rewrite Pg(yf ATT θ (X) > 0) Q(ζe Qτ W 1 + γeτ 2 ) O(T 1/3). Union bounding with failure probability over h, we conclude with the result. On the Role of Attention in Prompt-tuning G. Useful facts For a random variable Z and α > 0, Z ψα denotes its ψα-norm for Orlicz function ψα(z) = ezα 1 (Ledoux & Talagrand, 1991). Fact G.1. Let X1,...,Xn be independent zero-mean sub-gaussian or sub-exponential random variables with Xi ψm K for all i [n] for either m = 2 or m = 1. Then, XXXXXXXXXXXX 1 n i [n] Xi XXXXXXXXXXXXψm CK n . Fact G.2. (Pollard, 1990) The following identity holds for Orlitz norms α+β c X ψα Y ψβ (91) for a fixed numerical constant c. Next we state a Lemma from Talagrand quoted directly from Lemma 22 of (Mohammadi et al., 2019). Fact G.3. (Ledoux & Talagrand, 1991)For any scalar α (0,1], there exists a constant Cα such that for any sequence of independent random variables ξ1,ξ2,...,ξN we have i ξi E[ i ξi] ψα Cα (max i ξi ψα) Fact G.4. Let z Rd be a K-subgausssian vector i.e. z v is K-subgaussian for fixed v = 1. Then, the following are true for a constant c > 0 P( z c K( d + t)) e t2 . Proof. For completeness, we provide a proof. Repeating Lemma 31 of (Oymak, 2019), we can pick a 1/2 cover C of the unit Euclidean ball in Rd with size log C 2d. For any v C subgaussianity implies P(v z t) exp( ct2/K2). Setting K = K/ c, t = K ( 2d + τ) and union bounding over all v C, we find P(sup v C v z K ( d + τ)) exp( τ 2). To proceed, set v(z) C to be the nearest point to z = z/ z in C. Since v(z) z 0.5, note that z = z z = v(z) z + (z v(z)) z v(z) z + 0.5 z . Thus, z 2K ( d + τ) with probability at least 1 exp( τ 2). H. Additional experimental results and details H.1. Additional details for image classification experiments Dataset. As mentioned in Section 5.2, we construct three datasets by modifying the original images in CIFAR-10: FULL-TILED. Each examples consists of a 64x64 images obtained by arranging a 32x32 image from CIFAR-10 in a tiling pattern with four tiles (cf. Fig. 3a). PARTIAL-TILED. This is dataset is similar to FULL-TILED with the exception that each image has at-least T out of 4 tiles replaced by patches of i.i.d. random Gaussian noise with mean zero and variance 0.2. Note that, for each example in the dataset, T {1,2,3} is a random number as well as the location of the noisy tiles (cf. Fig. 3b). EMBED-IN-IMAGENET (Karp et al., 2021). We construct an example by simply embedding a 32x32 image from CIFAR-10 at a random location in a 64x64 background corresponding to a randomly selected image from Image Net (Russakovsky et al., 2015). We also add i.i.d. random Gaussian noise with mean zero and variance 0.2 to the background (cf. Fig. 3c). On the Role of Attention in Prompt-tuning 0 50 100 200 Number of prompt vectors Top-1 accuracy Fine-tuning Prompt-tuning-I Prompt-tuning-II Prompt-tuning-III (a) Full dataset 0 50 100 200 Number of prompt vectors 57.5 60.0 62.5 65.0 67.5 70.0 72.5 75.0 Top-1 accuracy (b) Capped 10% 0 50 100 200 Number of prompt vectors 56 58 60 62 64 66 68 70 72 Top-1 accuracy (c) Capped 2% Figure 5. Performance of fine-tuning vs. prompt-tuning on 10-way classification tasks defined by PARTIAL-TILED dataset. Full dataset has 50K training examples. Capped 10% and 2% correspond to sub-sampled train sets where we select exactly 500 and 100 examples per class from the full dataset. Note that number of prompt vectors equal to 0 corresponds to zero-shot performance. Table 1. Comparison between prompt-tuning and fine-tuning only first layer self-attention weights (Full dataset). Method (trainable parameters) EMBED-IN-IMAGENET PARTIAL-TILED PROMPT-TUNING-I w/ 100 prompt vectors (19.2K) 31.89 68.46 PROMPT-TUNING-II w/ 100 prompt vectors (19.2K) 38.48 72.04 PROMPT-TUNING-III w/ 100 prompt vectors (230.8K) 47.81 74.40 Fine-tuning only first layer attention weights (148.2K) 30.35 73.95 By construction, each dataset has 50,000 train and 10,000 test examples corresponding to train and test set of CIFAR-10. We also consider data-limited settings where we keep the test set intact but subsample the train set by selecting a fixed number of images for each class. Note that all three datasets define 10-way multiclass classification tasks with CIFAR-10 classes as potential labels. Model architecture. We utilize a tiny variant of the Vision transformer model (Dosovitskiy et al., 2021) for our experiments. This variant has 12 transformer layers with its hidden dimension, MLP intermediate dimension, and number of heads per attention layer being equal to 192, 768, and 3, respectively. The patch size in our study is set to be 4x4. The model itself (without counting the trainable parameters/weights during prompt-tuning) has approximately 5.44M parameters. We rely on the CLS token to obtain the classification logits. Training. We rely on Scenic library (Dehghani et al., 2022)4 to conduct our experiments on image classification. Following the default settings in the library along with a coarse grid search, we employ Adam optimizer (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999, weight decay = 0.1, and batch size = 128 while training a randomly initialized model. Furthermore, we employ a linear warm-up of learning rate followed cosine learning rate schedule with base learning rate 3e-3. As for the fine-tuning and prompt-tuning experiments that (partially) initialize from a pre-trained model, we rely on SGD with momentum parameter 0.9 and batch size = 128 to update trainable parameters. Again, we utilize a linear warm-up of learning rate followed by cosine learning rate schedule. Throughout our experiments, the base learning rates for fine-tuning and prompt-tuning are 1e-3 and 0.1, respectively. H.2. Additional results on image classification Figure 5 showcases the performance of fine-tuning and various prompt-tuning strategies on PARTIAL-TILED. Comparison with fine-tuning the first self-attention layer. In Tables 1, 2, and 3, we explored fine-tuning only first layer self-attention parameters for the underlying Vi T model. This setting aligns well with the single-layer nature of our theoretical results. Similar to Fig. 4 (corresponding to EMBED-IN-IMAGENET dataset) and Fig. 5 (corresponding to PARTIAL-TILED dataset), we considered three settings: 1) Full dataset; 2) Capped 10%; and 3) Capped 2%, which progressively corresponds to smaller amount of (training) data during fine-tuning and prompt-tuning. The key takeaways are: 4https://github.com/google-research/scenic On the Role of Attention in Prompt-tuning Table 2. Comparison between prompt-tuning and fine-tuning only first layer self-attention weights (Capped 10%). Method (trainable parameters) EMBED-IN-IMAGENET PARTIAL-TILED PROMPT-TUNING-I w/ 100 prompt vectors (19.2K) 28.06 67.68 PROMPT-TUNING-II w/ 100 prompt vectors (19.2K) 36.76 70.95 PROMPT-TUNING-III w/ 100 prompt vectors (230.8K) 42.96 73.09 Fine-tuning only first layer attention weights (148.2K) 18.53 70.44 Table 3. Comparison between prompt-tuning and fine-tuning only first layer self-attention weights (Capped 2%). Method (trainable parameters) EMBED-IN-IMAGENET PARTIAL-TILED PROMPT-TUNING-I w/ 100 prompt vectors (19.2K) 20.52 64.81 PROMPT-TUNING-II w/ 100 prompt vectors (19.2K) 33.62 68.14 PROMPT-TUNING-III w/ 100 prompt vectors (230.8K) 36.13 69.40 Fine-tuning only first layer attention weights (148.2K) 15.46 65.04 1. When there is a significant distribution-shift between from pre-training data (in case of EMBED-IN-IMAGENET), even the simplest prompt-tuning, namely PROMPT-TUNING-I, significantly outperforms the fine-tuning first layer self-attention parameters. 2. When the distribution-shift is small, prompt-tuning variants realize a better accuracy vs. training cost trade-off, e.g. PROMPT-TUNING-II outperforms fine-tuning first layer self-attention parameters in the Capped 10% and Capped 2% setting (while training only 19.2K rather than 148.2K parameters). H.3. Illustration of attention weights for prompt vectors Fig. 6 presents a representative example where we show evolution of average attention weights from prompt vectors to image tokens/patches across transformer layers, when we employ PROMPT-TUNING-III. It is evident from the figure that prompt-attention helps distinguish the relevant tokens/patches from the irrelevant patches. On the Role of Attention in Prompt-tuning Label: automobile (a) Input image Attention head:1 (layer: 1) Attention head:2 (layer: 1) Attention head:3 (layer: 1) Attention head:1 (layer: 4) Attention head:2 (layer: 4) Attention head:3 (layer: 4) Attention head:1 (layer: 8) Attention head:2 (layer: 8) Attention head:3 (layer: 8) Attention head:1 (layer: 12) Attention head:2 (layer: 12) Attention head:3 (layer: 12) (b) Average attention weights from prompts (keys) to image patches (values). Figure 6. Illustration of how attention weights progressive change from the first layer (Figure 6b-top) to the last layer (Figure 6b-bottom) in the transformer model for a given input image (Figure 6a) when we employ PROMPT-TUNING-III. We plot average attention weights from 50 prompt vectors (keys) to 256 image patches (values). The attention weights for each attention head are naturally arranged in a 16 x 16 grid corresponding to the original locations of the patches in the image. Note that the attention weights in the early layer have a tiling pattern similar to that in FULL-TILED the dataset utilized by the pre-trained model. However, as we progress deeper into the transformer, attention weights begin to capture the relevant patch locations in the dataset of interest, i.e., EMBED-IN-IMAGENET.