# universality_and_limitations_of_prompt_tuning__2c31af75.pdf

Universality and Limitations of Prompt Tuning

Yihan Wang UCLA wangyihan617@gmail.com

Jatin Chauhan UCLA chauhanjatin100@gmail.com

Wei Wang UCLA weiwang@cs.ucla.edu

Cho-Jui Hsieh Google and UCLA chohsieh@cs.ucla.edu

Despite the demonstrated empirical efficacy of prompt tuning to adapt a pretrained language model for a new task, the theoretical underpinnings of the difference between "tuning parameters before the input" against "the tuning of model weights" are limited. We thus take one of the first steps to understand the role of soft-prompt tuning for transformer-based architectures. By considering a general purpose architecture, we analyze prompt tuning from the lens of both: universal approximation and limitations with finite-depth fixed-weight pretrained transformers for continuous-valued functions. Our universality result guarantees the existence of a strong transformer with a prompt to approximate any sequence-to-sequence function in the set of Lipschitz functions. The limitations of prompt tuning for limited-depth transformers are first proved by constructing a set of datasets, that cannot be memorized by a prompt of any length for a given single encoder layer. We also provide a lower bound on the required number of tunable prompt parameters and compare the result with the number of parameters required for a low-rank update (based on Lo RA) for a single-layer setting. We finally extend our analysis to multi-layer settings by providing sufficient conditions under which the transformer can at best learn datasets from invertible functions only. Our theoretical claims are also corroborated by empirical results.

1 Introduction

The surge in the empirical research of large-scale models has led to the emergence of a new paradigm of prompt tuning. Current large models consist of billions of parameters [Brown et al., 2020, Chowdhery et al., 2022], which greatly exacerbate the cost of tuning the entire model weights via gradient-based optimization. On the other hand, the power of scale in both model size and pretraining dataset size has demonstrated strong capabilities by achieving reasonable performance through a learnable prompt appended before the input [Li and Liang, 2021, Lester et al., 2021]. Despite this, several questions emanate around the abilities and limitations of prompt tuning.

In this work, we aim to characterize some natural yet essential questions about prompt tuning with transformer architectures. Firstly, are prompts universal approximators, i.e. with a fixed pretrained transformer network, can we find a prompt to approximate any sequence-to-sequence function in a given space? If yes, can we construct the transformer for this universality result? Second, can we identify failure modes of prompt tuning when applied on potentially non-optimal but non-trivial transformers? Moreover, since prompt tuning is usually compared against Lo RA[Hu et al., 2021] in consideration to parameter-efficient tuning, is prompt tuning then more/less parameter-efficient than Lo RA? Answering these questions can lead to important insights on when and how to perform prompt tuning to adapt a pretrained transformer network to a given downstream task of interest.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

In this work, we seek to answer these questions with appropriate theoretical analysis and further validate our claims with empirical results. We first characterize the universal nature of prompt tuning by constructing a specific transformer network. We show that for a given approximation error and the space of sequence-to-sequence Lipschitz functions, we can construct a transformer network, with a suitable number of layers, that can leverage prompt tuning to approximate any function in this space. Despite this universality of prompt tuning with a carefully constructed pretrained transformer, we then identify some limitations of prompt tuning with weaker but non-trivial transformers. We prove this by constructing sequence-to-sequence datasets with shared input tokens, which are surprisingly simple but cannot be memorized by prompt tuning for a given transformer. We also extend our analysis to more general settings where the shared token is not required. In this setting, we first prove that prompt tuning on a single-layer transformer requires Ω(n) trainable parameters to memorize n training examples, wherein for Lo RA, it suffices with O(n) trainable parameters. We finally extend our analysis to the multi-layer setting and provide sufficient conditions under which prompt tuning exhibits extremely limited capacity to at best memorizing datasets from invertible functions.

Our contributions can be summarized as below:

We characterize the universal nature of prompt tuning by explicitly constructing a transformer network (Theorem 1). We provide a construction-based argument for sequence-to-sequence datasets that cannot be learned by prompt tuning with a given single-layer transformer (Theorem 2). We provide the lower bound on the required number of parameters for prompt tuning to memorize any sequence-to-sequence functions (Theorem 3). We provide a sufficient condition for multi-layer transformers, under which datasets with shared output tokens cannot be learned with prompt tuning (Theorem 4). We conduct empirical studies, including real-world datasets, to verify our theoretical claims.

2 Related Work

Theoretical Analysis of Transformers Various works have characterized the theoretical properties of transformers and its primary self-attention component. Yun et al. [2020] studies the universal approximation ability of transformers for continuous permutation equivariant sequence-to-sequence functions with compact support and further examined the role of using positional encodings to circumvent permutation equivariant condition. Pérez et al. [2021] show that transformer with a hardattention is Turing complete based on their capacity to perform computations and access the internal dense representations of the data. Wei et al. [2021] further shows that transformers can approximate Turing machines with bounded computation time with a new notion of approximation. [Dong et al., 2021] provides a negative yet interesting result signifying the limitations of pure self-attention in terms of rank diminishing of the input. Other works including Kim et al. [2021], Dasoulas et al. [2021] derive upper bounds on the Lipschitz constant of respective modifications of the attention mechanism. The works by Li et al. [2022], Zhang et al. [2020] documented optimization perspective on transformer training via SGD.

Fine-tuning and Prompt Tuning Fine-tuning is the standard way to adapt a pretrained model to downstream tasks. The most standard and popular paradigm is tuning the model weights via a suitable optimization procedure along with a linear head on the output representations [Radford et al., 2018, Devlin et al., 2019]. Subsequent works studied more parameter-efficient ways of fine-tuning by updating either a subset of model parameters [Ben Zaken et al., 2022] or restricting the parameterupdates to a low-dimensional subspace [Aghajanyan et al., 2021, Hu et al., 2021, Mahabadi et al., 2021]. The work by Hu et al. [2021] (their framework referred to as Lo RA) has garnered particular interest in the community and Malladi et al. [2023] has provided an interpretation of Lo RA via the kernel mechanism. In the particular context of LLMs, prompt tuning has emerged as the de facto approach where only the prompt is updated while keeping the rest of the transformer weights and architecture fixed [Shin et al., 2020, Lester et al., 2021, Li and Liang, 2021].

Analysis of Prompt Tuning Wei et al. [2022] studies the link between prompt tuning and downstream tasks with an underlying latent variable generative model of text, which is confined to a Hidden Markov Model. However, they focused on the discrete vocabulary setting contrary to our

results for continuous sequence-to-sequence functions. Some more recent works [Akyürek et al., 2023, Von Oswald et al., 2023] characterize an intriguing property of a specific form of prompting, referred to as in-context learning, where they proved by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. This work however pursued a different and specific direction from the prompting results we aim to provide for generic settings.

Memorization Capacity of Neural Networks A series of works have sought to provide finite sample universal memorization capacity results of neural networks and the understanding of expressive power of neural networks. Huang and Huang [1990], Huang and Babri [1998], Huang [2003], Yamasaki [1993] analyze the memorization capacity of FNNs with sigmoid and other bounded activation functions. Hardt and Ma [2016], Zhang et al. [2021], Nguyen and Hein [2018] provide results for modern Re LU networks including FNNs and CNNs. For transformer architectures, Kim et al. prove that transformers can memorize a dataset with finite parameters. To the best of our knowledge, similar results for prompt tuning have not been studied in continuous settings for transformer architectures.

3 Transformers and Parameter Efficient Training

3.1 Preliminaries

We use the following notations throughout the paper. A bold lower case character, e.g. x, denotes a vector. A bold upper case character, e.g. W, denotes a matrix while Wi,j, Wi,: and W:,j is the (i, j)-th element, i-th row, j-th column, respectively. We use a single superscript or subscript to denote the index of a matrix, e.g. Xi, Xi denote the i-th matrix in a matrices sequence. We use σ and σ for softmax and hardmax operators, respectively. We use Re LU(v) = max(v, 0) to denote the Re LU activation function where max( ) function is applied entry-wise to a vector. We use Cone(a1, a2, ..., am) to denote a cone without its origin point where Cone(a1, a2, ..., am) = {x : x = Pm i=1 aiai, ai > 0}. We also define the minus operation between a set S and a vector v as S v = {x v : x S}. In Section 4, we use [a : b : c] to denote a grid {a, a+b, a+2b, ..., c b} from a to c, with an interval b.

Transformer networks [Vaswani et al., 2017] are a stack of multiple transformer layers, composed subsequently. A transformer layer has two key components: an attention layer and a token-wise MLP layer, with residual connections around both blocks. We consider the input and output to be sequences of tokens X Rd m and Y Rd m, where m is the number of tokens in the sequence and d is the token dimension. Definition 1 (Attention Layer). We define an h-head attention layer parameterized with Wq, Wk, Wv, Wo between a single token x and a token sequence X as

Att(x, X) =

i=1 Wi o Wi v X σ((Wi k X) Wi qx). (1)

The normalizing factor of 1

dkq is subsumed in the weight matrices Wi k for notational simplicity.

We can then define the cross attention between two sequences X1 Rd m1 and X2 Rd m2 (We use xk = (X1):,k for simplicity):

Att(X1, X2) = [Att(x1, X2), Att(x2, X2), ..., Att(xm1, X2)].

Definition 2 (Standard Transformer Layer). With definition 1, we define a standard transformer layer τ as

MLP(X) = [W2Re LU(W1X:,1 + b1) + b2 + X:,1, ..., W2Re LU(W1X:,n + b1) + b2 + X:,n] (2) τ(X) = MLP(Att(X, X) + X). (3)

The definition here omits the layer normalization block for simplicity (following [Kim et al., 2021]).

We denote the set of transformer networks with h heads of size s and r MLP hidden neurons with T h,s,r. In Section 4, we utilize a modified transformer network with hardmax operation σ instead of softmax σ. We denote this modified version of transformer networks as T h,s,r.

During fine-tuning, we optimize the matrices Wi q, Wi k, Wi v in the attention layer and W1, W2, b1, b2 in the MLP layer pertaining to a loss function L. However in prompt tuning, the pretrained model weight matrices are fixed and we optimize a tunable sequence prepended to the input.

Prompt Tuning Given a pretrained transformer network g T and a downstream training dataset S = {(X1, Y1), ..., (Xn, Yn)}, prompt tuning seeks to find a prompt P Rd mp with mp tunable tokens under the loss function L:

P = arg min P

i=1 L(g([P, Xi]):,mp:, Yi). (4)

The tunable prompt P is shared amongst all the inputs in a task. Note that P in prompt tuning is a continuously trainable parameter, alternately referred to as soft prompt, which is different from hard prompt in that the latter operates on a discrete space of predefined vocabulary. Since the representation power of soft prompts is strictly more than the hard prompts, the limitations studied in this paper also extend to hard prompts.

In the subsequent sections, we analyze the universality and limitations of prompt tuning while comparing the latter against fine-tuning and Lo RA[Hu et al., 2021], which is a low-rank version of model fine-tuning. In Section 4, we prove that prompt tuning can be universal approximators for sequence-to-sequence functions, while providing the construction for the same. In Sections 5 and 6, we identify the failure modes where prompt tuning cannot learn with a possibly non-optimal but non-trivial pretrained transformer network.

4 Universality of Prompt Tuning

Without loss of generality, we assume that the support and range set of all considered sequence-tosequence functions f is [0, 1]d m in this section. We define FL as the collection of all continuous sequence-to-sequence L-lipschitz functions under norm p and sequence length m. For f FL and any two inputs X, X [0, 1]d m, we have f(X) f(X ) p L X X p. Furthermore, given functions f1, f2, the approximation error under a p-norm (which is entry-wise) is measured as:

dp(f1, f2) = ( Z f1(X) f2(X) p pd X) 1 p . (5)

Primarily, we show that there exists a Transformer network g T 2,1,4 such that for any f FL, prompt tuning on g can approximate this function upto some error budget ϵ > 0. Theorem 1. Let 1 p < and ϵ > 0, there exist a transformer network g T 2,1,4 and prompt length mp, such that for any f FL we can find a prompt P Rd mp with dp(g([P, ]):,mp:, f) ϵ.

Here we use the transformer in a encoder mode which generates the m outputs in one step. In Appendix C.4, a similar result can be obtained for next-token prediction, which is widely used in many recent language models.

The proof is inspired from [Yun et al., 2019a], which follows the typical construction based proof mechanism to show universality. Thereby, we can construct a meta-transformer for prompt tuning to approximate any sequence-to-sequence function with prompt tuning. Next we briefly describe the two steps for the construction of this meta-transformer. We start by building a meta-function for FL.

Building the Meta-Function We denote the length of all inputs as m and the prompt length as mp. Then we can build a sequence-to-sequence meta-function that accepts inputs with length m + mp.

Lemma 1. For the sequence-to-sequence function space FL with functions f : [0, 1]d m [0, 1]d m, we can build a sequence-to-sequence function g : [0, 1]d (mp+m) [0, 1]d (mp+m) such that for any f FL, we can find P Rd mp, dp( g([P, ]):,mp:, f) ϵ/2.

The complete proof is given in Appendix C.1. Succinctly, we first quantize the input and output sequence space of [0, 1]d m into a grid Gδ,m = {0, δ, 2δ, ..., 1 δ}d m, thus leading to C = ( 1 δd m ) 1 δd m possible functions mappings from the input to the output, in this discrete space. By

this quantized function space as FL = { f1, f2, ..., f C}, we can select δ such that the approximation error for any function is less than ϵ/2. Then we construct a set of quantized prompts in Gδ,mp = {0, δ, 2δ, ..., 1 δ}d mp to index these C functions and construct a quantized function g where g([Pi, X]):,mp: = fi(X), i = 1, 2, ..., C, for all X Gδ,m, thereby concluding the lemma.

Next we can utilize some conclusions in [Yun et al., 2019a] to construct a transformer for g.

Constructing the Meta-Transformer We first introduce a useful lemma which enables the construction of a transformer for any quantized sequence-to-sequence function.

Lemma 2. For any given quantized function f : [0, 1]d m [0, 1]d m with quantization at interval

δ, h T 2,1,1 such that f = h with positional embedding E =

0 1 2 ... m 1 0 1 2 ... m 1 ... ... ... ... ... 0 1 2 ... m 1

The proof mainly follows the discussions in Section C of [Yun et al., 2019a]. To prove this lemma, the network h can be constructed in the following three steps. We first use a series of MLP layers to quantize the input to grid [0 : δ : 1 δ]d m and then a series of attention layers to obtain a unique contextual mapping for each quantized input. Finally we can use a series of MLP layers to map the unique contextual mapping to the desired outputs. While a transformer network usually stacks self-attention and MLP layers alternately within a single layer, the aforementioned construction can be trivially attained via the use of skip connections. The complete proof of Lemma 2 is deferred to Appendix C.2.

Since g is a quantized function in grid Gδ,m+mp, following Lemma 2 we can find a modified version of transformer h T 2,1,1 such that g([P, X]) = h([P, X])). The modified version of transformer g with hardmax operators can then be approximated with a standard transformer g with softmax operators by Lemma 3.

Lemma 3 (Lemma 9 in [Yun et al., 2019a]). For each h T 2,1,1, ϵ > 0 and 1 p < , g T 2,1,4 such that dp( h, g) ϵ/2.

Since the approximation error can be treated uniformly amongst the Pi, we have that dp( h([Pi, ]):,mp:, g([Pi, ]):,mp:) dp( h([Pi, ]), g([Pi, ]) ϵ/2. Therefore, we can build a transformer g T 2,1,4, such that for any sequence-to-sequence f FL, we can find a quantized version fi FL and the corresponding prompt Pi Gδ,mp such that

dp(g([Pi, ]):,mp:, f) dp(g([Pi, ]):,mp:, h([Pi, ])) + dp( h([Pi, ]):,mp:, fi) + dp( fi, f) ϵ. (6)

Theorem 1 provides the construction for a large transformer (discussed more in appendix) that is sufficient for prompt tuning to exhibit universal approximation over a Lipschitz function space. However, even this strong transformer also has limitations with prompt tuning when the target function f / FL. Is this an essential limitation for prompt tuning on any transformer? In the next section, we will theoretically analyze the limitations of prompt tuning with transformers and target functions under more general conditions.

5 Limitations of Prompt-Tuning: Single Layer Transformer

To analyse the failure modes and therefore the limitations under the setting where a transformer has fixed pretrained weights, we follow the lens of exact memorization in the subsequent sections.

Definition 3 (Memorization of a Sequence-to-Sequence Dataset). Given a sequence-to-sequence dataset S = {(X1, Y1), ..., (Xn, Yn)} where Xi, Yi Rd m are the input/output sequences, we consider a function f exactly memorizing dataset S if f(Xi) = Yi. In the following proofs of this section, we explicitly focus on the last output token, ie: f(Xi):, 1 = (Yi):, 1.

We start from the analysis on a single layer transformer and extend to multi-layer settings in Section 6.

5.1 Failure modes of Prompt Tuning

It is straightforward to note that prompt tuning has limited expressive power when the number of trainable parameters is limited. A natural question to then ask is: Does increasing the number of trainable prompt tokens suffice? While it is known that for MLPs, even with a single hidden layer, increasing the number of hidden neurons can memorize any training data [Yun et al., 2019b]. However, as we will prove next, this is not the case for prompt tuning. This result highlights an essential limitation of prompt tuning compared to model fine-tuning.

Before providing the theorem statement, we first outline some straightforward assumptions on the pretrained transformer and datasets, without which prompt tuning trivial loses expressive power.

We consider sequence-to-sequence datasets of the form S = {(X1, Y1), (X2, Y2), ..., (Xn, Yn)} with n distinct examples and a single-layer single-head standard transformer defined in Definition 2. The results can be directly extended to the single-layer multi-head scenario, which we skip here to avoid notational clutter. Assumption 1 (Non-trivial conditions). We assume that all output tokens (Yi):,k are in the range set of MLP, otherwise the expressivity becomes trivially weak. We assume that Wq, Wk, Wv are full rank matrices and that Att(Xi, Xi) + Xi are distinct for i = 1, 2, ..., n. Assumption 2 (Assumption for the MLP layer). We assume that d 2 + dim((MLP 1(y10) x0) (MLP 1(y20) x0)) for the dataset constructed in Theorem 2 and token dimension d. dim(S) measures the dimension of subspace spanned by vectors in a set S and MLP 1(y) = {x : MLP(x) = y}.

We provide an example for this assumption in Example 1 and a sufficient condition in the following Lemma 4. Lemma 4. If W1 2 W2 2 < 1 , where 2 is the matrix spectral norm, then the MLP block in Definition 2 is invertible, ie, MLP 1 is a singleton set.

Therefore, if Lemma 4 holds and d 4, Assumption 2 also holds.

Proof of Lemma 4 can be found in Appendix C.5. The experimental evidence in [Dong et al., 2021] shows that for most architectures, the norm of the weight matrices indeed admits small values and thus the requirement that W1 2 W2 2 < 1 is a mild condition.

With these assumptions, here we introduce our first theorem on the unlearnability of prompt tuning. Theorem 2. For a single layer transformer τ defined above with Assumptions 1 and 2, we can build a sequence-to-sequence dataset S = {(X1 = [x1, x0], Y1 = [y11, y10]), (X2 = [x2, x0], Y2 = [y21, y20]))}, and we cannot find a prompt P Rd mp with any mp > 0 such that τ([P, Xi]) = Yi holds for any i = 1, 2. The vectors x0, x1, x2 are denoted post positional encodings.

An important feature of this dataset is that the same token x0 is shared between the two examples, and the expressive capability of prompt tuning is limited by the correlation of outputs corresponding to this token in different examples. We show a concrete example here to illustrate this theorem (note that Lemma 4 is in fact not required in the following construction) and defer the formal proof to Appendix C.6. Example 1. We consider a single-head transformer layer τ, where b1 = b2 = 0, W1 = 1r d, W2 = 1d r. Then the token-wise MLP layer is a concatenation of two linear functions:

MLP(x) = (W2W1 + I)x, (W1x)0 > 0 x , (W1x)0 0 (7)

Here (W1x)0 denotes the first element of vector W1x.

W2W1 +I is a non-singular matrix. Therefore, for any y in MLP(X) s output set, MLP 1(y) contains at most two points {y, (W2W1 + I) 1y}. We arbitrarily choose x0, y10 and y20.

As long as d 6 (from Assumption 2), we can find c1, c2 such that c1, c2 y10 x0, y20 x0, (W2W1+I) 1y10 x0, (W2W1+I) 1y20 x0, c1 c2. Then we choose x1 and x2 such that Att(x0, X1) c1 and Att(x0, X2) c2 (Lemma 7 in Appendix). Then Cone( Att(x0, X1), a x0) Cone( Att(x0, X2), b x0) = , for any a {y10, (W2W1 + I) 1y10} and b {y20, (W2W1 + I) 1y20}. Here Cone stands for a convex cone as defined in Section 3.1.

If a P exists such that τ([P, Xi]) = Yi holds for both i = 1, 2, then we have

Att(x0, [P, X1]) = λ(X1, x0, [P, X1])Att(x0, X1) + λ(P, x0, [P, X1])Att(x0, P) (8) Att(x0, [P, X2]) = λ(X2, x0, [P, X2])Att(x0, X2) + λ(P, x0, [P, X2])Att(x0, P)

where λ( , , ) is a positive scalar. We also have

Att(x0, [P, X1]) + x0 MLP 1(y10)

Att(x0, [P, X2]) + x0 MLP 1(y20)

as MLP(Att(x0, [P, Xi]) + x0) = yi0, i = 1, 2.

Therefore, Att(x0, P) must be in both Cone(a x0, Att(x0, X1)) and Cone(b x0, Att(x0, X2)), where a {y10, (W2W1 + I) 1y10} and b {y20, (W2W1 + I) 1y20}, which contradicts the existence of P as Cone( Att(x0, X1), a x0) Cone( Att(x0, X2), b x0) = . Therefore, in this example, even though we allow an arbitrary number of trainable parameters in prompt P, we cannot find one to exactly memorize the training set with only two training examples.

This theorem reveals an important difference between prompt tuning and adjusting the model weights directly. For any training dataset S with two training examples {(X1, Y1), (X2, Y2)}, so long as Att(X1, X1) + X1 and Att(X2, X2) + X2 are distinct, MLP can easily map the post-attention features to expected output tokens with finite number of hidden neurons. As a result, tuning the MLP parameters for this pretrained transformers can memorize any dataset in the form of Assumption 1. However, prompt tuning cannot achieve this even if the number of tunable tokens infinity, thereby limiting the expressiveness of prompt tuning when compared to model fine-tuning.

5.2 Comparison with a More General Dataset

In Section 5.1, we constructed sequence-to-sequence datasets that cannot be learned by a given transformer layer with prompt tuning, by utilizing the shared token between different training examples. In this section, we compare the expressive power of prompt tuning and fine-tuning under a more general dataset construction where the former requirement can be relaxed.

Since the primary essence of prompt tuning is to perform parameter-efficient tuning, wherein we seek to adapt a pretrained large model to a new task with fewer tunable parameters, we compare prompt tuning with another parameter-efficient version of model-tuning: Lo RA [Hu et al., 2021]. Succinctly, we compare the required number of parameters to memorize a given dataset. Again, consider a sequence-to-sequence dataset S = {(X1, Y1), (X2, Y2), ..., (Xn, Yn)}, where Xi = [xi1, xi2, ..., xim] and Yi = [yi1, yi2, ..., yim]. We again discuss the memorization of the last output token for simplicity and results can be directly extended.

We first give the required number of parameters of Lo RA to memorize dataset S. Lemma 5 (Lo RA). For a standard single-layer transformer τ defined in Definition 2 with r n MLP hidden neurons, for any sequence-to-sequence dataset S satisfying Assumptions 1, we can apply a low-rank update to MLP weights with O(nd) parameters to memorize τ(Xi):,m = yim.

This lemma is derived based on the memorization capabilities of 1-hidden layer MLPs [Yun et al., 2019b]. As the post-attention values for different training inputs are different from Assumption 1, we can construct a low rank update with O(nd) parameters on the MLP layer to memorize S. We defer the complete proof to Appendix C.7.

For prompt tuning, we derive a result in the next theorem which shows that it requires Ω(nd) tunable parameters to memorize some constructed dataset S with n examples.

Theorem 3 (Lower bound on Tunable Prompt Parameters). For any single layer transformer τ defined in Definition 2, there exists a sequence-to-sequence dataset {(X1 = [x10, x1], [y10, y11]), (X2 = [x20, x2], [y20, y21]), ..., (Xn = [xn0, xn], [yn0, yn1])} that satisfies Assumption 1 with n < d training examples such that we need at least n prompt tokens in P to memorize the training set, ie, for τ([P, Xi]):, 1 = yi1 to hold for all i = 1, 2, ..., n.

This dataset can be constructed by including n examples that require n linearly independent prompts tokens. The complete proof is deferred to Appendix C.8.

Note that in Theorem 3, we provide a key lower bound on the required number of prompt tokens for exact memorization and this can very well more than nd. This partially (but not necessarily) explains the worse empirical performance of prompt tuning against Lo RA under a comparable number of trainable parameters.

6 Extension to Multi-Layer Setting

In this section, we extend our analysis to multi-layer setting and provide a sufficient condition under which the expressiveness of prompt tuning is restricted. An immediate consequence of our result is an interesting connection to the spectral norm of soft prompts surfaces. This result provides us a partial understanding of the phenomenon that soft prompt P vectors typically exhibit larger norms compared to the actual input X, after the tuning.

With some further notation adjustments, we denote an H layer pretrained transformer network as g( T ) = τ 1 τ 2 ... τ H, the input set as X 1, and the set of possible prompts as P1. We assume that the following compactness condition is satisfied:

[Pl, Xl] 2 Dl (9)

s.t. [Pl+1, Xl+1] = τ l([Pl, Xl]), l = 1, ..., H.

Here [P1, X1] is the input to the first layer τ 1 with P1 P1, X1 X 1 and 2 is the spectral norm. Similarly, [PH+1, X H+1] denotes the output set.

We start by providing an upper bound to the Lipschitz constant of attention, pertaining to eq 9. This derivation is different from the works of [Dasoulas et al., 2021, Vuckovic et al., 2020] and thus can be of independent interest.

Lemma 6. Under the compactness condition, the Lipschitz constant of the i-th attention head in the l-th transformer layer, denoted for simplicity as Atti,l, admits the following bound w.r.t the entire input sequence of length m:

Lip(Atti,l( , )) (1 + 8 m(Dl)2 (Wi,l k )T Wi,l q 2) Wi,l v 2, (10)

and the Lipschitz constant of the entire attention block in layer l, denoted as Attl, admits the bound:

Lip(Attl( , ))

i=1 ( Wi,l o 2 Lip(Atti,l))2. (11)

It is noteworthy that this upper bound is dependent on Dl, the spectral norm of the input prepended with the prompt. In conjunction with the following theorem, we obtain a result on limited expressivity of prompt tuning by showing that the transformer becomes invertible, in consideration to functions from P1 X 1 PH+1 X H+1 (an extension to functions of the from X 1 X H+1 is provided in Appendix Section C.11).

Theorem 4. A transformer g T is invertible, ie ,g 1(Y) = {X : g(X) = Y} is a singleton set Y in range of g, if:

1. The Lipschitz constant of the attention block in each layer τ l is strictly less than 1

2. The Lipschitz constant of the 2-layer Re LU block in each layer τ l, which is bounded by Wl 2 2 Wl 1 2, is strictly less than 1.

Proof of Theorem 4 can be found in Appendix C.9. Combining Lemma 6 and Theorem 4, we observe that the invertibility is guaranteed if the upper bound for the Lipschitz constant of the attention, eq 11, and the MLP layer, is strictly less than 1. In this case, we can then construct arbitrarily many datasets where two different inputs share the same output, and prompt tuning cannot learn (more subtly: memorize) these datasets with a restricted prompt norm.

7 Experiments

7.1 Experimental Settings

In Section 7.2, we use a standard single-layer single-head transformer from Definition 2, to justify the infinite prompt-length limitation. In Section 7.3, we justify the increasing prompt norm on the pretrained LLa MA 7B model [Touvron et al., 2023]. For prompt tuning and Lo RA, we use the Huggingface Peft library [Mangrulkar et al., 2022]. On the dataset front, we utilize the RTE subtask of Super Glue dataset [Wang et al., 2019] and WMT14 En-Fr translation [Bojar et al., 2014]. More details and hyperparameter settings can be found in Appendix A.

7.2 Limited Expressivity of Infinite Length Prompt

We first construct the dataset following the proof of Theorem 2 and then show that prompt tuning cannot memorize this simple dataset {(X1 = [x1, x0], Y1 = [y11, y10]), (X2 = [x2, x0], Y2 = [y21, y20])} even with very large prompt lengths.

We set the token dimension d = 10. We follow the default pytorch weight initialization and then normalize W1, W2 such that W2 2 W1 2 < 1, following Assumption 2. We randomly sample x0, y10, y20 in a uniform distribution in [0, 1)d and construct the corresponding vectors: x1 and x2 following Theorem 2. To compute MLP 1(y), we follow [Kim et al., 2021] Section 4.1 with 5000 iterations at convergence. We solve Att(x0, [x0, x1]) c in Lemma 7 with gradient descent terminating at (Att(x0, [x0, x1]), c) < 0.0001. We repeat this setup to obtain 3 different datasets for distinct x0, y10, y20 and denote these with Si, i = 1, 2, 3.

We perform prompt tuning, MLP fine-tuning and MLP Lo RA training on the constructed datasets for 5 runs and report the mean and standard deviation of per-element Mean Squared Error (MSE) loss y10, y20 at convergence. We show the comparison between prompt-tuning and MLP fine-tuning in Figure 1. As we can observe from the figure, increasing the number of soft prompt tokens post a certain threshold that does not exhibit any reduction in MSE. On the contrary, fine-tuning on the MLP layer tend to easily memorize the training set by reducing the training loss to almost zero (all the three curves for fine-tuning overlap and thus not differentiated). Note that we plot the standard deviation, however it is negligible in the range. Similar to fine-tuning on the MLP layer, Lo RA with width 2 on the MLP layer also achieves near-zero training loss which is less than 10 10 on the constructed dataset. We don t plot the comparison on Figure 1 as all the six curves are overlapped). This result validates our Theorem 3 that Lo RA can memorize a dataset with n examples with trainable parameters O(n) while prompt-tuning may require more.

7.3 Increasing Prompt Spectral Norm during Tuning

As discussed in Section 6, a major constraint on the expressive power of prompt tuning is the spectral norm of soft prompts. In Figure 2, we plot the curve for spectral norm of soft prompt as training progresses and the loss reduces on RTE dataset. The curve for WMT14 En-Fr dataset can be found in Appendix B. This trend clearly highlights that in order to counter the limit on the capacity, the spectral norm consistently increases till the training loss saturates.

8 Conclusions

In this work, we embark on exploring the capabilities of prompt tuning in the continuous regime, contrasting it with fine-tuning, as an initial endeavor towards a theoretical comprehension. We prove by construction that prompt tuning admits universal approximation within the space of Lipschitz functions. Additionally, we identified inherent limitations of prompt tuning on single-layer transformers by constructing theoretically difficult datasets for prompt tuning. These limitations are then extended to multi-layer setting under a specific prompt-norm restriction.

From the analysis in Theorem 2 and 3, we note that the limitation of prompt-tuning primarily arises from the correlation across different inputs. Broadly describing, prompt-tuning implements transformation on different inputs via additional attention values , which is more restrictive as compared to the transformations from MLP layers on input tokens. An interesting potential direction to improve prompt-tuning is: designing a mechanism to leverage prompting in order to generate

10 100 1000 5000 Length of Soft Prompt

Training MSE loss

Prompt Tuning on S1 Fine-tuning on S1 Prompt Tuning on S2 Fine-tuning on S2 Prompt Tuning on S3 Fine-tuning on S3

Figure 1: MSE losses at convergence for the 3 constructed datasets (following Theorem 2). We plot the bold curves with increasing prompt length in prompt tuning and dashed fixed lines in fine-tuning (all three datasets overlapping).

20 40 60 80 100 120 140 160 180 200 Training Steps

Prompt Spectral Norm

Prompt Spectral Norm

Validation Loss

Validation Loss

Figure 2: Increasing prompt spectral norm during tuning on Super Glue RTE dataset.

prompt-dependent adapter/Lo RA updates . We expect to have some future work focusing on designing novel prompt-tuning strategies along this direction.

Limitations While our results provide valuable insights, extending the construction in Theorem 2 to multiple layers and deriving tighter bounds for Lemma 6 are critical steps for a deeper understanding of the limitations of prompt tuning.

Acknowledgments and Disclosure of Funding

We thank the reviewers for their invaluable feedbacks. The work is supported in part by NSF 2008173, 2048280, 2325121, 2331966, ONR N00014-23-1-2300:P00001.

Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319 7328, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.568. URL https://aclanthology.org/2021.acl-long.568.

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= 0g0X4H8y N4I.

Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and Joern-Henrik Jacobsen. Invertible residual networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 573 582. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/v97/behrmann19a.html.

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bit Fit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1 9, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.1. URL https://aclanthology.org/2022.acl-short.1.

Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ales Tamchyna. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12 58, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics. URL http://www. aclweb.org/anthology/W/W14/W14-3302.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022.

George Dasoulas, Kevin Scaman, and Aladin Virmaux. Lipschitz normalization for self-attention layers with application to graph neural networks, 2021.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https: //aclanthology.org/N19-1423.

Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2793 2803. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/dong21a.html.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. ar Xiv preprint ar Xiv:1611.04231, 2016.

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.

Guang-Bin Huang. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE transactions on neural networks, 14(2):274 281, 2003.

Guang-Bin Huang and Haroon A Babri. Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE transactions on neural networks, 9(1):224 229, 1998.

S-C Huang and Y-F Huang. Bounds on number of hidden neurons of multilayer perceptrons in classification and recognition. In 1990 IEEE International Symposium on Circuits and Systems (ISCAS), pages 2500 2503. IEEE, 1990.

Hyunjik Kim, George Papamakarios, and Andriy Mnih. The lipschitz constant of self-attention, 2021.

Junghwan Kim, Michelle Kim, and Barzan Mozafari. Provable memorization capacity of transformers. In The Eleventh International Conference on Learning Representations.

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045 3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https: //aclanthology.org/2021.emnlp-main.243.

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582 4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.

Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank Reddi, and Sanjiv Kumar. Robust training of neural networks using scale invariant architectures. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12656 12684. PMLR, 17 23 Jul 2022. URL https://proceedings. mlr.press/v162/li22b.html.

Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers, 2021.

Sadhika Malladi, Alexander Wettig, Dingli Yu, Danqi Chen, and Sanjeev Arora. A kernel-based view of language model fine-tuning, 2023.

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Belkada Younes, and Paul Sayak. Peft: Stateof-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.

Quynh Nguyen and Matthias Hein. Optimization landscape and expressivity of deep cnns. In International conference on machine learning, pages 3730 3739. PMLR, 2018.

Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is turing-complete. Journal of Machine Learning Research, 22(75):1 35, 2021. URL http://jmlr.org/papers/v22/20-302.html.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV au2, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts, 2020.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151 35174. PMLR, 2023.

James Vuckovic, Aristide Baratin, and Remi Tachet des Combes. A mathematical theory of attention, 2020.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.

Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Co RR, abs/2107.13163, 2021. URL https: //arxiv.org/abs/2107.13163.

Colin Wei, Sang Michael Xie, and Tengyu Ma. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning, 2022.

Masami Yamasaki. The lower bound of the capacity for a neural network with multiple hidden layers. In ICANN 93: Proceedings of the International Conference on Artificial Neural Networks Amsterdam, The Netherlands 13 16 September 1993 3, pages 546 549. Springer, 1993.

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? ar Xiv preprint ar Xiv:1912.10077, 2019a.

Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. Advances in Neural Information Processing Systems, 32, 2019b.

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Byx RM0Ntvr.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models?, 2020.

A Experimental Details

All the experiments are run on a NVIDIA RTX A6000 GPU. For experiments with Llama 7B model, we use batch size 32 and learning rate 0.001. For experiment on WMT14 En-Fr translation, we only compute the loss on the first 100 examples for computational efficiency.

We use Adam optimizer and optimal learning rate from grid search at 0.1 for prompt-tuning and at 0.001 for fine-tuning in Section 7.2.

In Section 7.3, we use the default loss function in Huggingface implementation for causal language models. We use prompt length m = 10 and the prompt tokens are initialized as the first m tokens in the model vocabulary.

B Additional Experiments

As mentioned in Section 7.3, the second real world dataset used in our experiment is WMT14 En-Fr translation in order to illustrate that the spectral norm of soft prompts increases during training. We show the curve in Figure B.

5 10 15 20 25 30 35 40 45 50 Training Steps

Prompt Spectral Norm

Prompt Spectral Norm

Validation Loss

Validation Loss

Figure 3: Increasing prompt spectral norm during tuning on WMT14 En-Fr translation dataset.

C Proof of Lemmas and Theorems

C.1 Proof of Lemma 1

For the sequence-to-sequence function space FL with functions f : [0, 1]d m [0, 1]d m, we can build a sequence-to-sequence function g : [0, 1]d (mp+m) [0, 1]d (mp+m) such that for any f FL, we can find P Rd mp, dp( g([P, ]):,mp:, f) ϵ/2.

Proof. we first quantize the input and output sequence space of [0, 1]d m into a grid space Gδ,m = {0, δ, 2δ, ..., 1 δ}d m, which leads to C = ( 1 δd m ) 1 δd m functions considering all input and output mappings in this grid. We index these C functions as FL = { f1, f2, ..., f C}. For X / Gδ,m, we let fi(X) = fi(X ) if ki,jδ < Xi,j, X i,j (ki,j + 1)δ and X Gδ,m.

Then for any f FL, we can find a function f FL such that dp( f, f) = ( R f(X) f(X) p pd X)1/p ( R Lpmdδpd X)1/p = L(md) 1 p δ. We choose δ = δ1 such that L(md) 1 p δ ϵ/2. For the prompt part, we choose mp such that 1 δd mp C. Then we can build a set of quantized prompts in Gδ,mp = {0, δ, 2δ, ..., 1 δ}d mp to index these C functions. We denote this set of prompts as {P1, P2, ..., PC}. Finally we can create the quantized function g and let

g([Pi, X]):,mp: = fi(X) and g([Pi, X]):,:mp = 0, X [0, 1]d m, P Gδ,mp. For P / Gδ,mp, we set g([P, X]) = g([P , X]) if ki,jδ < Pi,j, P i,j (ki,j + 1)δ and P Gδ,mp.

Therefore, with a properly chosen δ = δ1, for any f FL, we can find P Rd mp such that d P (f, g([P, ]):,mp:) = dp( f, f) ϵ/2.

C.2 Proof of Lemma 2

For any given quantized function f : [0, 1]d m [0, 1]d m with quantization at interval δ, h

T 2,1,1 such that f = h with positional embedding E =

0 1 2 ... m 1 0 1 2 ... m 1 ... ... ... ... ... 0 1 2 ... m 1

Proof. The proof is given following Section C in Yun et al. [2019a] appendix. With Section C.1 in Yun et al. [2019a], there exists a function gq composed of dm

δ token-wise feed-forward layers with hidden layer size r = 1 and Re LU activation to implement this scalar quantization on each input element:

gent q (t) = kδ if kδ t < (k + 1)δ, k = 0, 1, ..., m/δ 1 δ md otherwise

Then with Section C.2 in Yun et al. [2019a], we can stack m(1/δ)d + 1 attention layers to map all possible input sequences in grid [0 : δ : 1 δ]d [1 : δ : 2 δ]d ... [m 1 : δ : m δ]d to distinct numbers which are at least δ from each other.

Finally we only require O(m(1/δ)dm) layers to map these distinct numbers to expected outputs.

C.3 Proof of Lemma 3

Lemma 3 is alsmost the same as [Yun et al., 2019a] except that we use ϵ/2 instead of ϵ/3.

C.4 Extension of Theorem 1 to Next-token Predictors

As an extension of Theorem 1, we consider approximating a set of sequence-to-sequence functions when we use a transformer layer as a next-token predictor. We consider a set of sequence-to-sequence functions FL with Lipschitz constant L under norm p. f FL : [0, 1]d m1 [0, 1]d m2 accepts an input of length m1 and outputs a sequence of length m2. For any X, X [0, 1]d m1, we have f(X) f(X ) p L X X p.

Next we show that we can construct a transformer τ which can approximate any f FL with prompt-tuning when we use it as a next-token predictor.

Theorem 5. For any f FL, we can construct a transformer τ such that for any f FL, 1 p < and ϵ > 0, we can find a prompt P [0, 1]d mp, such that dp(f, h(P)) ϵ, where

h(P) = τ1([P, ]):, 1 τ2([P, ]):, 1 ... τm2([P, ]):, 1.

τi is the sequence-to-sequence function implemented with the transformer τ when accepting sequences with length mp + m1 + i.

Proof. Similar to Theorem 1, we quantize the inputs to grid of [0 : δ : 1 δ] with interval δ and set mp = (δd m2)δd m1. δ is chosen such that L(m1d)1/pδ m 1/p 2 ϵ. We index the C = (δd m2)δd m1 different fs as f 1, f 2, ..., f C and its sub-function to generate the i-th output token as f j i . The C sequence-to-sequence functions can then be indexed by C distinct prompts. Similar to Lemma 2, we can construct a transformer which can map all possible input sequences in grids [0 : δ : 1 δ] ... [m 1 : δ : m δ]d, 0 < m m1 + m2 + mp 1 to distinct numbers. A final series of MLP layers then map these distinct numbers to desired output vectors where inputs

in the same grid are mapped to the same output token at each step. Then for any input x [0, 1]d m1 and any f j, we can find a prompt Pj such that

τ([Pj, x]):, 1 f j 0(x) p m 1/p 2 ϵ

τ([Pj, x, τ([P, x]):, 1]:, 1, f j 1([x, τ([P, x]):, 1]) p m 1/p 2 ϵ ...

Then we have dp(h(Pj), f j) ϵ.

C.5 Proof of Lemma 4

If W1 2 W2 2 < 1 , where 2 is the matrix spectral norm, then the MLP block in Definition 2 is invertible, ie, MLP 1 is a singleton set.

Proof. Based on the sufficient conditions for invertibility of a residual block Behrmann et al. [2019], we have that if the feedforward part of a residual block W2Re LU(W1X:,1 +b1)+b2 is a contraction with respect to some metric, i.e. its Lipschitz constant < 1, and the metric space on which is defined is complete, then MLP in eq 2 is invertible. Since we are dealing with the euclidean space, any metric induced by the p norm for p [1, ] ensures the space is complete. The Lipschitz constant of W2Re LU(W1x+b1)+b2 is simply W1 2 W2 2. Thus the statement of the lemma follows.

C.6 Proof of Theorem 2

For a single layer transformer τ defined above with Assumptions 1 and 2, we can build a seq-to-seq dataset {(X1 = [x1, x0], Y1 = [y11, y10]), (X2 = [x2, x0], Y2 = [y21, y20]))}, and we cannot find a prompt P Rd mp with any mp > 0 such that τ([P, Xi]) = Yi holds for any i = 1, 2. The vectors x0, x1, x2 are denoted post positional encodings.

Proof. Before proving Theorem 2, we first provide a lemma that will be used in proof and also Theorem 3.

Lemma 7. Given any c Rd m, there are x0 almost anywhere for which we can find another vector x1 Rd m such that Att(x0, [x0, x1]) c with full rank attention weights Wq, Wk, Wv.

Proof. If Wvx0 c, we can just set x1 = x0, which makes Att(x0, [x0, x1]) c hold.

If Wvx0 c, let v = αc Wvx0 where α R. As Wv is full-rank, we can find x such that x = W 1 v v = αW 1 v c x0. Then we will have

Att(x0, [x0, x1])

=exp ((Wqx0) (Wkx0))Wvx0 + exp ((Wqx0) (Wk(αW 1 v c x0)))(αc Wvx0) exp ((Wqx0) (Wkx0)) + exp ((Wqx0) (Wk(αW 1 v c x0)))

Therefore, as long as Wqx0 Wk(W 1 v c), we can change α such that Att(x0, [x0, x1]) =

βWvx0 + (1 β)(αc Wvx0) where β = exp ((Wqx0) (Wkx0)) exp ((Wqx0) (Wkx0))+exp ((Wqx0) (Wk(αW 1 v c x0))).

When α = 0, Att(x0, [x0, x1]) = Wvx0, when α or α , Att(x0, [x0, x1]) αc Wvx0. As Att(x0, [x0, x1]) is continuous w.r.t changing α, there must exist an α such that Att(x0, [x0, x1]) c.

Pass the two input sequences X1, X2 through the attention layer Att with any prompt P, we can get the last output token as:

Att(x0, [P, X1]) =λ(X1, x0, [P, X1])Att(x0, X1) + λ(P, x0, [P, X1])Att(x0, P) (12) Att(x0, [P, X2]) =λ(X2, x0, [P, X2])Att(x0, X2) + λ(P, x0, [P, X2])Att(x0, P) (13)

Here λ(X1, x2, X3 = [X1, X2]) (0, 1) is a positive scalar, defined as

λ(X1, x2, X3) =

j exp((Wkx1j) (Wqx2)) P

j exp((Wkx3j) (Wqx2)).

xij is the jth token in Xi for notation simplicity.

1. Then from equation 12, Att(x0, P) must be on Cone( Att(x0, X1), Att(x0, [P, X1])) and Cone( Att(x0, X2), Att(x0, [P, X2])).

2. On the otherhand, as we want to memorize the two examples, we must have Att(x0, [P, X1]) + x0 MLP 1(y10) and Att(x0, [P, X2]) + x0 MLP 1(y20).

We construct the dataset S with arbitrary x0, y10 and y20. Then if dim((MLP 1(y10) x0) (MLP 1(y20) x0)) + 2 d (Assumption 2), we can find two vectors c1, c2 such that c1, c2 v : v + x0 MLP 1(y10) or v + x0 MLP 1(y20) and c1 c2. Then we can choose x1, x2 such that Att(x0, X1) c1 and Att(x0, X2) c2 (Lemma 7). Combine this construction with assumption 1, we have that Cone( Att(x0, X1), Att(x0, [P, X1])) and Cone( Att(x0, X2), Att(x0, [P, X2])) has no intersection, which means that we cannot find a P to memorize this constructed dataset.

C.7 Proof of Lemma 5

For a standard single-layer transformer τ defined in Definition 2 with r n MLP hidden neurons, for any sequence-to-sequence dataset S satisfying Assumptions 1, we can apply a low-rank update to MLP weights with O(nd) parameters to memorize τ(Xi):,m = yim.

Proof. We use MLPj(x) to denote the jth output of the MLP layer for an input token x, which is

MLPj(x) = xj + b2,j +

k=1 wk,j max( ak, x + b1,k, 0)

According to our assumption, Att(xim, Xi) are unique vectors for i = 1, 2, ..., n. Then we only need to use the MLP layer to map each xi = Att(xim, Xi) + xim to yim, where we get a new token-wise dataset {(x1, y1), (x2, y2), ..., (xn, yn)}

Then we need to find wk, ak and bk such that

MLPj(xi) = xi,j + b2,j +

k=1 wk,j max( ak, xi + b1,k, 0) = yi,j, i = 1, 2, ..., n, j = 1, 2, ..., d

, which is equivalent to constructing a standard MLP to memorize a dataset:

k=1 wk,j max( ak, xi + b1,k, 0) = yi,j xi,j

k=n+1 wk,j max( ak, xi + b1,k, 0) b2,j (15)

Follow Thoerem 1 in Yun et al. [2019b], we can construct a, b1, ..., bn such that for x1, x2, ..., xn, we have zi = a, xi , b1 < z1 < b2 < ... < bn < zn. Then we can find w1, ..., wn which solves equation 15. For d-dimension output, we need to find W Rn d and a Rd and b Rn. With Lo RA, we need a low-rank update of size m n + n d for W2, a low-rank update of size d n + n m for W1 and an update of size n for b1, which is O(n d) in total. Normally we have m d, then we need an update with parameter size around (4n + 1)d to memorize the last token of n training examples.

C.8 Proof of Theorem 3

For any single layer transformer τ defined in Definition 2, there exists a seq-to-seq dataset {(X1 = [x10, x1], [y10, y11]), (X2 = [x20, x2], [y20, y21]), ..., (Xn = [xn0, xn], [yn0, yn1])} that satisfies Assumption 1 with n < d training examples such that we need at least n prompt tokens in P to memorize the training set, ie, for τ([P, Xi]):, 1 = yi1 to hold for all i = 1, 2, ..., n.

Proof. Without loss of generality, we assume W2 has no zero elements, otherwise we can just ignore this hidden neuron in MLP layer.

Rd has d bases {tj : j = 1, 2, ..., d}, then MLP 1(yi1) must be bounded on either positive or negative part of these d directions, which means there exists B 0 such that

tj B, v MLP 1(yi1), j = 1, 2, ..., d

Otherwise B > 0 , tj, we can find a v MLP 1(yi) that v tj

tj B. Meanwhile we have MLP(v) = v+b2+W2Re LU(W1v+b1). As v can be arbitrarily large, if W1v = 0, MLP(v) if v . if W1v = 0, MLP(v) can also be arbitrarily large when increasing the norm of v due to the non-linearity of Re LU(W1v + b1).

Then we can find a set of n linearly independent vectors {c1, c2, ..., cn} such that {ai : ai ci ci, ai MLP 1(yi1)} = by enlarging the norm of ci. With the n ci vectors, we can begin to construct our dataset:

We set xi = ci, i = 1, 2, ..., n and find xi0 such that ci Att(xi, Xi) (Lemma 7) and Att(xi, Xi) are distinct for i = 1, 2, ..., n (Assumption 1), which makes {a1 x1 λ(X1, x1, [P, X1])Att(x1, X1), ..., an xn λ(Xn, xn, [P, Xn])Att(xn, Xn)} linearly independent for any ai MLP 1(yi1). Here λ( , , ) is the same as defined in Section C.6.

Moreover, we have Att(xi, [P, Xi]) = λ(Xi, P, [P, Xi])Att(xi, P) + λ(Xi, xi, [P, Xi])Att(xi, Xi) (16)

MLP 1(yi1) xi

Then Att(xi, P), i = 1, 2, ..., n must be n linearly independent vectors, which requires rank(Wv PA) = n, (17)

where A Rmp n is the attention score matrix between xi and P. P Rd mp is the prompt token sequence and Wv is the attention value weight. Therefore, we must have mp n.

C.9 Proof of Theorem 4

A transformer T is invertible if:

1. The Lipschitz constant of the attention block in each layer τ l is strictly less than 1 2. The Lipschitz constant of the 2-layer Re LU block in each layer τ l, which is bounded by Wl 2 2 Wl 1 2, is strictly less than 1

Proof. This proof is based on the proof provided for lemma 4, thus we restrict ourselves to the sketch: Based on the sufficient condition for invertibility in Behrmann et al. [2019], condition (1) implies that the attention block (eq 1) with the residual connection, ie X + Att(X, X) , is an invertible function. Similarly, condition (2) implies that the MLP block which constitutes of the 2-layer Re LU block with the residual connection (eq 2) also exhibit invertibility. Thus each transformer layer τ l (eq 3) is invertible by noting that its a composition of two invertible functions. The same property ensures that the entire transformer architecture T is also invertible.

C.10 Proof of Lemma 6

Under the compactness condition, the Lipschitz constant of the i-th attention head in the l-th transformer layer, denoted for simplicity as Atti,l, admits the following bound w.r.t the entire input sequence of length m:

Lip(Atti,l( , )) (1 + 8 m(Dl)2 (Wi,l k )T Wi,l q 2) Wi,l v 2, (18)

and the Lipschitz constant of the entire attention block in layer l, denoted as Attl, admits the bound:

Lip(Attl( , ))

i=1 ( Wi,l o 2 Lip(Atti,l))2. (19)

Proof. We drop the superscripts i, l in the proof to avoid notation clutter. Similarly, we denote the concatenation of the prompt matrix P and the original input matrix X, simply with X.

Derivation for single head eq 18: Consider two matrices X1, X2 X = {X Rd m; X 2 D} . Denote with A1, A2 the corresponding attention matrices respectively, which can be defined as:

A1 = σ((Wk X1) Wq X1)

A2 = σ((Wk X2) Wq X2) (20)

The output of the attention head, denoted with Att( ) admits the following:

Att(X1) Att(X2) 2 = Wv X1A1 Wv X2A2 2 (21)

a X1A1 X2A2 2 Wv 2 (22) = X1A1 X2A1 + X2A1 X2A2 2 Wv 2 (23) ( A1 2 X1 X2 2 + X2 2 A1 A2 2) Wv 2 (24)

b ( X1 X2 2 + A1 A2 2D) Wv 2 (25)

where (a) holds from the spectral norm properties and in (b) we use the bounded input spectral norm assumptions. We now focus on the second term A1 A2 2 in eq 25 . From the bound in lemma 9, we have:

A1 A2 2 2 m G 2 (26)

where G is the diagonal matrix with entires described in lemma 8 We can now invoke lemma 10 to obtain the following :

A1 A2 2 2 m 2 2 WT k Wq 2D X1 X2 2 (27)

Combining the previous inequality with eq 25, we have the following bound:

Att(X1) Att(X2) 2 (1 + 8 m WT k Wq 2D2) Wv 2 X1 X2 2 (28)

Derivation for the entire block eq 11: The proof follows simply by leveraging the following property: Property: for a matrix C = [A, B], the spectral norm of C admits the bound:

A 2 2 + B 2 2

We then simply combine the definition of the attention block and the lipschitz constant bound in eq 18 with the above property in order to obtain the desired bound.

Lemma 8 (Dong et al. [2021] Lemma A.1). For the column stochastic matrix A1 obtained by performing column-wise softmax of some matrix Z1 (where in our setting Z1 = (Wk X1) Wq X1, and another row stochastic matrix A2 obtained by performing column-wise softmax of some matrix Z2, where Z2 = Z1 E (for some E, which need not belong to X), we have the following bound:

A2(I G) A1 A2(I + 2G) (29)

where the inequality is elementwise and G is a diagonal matrix with entries as Gii = maxj,j |δT i E(δT j δT j )|. Here δk is a one-hot vector with the entry 1 in the kth dimension.

Lemma 9. Following the notations of lemma 8, we have the following spectral norm bound:

A1 A2 2 2 m G 2 (30)

Proof. We begin by noting the following entry-wise inequality from eq 29:

A2G A1 A2 2A2G (31)

which ensures that A1 A2 F 2 A2G F . We also have the following using matrix norm equivalence:

A1 A2 2 A1 A2 F (32)

Invoking the matrix norm equivalence again, we have that

2 A2G F 2 p

rank(A2G) A2G 2 (33)

where rank( ) is the matrix rank. Combining the inequalities, we attain the bound :

A1 A2 2 2 m A2 2 G 2 (34)

since A2 is column-stochastic , A2 2 = 1

Lemma 10. The term G in lemma 9 admits the following spectral norm bound:

G 2 2D Wq WT k 2 X1 X2 2 (35)

here D is the previously stated spectral norm bound of the inputs Xl X l.

Proof. We begin by noting that since G is a square diagonal matrix with non-negative real values, the singular values of G are the corresponding diagonal elements. We thus have that G max = G 2 , where max is the max norm. Since G admits the form described in lemma 8, it is trivial to note that:

G max = max i,j,i ,j |Ei,j Ei ,j | (36)

2 E max 2 E 2 (37)

where the second inequality follows from the matrix norm equivalence. Now, we can bound the last term E 2 by noting that the inputs X belong to a bounded set. This allows us to provide the following bounds:

E 2 = (Wk X1) Wq X1 (Wk X2) Wq X2 2 (38)

= (Wk X1) Wq X1 (Wk X1) Wq X2 + (Wk X1) Wq X2 (Wk X2) Wq X2 2 (39)

( X1 2 WT k Wq 2 + X2 2 WT k Wq 2) X1 X2 2 (40)

2D WT k Wq 2 X1 X2 2 (41)

C.11 Extension of Lemma 6

Lemma 6 and theorem 4 operate over functions from P1 X 1 PL+1 X L+1. We can relax the requirement of the prompt and provide the Lipschitz constant upper bound in consideration to functions of the form X 1 X L+1 by using the following assumption: Assumption 3. Assume for simplicity that Pl 1 Pl 2 2 αl Xl 1 Xl 2 2; l 1. αl = 0 when l = 1.

Note: A recursive expression for αl in the above assumption can be provided, but the expression does not admit a simplified form and we thus omit it here.

We will use Dl X , akin to eq 9, to denote the compactness corresponding to the input matrix across the layers. Based on this assumption, we have the following Lipschitz constant upper bound: Lemma 11. The Lipschitz constant of the single head Atti,l admits the following bound w.r.t the input part, X1 of length m X, of the input sequence:

Lip(Atti,l( , )) q

1 + (αl)2 + 8 m X(Dl X)2(1 + (αl)2) (Wi,l k )T Wi,l q 2

Wi,l v 2 (42)

For l = 1, αl = 0 in the above bound. The Lipschitz constant of the entire attention block in layer l follows similarly.

Proof. For some first layer input X1 1 and prompt P, let us denote the direct output of the attention head in the l-th layer with Xl 1. We have the following update rule for Xl 1: Xl 1 = Wi,l v [Pl 1, Xl 1] σ((Wi,l k [Pl 1, Xl 1]) Wi q Xl 1) = Wi,l v [Pl 1, Xl 1] Al 1 (43)

Here, Pl 1 is the updated prompt matrix w.r.t the input. For two different inputs X1 1 and X1 2 at the first layer, P1 1 = P1 2 = P, since the prompt is same across all inputs. Al 1 is then simply the corresponding column-stochastic matrix.

With the context clear, we now drop the superscripts i, l, as done previously. For X1 X2 2, we have:

X1 X2 2 [P1, X1]A1 [P2, X2]A2 2 Wv 2

P1 P2 2 2 + X1 X2 2 2 + q

P2 2 2 + X2 2 2 A1 A2 2

where the second inequality is attained using the property of spectral norm of concatenated matrices.

We now consider the term A1 A2 2. By invoking lemmas 9 and 10, we have that:

A1 A2 2 2 m X 2 E 2 where E = (Wk[P1, X1]) Wq X1 (Wk[P2, X2]) Wq X2 (45)

Invoking assumption 3 , E 2 can further be bounded as:

1 + α2 (Wk)T Wq 2 (46)

Finally, by combining the bound for E 2 and assumption 3 with eq 44, we obtain:

X1 X2 2 (47)

1 + α2 + DX p

1 + α2 8 m XDX p

1 + α2 (Wk)T Wq 2 Wv 2 X1 X2 2 (48)

which provides us the desired bound.

By setting αl = 0, the case when there is no prompt, we obtain a similar bound as lemma 6