# hyperprompt_promptbased_taskconditioning_of_transformers__5b99db5c.pdf

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

Yun He * 1 Huaixiu Steven Zheng * 2 Yi Tay 2 Jai Gupta 2 Yu Du 2 Vamsi Aribandi 2 Zhe Zhao 2 Ya Guang Li 2

Zhao Chen 3 Donald Metzler 2 Heng-Tze Cheng 2 Ed H. Chi 2

Prompt-Tuning is a new paradigm for ﬁnetuning pre-trained language models in a parameterefﬁcient way. Here, we explore the use of Hyper Networks to generate hyper-prompts: we propose Hyper Prompt, a novel architecture for promptbased task-conditioning of self-attention in Transformers. The hyper-prompts are end-to-end learnable via generation by a Hyper Network. Hyper Prompt allows the network to learn task-speciﬁc feature maps where the hyper-prompts serve as task global memories for the queries to attend to, at the same time enabling ﬂexible information sharing among tasks. We show that Hyper Prompt is competitive against strong multi-task learning baselines with as few as 0.14% of additional taskconditioning parameters, achieving great parameter and computational efﬁciency. Through extensive empirical experiments, we demonstrate that Hyper Prompt can achieve superior performances over strong T5 multi-task learning baselines and parameter-efﬁcient adapter variants including Prompt-Tuning and Hyper Former++ on Natural Language Understanding benchmarks of GLUE and Super GLUE across many model sizes.

1. Introduction

Prompt-Tuning (Lester et al., 2021), learning to condition large language models with soft learnable memory tokens, have recently garnered attention owing to their ability for parameter-efﬁcient ﬁnetuning. Prompts are lightly tuned, allowing the model to be trained quickly since the main body of the pretrained model is kept frozen. To this end, this paradigm is strongly reminiscent of adapter layers (Houlsby

*Equal contribution 1Texas A&M University, work done as an intern at Google 2Google Research 3Waymo LLC. Correspondence to: Huaixiu Steven Zheng <stevenzheng@google.com>, Yi Tay <yitay@google.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

220M 770M 3B 11B # Parameters

Super GLUE Score

MODEL Hyper Prompt Prompt-Tuning (Lester et al.) MTL Prompt-Tuning Hyper Former++

Figure 1. Hyper Prompt achieves state-of-the-art performance on Super GLUE for T5 models up to XXL. Prompt-tuning (Lester et al., 2021) with tuning prompt parameters only achieves competitive performance against multi-task learning (MTL) baseline for the 11B parameter model with a big performance gap for smaller models. Hyper Prompt-Global outperforms the strong parameterefﬁcient adapter variant Hyper Former++ (Karimi Mahabadi et al., 2021), the MTL baseline, and the full ﬁne-tuning of Prompt-Tuning (our implementation) across model sizes with a large margin [e.g. 91.3 vs 90.2 (MTL) for T5 XXL].

et al., 2019a; Karimi Mahabadi et al., 2021; Zaken et al., 2021; He et al., 2021) which are also efﬁciently ﬁnetuned.

We introduce Hyper Prompt, a natural but novel extension of Prompt-Tuning to multi-task learning (MTL) for language. Hyper Prompt introduces task-conditioned hyper-prompts that conditions the model on task-speciﬁc information for constructing these prompts. Hyper-prompts are injected to the keys and values in the self-attention module, reminiscent of memory augmented Transformers (Sukhbaatar et al., 2019). This mitigates the cost of having prompts pass through the standard FFN layers in Transformers and serves as additional task-speciﬁc memory tokens for queries to attend to.

We further improve upon this by introducing task-aware and layer-aware Hyper Networks (Ha et al., 2017) that parameterize and generate weights for the prompt generation process. The usage of Hyper Network imbues our model with the necessary ﬂexibility and expressiveness, especially when it comes to incorporating task-speciﬁc and layer-speciﬁc information to the network. Meanwhile, Hyper Prompt remains very parameter and computational efﬁcient and friendly to multi-task scaling: the additional parameters scale sub-

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

linearly with, and are independent of the number of tasks in practice. While Hypernetworks have enjoyed some success in learning adapters (Karimi Mahabadi et al., 2021; Tay et al., 2020) and/or continual learning (von Oswald et al., 2019), we note that this is the ﬁrst exploration of Hyper Networks as a prompt generator.

Contrary to prior work, we additionally propose to ﬁnetune the entire network instead of only the hyper-prompts. We make several compelling arguments for this. Firstly, Lester et al. (2021) shows that parameter efﬁcient Prompt-Tuning only shines for large (e.g., 11B) models and substantially pales in comparison to ﬁne-tuning when the model is moderately parameterized (e.g., 220M). Secondly, ﬁnetuning only adaptive parameters (e.g., prompts/adapters) simply presents an illusion of efﬁciency (Dehghani et al., 2021). In reality, the FLOPs incurred by the model is still identical on the forward pass, which saves no compute during inference. Parameter counts, especially when including only prompts and adapters, are not the only measurement of computational efﬁciency. Instead, the FLOPs and training time should be considered together to provide a holistic view.

Our Contributions Our main contributions include:

We propose a novel Hyper Prompt Transformer architecture with learnable hyper-prompts for multi-task ﬁne-tuning with great parameter and computational efﬁciency.

We demonstrate that for difﬁcult tasks, it is crucial to ﬁne-tune the task-speciﬁc parameters together with the backbone model to achieve Pareto efﬁciency on all tasks.

We explore Hyper Networks as a prompt generator, and inject hyper-prompts into the self-attention module as global task memory tokens.

Hyper Prompt outperforms state-of-the-art parameterefﬁcient T5 models (Raffel et al., 2019) using Prompt Tuning or adapters on well-established benchmarks such as Super GLUE and GLUE, across all explored model sizes (see Figure 1).

2. Problem Statement

We consider the general setting of multi-task learning for a set of tasks {Dτ}T τ=1, where T is the total number of tasks and {Dτ} = {x(n) τ , y(n) τ }Nτ n=1 indicates the corresponding training set of the τ-th task with Nτ samples. We assume that a pre-trained Transformer model fθ( ) (e.g., T5) is given, where the model is parameterized by θ. To tackle such multi-task learning problem with fθ( ), we minimize the following objective function L(θ) = PT τ=1 PNτ n=1 C(fθ(x(n) τ ), y(n) τ ), where C( , ) is

typically the cross-entropy loss and fθ(x(n) τ ) is the output for training sample x(n) τ .

Transformer-based pre-trained language models such as T5 (Raffel et al., 2019) and BART (Lewis et al., 2020) are uniﬁed text-to-text frameworks where all tasks share the same encoder-decoder architecture {{x(n) τ }Nτ n=1}T τ=1 are fed into the same encoder and {{ˆy(n) τ }Nτ n=1}T τ=1 are generated by the same decoder. For such universal modules, multi-task learning simply corresponds to mixing task data sets together and there is no task-speciﬁc classiﬁcation or regression networks for each task as in encoder-only modules Devlin et al. (2019); Liu et al. (2019b).

Previous work Raffel et al. (2019) shows that co-learning all tasks together on a pre-trained Transformer model is inferior to ﬁne-tuning on each task separately. A possible reason is that θ is task-agnostic (i.e., all parameters are shared) and hence task-speciﬁc information is not well captured which can be especially true for low-resource tasks. Therefore, a natural way to improve the performance of Transformers on multi-task learning is to introduce a set of task-conditioned parameters {δτ}T τ=1 into fθ(.). The objective function can be updated as L(θ, {δτ}T τ=1) = PT τ=1 PNτ n=1 C(fθ,δτ (x(n) τ ), y(n) τ ), where δτ is the taskspeciﬁc parameterization for the τ-th task. During training, both θ and {δτ}T τ=1 are updated via back-propagation because we observe a large performance drop in Super GLUE when backbone model θ is frozen and only task-conditioned parameters are tuned, as done in Karimi Mahabadi et al. (2021), which will be detailed in Section 4.3.

To this end, our goal is to design task-conditioned parameterization of Transformer models to achieve greater parameter and computational efﬁciency as well as Pareto efﬁciency for multi-task learning. More explicitly, we have two goals: (1) improving the ﬁnetuning performance of most tasks in {Dτ}T τ=1 by introducing task-conditioned parameters {δτ}T τ=1 into fθ(.) and (2) under the constraint that P

τ {δτ}T τ=1 0 θ 0, which means that the model capacity will not be signiﬁcantly increased. And the computational cost would not increase substantially either.

In this section, we introduce Hyper Prompt which has three variants: Hyper Prompt-Share, Hyper Prompt-Sep and Hyper Prompt-Global (Figure 2). We follow two key design principles to formulate Hyper Prompt: (1) injecting taskconditioning into self-attention module for better computational efﬁciency and more expressive power via token-level interactions, and (2) using Hyper Networks to simultaneously improve the parameter efﬁciency and allow a ﬂexible degree of task sharing for better generalization.

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

Scaled Dot-Product Attention

Q K V PV PK

Multi-Head Attention

PV PK Local (at each layer) Hyper Network hk,v

Hyper Prompts Global Prompts

Global Hyper Network Hk,v

Layer-Aware Task Embedding

Hyper Prompt-Share/Sep Hyper Prompt-Global (b) (c)

Figure 2. Hyper Prompt framework: (a) in each Transformer block, task-speciﬁc hyper-prompts PK,V are prepended to the original key K and value V for the query Q to attend to, (b) in Hyper Prompt-Share/Sep, global prompts P are used to generate the hyper-prompts PK,V through local Hyper Networks hk,v at each Transformer layer, which consists of a down-projection matrix DK,V , a RELU layer and a up-project matrix UK,V , (c) in Hyper Prompt-Global, all the local Hyper Networks (DK,V , UK,V ) are generated by global Hyper Networks Hk,v using layer-aware task embeddings I as task-speciﬁc inputs (see Section 3.3 for details).

3.1. Prompt-Based Task-Conditioned Transformer

Previous adapter-based methods (Karimi Mahabadi et al., 2021; Tay et al., 2020) for multi-task learning normally add an adapter (i.e., dense-relu-dense network) for each task after the feed-forward layers at every Transformer block. Instead, the key idea of our approach is to prepend l taskconditioned trainable vectors to the keys and values of the multihead self-attention layer at every Transformer block, where the task-speciﬁc attention feature maps are jointly learned with the task-agnostic representation.

The idea of prepending learnable prompts to the network is explored before by Li & Liang (2021); Lester et al. (2021); Liu et al. (2021) for single-task ﬁne-tuning. We ﬁrst introduce and expand this idea for multi-task learning in this subsection. Speciﬁcally, we design a novel method called Hyper Prompt following the design principle #1 of injecting hyper-prompts into self-attention and #2 using Hyper Networks as generators for hyper-prompts.

At a multihead self-attention layer, the original key, value and query are calculated as Kτ = XτWk, Vτ = XτWv, Qτ = XτWq, where Xτ RL d is the input sequence of a training sample from the τ-th task, L is the sequence length, d is the model dimension. Wk Rd h dh, Wv Rd h dh and Wq Rd h dh project the input into original key Kτ RL h dh, value Vτ RL h dh and query Qτ RL h dh, h is the number of heads, dh is the dimension of each head and typically set to d/h to save parameters.

To learn the task-speciﬁc information for the τ-th task, we have l trainable d-dimensional vectors as the hyper-prompts

for the key and the value respectively, denoted as Pτ,k Rl h dh and Pτ,v Rl h dh, as shown in Figure 2(a). Then, the hyper-prompts are concatenated with the original key and value:

K τ = concat(Pτ,k, Kτ) (1)

V τ = concat(Pτ,v, Vτ) (2)

where the new key (value) K τ (V τ ) R(l+L) h dh are used to compute the multihead self-attention.

After that, the multihead self-attention can be operated: Oτ = Attention(Qτ, K τ, V τ ) = softmax(QτK T τ )V τ where Oτ RL d is the output of multihead attention.

The hyper-prompts beneﬁt Transformers for multi-task learning in two ways: (1) Prompt for key Pτ,k is prepended with the original key and will participate in the calculation of attention feature map: softmax(QτK T τ ). Pτ,k directly interacts (matrix multiplication) with the original query Qτ, allowing tokens to acquire task-speciﬁc semantics. (2) Prompt for value Pτ,v is prepended with the original value and will be absorbed into the self-attention output Oτ, where each position in Oτ is the weighted-sum of vectors in V τ with weights from the attention scores. This way, Pτ,v can serve as task-speciﬁc memories for multihead attention to retrieve information from.

3.2. Hyper Prompt

How to obtain the prompts for the m-th Transformer block? A straightforward way is to directly initialize P m τ,k and P m τ,v. However, this way is parameter-inefﬁcient, as it scales linearly with both the number of tasks T and the number layers M as O(T M).

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

Instead, we initialize a global1 prompt Pτ for each task and apply local Hyper Networks at every Transformer block to project this prompt into {P m τ,k}M m=1 and {P m τ,v}M m=1.

Global Prompts. Speciﬁcally, we initialize a set of global prompts {Pτ}T τ=1, where Pτ Rl d is a trainable matrix to learn the task-speciﬁc information of the τ-th task, d is the model dimension and l is the length of the prompt.

Local Hyper Networks. At the m-th Transformer block, we apply two local Hyper Networks hm k and hm v to transform the global prompt Pτ into layer-speciﬁc and task-speciﬁc prompts as shown in Figure 2(b):

P m τ,k = hm k (Pτ) = U m k (Relu(Dm k (Pτ))), (3)

P m τ,v = hm v (Pτ) = U m v (Relu(Dm v (Pτ))), (4)

where P m τ,k/v Rl h dh. We call these generated prompts hyper-prompts to distinguish from global prompts.

In particular, to limit the number of parameters, the local Hyper Networks are designed using a bottleneck architecture: Dm k/v Rd b and U m k/v Rb h dh are down-projection and up-projection matrices, respectively. b is the bottleneck dimension satisfying b d.

Hyper Prompt-Share. We ﬁrst have all tasks share the same two local Hyper Networks deﬁned by the down-project matrices Dm k and Dm v , and the up-project matrices U m k and U m v . We refer to this design choice as Hyper Prompt-Share.

Despite the saving of parameters, one drawback of Hyper Prompt-Share is that the task conﬂicts could arise given the limited model capacity (Wu et al., 2020; Wang et al., 2020) of the shared local Hyper Networks.

Hyper Prompt-Sep. In the opposite extreme of Hyper Prompt-Share, each task can have its own local Hyper Networks hm τ,k(Pτ) and hm τ,v(Pτ) as following:

P m τ,k = hm τ,k(Pτ) = U m τ,k(Relu(Dm τ,k(Pτ))), (5)

P m τ,v = hm τ,v(Pτ) = U m τ,v(Relu(Dm τ,v(Pτ))), (6)

where Dm τ,k/v and U m τ,k/v are down-projection and upprojection matrices for the τ task, respectively. In this case, each task hyper-prompt is trained independently and hence there is no information sharing.

3.3. Hyper Prompt-Global

We further propose a novel design of Hyper Prompt-Global to ﬂexibly share information and knowledge among tasks and blocks while maintaining a low parameter cost. As shown in Figure 2(c), the key idea of Hyper Prompt-Global is to generate the local Hyper Networks using the same

1we term it global because it is independent of the layer number as opposed to layer-dependent prompt P m τ .

global Hyper Network shared by all tasks and all Transformer blocks.

Layer-Aware Task Embedding. Following the same recipe in Karimi Mahabadi et al. (2021), we deﬁne a layer-aware task embedding for better generalization. Let kτ Rt denote the task embedding for the τ task and t

is the dimension. To capture the layer-speciﬁc information, layer embedding zm Rt is introduced. After that, a task projection network ht( , ) is applied to fuse the task embedding and the layer embedding into the ﬁnal layer-awared task embedding Im τ = ht(kτ, zm), where Im τ is the input to the shared global Hyper Networks as shown in Figure 2(c). ht is a MLP consisting of two feed-forward layers and a Re LU non-linearity, which takes the concatenation of kτ and zm as input.

Global Hyper Networks. Hk( ) generates the weight matrices (U m τ,k, Dm τ,k) in the local Hyper Networks of key hyperprompts and another global Hyper Network Hv( ) generates the weight matrices (U m τ,v, Dm τ,v) in the local Hyper Networks of value hyper-prompts:

(U m τ,k, Dm τ,k) = Hk(Im τ ) = (W Uk, W Dk)Im τ , (7)

(U m τ,v, Dm τ,v) = Hv(Im τ ) = (W Uv, W Dv)Im τ , (8)

where Im τ Rt is the layer-aware task embedding for the τ task at the m-th block. W Dk R(d b) t, W Dv R(d b) t, W Uk R(b h dh) t and W Uv R(b h dh) t are the weight matrices of Hk( ) and Hv( ).

Given that U m τ,k/v, and Dm τ,k/v are generated by the global Hyper Networks, we project the global prompts Pτ,k/v into hyper-promtps P m τ,k/v following Eqs. 5 and 6. Finally, the hyper-prompts P m τ,k/v are prepended with original key and value at every self-attention layer as shown in Figure 2(a) to calculate the task-conditioned attention scores.

Using global Hyper Networks to generate the projection networks has two beneﬁts:

1. It enables a more ﬂexible way to share information across tasks and layers: the transformation matrices are decomposed into Hk/v( ) that are shared by all tasks and all layers. Therefore, the model can adjust the degree of information sharing across tasks and layers through learning the appropriate parameter values in Hk/v( ) during the end-to-end training.

2. A parameter-efﬁcient task conditioned parameterization is enabled. The number of extra task-conditioned parameters doesn t depend on the number of layers M, and scales sub-linearly with respect to the total number of tasks T. In practice, since task embeddings

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

and task prompts have far fewer parameters than the global Hyper Networks, the additional task-conditioned parameters is almost independent of T.

3.4. Parameter Efﬁciency of Hyper Prompt

As shown in A.1, the total number of additional parameters from Hyper Prompt-Global is dl T + 4(bdt) + Tt + Mt + (2t + t)e, where d is the model dimension, l is the length of the prompts, T is the total number of tasks, b is the bottleneck dimension of the weight matrices of the local Hyper Networks, d is the model dimension, t /t is the dimension of the raw/ﬁnal layer-aware task embedding, and e is the hidden dimension of hk/v. Therefore, the space complexity is O(d(l T + 4bt)), given that in practice M T, t dl, and e bd. This leads to a sub-linear scaling with respect to T.

Furthermore, T is typical O(10) for multi-task learning. A reasonable l O(10) is required to achieve the optimal performance, which will be detailed in Section 4.7. On the other hand, typical values for b 24 and t 32, and therefore 4bt l T is satisﬁed in most cases. Hence, the space complexity could be further simpliﬁed as O(bdt). In conclusion, the space complexity of Hyper Prompt-Global mainly comes from the global Hyper Networks and is practically independent of the prompt length l, the number of Transformer layers M, and the number of tasks T.

4. Experiments

4.1. Experimental Setup

Datasets. We evaluate the performance of the models on GLUE (Wang et al., 2018) and Super GLUE (Wang et al., 2019) respectively. Each of them is a collection of text classiﬁcation tasks to test the general language understanding ability. Speciﬁcally, the tasks include: sentence acceptability (Co LA), sentiment analysis (SST-2), paraphrasing/sentence similarity (MRPC, STS-B and QQP), natural language inference (MNLI, QNLI, RTE and CB), coreference resolution (WSC), sentence completion (COPA), word sense disambiguation (WIC) and question answering (Multi RC and Re Co RD, Bool Q).

Transformers. Following previous work Karimi Mahabadi et al. (2021) and Tay et al. (2020), our models are built on top of the state-of-the-art Transformer model T5 (Raffel et al., 2019), which uses encoder-decoder architecture from Vaswani et al. (2017). We use already pre-trained T5 with sizes from Base (220M parameters) to XXL (11B).

Evaluation. We save a checkpoint every 2000 steps for all models and follow the same convention as Raffel et al. (2019) in selecting the best checkpoint for each task. The emphasis of our evaluation is not to ﬁnd the best single

checkpoint for all tasks but to test the model s ability of transfer learning among the co-trained tasks. We ﬁrst calculate the average of all metrics for each task and then report the average of all tasks for GLUE and Super GLUE.

Baselines. We compare our proposed Hyper Prompt Share/Sep/Global with vanilla T5 models (Raffel et al., 2019) for multi-task learning, which is referred to MTL. Another baseline is Vanilla Adapter proposed in Houlsby et al. (2019b) that add adapters modules for each task after each of the the two feed-forward modules in each Transformer block of the T5 model. The state-of-the-art adapter-based method for multi-task learning is Hyper Former++ proposed in Karimi Mahabadi et al. (2021) that use Hyper Networks to generate adapters for each task and add them after the feed-forward modules following Houlsby et al. (2019b). In addition, Prompt-Tuning (Lester et al., 2021) is originally for parameter-efﬁcient single-task ﬁne-tuning and only prepends prompts to the input word embeddings in the ﬁrst layer. We slightly modify it by initializing and prepending prompts for each task respectively so that Prompt-Tuning can be applied to multi-task learning.

We defer additional details of the experiments to A.2

4.2. Key Results

Figure 1 provides an overall summary of the results of Hyper Prompt. Previous prompt-tuning (Lester et al., 2021; Li & Liang, 2021) methods focus on parameter-efﬁcient single-task ﬁne-tuning and hence freeze the backbone and only ﬁne-tune the prompts. Their experiments show that the performance of only tuning the prompts can match the full model training with a very large 11B model (Figure 1), but substantially pales for moderate model sizes.

Our Hyper Prompt-Global architecture when fully ﬁne-tuned achieves state-of-the-art performance on Super GLUE across four different model sizes. Competitive adapter-tuning variants including Prompt-Tuning and Hyper Former++ can either match or slightly improve upon the multi-task learning (MTL) baseline on the Super GLUE dataset. In contrast, Hyper Prompt-Global outperforms the strong MTL baseline by a large margin on Super GLUE score (78.9 vs 77.2 for T5 Base). Interestingly, such a performance gain continues all the way to model size as big as XXL (e.g. 91.3 vs 90.2) with only 0.14% additional parameters.

4.3. Tuning all vs Task-Conditioned Parameters

Recently, Karimi Mahabadi et al. (2021) show that only tuning adapters can be competitive against the full ﬁnetuning. However, the evaluation is conducted only on the GLUE with smaller models including T5 Small and Base.

In the experiments, we ﬁrst compare tuning the full model vs. only task-conditioned parameters. Table 1 shows the com-

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

Tunable Model GLUE Super GLUE

All MTL 88.3 85.9 All Hyper Former++ 88.8 86.4 All Hyper Prompt-Global 89.4 87 Task Hyper Former++ 87.3 80.5 Task Hyper Prompt-Global 87.5 81.5

Table 1. Comparison of ﬁne-tuning all vs task-speciﬁc parameters using T5 Large. The average scores of GLUE and Super GLUE are reported on T5 Large.

parison on the GLUE and Super GLUE average scores using T5 large (for per-task performance, please refer to A.4). For GLUE, the observation is consistent with (Karimi Mahabadi et al., 2021), where task-speciﬁc only ﬁne-tuning of Hyper Former++ and Hyper Prompt-Global is comparable to the MTL baseline. However, on Super GLUE, we observe a large gap: the average score drops by 5.5 and 5.9 for Hyper Prompt-Global and Hyper Former++, respectively.

Therefore, these experiments show that only tuning the taskconditioned parameters is not enough to achieve competitive results as full model training for multi-task learning on highdifﬁculty tasks such as Super GLUE. This is consistent with the results of Prompt-Tuning (Lester et al., 2021). Hence, the rest of the experiments are conducted with tuning all model parameters.

4.4. Computational Efﬁciency

Table 2 presents the computational efﬁciency of the Adapter/Prompt models. Hyper Prompt-Global (together with Hyper Prompt-Share) has the lowest # Ops since hyperprompts are injected into self-attention and skip the standard FFN layers. In contrast, Hyper Former++ has 3x # Ops compared to other variants. Regarding training time, Hyper Prompt-Share is fastest given that the local Hyper Networks are shared across tasks. Vanilla Adapter and Hyper Prompt-Global are comparable while Hyper Former++ and Prompt-Tuning take signiﬁcant longer to do the full ﬁne-tuning. This shows the computational efﬁciency of Hyper Prompt for both training and inference.

Model # Ops Training Time (hours)

Vanilla Adapter 1.01 1013 8.4 Hyper Former++ 3.14 1013 10.3 Prompt-Tuning 1.16 1013 11.1 Hyper Prompt-Sep 1.01 1013 8.9 Hyper Prompt-Share 9.8 1012 8.0 Hyper Prompt-Global 9.8 1012 8.7

Table 2. The number of operations for a single forward pass and training time on T5 Base.

Model #Params GLUE Super GLUE

MTL 1.0x 85.5 (0.9) 77.2 (0.2) Vanilla Adapter 1.06x 86.7 (0.3) 77.5 (0.1) Hyper Former++ 1.04x 86.5 (0.0) 78.2 (0.7) Prompt-Tuning 1.0003x 84.8 (0.6) 77.3 (0.2)

Hyper Prompt-Share 1.008x 86.4 (0.6) 78.2 (0.7) Hyper Prompt-Sep 1.06x 86.8 (0.1) 77.5 (0.1) Hyper Prompt-Global 1.04x 86.8 (0.4) 78.9 (0.5)

Table 3. GLUE and Super GLUE average scores (standard deviations) over 3 runs of Hyper Prompt against baselines on T5 Base.

4.5. Ablation Study

Table 3 presents the results on T5 Base and Table 4 presents the results on T5 Large (see more detailed results in A.4). Hyper Prompt-Global outperforms all baselines in terms of the average score of GLUE and Super GLUE.

Hyper Prompt-Global vs. Prompt-Tuning. The original Prompt-Tuning (Lester et al., 2021) is for single-task ﬁnetuning. To be parameter-efﬁcient, it only trains the prompts with the backbone frozen. To make a fair comparison, we modify Prompt-Tuning by (1) training both prompts and backbone, and (2) adding prompt to each task and co-train all tasks together. As shown in Table 3 and 4, Hyper Prompt Global outperforms Prompt-Tuning by 2.0 (0.6) and 1.6 (1.4) on GLUE and Super GLUE using T5 Base (Large), respectively. Hyper Prompt-Global improves upon Prompt Tuning in two places: (1) Prompt-Tuning only adds prompts to the word embedding layer while Hyper Prompt-Global adds hyper-prompts at every Transformer layer and hence is more expressive; and (2) Prompts of tasks are trained independently in Prompt-Tuning while Hyper Prompt-Global enables a ﬂexible information sharing via Hyper Networks.

Hyper Prompt-Global vs. Hyper Former++. Our method is superior to the state-of-the-art baseline Hyper Former++ in the average score of GLUE and Super GLUE for both Base and Large T5 model. For example, Hyper Prompt-Global of T5 large achieves 87.0 on the Super GLUE compared to 86.4 by Hyper Former++ (Table 4). Note that the main differ-

Model #Params GLUE Super GLUE

MTL 1.0x 88.3 (0.6) 85.9 (0.3) Vanilla Adapter 1.06x 88.8 (0.2) 86.1 (0.5) Hyper Former++ 1.02x 88.8 (0.0) 86.4 (0.5) Prompt-Tuning 1.0001x 88.8 (0.3) 85.6 (0.1)

Hyper Prompt-Share 1.008x 89.3 (0.1) 86.8 (0.2) Hyper Prompt-Sep 1.06x 89.4 (0.2) 86.1 (0.3) Hyper Prompt-Global 1.02x 89.4 (0.1) 87.0 (0.5)

Table 4. GLUE and Super GLUE average scores (standard deviations) over 3 runs of Hyper Prompt against baselines on T5 Large.

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

ence between the two methods is that Hyper Prompt-Global inserts the task-conditioned parameters as prompts into selfattention layers while Hyper Former++ insert adapters after each block. We believe task-conditioning in self-attention gives more expressive power than in the feed-forward network as done in adapters. Hyper-prompts that are prepended with the key and value participate in the attention interactions between different token positions, which helps the model to better capture the task-dependent semantics.

Hyper Prompt-Global vs. MTL. Next, we observe that using Hyper Prompt-Global can greatly improve the performance upon the vanilla Transformer model (referred to MTL): 1.7 (1.1) gain on Super GLUE score for T5 Base (Large) with 4% (2%) additional paramters. In conclusion, the experiments show that Hyper Prompt-Global is a parameter-efﬁcient and effective task-conditioned parameterization of Transformers for multi-task learning.

Hyper Prompt-Global vs. Hyper Prompt-Share/Sep. Interestingly, Hyper Prompt-Share is better than Hyper Prompt Sep on the Super GLUE on both Base and Large models while the opposite is true for GLUE. Notice that all tasks share the same two projection networks in Hyper Prompt Share while each task has its own projection networks in Hyper Prompt-Sep. More importantly, we observe that Hyper Prompt-Global, where the projection networks are generated by the global Hyper Networks, always achieves the best performance on both GLUE and Super GLUE. Hence, the experiments show that Hyper Prompt-Global can adjust the degree of information sharing for better multi-task generalization, compared to Hyper Prompt-Share/Sep.

4.6. Peeking into Hyper-Prompts

To shed light on how hyper-prompts help improve the multi-task generalization via task-conditioning, we peek into Hyper Prompt-Global models by looking at the distribution of attention scores. We choose the GLUE task MRPC as an example. To avoid biasing on individual examples, we aggregate over 100 validation examples to compute the quantity of interest (see A.3 for details). First, we compute the attention mass on hyper-prompts for each encoder layer. Figure 3 (top) shows that the network has lower attention mass on hyper-prompts in the lower layers and gradually increases attention mass for higher layers. This phenomenon indicates that higher-levels of Transformer becomes more task-specialized while it is beneﬁcial for the lower-levels to learn task-agnostic representation (Yosinski et al., 2014) by casting lower attention mass on hyper-prompts. Furthermore, we calculate the entropy of the attention scores on the tokens. For Hyper Prompt-Global, we remove the hyperprompts from the calculation and re-normalize the attention scores on the tokens to make a fair comparison with the MTL baseline. Figure 3 (bottom) shows a shift of entropy

1 2 3 4 5 6 7 8 9 10 11 12 Layer

Attention Mass on Prompts

2 3 4 5 6 Entropy Over Tokens

Model MTL Hyper Prompt

Figure 3. Visualization of attention mass and entropy distribution.

distribution towards higher values for Hyper Prompt-Global. This signiﬁes that injecting hyper-prompts encourages a more diverse attention distribution, which seems to be beneﬁcial to model generalization.

4.7. Impact of Hyper-Prompt Length

Hyper Prompt prepends l trainable hyper-prompts to the keys and values of self-attention layer at every Transformer layer. In Figure 4, we present the results of tuning the prompt length l on GLUE using T5 Base as the example for Hyper Prompt-Global (similar patterns are observed on T5 Large and Super GLUE). We ﬁrst add hyper-prompts on the decoder and search the best l and then search the best l for the encoder with the ﬁxed best decoder hyper-prompt length. As shown in Figure 4(a), l = 6 is the best for the decoder. As shown in Figure 4(b), Hyper Prompt-Global

(a) Decoder

8 12 16 20 24 28

(b) Encoder

Figure 4. Impact of hyper-prompt length in Hyper Prompt-Global (GLUE score on T5 Base).

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

achieves the best result of 86.8 when l = 16 on the encoder with l = 6 ﬁxed for the decoder. The experiments show that hyper-prompts with length l O(10) are good enough to achieve superior performance. Note that the original sequence length is 512 on the encoder and 32 on the decoder. Therefore, Hyper Prompt does not substantially increase the time complexity of self-attention layers in practice.

4.8. Encoder vs Decoder

To understand the effect of adding task-conditioned parameters to different parts of the network, we present the results of Hyper Prompt-Global and Hyper Former++ with adding hyper-prompts/adapters to: (1) encoder-only, (2) decoderonly, and (3) both encoder-decoder. As shown in Table 5, we observe adding task-conditioned parameters to encoder (encoder-only) performs better than decoder-only on GLUE. However, the opposite is true for Super GLUE, where encoder-only is substantially worse than decoderonly. This potentially could be a trainability issue when prompts are inserted into encoders, i.e. a different learning rate might be required to learn the prompt parameters from scratch. We leave this investigation as a future work. Based on this experiment, we add task-conditioned parameters to the decoder for Super GLUE in our experiments.

Model #Params GLUE Super GLUE

MTL 1.0x 85.5 77.2 Hyper Former++-Encoder 1.02x 85.9 74.4 Hyper Former++-Decoder 1.02x 85.7 78.2 Hyper Former++-Enc-Dec 1.04x 86.5 74.8 Hyper Prompt-Encoder 1.02x 86.6 76.5 Hyper Prompt-Decoder 1.02x 86.3 78.9 Hyper Prompt-Enc-Dec 1.04x 86.8 78.7

Table 5. Ablation of inserting hyper-prompts or adapters into Encoder/Decoder/Enc-Dec (T5 Base).

5. Related Work

Prompt-Tuning. Prompt tuning is becoming a new paradigm for adapting pre-trained general-purpose language models to downstream tasks, as a lightweight alternative to the popular ﬁne-tuning approach. Here, we use the term Prompt-Tuning to cover a family of methods following the prompting idea in GPT-3 (Brown et al., 2020). To avoid manually design the prompts, recent efforts have focused on search for discrete prommpting words automatically (Shin et al., 2020). On the other hand, soft prompts (Li & Liang, 2021; Hambardzumyan et al., 2021; Lester et al., 2021; Liu et al., 2021) in the form of continuous vectors are introduced to simplify the process and have shown competitive results (Lester et al., 2021; Liu et al., 2021; Li & Liang, 2021). In particular, Lester et al. (2021) show that soft

prompts can become competitive against full ﬁne-tuning for a 11B parameters model, but with a big performance gap for moderate-size models. In our work, we close this gap in the full ﬁne-tuning setting and demonstrate that Hyper Prompt can outperform strong baselines across all model sizes studied.

Adapter-Tuning. Adapter tuning (Houlsby et al., 2019a;b; Karimi Mahabadi et al., 2021) is an alternative approach for parameter-efﬁcient lightweight tuning of pre-trained langauge models. Task-speciﬁc adapter layers (Houlsby et al., 2019a) are inserted into the Transformer block for ﬁne-tuning while the rest of the backbone model is frozen. By adding only a few percent of additional parameters, Karimi Mahabadi et al. (2021) show that competitive performance can be obtained on NLU benchmarks such as GLUE (Wang et al., 2018). However, one limitation from the existing work is the evaluation of NLU on GLUE dataset, which is known to be no longer suitable for measuring the progress of language understanding (Wang et al., 2019). In our work, we evaluate Hyper Prompt on Super GLUE in addition to GLUE, and show that indeed higher-difﬁculty tasks such as Super GLUE requires full-tuning of the model beyond adapter tuning, to be competitive against state-of-the-art baselines. We also demonstrate that it is advantageous to inject prompts into self-attention than adding adapters.

Multi-task Natural Language Understanding. Multitask learning is an important and challenge research direction in both full ﬁne-tuning and prompt-tuning paradigms because of the competing needs of training and serving a single model while achieving Pareto efﬁciency in all tasks.

The T5 model (Raffel et al., 2019) renders all NLP tasks as a Text-to-Text problem. However, the best results are obtained by task-speciﬁc ﬁne-tuning. MTDNN (multi-task deep neural network) (Liu et al., 2019a) shares parameters between several NLP tasks, and achieves strong performance on the GLUE benchmark. Aghajanyan et al. (2021) use around 50 tasks to boost the multi-task learning performance. Aribandi et al. (2021) builds an extremely diverse set of 107 NLP tasks for extreme multi-task scaling and demonstrate superior performances on a wide range of benchmarks. Recently, Wei et al. (2021); Sanh et al. (2021) also illustrated how a multi-task learning stage can greatly improve the zero-shot prompting performance of large language models.

6. Conclusion

We propose a novel architecture for prompt-based taskconditioning of self-attention in Transformers. The hyperprompts are generated by a Hyper Network to enable ﬂexible information sharing among tasks while remain efﬁcient in parameters and computation. Hyper Prompt allows the network to learn task-speciﬁc feature maps where the

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

hyper-prompts serve as task global memories, encouraging a more diverse distribution of attention. Extensive experiments show that Hyper Prompt can achieve superior performances over strong T5 multi-task learning baselines and parameter-efﬁcient models including Prompt-Tuning and Hyper Former++ on GLUE and Super GLUE benchmarks.

Aghajanyan, A., Gupta, A., Shrivastava, A., Chen, X., Zettlemoyer, L., and Gupta, S. Muppet: Massive multitask representations with pre-ﬁnetuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5799 5811, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.468. URL https:// aclanthology.org/2021.emnlp-main.468.

Aribandi, V., Tay, Y., Schuster, T., Rao, J., Zheng, H. S., Mehta, S. V., Zhuang, H., Tran, V. Q., Bahri, D., Ni, J., Gupta, J., Hui, K., Ruder, S., and Metzler, D. Ext5: Towards extreme multi-task scaling for transfer learning, 2021.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf.

Dehghani, M., Arnab, A., Beyer, L., Vaswani, A., and Tay, Y. The efﬁciency misnomer. ar Xiv preprint ar Xiv:2110.12894, 2021.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.

Ha, D., Dai, A. M., and Le, Q. V. Hypernetworks.

In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL https://openreview.net/forum? id=rkp ACe1lx.

Hambardzumyan, K., Khachatrian, H., and May, J. WARP: Word-level Adversarial Re Programming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4921 4933, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.381. URL https: //aclanthology.org/2021.acl-long.381.

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a uniﬁed view of parameter-efﬁcient transfer learning, 2021.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efﬁcient transfer learning for NLP. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2790 2799. PMLR, 09 15 Jun 2019a. URL https://proceedings.mlr.press/v97/ houlsby19a.html.

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efﬁcient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790 2799. PMLR, 2019b.

Karimi Mahabadi, R., Ruder, S., Dehghani, M., and Henderson, J. Parameter-efﬁcient multi-task ﬁne-tuning for transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), August 2021.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann.

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efﬁcient prompt tuning. ar Xiv preprint ar Xiv:2104.08691, 2021.

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871 7880, 2020.

Li, X. L. and Liang, P. Preﬁx-tuning: Optimizing continuous prompts for generation. ar Xiv preprint ar Xiv:2101.00190, 2021.

Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4487 4496, Florence, Italy, July 2019a. Association for Computational Linguistics. doi: 10.18653/v1/P19-1441. URL https://aclanthology.org/P19-1441.

Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4487 4496, 2019b.

Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., and Tang, J. Gpt understands, too, 2021.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer. ar Xiv preprint ar Xiv:1910.10683, 2019.

Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chafﬁn, A., Stiegler, A., Scao, T. L., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S. S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N., Datta, D., Chang, J., Jiang, M. T.-J., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Biderman, S., Gao, L., Bers, T., Wolf, T., and Rush, A. M. Multitask prompted training enables zero-shot task generalization, 2021.

Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., et al. Mesh-tensorﬂow: Deep learning for supercomputers. ar Xiv preprint ar Xiv:1811.02084, 2018.

Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., and Singh, S. Auto Prompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4222 4235, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main. 346. URL https://aclanthology.org/2020. emnlp-main.346.

Sukhbaatar, S., Grave, E., Lample, G., Jegou, H., and Joulin, A. Augmenting self-attention with persistent memory. ar Xiv preprint ar Xiv:1907.01470, 2019.

Tay, Y., Zhao, Z., Bahri, D., Metzler, D., and Juan, D.-C. Hypergrid: Efﬁcient multi-task transformers with gridwise decomposable hyper projections. ar Xiv preprint ar Xiv:2007.05891, 2020.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017.

von Oswald, J., Henning, C., Sacramento, J., and Grewe, B. F. Continual learning with hypernetworks. ar Xiv preprint ar Xiv:1906.00695, 2019.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461, 2018.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Superglue: A stickier benchmark for general-purpose language understanding systems. ar Xiv preprint ar Xiv:1905.00537, 2019.

Wang, Y., Zhao, Z., Dai, B., Fifty, C., Lin, D., Hong, L., and Chi, E. H. Small towers make big differences, 2020.

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language models are zero-shot learners, 2021.

Wu, S., Zhang, H. R., and R e, C. Understanding and improving information transfer in multi-task learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=Sylzhk Bt DB.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In Advances in neural information processing systems, pp. 3320 3328, 2014.

Zaken, E. B., Ravfogel, S., and Goldberg, Y. Bitﬁt: Simple parameter-efﬁcient ﬁne-tuning for transformerbased masked language-models. ar Xiv preprint ar Xiv:2106.10199, 2021.

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

A. Appendix

This section covers the parameter count of Hyper Prompt, the experimental details, the calculation of attention mass and entropy, and per-task performance of GLUE and Super GLUE.

A.1. Parameter Count of Hyper Prompt-Global ( 3.4)

Since the encoder and the decoder of Transformers have approximately the same capacity, the calculation considers only the decoder-side for simplicity. First, we have global task prompts Pτ Rl d for the τ-th task, which contains dl T parameters for T tasks. The global Hyper Networks contain four weight matrices W Dk R(d b) t, W Dv R(d b) t, W Uk R(b h dh) t and W Uv R(b h dh) t, which result in 4(bdt) parameters (we let d = h dh). To obtain layer-aware task embedding, Hyper Prompt-Global learns task embedding kτ Rt for the τ task and layer embedding zm Rt for the m-th Transformer block, which in total results in Tt + Mt parameters. Besides, a task projection network ht is applied to fuse the task embedding and the layer embedding into the ﬁnal layer-aware task embedding Im τ Rt. ht is a two-layer feed-forward networks and contains (2t + t)e parameters, where e is the hidden dimension for ht.

A.2. Experimental Details ( 4.1)

Our models were implemented using Mesh Tensorﬂow2 (Shazeer et al., 2018) with the T5 library3 (Raffel et al., 2019). Following Raffel et al. (2019), all data are preprocessed as into a sequence-to-sequence format. The length of the sequence is 512 at the encoder and 32 at the decoder. For all experiments, we train models 300K steps with a batch size of 128 and each batch is a mixture which samples each task proportionately to the number of examples in the dataset. Learning rate is a constant of 1e-3 with Adam optimizer (Kingma & Ba, 2014).

For hyper-parameters tuning, the length of prompt l is selected from {12, 16, 20, 20, 24} at the encoder and {2, 4, 6, 8, 10, 12, 14, 16} at the decoder. The bottleneck dimension b in the transform matrices is set to d/r, where d is the model dimension of the T5 models and r is a reduction factor and selected from {16, 32, 64}. The dimension t of the layer-aware task embedding is selected from {32, 64, 128}. For a fair comparison, the hyper-parameters of baseline methods are set to have approximately the same numbers of parameters as Hyper Prompt-Global with the exception that Prompt-Tuning and Hyper Prompt-Share are extremely parameter-efﬁcient with signiﬁcantly fewer parameters.

A.3. Attention Mass and Entropy calculation ( 4.6)

To calculate the attention mass over hyper-prompts per layer, we averaged the hyper-prompt attention softmax scores across 100 validation examples and each attention head in a layer, and summed across each query attending to the hyper-prompts. In other words, we aggregated the amount of attention given to hyper-prompts by queries. To calculate the attention entropy over tokens (other than hyper-prompts), we calculated the entropy of the attention distributions (averaged across attention heads) for 100 validation examples. This results in P100 n=1 P12 L=1 |Xn| entropies calculated and visualized in Figure 3 (bottom). For the Hyper Prompt model, this involved re-normalizing the softmax distribution after removing hyper-prompts, as we wanted to understand how the original tokens are attended to.

A.4. Per-Task Performance of GLUE and Super GLUE

Table 6 and 7 below show the comparison of ﬁne-tuning the entire model against task-speciﬁc parameters only on GLUE and Super GLUE datasets. Table 8 and 9 show the detailed results of full-tuning of Hyper Prompt against baselines on T5 Base. Table 10 and 11 show the detailed results of full-tuning of Hyper Prompt against baselines on T5 Large.

2https://github.com/tensorflow/mesh 3https://github.com/google-research/text-to-text-transfer-Transformer

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

Tunable Parameters Model Co LA SST-2 MRPC SST-B QQP MNLI QNLI RTE AVG

All MTL 59.4 96.6 93.3/90.7 90.6/90.4 89.8/92.3 90.8/90.8 95.2 90.8 88.3 All Hyper Former++-T5.1.1LARGE 63.3 96.6 93.2/90.7 92.1/91.9 89.7/92.3 90.5/90.7 95.1 89.9 88.8 All Hyper Prompt-T5.1.1LARGE 64.6 96.7 94.0/91.8 91.3/91.4 90.0/92.4 90.8/91.0 95.4 91.9 89.4 Task-Speciﬁc Hyper Former++-T5.1.1LARGE 58.9 95.7 92.7/90.0 91.6/91.5 87.7/90.7 89.8/90.0 94.5 87.0 87.3 Task-Speciﬁc Hyper Prompt-T5.1.1LARGE 57.5 96.7 93.6/91.2 91.9/92.0 87.0/90.1 90.3/90.6 95.0 87.7 87.5

Table 6. Comparison of ﬁne-tuning all vs task-speciﬁc parameters on GLUE.

Tunable Parameters Model Bool Q CB COPA Multi RC Re Co RD RTE Wi C WSC AVG

All MTL 88.5 95.8/98.2 87.0 85.5/56.3 89.2/88.6 91.7 74.0 89.4 85.9 All Hyper Former++-T5.1.1LARGE 88.9 98.7/98.2 86.7 85.4/56.7 89.4/88.8 92.1 74.5 90.7 86.4 All Hyper Prompt-T5.1.1LARGE 88.7 99.1/98.8 91.0 85.0/55.6 89.8/89.1 91.3 74.2 92.0 87.0 Task-Speciﬁc Hyper Former++-T5.1.1LARGE 85.2 90.9/94.6 76.7 81.5/48.8 87.2/86.4 87.7 67.8 82.1 80.5 Task-Speciﬁc Hyper Prompt-T5.1.1LARGE 85.2 95.2/95.5 75.5 82.9/52.9 89.1/88.3 85.7 71.1 82.2 81.5

Table 7. Comparison of ﬁne-tuning all vs task-speciﬁc parameters on Super GLUE.

Model #Params Co LA SST-2 MRPC SST-B QQP MNLI QNLI RTE AVG

MTL 1.0x 49.8 94.6 92.5/89.8 90.7/90.5 89.2/91.9 88.8/88.5 93.3 85.0 85.5 Vanilla Adapter 1.06x 60.0 95.4 92.7/89.8 90.2/90.2 89.3/91.9 88.5/88.1 93.5 84.4 86.7 Hyper Former++ 1.04x 56.9 94.8 92.9/90.1 91.1/90.9 88.9/91.7 88.7/88.3 93.4 85.6 86.5 Prompt-Tuning 1.0003x 48.0 95.0 92.2/89.0 90.3/90.2 89.0/91.7 88.8/88.5 93.2 82.9 84.8

Hyper Prompt-Share (ours) 1.008x 56.2 94.7 93.0/90.4 90.6/90.4 89.2/91.9 88.7/88.4 93.4 85.2 86.4 Hyper Prompt-Sep (ours) 1.06x 57.2 94.6 93.8/91.4 91.0/90.8 89.2/91.9 88.5/88.4 93.4 86.6 86.8 Hyper Prompt-Global (ours) 1.04x 57.0 95.2 93.4/90.9 90.4/90.2 89.2/92.0 88.7/88.5 93.4 87.1 86.8

Table 8. Comparison of Hyper Prompt with baselines on GLUE using T5 Base.

Model #Params Bool Q CB COPA Multi RC Re Co RD RTE WIC WSC AVG

MTL 1.0x 82.6 93.4/93.5 65.7 76.7/39.7 80.9/80.2 85.6 70.5 81.4 77.2 Vanilla Adapter 1.03x 83.5 93.4/94.6 65.3 77.6/42.7 81.0/80.2 88.2 71.0 76.9 77.5 Hyper Former++ 1.02x 83.5 96.2/97.0 66.3 77.8/41.9 81.2/80.4 87.4 71.0 80.1 78.2 Prompt-Tuning 1.0003x 82.5 94.0/95.8 68.0 76.9/40.2 80.9/80.2 84.1 69.3 80.8 77.3

Hyper Prompt-Share (ours) 1.004x 83.1 95.7/95.2 67.7 77.3/41.3 81.9/81.0 87.4 70.4 80.8 78.2 Hyper Prompt-Sep (ours) 1.03x 83.3 97.8/97.0 61.7 77.6/42.3 81.5/80.6 86.8 71.4 78.2 77.5 Hyper Prompt-Global (ours) 1.02x 83.3 96.6/96.4 69.7 77.5/41.0 81.7/80.9 86.8 70.5 83.7 78.9

Table 9. Comparison of Hyper Prompt with baselines on Super GLUE using T5 Base.

Model #Params Co LA SST-2 MRPC SST-B QQP MNLI QNLI RTE AVG

MTL 1.0x 59.4 96.6 93.3/90.7 90.6/90.4 89.8/92.3 90.8/90.8 95.2 90.8 88.3 Vanilla Adapter 1.06x 63.8 96.5 93.7/91.3 92.0/91.9 90.0/92.5 90.6/90.5 94.9 88.7 88.8 Hyper Former++ 1.02x 63.3 96.6 93.2/90.7 92.1/91.9 89.7/92.3 90.5/90.7 95.1 89.9 88.8 Prompt-Tuning 1.0001x 62.5 96.7 93.4/91.0 91.3/91.0 90.0/92.4 90.9/91.0 95.4 89.9 88.8 Hyper Prompt-Share (ours) 1.008x 65.0 96.7 93.8/91.6 91.1/90.8 90.0/92.4 90.8/91.1 95.3 91.3 89.3 Hyper Prompt-Sep (ours) 1.06x 63.9 96.6 94.6/92.6 92.0/91.7 90.0/92.4 90.9/91.0 95.2 91.6 89.4 Hyper Prompt-Global (ours) 1.02x 64.6 96.7 94.0/91.8 91.3/91.4 90.0/92.4 90.8/91.0 95.4 91.9 89.4

Table 10. Comparison of Hyper Prompt with baselines on GLUE using T5 Large.

Hyper Prompt: Prompt-based Task-Conditioning of Transformers

Model #Params Bool Q CB COPA Multi RC Re Co RD RTE WIC WSC AVG

MTL 1.0x 88.5 95.8/98.2 87.0 85.5/56.3 89.2/88.6 91.7 74.0 89.4 85.9 Vanilla Adapter 1.03x 88.8 98.3/98.8 86.0 85.3/56.0 89.3/88.7 91.2 73.6 91.3 86.1 Hyper Former++ 1.01x 88.9 98.7/98.2 86.7 85.4/56.7 89.4/88.8 92.1 74.5 90.7 86.4 Prompt-Tuning 1.0001x 88.5 97.6/98.8 85.0 84.9/55.2 89.0/88.4 91.5 72.8 90.1 85.6 Hyper Prompt-Share (ours) 1.004x 88.5 98.7/98.2 88.0 85.2/55.8 89.7/89.1 91.8 74.1 93.9 86.8 Hyper Prompt-Sep (ours) 1.03x 88.6 97.6/98.8 87.7 85.2/56.4 89.7/89.1 91.6 73.5 89.4 86.1 Hyper Prompt-Global (ours) 1.01x 88.7 99.1/98.8 91.0 85.0/55.6 89.8/89.1 91.3 74.2 92.0 87.0

Table 11. Comparison of Hyper Prompt with baselines on Super GLUE using T5 Large.