# lowrank_bottleneck_in_multihead_attention_models__67f394a7.pdf

Low-Rank Bottleneck in Multi-head Attention Models

Srinadh Bhojanapalli 1 Chulhee Yun 2 Ankit Singh Rawat 1 Sashank Reddi 1 Sanjiv Kumar 1

Attention based Transformer architecture has enabled signiﬁcant advances in the ﬁeld of natural language processing. In addition to new pre-training techniques, recent improvements crucially rely on working with a relatively larger embedding dimension for tokens. Unfortunately, this leads to models that are prohibitively large to be employed in the downstream tasks. In this paper we identify one of the important factors contributing to the large embedding size requirement. In particular, our analysis highlights that the scaling between the number of heads and the size of each head in the current architecture gives rise to a lowrank bottleneck in attention heads, causing this limitation. We further validate this in our experiments. As a solution we propose to set the head size of an attention unit to input sequence length, and independent of the number of heads, resulting in multi-head attention layers with provably more expressive power. We empirically show that this allows us to train models with a relatively smaller embedding dimension and with better performance scaling.

1. Introduction

Attention based architectures, such as Transformers, have been effective for sequence modelling tasks such as machine translation (Gehring et al., 2017; Vaswani et al., 2017), question answering, sentence classiﬁcation (Radford et al., 2018; Devlin et al., 2018) and document generation (Liu et al., 2018). These models have emerged as better alternatives to the recurrent models - RNNs (Sutskever et al., 2014), LSTMs (Hochreiter & Schmidhuber, 1997) and GRUs (Cho et al., 2014). This is mainly due to their feed forward structure, which removes the sequential processing bottleneck for sequence data, making them easier to train compared

1Google Research New York 2Massachusetts Institute of Technology. Correspondence to: Srinadh Bhojanapalli <bsrinadh@google.com>, Chulhee Yun <chulheey@mit.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

to the recurrent models. Self attention models also have found applications in vision (Wang et al., 2018), adversarial networks (Zhang et al., 2018), reinforcement learning (Zambaldi et al., 2018; Li, 2017) and speech recognition (Chiu et al., 2018).

Recent advances in using the self attention models in natural language tasks have been made by ﬁrst using a language modeling task to pre-train the models and then ﬁne tuning the learned models on speciﬁc downstream tasks. Radford et al. (2018) and Devlin et al. (2018) used Transformers to pre-train a language model and showed that the ﬁne tuned model outperforms LSTMs on many natural language understanding and question answering tasks. For example, BERT (Devlin et al., 2018), a 24 layer transformer model, is shown to achieve the state of the art performance on several NLP tasks, including on the SQu AD dataset. These advances, in addition to novel pre-training tasks, relied on bigger models with a larger embedding size. BERT model uses an embedding size of 1024 (Devlin et al., 2018); GPT-2 uses models with embedding size up to 1600 (Radford et al., 2019).

A single Transformer block consists of two key components: a multi-head self attention layer followed by a feed forward layer (Vaswani et al., 2017). A single head in a multi-head attention layer, computes self attention between the tokens in the input sequence, which it then uses to compute a weighted average of embeddings for each token. Each head projects the data into a lower dimensional subspace, and computes the self attention in this subspace. This projection size for each head is commonly referred to as the head size.

To keep the number of parameters ﬁxed in the attention layer regardless of the number of heads, the prevalent heuristic is to scale the head size with 1/(number of heads). This heuristic was initially proposed in Vaswani et al. (2017) and has become a de facto standard heuristic in multi-head attention models (Radford et al., 2018; Devlin et al., 2018). However, increasing the number of heads decreases the head size, decreasing the expressive power of individual heads. We prove that reducing the head size to a value below the input sequence length harms the representation power of each head (see Theorem 1). This is because a smaller head size introduces a rank constraint on the projection matrices in each head, and limits their representation power. We indeed notice this effect in practice: while the performance improves with increasing the number of heads in the beginning

Low-Rank Bottleneck in Multi-head Attention Models

# heads 8 16 32 # params 336M 336M 336M SQu AD - F1 90.89 0.15 90.61 0.14 90.45 0.08 SQu AD - EM 84.1 0.34 83.75 0.27 83.48 0.13 MNLI 85 0.2 84.5 0.4 84.4 0.2

Table 1: Performance of BERTLARGE (Devlin et al., 2018), a 24 layer Transformer with an embedding size of 1024, suffers with the increasing number of heads after 8 heads.

(Devlin et al., 2018), we notice a drop in the performance once the number of heads increases beyond a certain threshold, as seen in Table 1 and Fig. 1 (see also Table 4(A) in Vaswani et al. (2017)).

In order to avoid hurting the performance, the existing models allow for multiple heads by increasing the embedding size, which in turn increases the head size. However, larger embedding size, in addition to increasing the number of parameters, makes it expensive to use the model and the learned embeddings in downstream tasks, as the downstream model sizes scale with the embedding size of the tokens. For example, the inference time and memory required in retrieval tasks typically increases linearly with the embedding size.

In this paper we propose setting the head size of attention units to input sequence length. While this is a simple hyperparameter change in the Transformer architecture, we show that it is important to set this value appropriately to avoid the low-rank bottleneck (see Theorem 1), and to improve the representation power (see Theorem 2). This ﬁxed head size is also independent of both the number of heads and the embedding size of the model. This allows us to train models with a relatively smaller embedding size (hence fewer parameters) without affecting the head size. Another advantage of the ﬁxed head size is that unlike the standard setting which requires the number of heads to be a factor of the embedding size, we are free to set an arbitrary number of heads as required for the task.

Interestingly, we note that this simple yet novel approach of ﬁxing the head size in multi-head Transformers results in empirically superior performance. We evaluate Transformers trained with this ﬁxed head size on language modeling (LM1B dataset), natural language inference (MNLI dataset) and question answering tasks (SQu AD dataset). We show that ﬁxing the head size allows us to train Transformers with a better performance scaling and smaller embedding size. We show that with the ﬁxed head size Transformers trained with an embedding size of 512 can match the performance of the BERTLARGE(Devlin et al., 2018), a Transformer with an embedding size of 1024 (see Fig. 2). We further present experimental results evaluating the effect of different choices

of the head size and the embedding size in Section 4.

Our contributions in this paper lie in identifying and rigorously proving the low rank bottleneck in multi-head attention models, and showing that ﬁxing the head size to input sequence length results in a strictly better model, both theoretically and empirically. The contributions of this paper are summarized below.

We analyze the representation power of the multi-head

self attention layer and prove the low-rank bottleneck the head size places on the attention units (Theorem 1).

We propose to set the head size to input sequence length,

and show that ﬁxing the head size strictly improves the expressive power of the multi-head attention layers compared to the standard heuristic for setting the head size (Theorem 2). This allows us to both increase the number of heads per layer and decrease the embedding size, without hurting the performance. We develop a novel construction based approach to prove this result, which can potentially be useful in analyzing other variants of the Transformer architecture.

We experimentally show that with a ﬁxed head size,

Transformers can be trained with better performance scaling and a smaller embedding size on three standard NLP tasks.

1.1. Related Works

Given the signiﬁcance of self attention models, there has been work trying to both improve the performance and speed up the computation in Transformers. Ott et al. (2018) and You et al. (2019) reduce precision and use large batch train-

ing to reduce the training time of the attention models. Child et al. (2019) propose sparse self attention models to speed up the computation in the attention layer for long sequence data generation tasks. They show that these sparse attention models can be trained on tasks with sequence length greater than 10k without sacriﬁcing the accuracy. Dehghani et al. (2018) propose a depth recurrent Transformer network that reuses the parameters across layers. They show that this modiﬁcation makes the Transformer networks Turing complete even with ﬁnite precision weights. Yang et al. (2019) propose a new way to increase the effective sequence length that the Transformer attends to, by reusing the intermediate embeddings across sequences. They show that the modiﬁed architecture performs better on tasks that require computing context over longer sequence lengths. We note that most of these modiﬁcations rely on the multi-head self attention, the same building block of the Transformers. Our work is studying this basic multi-head attention layer, and suggesting a new way to set the head size, which can be easily applied along with any of the above architectural modiﬁcations.

Low-Rank Bottleneck in Multi-head Attention Models

Wu et al. (2019) propose to replace the self-attention layer with lightweight dynamic convolutions and show improved performance on machine translation and language modeling. Even though the resulting model has faster inference time, it still needs to use a large embedding size (1024), as big as the original attention models. We believe the techniques in this paper can be combined with these results to realize both smaller embedding size and faster inference time.

Sun et al. (2019) perform neural architecture search using evolutionary methods on sequence to sequence models and ﬁnd an evolved transformer architecture, which in addition to multi-head attention units, has convolution ﬁlter and gated linear units. Our proposed modiﬁcations stay closer to Transformers in spirit and can be used as seed units for this architecture search.

Voita et al. (2019); Michel et al. (2019) study the importance of different heads in an attention layer. They observe that, during inference, many of the heads in each layer can be pruned away with a little effect on the prediction. However, they still need multiple heads during the training.

Child et al. (2019); Correia et al. (2019) impose sparsity structure on the attention layer during training to improve both interpretability and performance. Fixing the head size will in fact make it easier to learn such sparsity patterns, as a low rank constraint does not allow a head to express all possible sparsity patterns. Combining these techniques can hence potentially enable training of sparse attention models with a smaller embedding size.

More recently Brunner et al. (2019) also rely on a rank argument to study the identiﬁability question in attention models: when are attention weights unique for a given output of the attention layer? They show that the attention weights are not identiﬁable when the sequence length is longer than the head size, i.e., there exists inﬁnitely many weights that result in the same output of the attention layer. This arises from the smaller projection size (rank) of the value layer. They are mainly concerned with using attention weights as an explanation, given that different attention weights can result in the same output, and do not address the question: when can an attention layer compute an arbitrary attention weight matrix? In contrast the focus of this work is on characterizing how expressive power of an attention head is constrained by its head size.

2. Transformer Architecture and Analysis

In this section, we present the Transformer architecture and analyze the representation power of the multi-head self attention, a key component of the Transformer block.

The input to a Transformer network is a sequence of n tokens. Typically, each token is converted into a token

embedding of dimension d by an embedding layer. We let X 2 Rd n be the embedding matrix corresponding to the n tokens in the input sequence.

2.1. Single-Head Attention

The Transformer block is a combination of a self attention layer followed by a feed forward layer (Vaswani et al., 2017). Both layers have a skip connection and use Layer Normalization (LN) (Ba et al., 2016). In particular, for token embeddings X, the dot product attention is computed as follows.

Attention(X) = Wv X Softmax

(Wk X)T (Wq X) pdk

= Wv X P. (1)

Here Wq 2 Rdq d, Wk 2 Rdk d and Wv 2 Rdv d

represent the projection matrices associated with the query, key and value respectively in an attention unit (Vaswani et al., 2017). For a single-head attention unit, we have dq = dk = dv = d. In the dot-product attention (cf. (1)), P aims to capture the context of the input for a given token based on the remaining tokens in the input sequence. Subsequently, the output of the attention layer takes the following form.

LN (X + Wo Attention(X)) , (2)

where LN( ) represents the layer-normalization operation and Wo 2 Rd d. Given the attention module, as deﬁned in (1), it is natural to question its ability to represent arbitrary contexts P for a given input sequence X.

In the following result we establish that for a large enough projection size an attention unit can represent any data pair (X, P). We also show that the model cannot represent arbi-

trary context when d is smaller than n, creating a low-rank bottleneck.

Theorem 1 (Representation Theorem). If dq = dk = d n, then given any full column rank matrix X 2 Rd n and an arbitrary n n positive column stochastic matrix P, there always exists d d projection matrices Wq and Wk such that

(Wk X)T (Wq X) pdk

If dq = dk = d < n, there exist X and P such that (3) does not hold for all Wq and Wk.

This result shows that the projection dimension dq = dk = d needs to be larger than the sequence length n for the attention unit to be able to represent any desired context P. Even though this result describes a single example sequence case, it highlights a fundamental property of the model architecture that decreasing the projection size below a certain threshold introduces a bottleneck.

Low-Rank Bottleneck in Multi-head Attention Models

Proof of Theorem 1. d n case. To prove the ﬁrst part of the result, we present an explicit construction of Wk and Wq which allows us to generate P from X using the dot product attention. Since X has full column rank, there exists a left inverse X = (XT X) 1XT 2 Rn d such that X X = In. Let Wk = Wk X and Wq = Wq X . Then

k Wq X = XT (X )T WT

k Wq = Wkq. (4)

Now that the above choice of Wq and Wk has handled the dependence on X, we will choose a Wkq depending on P and ﬁnish the construction. Below we express the Softmax operation on the query and key inner products. Note that the Softmax here is a columnwise operator computing the attention scores for each query. By using (4), we obtain that

(Wk X)T (Wq X) pdk

where D Wkq is an n n diagonal matrix such that

(D Wkq)ii =

( Wkq)ji pdk

Hence, we can establish the desired result by showing that there always exists a Wkq that satisﬁes the following ﬁxed point equation.

= P D Wkq. (5)

Given P, to construct such a Wkq, we pick an arbitrary positive diagonal matrix D0, and set

dk log (P D0) . (6)

Since P is a positive matrix, such a Wkq always exists. Next, we verify that this construction indeed satisﬁes the ﬁxed point equation (cf. (5)). Note that

D Wkq = Diag

The last equation follows from the fact that P is a column stochastic matrix. Now, using (6) and (7),

= P D0 = P D Wkq.

This completes the ﬁrst part of the proof.

d < n case. Consider the case of d = 1 and n = 2. Then X 2 R1 2 and Wq and Wk 2 R1 1. Let X = [1, 0]. Then

(Wk X)T (Wq X) pdk

k Wq[1, 0] pdk

Wk Wq 0 0 0

This matrix clearly cannot be used to generate P that have distinct elements in the second column, e.g., P = 0.5 0.75 0.5 0.25

We now extend the above example to general values of n and d, (d < n). Let X = [1d, , 1d, 0d] = [1mat, 0d] 2 Rd n, where 0d(1d) 2 Rd denotes the all zeros (ones) vector. We denote the d n 1 all ones matrix compactly with 1mat. Then,

(Wk X)T (Wq X) pdk

[1mat, 0]T WT

k Wq[1mat, 0] pdk

mat Wk Wq1mat 0n 1 0n 1 0

Again, the above matrix cannot be used to generate arbitrary context P.

2.2. Multi-Head Attention

As discussed in Section 2.1, an attention unit updates the embedding of an input token based on a weighted average of the embeddings of all the tokens in the sequence, using the context P (cf. (1)). Vaswani et al. (2017) proposed Multi Head attention mechanism that increases the representation power of an attention layer, where multiple attention units operate on different low dimensional projections of the input, with each attention unit being referred to as a head. This is followed by concatenation of the outputs from different heads. In particular, the computation inside a Multi-Head attention with h heads takes the following form:

v X Softmax

Low-Rank Bottleneck in Multi-head Attention Models

Multi Head(X) =

Concat[head(X)1, , head(X)h] 2 Rd n.

The output of the Multi-head attention layer then becomes

Z = LN (X + Wo Multi Head(X)) , (8)

where Wo 2 Rd d. For a model with h heads, the query, key and value projection matrices {Wi

h d matrices. Therefore, each head projects the input onto a d

h-dimensional subspace to compute the context, and keeps the number of parameters ﬁxed per layer. Using Multi Head has resulted in empirically better performance over the single head attention layer (Vaswani et al., 2017).

2.3. Low-Rank Bottleneck

While increasing the number of heads seemingly gives the model more expressive power, at the same time we are reducing the head size, which can decrease the expressive power. When the number of heads h is larger than d

n, the attention unit inside each head projects onto a dimension smaller than n, creating a low-rank bottlenck and loses its ability to represent arbitrary context vectors (cf. Theorem 1). Interestingly, this is consistent with the empirical observation in Table 1 that increasing h beyond 8 results in performance degradation in BERTLARGE (Devlin et al., 2018); note that d = 1024 and n = 128 for most of the pre-training phase of BERTLARGE.

Since the sequence length is ﬁxed from the data/task at hand, the only way to increase the number of heads without introducing the low-rank bottleneck is by increasing the embedding size d. This is a fundamental limitation of the currently dominant head size heuristic, that we need to increase the embedding size in order to support more heads.

Unfortunately, increasing the embedding size leads to higher computation and memory requirements to train and store the model. Further, since it is common to use learned embeddings from Transformer based models for downstream tasks (Devlin et al., 2018), larger embedding size increases the model size and computation required for all the downstream tasks as well.

3. Fixed Multi-Head Attention

In this section we propose to ﬁx the head size of the Transformer, which allows us to enjoy the advantage of higher expressive power of multiple heads without requiring the embedding size to be large. The key is to decouple the dependency between the projection size in a head and the embedding size of the model. The projection matrices now project onto subspaces of a ﬁxed dimension dp irrespective of the number of heads h. This approach where dp is independent of d and h leads to the following attention

ﬁxedhead(X)i =

v X Softmax

Fixed Multi Head(X) =

Concat[ﬁxedhead(X)1, , ﬁxedhead(X)h] 2 Rdp h n.

Note that the projection matrices used here {Vi

v} are dp d matrices. With Vo 2 Rd h dp, the output of this new multi-head attention layer takes the following form.

Z = LN (X + Vo Fixed Multi Head(X)) . (9)

This modiﬁcation makes each attention head more similar to a hidden unit in a feed forward network or a ﬁlter in a convolutional network, and allows us to vary the number of heads without worrying about reducing the representation power per head. The downside is, unlike the standard Multi Head, the number of parameters per layer increases with the number of heads. However, this modiﬁcation allows us to train a model with a smaller embedding size without a low-rank bottleneck, ultimately allowing us to reduce the total number of parameters in the model.

3.1. Multi Head vs. Fixed Multi Head Attention

Given a Multi Head layer, we can always represent it using a Fixed Multi Head layer, whenever we have the head size dp d/h. While this shows that increasing the number of heads h beyond d/dp makes individual heads of the Fixed Multi Head as expressive as the ones in the Multi Head, it is not obvious if Fixed Multi Head is strictly more expressive. Can the Fixed Multi Head layer represent functions that the standard Multi Head layer can not represent? In this subsection we show that indeed, in the multi-head regime, the Fixed Multi Head layer is strictly better than the standard Multi Head layer in terms of expressive power.

Consider the standard multi-head attention units in (8).

W (X) = Wo Multi Head(X).

We denote the collection of all parameter matrices as W. d and h represent the dimension and number of heads in Multi Head (8), respectively. Similarly, consider the function represented by the ﬁxed head size attention units:

V (X) = Vo Fixed Multi Head(X).

Let V be the collection of all these parameter matrices. Here d, h, dp denote the parameters of Fixed Multi Head (9). We deﬁne Fd,h and Gd,h,dp to be the class of functions f d,h

Low-Rank Bottleneck in Multi-head Attention Models

Figure 1: Performance of Transformers trained with the prevalent head size heuristic (dp = d/h) (baseline) compared with the ﬁxed head size (dp = 32) on a language modeling task (LM1B) on the test set. We train baseline models with embedding sizes from 256 to 512. We train the ﬁxed head size models with a ﬁxed embedding size of 256 and a head size of 32, and vary the number of heads from 4 to 70, while matching the number of parameters. The plots clearly indicate that ﬁxing the head size allows us to train Transformers with a smaller embedding size (plot (b)), and with a better scaling of performance (plot (a)). Note that for perplexity lower values are better.

and gd,h,dp

V ( ), respectively. As noted above, if dp d/h, we have Fd,h Gd,h,dp.

The following theorem shows that even for simple examples in Gd,h,dp, functions in Fd,h fail to represent them; this already shows that Fd,h is a strict subset of Gd,h,dp.

Theorem 2. Let n 2, d dp, and h > d/dp. Consider a Fixed Multi Head attention layer gd,h,dp

V ( ) with parameters that satisfy the following conditions:

75 is full rank, and (Vi

for all i = 1, . . . , h, where U is a rank-dp matrix.

Then, for any f d,h

W 2 Fd,h, there exists X 2 Rd n such that f d,h

W (X) 6= gd,h,dp

Because kf d,h

W (X) gd,h,dp

V (X)k is a continuous function of X, existence of such an X implies that the integral of the norm of difference (i.e., approximation error) is strictly positive. We note that the assumptions on Vi

q in the above Theorem are made to provide a simple and constructive proof; in fact, failure of Multi Head (Fd,h) to represent such simple attention layers suggests that the situation is likely worse for more complex functions in Gd,h,dp.

Theorem 2 shows that the expressive power of the Fixed Multi Head attention function class is strictly superior to the standard Multi Head attention function class. Hence the heuristic of reducing the head size with the number of heads

is limiting the expressive power of Multi Head, whereas using the ﬁxed head size will increase the expressive power of the attention layers.

4. Experiments

The goal of this section is to show that setting the head size in a principled way leads to better performance than using the prevalent heuristic. We again note that while this is a simple hyper-parameter change to the Transformer, setting this to input sequence length as shown in our analysis, allows us to train better models with a smaller embedding size.

In this section we present our experiments on three standard NLP tasks, language modeling (LM1B), question answering (SQu AD), and sentence entailment (MNLI), to demonstrate: 1) Increasing the number of heads in Transformers beyond

a certain point hurts the performance with the prevalent head size heuristic, but always helps with the ﬁxed head size attention layers; 2) Decoupling the head size from embedding size allows us to train models with a smaller embedding size; and 3) Setting the head size appropriately in the Transformers allows us to train models with a better performance scaling. We ﬁrst describe our experimental setup followed by our results and ablation studies on the proposed modiﬁcations.

Low-Rank Bottleneck in Multi-head Attention Models

(a) SQu AD F1

(b) SQu AD EM

Figure 2: Comparison of 24 layer Transformer models trained with the prevalent head size heuristic BERTLARGE (baseline) vs. the ﬁxed head size model on the SQu AD and MNLI dev sets. We vary the embedding size of the baseline models from 512 to 1024. We train the ﬁxed head size models with a ﬁxed embedding size of 512 and a head size of 128, with a varying number of heads from 8 to 32, while matching the number of parameters. Fixing the head size allows us to train models with a smaller embedding size of 512 and with a better performance.

Figure 3: Ablation studies on LM1B: (a) We ﬁx the embedding size of all the models to 256 and vary the capacity of Transformers trained with the prevalent head size heuristic (baseline) by increasing the size of the feedforward layers. For the ﬁxed head size models we ﬁx the head size to 32, so 8 head ﬁxed head size model is the same as the 8 head baseline model. We notice that again with the standard heuristic increasing the number of heads beyond 16 hurts the performance, whereas with a ﬁxed head size increasing the number of heads monotonically improves the performance. (b) We show the effect of head size on the performance with different number of heads. Both plots clearly show the advantage in having an additional way to tune the capacity of Transformers with a ﬁxed embedding size.

4.1. Setup and Datasets

For the language modeling task we use the one billion word benchmark dataset (LM1B) (Chelba et al., 2013). This dataset has around 30M training examples and around 300k examples in the test set. We use a sub-word tokenizer with 32k vocab and cap the input to 256 sequence length. We train a 6 layer Transformer model with the ADAM optimizer using the tensor2tensor library (Vaswani et al., 2018). The detailed experimental setting is presented in Section C.

Multi-Genre Natural Language Inference (MNLI) is a sentence level entailment task, designed to test natural language understanding (Williams et al., 2018). Given a premise

sentence and a hypothesis sentence, the goal is to predict whether hypothesis entails, contradicts or is neutral to the premise. We report the classiﬁcation accuracy for this task. Stanford Question Answering Dataset (SQu AD) is a question answering dataset, where given a paragraph and a question, the goal is to predict the sequence of words in the paragraph that constitute the answer to the question (Rajpurkar et al., 2016). This is a harder word level task, compared to the sentence classiﬁcation task. We report both Exact Match (EM) and F1 scores for this task. All results in this section are reported on the Dev set, which has not been used in any experimental choices in this paper.

For these latter two tasks, we follow the two stage approach

Low-Rank Bottleneck in Multi-head Attention Models

# heads 8 12 16 32 # params 168M 193M 218M 319M SQu AD - F1 89.6 0.17 90.25 0.21 90.43 0.14 90.95 0.14 SQu AD - EM 82.73 0.21 83.18 0.24 83.59 0.06 84.4 0.29 MNLI 83.5 0.2 84.2 0.2 83.9 0.2 84.9 0.2

(A) Increasing number of heads

head size 32 64 128 256 # params 130M 142M 168M 218M SQu AD - F1 88.53 0.06 89.51 0.15 89.6 0.17 90.33 0.23 SQu AD - EM 81.19 0.21 82.41 0.32 82.73 0.21 83.36 0.48 MNLI 82.5 0.1 83.4 0.3 83.5 0.2 83.9 0.2

(B) Increasing head size

Table 2: Ablation studies on SQu AD and MNLI: (A) 24 layer Transformer with a ﬁxed head size of 128 and 512 embedding size shows an improvement in the accuracy with the increasing number of heads. (B) The ﬁxed head size model with 512 embedding size and 8 heads shows an improvement in accuracy with the increasing head size. This shows that indeed head size is an important capacity controlling parameter in the self attention architecture.

of ﬁrst pre-training on a language modeling task and then ﬁne-tuning the models on the task data. We follow the same experimental setup for both pre-training and ﬁne-tuning as BERT (Devlin et al., 2018), and use their codebase1. We ﬁrst pre-train our models using the masked language model and the next sentence prediction objectives, and then ﬁne tune the pre-trained model for individual tasks (Devlin et al., 2018). For pre-training we use English Wikipedia and Books Corpus dataset (Zhu et al., 2015). The input to the models is tokenized using the Word Piece representation with 30000 tokens in the vocabulary. We present the key experiment choices in Section C, and refer the reader to Devlin et al. (2018) for a complete description of the setup.

Choice of the head size. Our proposed modiﬁcation introduces head size dp as a new model hyper-parameter. We choose head size to be 128 for our BERT experiments, as most of the pre-training is done with 128 sequence length data. While we have ablation studies (cf. Table 2(B)) showing bigger head size improves the performance, there is a tradeoff between increasing the head size vs number of heads vs layers. We found that having sufﬁciently large head size, e.g., matching the pre-training sequence length, is better than having a larger embedding size.

4.2. Results

For our ﬁrst set of experiments we want to see if Transformers trained with a ﬁxed head size and a smaller embedding size can match the performance of training with the standard head size heuristic but with a larger embedding size. As

1https://github.com/google-research/bert

a baseline for the language modeling task, we train Transformers with the embedding size increasing from 256 to 512 with different number of heads. We train the ﬁxed head size models with a ﬁxed embedding size of 256 and a head size of 32, with an increasing number of heads from 4 to 70. We notice that Transformers with a ﬁxed head size and an embedding size of 256 have better performance than the baseline models with an embedding size of 512 (see Fig. 1). We repeat the similar experiment on the other two tasks, where for baseline we train BERTLARGE, a 24 layer, 16 head Transformer with the standard head size heuristic, with embedding sizes from 512 to 1024. We compare it with the ﬁxed head size model, with an embedding size of 512 and a head size of 128, with an increasing number of heads from 8 to 32. We again notice that the Transformers trained with a ﬁxed head size and 512 embedding size have better performance than the baseline, BERTLARGE (see Fig. 2).

Note that simply trying to increase the head size of the Transformers by decreasing the number of heads does not improve the performance, as decreasing the number of heads reduces the expressive power of the model (see Fig. 4 in the Appendix). Hence, both the head size and the number of heads needs to be set high enough for better performance.

4.3. Ablation

Increasing heads. From Table 1 and Fig. 1a we can see that increasing the number of heads hurts the performance of the Transformer after a certain number. We repeat the same experiments with the ﬁxed head size Transformer, and present the results in Table 2(A) and Fig. 3a. The results show that the performance of the modiﬁed model improves

Low-Rank Bottleneck in Multi-head Attention Models

monotonically as the number of heads increase. This is because the model capacity (a function of the head size) is no longer reduced with the increasing number of heads.

Increasing head size. In Table 2(B) and Fig. 3b, we present comparisons between models with different head sizes. This shows that the gains in the performance of the ﬁxed head size models indeed come from adjusting the head size of the query, key and value layers in the attention unit. The table shows a clear trend of better performance with a larger head size, suggesting that it indeed is an important factor in the performance of the attention models.

5. Conclusion

In this paper we studied the representation power of the multi-head self attention models and proved the low-rank bottleneck that results from a small head size in the multihead attention. We showed that the larger embedding size used in the current models is a consequence of this low-rank bottleneck in multi-head attention layers. We propose to instead use ﬁxed head size attention units, with the head size set to input sequence length, to avoid this bottleneck. We showed that it allows us to increase the number of heads without increasing the embedding size. As a consequence we are able to train Transformers with a smaller embedding size and fewer parameters, with better performance. In the future, it will be interesting to experiment with varying head sizes within an attention block and across layers. This requires further understanding of the role of each layer in computing the context, which is an interesting direction for the future work.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization.

ar Xiv preprint ar Xiv:1607.06450, 2016.

Brunner, G., Liu, Y., Pascual, D., Richter, O., Ciaramita, M.,

and Wattenhofer, R. On identiﬁability in transformers. In International Conference on Learning Representations, 2019.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T.,

and Koehn, P. One billion word benchmark for measuring progress in statistical language modeling. Co RR, abs/1312.3005, 2013. URL http://arxiv.org/ abs/1312.3005.

Child, R., Gray, S., Radford, A., and Sutskever, I. Gen-

erating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019.

Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R.,

Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., et al. State-of-the-art speech recognition

with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774 4778. IEEE, 2018.

Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y.

On the properties of neural machine translation: Encoder decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statisti-

cal Translation, pp. 103 111, 2014.

Correia, G. M., Niculae, V., and Martins, A. F. Adaptively sparse transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2174 2184, 2019.

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and

Kaiser, Ł. Universal transformers. ar Xiv preprint ar Xiv:1807.03819, 2018.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:

Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin,

Y. N. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243 1252. JMLR. org, 2017.

Hochreiter, S. and Schmidhuber, J. Long short-term memory.

Neural computation, 9(8):1735 1780, 1997.

Li, Y. Deep reinforcement learning: An overview. ar Xiv

preprint ar Xiv:1701.07274, 2017.

Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepa-

ssi, R., Kaiser, L., and Shazeer, N. Generating wikipedia by summarizing long sequences. ar Xiv preprint ar Xiv:1801.10198, 2018.

Michel, P., Levy, O., and Neubig, G. Are sixteen heads

really better than one? ar Xiv preprint ar Xiv:1905.10650, 2019.

Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling

neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 1 9, 2018.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever,

I. Improving language understanding by generative pretraining. Technical report, Open AI, 2018.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and

Sutskever, I. Language models are unsupervised multitask learners. Technical report, Open AI, 2019.

Low-Rank Bottleneck in Multi-head Attention Models

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad:

100,000+ questions for machine comprehension of text.

ar Xiv preprint ar Xiv:1606.05250, 2016.

Sun, H., Tan, X., Gan, J.-W., Liu, H., Zhao, S., Qin,

T., and Liu, T.-Y. Token-level ensemble distillation for grapheme-to-phoneme conversion. ar Xiv preprint ar Xiv:1904.03446, 2019.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-

quence learning with neural networks. In Advances in neural information processing systems, pp. 3104 3112, 2014.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017.

Vaswani, A., Bengio, S., Brevdo, E., Chollet, F., Gomez,

A. N., Gouws, S., Jones, L., Kaiser, L., Kalchbrenner, N., Parmar, N., Sepassi, R., Shazeer, N., and Uszkoreit, J. Tensor2tensor for neural machine translation. Co RR, abs/1803.07416, 2018. URL http://arxiv.org/ abs/1803.07416.

Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I.

Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ar Xiv preprint ar Xiv:1905.09418, 2019.

Wang, X., Girshick, R., Gupta, A., and He, K. Non-local

neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794 7803, 2018.

Williams, A., Nangia, N., and Bowman, S. A broadcoverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112 1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.

Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., and Auli, M.

Pay less attention with lightweight and dynamic convolutions. ar Xiv preprint ar Xiv:1901.10430, 2019.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,

R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. ar Xiv preprint ar Xiv:1906.08237, 2019.

You, Y., Li, J., Hseu, J., Song, X., Demmel, J., and Hsieh,

C.-J. Reducing bert pre-training time from 3 days to 76 minutes. ar Xiv preprint ar Xiv:1904.00962, 2019.

Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y.,

Babuschkin, I., Tuyls, K., Reichert, D., Lillicrap, T., Lockhart, E., et al. Relational deep reinforcement learning. ar Xiv preprint ar Xiv:1806.01830, 2018.

Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Self-

attention generative adversarial networks. ar Xiv preprint ar Xiv:1805.08318, 2018.

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urta-

sun, R., Torralba, A., and Fidler, S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19 27, 2015.