# on_the_dynamics_of_training_attention_models__1672ed3c.pdf Published as a conference paper at ICLR 2021 ON THE DYNAMICS OF TRAINING ATTENTION MODELS Haoye Lu, Yongyi Mao & Amiya Nayak School of Electrical Engineering and Computer Science University of Ottawa Ottawa, K1N 6N5, Canada {hlu044, ymao, nayak}@uottawa.ca The attention mechanism has been widely used in deep neural networks as a model component. By now, it has become a critical building block in many state-of-the-art natural language models. Despite its great success established empirically, the working mechanism of attention has not been investigated at a sufficient theoretical depth to date. In this paper, we set up a simple text classification task and study the dynamics of training a simple attention-based classification model using gradient descent. In this setting, we show that, for the discriminative words that the model should attend to, a persisting identity exists relating its embedding and the inner product of its key and the query. This allows us to prove that training must converge to attending to the discriminative words when the attention output is classified by a linear classifier. Experiments are performed, which validate our theoretical analysis and provide further insights. 1 INTRODUCTION Attention-based neural networks have been broadly adopted in many natural language models for machine translation (Bahdanau et al., 2014; Luong et al., 2015), sentiment classification (Wang et al., 2016), image caption generation (Xu et al., 2015), and the unsupervised representation learning (Devlin et al., 2019), etc. Particularly in the powerful transformers (Vaswani et al., 2017), attention is its key ingredient. Despite its great successes established empirically, the working mechanism of attention has not been well understood (see Section 2). This paper sets up a simple text classification task and considers a basic neural network model with the most straightforward attention mechanism. We study the model s training trajectory to understand why attention can attend to the discriminative words (referred to as the topic words). More specifically, in this task, each sentence is treated as a bag of words, and its class label, or topic, is indicated by a topic word. The model we consider involves a basic attention mechanism, which creates weighting factors to combine the word embedding vectors into a context vector ; the context vector is then passed to a classifier. In this setting, we prove a closed-form relationship between the topic word embedding norm and the inner product of its key and the query, referred to as the score , during gradient-descent training. It is particularly remarkable that this relationship holds irrespective of the classifier architecture or configuration. This relationship suggests the existence of a synergy in the amplification of the topic word score and its word embedding; that is, the growths of the two quantities promote each other. This, in turn, allows the topic word embedding to stand out rapidly in the context vector during training. Moreover, when the model takes a fixed linear classifier, this relationship allows rigorous proofs of this mutual promotion phenomenon and the convergence of training to the topic words. Our theoretical results and their implications are corroborated by experiments performed on a synthetic dataset and real-world datasets. Additional insights are also obtained from these experiments. For example, low-capacity classifiers tend to give stronger training signals to the attention module. The mutual promotion effect implied by the discovered relationship can also exhibit itself as mutual suppression in the early training phase. Furthermore, in the real-world datasets, where perfect Published as a conference paper at ICLR 2021 delimitation of topic and non-topic words does not exist, interesting training dynamics is observed. Due to length constraints, all proofs are presented in Appendix. 2 RELATED WORKS Since 2019, a series of works have been published to understand the working and behaviour of attention. One focus of these works pertains to understanding whether an attention mechanism can provide meaningful explanations (Michel et al., 2019; Voita et al., 2019; Jain & Wallace, 2019; Wiegreffe & Pinter, 2019; Serrano & Smith, 2020; Vashishth et al., 2020). Most of these works are empirical in nature, for example, by analyzing the behaviours of a well-trained attention-based model (Clark et al., 2019), or observing the impact of altering the output weights of the attention module or pruning a few heads (Michel et al., 2019; Voita et al., 2019), or a combination of them (Jain & Wallace, 2019; Vashishth et al., 2020). Apart from acquiring insights from experiments, Brunner et al. (2019) and Hahn (2020) show theoretically that the self-attention blocks lacks identifiability, where multiple weight configurations may give equally good end predictions. The non-uniqueness of the attention weights therefore makes the architecture lack interpretability. As a fully connected neural network with infinite width can be seen as a Gaussian process (Lee et al., 2018), a few works apply this perspective to understanding attention with infinite number of heads and infinite width of the network layers (Yang, 2019; Hron et al., 2020). In this paper, we restrict our study to the more realist non-asymptotic regime. 3 PROBLEM SETUP Learning Task To obtain insights into the training dynamics of attention models, we set up a simple topic classification task. Each input sentence contains m non-topic words and one topic word indicating its topic. Note that a topic may have multiple topic words, but a sentence is assumed to include only one of them. Assume that there are J topics that correspond to the mutually exclusive topic word sets T1, T2, , TJ. Let T = SJ j=1 Tj be the set of all topic words. The non-topic words are drawn from a dictionary Θ, which are assumed not to contain any topic word. The training set Ψ consists of sentence-topic pairs, where each pair (χ, y) is generated by (1) randomly pick a topic y {1, 2, , J} (2) pick a topic word from set Ty and combine it with m words drawn uniformly at random from Θ to generate the sentence (or the bag of words) χ. In this task, one aims to develop a classifier from the training set that predicts the topic y for a random sentence χ generated in this way. We will consider the case that |Θ| >> |T|, which implies that a topic word appears much more frequently in the sentences than a non-topic word. Attention Model For this task, we consider a simple attention mechanism similar to the one proposed by Wang et al. (2016). Each word w is associated with two parameters: an embedding νw Rd and a key κw Rd . Based on a global query q Rd , the context vector of sentence χ is computed by ν(χ) = P w χ νw exp(q T κw) Z(χ) , where Z(χ) = P w χ exp(q T κw ). Then ν(χ) is fed into a classifier that predicts the sentence s topic in terms of a distribution over all topics.1 Denote the loss function by l(χ, y). Our upcoming analysis implies this attention model, although simple, may capture plenty of insight in understanding the training of more general attention models. Problem Statement Our objective is to investigate the training dynamics, under gradient descent, of this attention model. In particular, we wish to understand if there is an intrinsic mechanism that allows the attention model to discover the topic word and accelerates training. Moreover, we wish to investigate, beyond this setup, how the model is optimized when there is no clear delimitation between topic and non-topic words, as in real-world data. 1The condition that the attention layer directly attends to the word embeddings merely serves to simplify the analysis in Section 4 but this condition is not required for most results presented in Sections 4 and 5. More discussions are given in Appendix A in this regard. Published as a conference paper at ICLR 2021 4 THEORETICAL ANALYSIS It is common to fix some parameters when we train a model with limited resources. Also Lemma 1. Assume q = 0 when initialized. Fixing it does not affect the attention block s capacity. Thus, our upcoming discussion focuses on the case in which the query is fixed. Doing so also allows us to establish a closed-form expression connecting the word s embedding and the inner product of its key and the query. In Appendix B, extra discussions and experimental results reveal that the trainability of the query does not affect the fundamental relationship we are about to present. For a topic word t, let Ψt denote the training samples involving it. Then, by gradient descent, (χ,y) Ψt ν(χ)l(χ, y) exp(q T κt) (χ,y) Ψt q(νt ν(χ))T ν(χ)l(χ, y) exp(q T κt) where τ denote the learning rate. As it will turn out, an important quantity in this setting is the inner product q T kw of query q and the key kw, which we denote by sw, and refer to it as the score of the word w. Denoting vw = ||q||2νw, η = τ||q||2, v(χ) = P w χ exp(sw) Z vw, and h( v(χ); y) = ν(χ)l(χ, y), for a topic word t, the dynamics simplifies to (χ,y) Ψt h( v(χ); y) exp(st) (χ,y) Ψt (vt v(χ))T h( v(χ); y) exp(st) In the rest of the paper, whenever we refer to the embedding of word t, we actually mean vt not νt. Our analysis assumes the word embeddings are sampled i.i.d. from a distribution with mean zero and variance σ2 d , where σ2 is assumed close to zero. The word keys and the query are also sampled from zero mean distributions with a possibly different variance. We assume that this variance is so small that the initial word scores are approximately zero. This assumption of the initial configurations corresponds to the attention model starting as a word-averaging model, and allows us to investigate how the model deviates from this initial setting with training. We also assume the derivative h( v(χ); y) of ℓis Lipschitz continuous in v(χ) throughout training. Further the assumption in Section 3 that the number of non-topic words |Θ| is much larger than the number of topic words |T| implies that with a sufficient number of training samples, the occurrence rate of a topic word is significantly higher than the non-topic ones. This then justifies the following assumption we will use throughout our analysis. Assumption 1. The scores and the embeddings of the non-topic words are nearly unchanged compared to their counterparts for the topic words. Hence, our upcoming analysis will treat the scores and embeddings of the non-topic words as constants. Assumption 1 will be validated by experimental results presented in Section 5. By selecting a sufficiently small η, we can take the gradient-descent updates in Eq (3) and Eq (4) to its continuous-time limit and get2 (χ,y) Ψt h( v(χ); y) exp(st) (χ,y) Ψt (vt v(χ))T h( v(χ); y) exp(st) 2Reversely, Eq (3) is a discretized approximation of Eq (5): vt(t+1) vt(t) = R t+1 t dvt(t ) dt dt 1 dvt(t) dt = vt(t) . The approximation becomes accurate if vt(t + 1) is close to vt(t), which can be achieved by choosing a sufficiently small η. Likewise, Eq (4) is a discretized approximation of Eq (6). Published as a conference paper at ICLR 2021 We can then characterize the update of the score and the embedding of a topic word as a continuoustime dynamical system stated in Lemma 2. The same technique has been used to analyze the training of neural networks in other contexts (Saxe et al., 2014; Greydanus et al., 2019). Lemma 2. For sufficiently small η and σ2, the score st and embedding vt of topic word t satisfy h( v(χ); y)exp(st) dt = vt v(χ \ t) Ψt T dvt exp(st) + Z(χ \ t) where Z(χ \ t) = P w χ\{t} exp(sw), v(χ \ t) = P w χ\{t} vw exp(sw) Z(χ\t) , and Ψt denotes taking sample mean over the set Ψt. Eq (7) implies the speed of moving vt along the direction of h( v(χ); y) Ψt is controlled by the attention weight exp(st) Z(χ) . Eq (8) shows that vt increases if and only if vt has a greater projection on h( v(χ); y) Ψt than the weighted average of the non-topic word counterparts. Consider a simplified case where h( v(χ); y) Ψt is fixed. Since the change of vt is much faster than the non-topic word counterparts, vt will have a larger projection on h( v(χ); y) Ψt after a few epochs of training. Then st increases as well as its attention weight, which in turn speeds up the extension of the embedding vt. This observation reveals a mutual enhancement effect between the score increment and the embedding elongation. In fact such an effect exists in general, as stated in the theorem below, irrespective of whether h( v(χ); y) Ψt is fixed. Theorem 1. In the setting of Lemma 2, from epoch t0 to t1, the topic word score st and its embedding vt satisfy " st(t) + exp(st(t)) 1 Z(χ \ t) 2||vt(t) v(χ \ t) Ψt ||2 2 Following from Lemma 2, this theorem implies a positive relationship between the topic word score and the distance between vt and the non-topic word embedding average v(χ \ t) Ψt. Remarkably this result makes no reference to h( v(χ); y) Ψt, hence independent of it. This implies the identity in Eq (9) holds irrespective of the choice and setting of the classifier. Theorem 1 further implies a score and embedding norm ( SEN in short) relationship for the topic words: Corollary 1. In the context of Theorem 1, by setting t0 = 0 and t1 = t, Eq (9) is reduced to ||vt(t)||2 = 2 st(t) + exp st(t) The corollary indicates that ||vt(t)||2 is monotonically increasing with st(t). So, st increases if and only if the point vt departs from its initial location. That is, if the norm of the topic word embedding increases, it will be attended to. This result is independent of the configuration of all other network layers. Thus, if h( v(χ); y) Ψt has a gradient field that pushes vt away from its original location, the topic word is expected to be attended to. This statement can be made precise, as in Theorem 2, when the model uses a linear classifier. Theorem 2. Assume the model has a fixed classifier in the form c( v(χ)) = softmax(U T v(χ)), where the columns of U are linearly independent, and the model is trained using gradient descent with the cross-entropy loss. As training proceeds, the model will attend to the topic word in every input sentence and have its training loss approach zero. It is notable that the theorem holds broadly for any arbitrary fixed linear classifier (subjective to the mild linear independence constraint of its parameter U). Additionally, we anticipate that this result holds for a much wider family of classifiers including trainable and even nonlinear ones. But rigorous proof appears difficult to obtain in such settings, and we will corroborate this claim in an experimental study in Section 5. Published as a conference paper at ICLR 2021 To sum up, in this section, we have shown two main results: (a) there is a closed-form positive relationship, the SEN relationship, between the topic word score and its embedding norm, which is independent of the configuration of the classifier. (b) the model, equipped with a fixed linear classifier stated in Theorem 2, can be trained to have all topic words attended to. 5 EXPERIMENTAL STUDY In this section, we first test our model on an artificial dataset generated through the procedure introduced in Section 3. The test corroborates our theoretical results and validates their assumptions. Our test results suggest that the attention mechanism introduces a synergy between the embedding and the score of topic words. Another experiment is performed on the real datasets SST2 and SST5 (Socher et al., 2013). The experiment results suggest that the SEN relationship of topic words holds at least in initial training stages. As training proceeds, some words appear to deviate from the theoretical trajectories. Further analysis of this behaviour provides additional insights into the attention model s training dynamics on real-world datasets, often possessing a much more complex structure as well as rich noise. We performed all experiments using Py Torch (Paszke et al., 2017). We performed our experiments on three models, Attn-FC, Attn-TC and Attn-TL, having the same attention block but different classifiers. The first two have the classifier in form c( v(χ)) = softmax(U T v(χ)) and the last in form c( v(χ)) = softmax(U T 2 Re Lu(U T 1 v(χ)+b1)+b2). Except that the U in Attn-FC is fixed, other parameters of the three models are trainable and optimized using the cross-entropy loss. Since a real-world dataset does not have a topic word as the sentence topic indicator, we introduce a word topic purity measurement to facilitate our discussion on the experiments performed on SST2. Let δ+ w and δ (w) respectively denote the portions of the positive and negative sentences among all training samples containing word w. Then the topic purity of w is δ(w) = |δ+(w) δ (w)|. If δ(w) = 1, w is either a pure positive or negative topic word. If δ(w) = 0, δ+(w) = δ (w) = 0.5, which implies w has a completely random topic correspondence. 5.1 EXPERIMENTS ON SYNTHETIC DATASETS The artificial dataset, consisting of 800 training and 200 test samples, is generated through the procedure introduced in Section 3. The dataset has four topics, and each contains two topic words. There are 20 non-topic words per sentence and the non-topic word dictionary size M = 5, 000. Our experiments use the same embedding dimension 15 for all three models. Regarding the classifiers, Attn-FC and Attn-TC adopt U R15 4, while Attn-TL takes U1 R15 10, b1 R10, U2 R10 4 and b2 R4. For the validation of Theorem 2, the columns of U in Attn-FC are set to be orthonormal and thus linear independent. Unless otherwise stated, the scores are set to zero and the embeddings are initialized by a normal distribution with mean zero and variance σ2 d = 10 6. We trained the models using gradient descent with learning rate η = 0.1 for 5K epochs before measuring their prediction accuracy on the test samples. When training is completed, all three models achieve the training loss close to zero and the 100.0% test accuracy, which implies the trained models perfectly explain the training set s variations and have a good generalization on the test set. Verification of Assumption 1 and validation of Corollary 1 and Theorem 2. We repeated the experiments for five runs and plotted the empirical score distributions of the non-topic and topic words of the three well-trained models with their 95% confidence intervals in the first two graphs of Fig 1. Compared to the topic words, the scores of the non-topic words are nearly unchanged throughout the entire training process. Likewise, the next two plots show the embedding norms of the non-topic words are nearly constant, too. This implies Assumption 1 indeed holds. Fig 2 plots the empirical and the theoretical SEN curves of a randomly picked topic word for the three models, where the theoretical curve has the expression stated in Eq (10). The coincidence of the empirical and the theoretical curves in all three models validates the SEN relationship stated in Eq (10) and Published as a conference paper at ICLR 2021 0.020 0.015 0.010 0.005 0.000 Non-topic word score distribution Attn_FC Attn_TC Attn_TL 0 1 2 3 Topic word score distribution Attn_FC Attn_TC Attn_TL 0.02 0.04 0.06 0.08 0.10 Non-topic word embedding norm distribution Attn_FC Attn_TC Attn_TL 0 1 2 3 Topic word embedding norm distribution Attn_FC Attn_TC Attn_TL Figure 1: The first two histograms plot the distributions with the 95% confidence interval of the non-topic and topic word scores for the three well-trained models, Attn-FC, Attn-TC and Attn-TL by repeating the experiments five times. Likewise, the next two plot the ones of the embedding norms. The data of the models are overlaid in each histogram for easy comparisons. 0 1 2 3 4 5 Score (Attn-FC) Topic word embedding norm Exp 0 Exp 1 Exp 2 Exp 3 Exp 4 Theoretical 0 1 2 3 4 5 Score (Attn-TC) Topic word embedding norm Exp 0 Exp 1 Exp 2 Exp 3 Exp 4 Theoretical 0 1 2 3 4 5 Score (Attn-TL) Topic word embedding norm Exp 0 Exp 1 Exp 2 Exp 3 Exp 4 Theoretical Figure 2: For Attn-FC, Attn-TC and Attn-TL, the graphs in order respectively plot the SEN relationship of a randomly picked topic word along with the theoretical predication derived in Corollary 1. For each model, we repeated the experiments five times and selected the same topic word. its independence of the later layers.3 Moreover, Fig 1 (left two) shows the scores of the topic words exceed the non-topic word counterparts by two orders of magnitude, which implies the topic words are attended to in a well-trained model. As we have reported earlier, Attn-FC has the training loss roughly zero when the training is completed, Theorem 2 is confirmed. Lower-capacity classifiers result in stronger attention effects. The comparison, among the SEN distributions of the topic words in the three models, implies that a lower-capacity classifier leads to greater topic word SEN, which means a more drastic attention decision. This happens because the classifier of a larger capacity can explain more variations of the sample distribution and has more freedom to accommodate and absorb the correcting gradient signals. As a result, the attention layer receives a weaker gradient on average, which makes the embeddings of the topic word extend less from the original point. Thus, as implied by Eq (10), the magnitudes of the scores of the topic words are dampened, and therefore a weaker attention effect will be expected. This observation hints that if we know attending to the right words can explain most of the variations in sample distributions, we should consider a low capacity classifier (or later layers in general). Alternatively, we may also freeze the classifier in the initial stage of a training process, forcing the attention layer to explain more variations. Remarkably, all these modifications do not affect the relationship stated in Eq (10). Synergy between score growth and embedding elongation in topic words. Eq (10) implies a positive SEN relationship of the topic word. That is, a larger topic word embedding norm results in a larger score (and thus a larger attention weight), which in turn makes the embedding extend faster. To corroborate this claim, we performed an ablation study by considering two variants of Attn-TC. The first has the scores frozen (referred as Attn-TC-KF) and the second has the embeddings fixed (referred as Attn-TC-EF). In this experiment, the embeddings of the models are initialized by a normal distribution of mean zero and variance σ2 d = 0.1. We trained all three models by gradient descent for 60K epochs with learning rate η = 0.1, which are sufficient for all three models to fully converge. All three trained models reached 100% test accuracy. 3In Appendix B, the experiments with trainable queries are also implemented. The results indicate that the trainability of queries do not affect the positive SEN relationship. Besides, the query fixed model has very similar training dynamics to the one with a trainable query and a large initial norm. Published as a conference paper at ICLR 2021 0 1000 2000 3000 Epoch Topic word score Attn-TC Attn-TC-KF Attn-TC-EF 0 1000 2000 3000 Epoch Topic word embedding norm Attn-TC Attn-TC-KF Attn-TC-EF 0 1000 2000 3000 Epoch Training Loss Attn-TC Attn-TC-KF Attn-TC-EF 0 100 200 300 400 Epoch Topic word score Attn-TC Attn-TC-KF Attn-TC-EF 0 100 200 300 400 Epoch Topic word embedding norm Attn-TC Attn-TC-KF Attn-TC-EF Figure 3: The changes of the SEN and the training loss for Attn-TC, Attn-TC-KF and Attn-TC-EF. The first two graphs demonstrate the evolution of a topic word SEN when the mutual enhancement happens. The third shows the change of the training loss. The last two show how the topic word SEN change in a separate training when the mutual diminution occurs. 0 1 2 3 Score (Attn-FC) -- stupid Word embedding norm Theoretical Experimental 0 1 2 3 Score (Attn-FC) -- suffers Word embedding norm Theoretical Experimental 0.0 0.2 0.4 0.6 Score (Attn-TC) -- worst Word embedding norm Theoretical Experimental 0.0 0.2 0.4 0.6 Score (Attn-TC) -- stupid Word embedding norm Theoretical Experimental 0 1 2 3 4 Score (Attn-FC) -- stupid Word embedding norm Theoretical Experimental 0 1 2 3 4 Score (Attn-FC) -- ugly Word embedding norm Theoretical Experimental 0.0 0.2 0.4 0.6 0.8 Score (Attn-TC) -- bad Word embedding norm Theoretical Experimental 0.0 0.2 0.4 0.6 0.8 Score (Attn-TC) -- stupid Word embedding norm Theoretical Experimental Figure 4: The SEN relationship in the training process of Attn-FC (left two columns) and Attn-TC (right two columns) on SST2 (upper row) and SST5 (lower row). The words, picked among the ones having the largest five scores, have the SEN curves that agree with their theoretical counterparts. The first three graphs of Fig 3 describe the evolution of the three models in the first 3K training epochs. For a randomly picked topic word, the first plot shows its score in Attn-TC grows faster than the one in Attn-TC-EF. Note that the score in Attn-TC-EF finally surpasses the one in Attn-TC because Attn-TC has converged at around 1K epochs. Likewise, the word embedding norm in Attn-TC increases more rapidly than the one in Attn-TC-KF before Attn-TC converges. The observations imply the attention introduces a mutual enhancement effect on training the topic word s score and its embedding, which makes Attn-TC enjoy the fastest training loss drop as shown in the third plot. The mutual enhancement could become the mutual diminution in the early training stage if the initial embedding of the topic word has a negative projection on the direction that it will move along. This effect can be precisely characterized by Eq (8). Assume the embedding of a topic word is initialized to have a smaller projection on the gradient, passed from the classifier, than the average of the non-topic words. The reversed order of the projections makes the score of the topic word decrease as its embedding has a negative effect on the training loss compared to the average of the non-topic word embeddings. This will, in turn, impede the elongation of the embedding vector or even make it shrink (see the last two plots of Fig 3). The negative effect cannot last long because the topic word embedding moves along the gradient much faster than the non-topic words due to its high occurrence rate in the training samples. By Eq (8) again, dst dt will finally become positive. That is, the score of the topic word starts to increase and its attention weight will surpass the one of the word-averaging model (see the second last plot of Fig 3). Then, we start to observe the mutual enhancement effect, which is indicated in the last plot: the increase speed of the Attn-TC s embedding norm exceeds the Attn-TC-KF s since around the 370-th epoch. 5.2 EXPERIMENTS ON SST2 AND SST5 The second part of the experiment is performed on datasets SST2 and SST5, which contain movie comments and ratings (positive or negative in SST2 and one to five stars in SST5). For simplicity, we limit our discussion on Attn-FC and Attn-TC using the same configurations of our previous experiments except that the embedding dimension is set to 200. Remark that our goal is not to find a state-of-the-art algorithm but to verify our theoretical results and further investigate how an attention-based network works. Published as a conference paper at ICLR 2021 0 1 2 3 Score (Attn-FC) -- powerful Word embedding norm Theoretical Experimental 0 1 2 3 Score (Attn-FC) -- best Word embedding norm Theoretical Experimental 0 5 10 15 20 Word embedding norms powerful best 0 5 10 15 20 powerful best 0 5 10 15 20 powerful best 0.0 0.2 0.4 0.6 Score (Attn-TC) -- powerful Word embedding norm Theoretical Experimental 0.0 0.2 0.4 0.6 Score (Attn-TC) -- best Word embedding norm Theoretical Experimental 2.5 5.0 7.5 10.0 12.5 Word embedding norms powerful best 2.5 5.0 7.5 10.0 12.5 powerful best 2.5 5.0 7.5 10.0 12.5 powerful best Figure 5: The juxtaposition of powerful and best during the training of Attn-FC (the first row) and Attn-TC (the second row). In each row, the first two graphs are the empirical and the theoretical SEN curves of words powerful and best. The next two plot their SEN as a function of the number of epochs. The last shows the changes of their weights in a sentence that they co-occur. For both SST2 and SST5, we trained the two models by gradient descent with learning rate η = 0.1 combined with the early stopping technique (Prechelt, 2012) of patience 100. As Py Torch requires equal length sentences in a batch, we pad all the sentences to the same length and set the score of the padding symbol to the negative infinity. Under this configuration, the trained Attn-FC and Attn-TC reached 76.68% and 79.59% test accuracy on SST2 and 38.49% and 40.53% on SST5. Validation of Corollary 1. As the true topic words are unknown in a real dataset, we checked the words of the largest fifty scores after the training is completed. We observed that most of the words have their SEN curves close to our theoretical prediction. We picked two words for each model-dataset combination and plotted the curves with their theoretical counterparts in Fig 4. The competition of two topic word candidates of various occurrence rates and topic purity. To better understand how the attention block works, we investigated the case when the empirical and the theoretical SEN curves disagree. We limit our discussion on the model trained on SST2. For both Attn-FC and Attn-TC, we noticed that there are mainly two types of deviations, as shown in the first two columns of Fig 5. In the first one, the theoretical curves overestimate the score growth of powerful in terms of the embedding norm, while best shown in the second column experiences a score drop combined with a failed theoretical prediction in the final training stage. These two types of the disagreement in fact occur in pairs, which is caused by a pair of words that one of them frequently appears in the training samples but has a low topic purity, while the other has the opposite. Regarding the (powerful, best) pair, best appears in 128 training samples while 103 of them are positive. In comparison, powerful appears in the 36 training samples, and all of them are positive. The large difference in the number of samples makes the embedding norm of best extend much faster than powerful in the initial training process (Fig 5, third column). As the positive SEN relationship implies, a quicker embedding norm increase of best results in its faster score growth. Therefore, best will be more attended to ( Fig 5, last column), which thus further accelerates its SEN increase (this is the synergy relationship between the embedding and score that has been demonstrated in the ablation study). This process does not stop until the gradient from the classifier diminishes because of its low topic purity (which is similar to the case when the label smoothing (Szegedy et al., 2016) is applied for alleviating the overfitting problem). In contrast, powerful is less trained initially due to its low occurrence rate. But its high topic purity makes the direction of the gradient stable, in which its embedding will steadily elongate. The elongation will, at last, let the embedding have a greater projection on the gradient vector than the average of the other words appearing in the same sentence. Thus, the score of powerful starts to increase as shown by Eq (8) and plotted in the second last column of Fig 5. In contrast, as the gradient magnitude drops, the embedding of best will extend in a decreasing speed; and its projection on the gradient, passed from the classifier, will finally be surpassed by the words co-occurring with it but having a higher topic purity (like powerful). Thus, its score starts to drop eventually. The dynamics of topic purity of attended words The analysis of the inconsistency between the empirical and theoretical SEN curves hints that the disagreement are strongly related to the word s topic purity and its occurrence rate. To better characterize their dynamics, for a word w, let Aw(t) Published as a conference paper at ICLR 2021 0 5 10 15 20 x103 Epoch Attn-FC Apowerful(t) Avg. of δ(u) 0 5 10 15 20 x103 Epoch Attn-FC Astupid(t) 0 5 10 15 20 x103 Epoch Attn-FC Asolid(t) 0 5 10 15 20 x103 Epoch Attn-FC Asuffers(t) 0 5 10 15 20 x103 Epoch Attn-FC Amess(t) Avg. of occ. (x 103) 0 5 10 x103 Epoch Attn-TC Aworst(t) Avg. of δ(u) 0 5 10 x103 Epoch Attn-TC Amess(t) 0 5 10 x103 Epoch Attn-TC Astupid(t) 0 5 10 x103 Epoch Attn-TC Asuffers(t) 0 5 10 x103 Epoch Attn-TC Asolid(t) Avg. of occ. (x 103) Figure 6: The average topic purity (red) and number of occurrence (blue) of the words attended to in the sentences containing w as the training of Attn-FC (first row) and Attn-TC (second row) proceeds. Word w is picked from the set of words having the largest five scores in the well-trained models. 0 5 10 15 20 x103 Epoch Attn-FC Avg. of δ(u) Avg. of occ. (x 103) 0.0 2.5 5.0 7.5 10.0 12.5 x103 Epoch Attn-TC Avg. of δ(u) Avg. of occ. (x 103) Figure 7: The average topic purity (red) and number of occurrence (blue) of the words attended to in all the training sentences as the training of Attn-FC (left) and Attn-TC (right) proceeds. be the list, at epoch t, that records the attended word (or of the largest score) in the sentences containing w. If multiple words in a sentence have the same score, randomly pick one of them. Note that Aw(t) may contain repeating words as a word could be attended in multiple sentences. We selected w to be the words having the largest five scores in the well-trained Attn-FC and Attn-TC, respectively. At various epoch t, Fig 6 plots how the average of topic purity δ(w ) evolves for w Aw(t) as well as the average number of occurrence in the training samples. For both models, they initially attend to the words that mostly have low topic purity with a high occurrence rate. As the training proceeds, the average topic purity of the attended words increases while the average occurrence rate drops. At the end of the training, almost all the attended words have a close-to-one topic purity. Fig 7 shows the evolution of the average topic purity and the average occurrence rate of the attended words over the entire training set. While a similar changing pattern can be observed, the average topic purity is lower than the one presented in Fig 6 when the training is completed. We argue that this happens because some sentences do not have any high topic purity words or their high topic purity words have a too low occurrence rate to be sufficiently trained. 6 CONCLUSION This paper investigated the dynamic of training a series of attention-based bag-of-word classifiers on a simple artificial topic classification task. We have shown a persisting closed-form positive SEN relationship for the word to which the model should attend to. This result is independent of the configurations of the later layers. Through the result, we have proved that the model must converge in attending to the topic word with the training loss close to zero if the output of the attention layer is fed into a fixed linear classifier. A list of experiments has confirmed these results. The experimental results indicate that the attention block intends to make a more drastic decision if its later layers have a lower capacity. This hints the classifier s limited capacity may help if selecting the right word explains most of the variations in the training samples. An ablation study shows a synergy between the topic word score s growth and its embedding elongation, leading to a faster training loss drop than the fixed score and fixed embedding variants. Besides, we have shown that this mutual promotion effect can also exhibit itself as mutual suppression in the initial training stage. We investigated the competition of two topic word candidates with large differences in the topic purity and the occurrence rate in the training samples. The words of a higher occurrence rate but possibly low topic purity are more likely to be attended to initially. However, as the training proceeds, the attended words are gradually replaced by those of higher topic purity. Published as a conference paper at ICLR 2021 Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations (ICLR), 2014. Sergei G. Bobkov and Christian Houdrè. Variance of Lipschitz functions and an isoperimetric problem for a class of product measures. Bernoulli, 2(3):249 255, 1996. Gino Brunner, Yang Liu, Damián Pascual, Oliver Richter, Massimiliano Ciaramita, and Roger Wattenhofer. On Identifiability in Transformers. In International Conference on Learning Representations (ICLR), 2019. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What Does BERT Look at? An Analysis of BERT s Attention. ar Xiv preprint ar Xiv:1906.04341, 2019. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2019. Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian Neural Networks. In Neural Information Processing Systems (NIPS), 2019. Michael Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, 8:156 171, 2020. Jiri Hron, Yasaman Bahri, and Jascha Sohl-dickstein Roman. Infinite attention : NNGP and NTK for deep attention networks. In International Conference on Machine Learning (ICML), 2020. Sarthak Jain and Byron C. Wallace. Attention is not Explanation. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019. Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations (ICLR), 2018. Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), 2015. Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In Neural Information Processing Systems (NIPS), volume 32. Curran Associates, Inc., 2019. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In Neural Information Processing Systems (NIPS), 2017. Lutz Prechelt. Early Stopping But When?, pp. 53 67. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. Andrew M. Saxe, James L. Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations (ICLR), 2014. Sofia Serrano and Noah A. Smith. Is attention interpretable? In Annual Meeting of the Association for Computational Linguistics (ACL), 2020. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng, and Christopher Potts. Parsing With Compositional Vector Grammars. In Empirical Methods in Natural Language Processing (EMNLP), 2013. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Published as a conference paper at ICLR 2021 Shikhar Vashishth, Shyam Upadhyay, Gaurav Singh Tomar, and Manaal Faruqui. Attention Interpretability Across NLP Tasks. In International Conference on Learning Representations (ICLR), 2020. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), 2017. Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, July 2019. Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. Attention-based LSTM for aspect-level sentiment classification. In Empirical Methods in Natural Language Processing (EMNLP), pp. 606 615, Austin, Texas, November 2016. Association for Computational Linguistics. Sarah Wiegreffe and Yuval Pinter. Attention is not not explanation. In 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML), 2015. Greg Yang. Tensor Programs I : Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes. In Neural Information Processing Systems (NIPS), 2019. A THE RESULTS OF THE MODEL THAT THE ATTENTION BLOCK ATTENDS TO A CNN LAYER The main text focuses on a model that has the attention block directly attends to word embeddings. Such design simplifies our analysis but should not be considered as a pre-condition to keep our results valid. In particular, the positive SEN relationship generally holds regardless of other components of the network, which can be justified as follows. There are two correlated sources of gradient signals, one back-propagates from the score to update key and query, the other back-propagates from the classifier loss to update word embedding. The correlation governs the positive relationship between score and embedding norm. Although the integration of other modules makes the analysis harder, the relationship should persist. Hence, all the results depending on this relationship keep valid. We empirically verify our claims by implementing an experiment similar to the one discussed in the main text. We modified Attn-TC (named Attn-TC-CNN) by adding two parallel CNN layers to respectively process the word embeddings and the keys before feeding them into the attention block. Then, we construct an analogous data generation process introduced in Section 3. Finally, we empirically show that Assumption 1 and the positive SEN relationship still hold. The Attn-TC-CNN has the same configurations as the Attn-TC (introduced in Section 5.1) except that we added two parallel CNN layers to preprocess the word embeddings and the keys. Regarding the CNN layer processing the word embeddings, it has the kernel of size d 2 and stride 1. We used d kernels to keep the word embedding dimension unchanged. So, for two consecutive words in a sentence, the CNN mixes their embeddings and produce a new one. Given that the sentence has m + 1 words, the input embedding matrix has shape d (m + 1) and the output has shape d m. Likewise, regarding the keys, the CNN layer processing them has the kernel size d 2 and stride 1. And there are d kernels in total. Consider a classification problem containing two topics A and B. The sentences of the two topics are generated by two Markov chains, respectively, which are constructed as follows: 1. Let Li (i = A, B) be two mutually exclusive sets of ordered word pairs. The word pairs do not contain repeating words. 2. For i = A, B: Published as a conference paper at ICLR 2021 0.15 0.10 0.05 0.00 Centered non-topic word score distribution 5 6 Centered topic word score distribution 0.00 0.02 0.04 0.06 0.08 0.10 Centered non-topic word embedding norm distribution 12.0 12.5 13.0 13.5 14.0 Centered topic word embedding norm distribution Figure 8: The first two histograms plot the distributions of the non-topic and topic word pair scores along with their 95% confidence intervals for the well-trained Attn-TC-CNN. Likewise, the next two plot the ones of the centered embedding norms. 0 2 4 6 8 10 Score Topic word embedding norm Exp 0 Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 7 Exp 8 Exp 9 Theoretical Figure 9: The empirical and theoretical SEN curves of a randomly picked topic word pair for model Attn-TC-CNN. (a) Initialize Markov Chain (MCi) of words in the dictionary such that from a word, there is a uniform probability of moving to any words (including itself) in one step. (b) Group the word pairs in Li according to the leading words. For each group, let s denote the shared leading word and ei (i = 1, 2, , n) the second. Set the probability of moving from s to ei in one step be 1 n and those to the rest zero. We call the word pairs in Li (i = A, B) the topic word pairs. For each pair of words, a new embedding and a new key are generated by feeding their original embeddings and keys into the CNNs. We refer to the new embedding as the topic word-pair embedding and the new key as the topic word-pair keys. Likewise, for any other word pairs, the new generated embeddings and keys are referred to as the non-topic word-pair embeddings and keys. Assume the training dataset contains sentences of length m + 1. We generate it by repeating the following procedure: 1. uniformly sample a topic i {A, B}. 2. uniformly sample a word pair from Li and let the first word be the starting word. 3. sample another m words by running the Markov process for m times. In our experiments, the word dictionary has 200 words, and so there are 40, 000 word pairs. The sentence length is set to 10. For i = A, B, each Li contains two randomly picked word pairs as the topic word pairs. Note that we have ensured that LA and LB are mutually exclusive. We used 4, 000 samples for training and 1, 000 for testing. The model was trained for 3, 000 epochs and achieved 99.89% test accuracy. We repeated the experiments for ten runs and plotted the distributions along with the 95% confidence interval of the scores and the embedding norms of the topic and the non-topic word pairs in Fig 8. Note that, the word pair embeddings and the keys are generated from the CNN layers. So the initial embeddings and the scores are not very closed to the origin. To facilitate the verification of Assumption 1, we centered them by subtracting their initial values before we plot their distributions. From Fig 8, we observe that the non-topic word pair scores and the embeddings are nearly unchanged in comparison to their counterparts of the topic ones. Therefore, we have shown that Assumption 1 is largely held even we have added CNN layers to process the word embeddings and the keys before feeding them into the attention block. Published as a conference paper at ICLR 2021 0 1 2 3 4 5 Score (Attn-FC) 2 = 0.1 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-FC) 2 = 0.5 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-FC) 2 = 1 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-FC) 2 = 2 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-TC) 2 = 0.1 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-TC) 2 = 0.5 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-TC) 2 = 1 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-TC) 2 = 2 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-TL) 2 = 0.1 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-TL) 2 = 0.5 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-TL) 2 = 1 Topic word embedding norm Experimental Theoretical 0 1 2 3 4 5 Score (Attn-TL) 2 = 2 Topic word embedding norm Experimental Theoretical Figure 10: The empirical and the theoretical SEN curves of a randomly picked topic word for Attn FC (first row), Attn-TC (second row) and Attn-TL (third row) with a trainable query. The query is initialized by a normal distribution N(0, σ2). From left to right, σ2 = 0.1, 0.5, 1 and 2, respectively. Randomly picking a topic word pair, we plotted its empirical and theoretical SEN curves in ten runs in Fig 9. The figure shows that the positive SEN relationship holds even if the attention layer attends to other layers instead of the word embeddings directly. Therefore, all the results due to the relationship keep valid. B DISCUSSION AND EXPERIMENTAL RESULTS OF THE MODELS WITH A TRAINABLE QUERY In Section 4, we assumed a fixed and non-trainable query which allows the derivation of a clean closed-form "SEN" relation. But it is worth emphasizing that the positive relationship between the score and the embedding norm in fact exists regardless of the trainability of the query. As we have mentioned in Appendix A, there are two correlated sources of gradient signals, one back-propagates from the score to update key and query, the other back-propagates from the classifier loss to update word embedding. This correlation governs the positive relationship between score and embedding norm. Whether the query is trainable does not alter the existence of this correlation, although a trainable query makes the analysis more difficult. In particular, when the query norm is large, the update of query is relatively negligible; thence, the training behaviour is similar to having a fixed query. To verify our claims, we reimplemented the experiments introduced in Section 5.1 with the same configurations except that the query is trainable. In Fig 10, we plot the empirical and the theoretical SEN curves of a randomly picked topic word by training Attn-FC, Attn-TC and Attn-FC with a trainable query. We initialize the entries of the query vector by a normal distribution N(0, σ2). From left to right, σ2 increases as well as the initial query norm. We observe that regardless of how the query is initialized, the positive SEN relationship always preserves. Moreover, as the initial norm increases, the empirical curve approaches the theoretical curve having the expression in Eq (10). As we have discussed, this asymptotic approach happens since an increasing initial norm of the query makes its change negligible during the training process compared to its already big enough norm. Published as a conference paper at ICLR 2021 C PROOFS OF THE RESULTS IN SECTION 4 Proof of Lemma 1. Assume there are in total N words in the dictionary (including both the topic and the non-topic words). Sample the keys of the N words arbitrarily to generate K Rd N with the key vectors as its columns. Randomly pick q Rd as a query. To prove the lemma, it is sufficient to show for any non-zero q Rd , there exists K Rd N such that q T K = q T K. Since q = 0, without loss of generality, assume its first entry q1 is non-zero. Let S = [s1, s2, s N] = q T K. For i = 1, 2, N, let the i-th column of K be [ si q1 , 0, , 0]T . Then we can easily check that q T K = q T K. Proof of Lemma 2. Picking a sufficiently small η, a continuous time limit can be taken to obtain the dynamics, (χ,y) Ψt h( v(χ); y) exp(st) Z(χ) , (11) (χ,y) Ψt (vt v(χ))T h( v(χ); y) exp(st) Z(χ) . (12) which are equivalent to h( v(χ); y)exp(st) (vt v(χ))T h( v(χ); y) exp(st) As h( v(χ); y) is assumed Lipschitz continuous, we have4 vt v(χ) T Ψt h( v(χ); y)exp(vt) Ψt (vt v(χ))T h( v(χ); y)exp(vt) (15) where L is the Lipschitz constant. Choosing a small enough σ2 so that L dσ2 is close to zero, we have vt v(χ) T Ψt h( v(χ); y)exp(vt) Ψt (vt v(χ))T h( v(χ); y)exp(vt) Then, combining it with Eq (14) yields (vt v(χ))T h( v(χ); y) exp(st) vt v(χ) T Ψt η|Ψt| h( v(χ); y)exp(st) Ψt = vt v(χ) T Ψt dvt 4The derivation is given in Appendix D Published as a conference paper at ICLR 2021 Remarkably, the relation stated in Eq (16) is independent of h( v(χ); y). So the relationship does not depend on the architecture of the classifier (or later layers in general). Expand Eq (16), dt = vt v(χ) T Ψt dvt Z(χ) vt + X exp(sw)(vt vw) Ψt vt vw T Ψt dvt exp(sw)/Z(χ \ t) Ψt Z(χ)/Z(χ \ t) Ψt vt vw T Ψt dvt The second last step is due to the independence between the score and embedding initialization, while the approximation in the last step is made as we assume all the scores of the non-topic words maintain the same during the entire training process. Rearranging the equation yields Ψt vt vw T Ψt dvt exp(st) + Z(χ \ t) = vt v(χ \ t) Ψt T dvt exp(st) + Z(χ \ t) where v(χ \ t) = P w χ\{t} vw exp(sw) Proof of Theorem 1. By Lemma 2, we have exp(st) + Z(χ \ t) dt = vt v(χ \ t) Ψt T dvt 1 + exp(st) 1 Z(χ \ t) dt = vt v(χ \ t) Ψt T dvt By the fundamental theorem of calculus, integrating on both sides from t = t0 to t1 yields, " st + exp(st) 1 Z(χ \ t) 2||vt v(χ \ t) Ψt ||2 2 Proof of Corollary 1. Since the scores and the embeddings of the non-topic words are considered constant, we have D 1 Z(χ\t) E m and v(χ \ t) Ψt = 0. As vt is initialized with mean zero and a very small variance, ||vt(0)||2 2 0. Then, Eq (9) can be written as ||vt(t)||2 = 2 st(t) + exp st(t) Published as a conference paper at ICLR 2021 Proof of Theorem 2 (sketch). Without loss of generality, pick topic word t and assume it corresponds to the ϕ-th topic. We prove the theorem by showing that as the number of epochs increases, for any sentence χ in Ψt, st and softmax(U T v(χ)) eϕ, where eϕ is the one-hot vector that the ϕ-th entry equals one and the rest are zeros. Let x = U T v(χ). Notice that the loss function log(softmaxϕ(x)) Ψt is convex in terms of x. As U is fixed, the loss function is also convex in terms of v(χ). This implies, if the model is optimized by gradient descent, the gradient will lead v(χ) to its optimal solution v (χ). In our case, as the columns of U are linearly independent, there exists a vector n that are orthogonal to all the columns of U except Uϕ. Without loss of generality, assume Uϕ n > 0 (otherwise, choose its inverse). Then a potential optimal solution is v (χ) = λn for λ goes to the infinity as v (χ) Ui = 0 for i = ϕ and v (χ) Uϕ , which implies softmax(U T v (χ)) = eϕ. Combined with the fact that there cannot be an optimal solution v (χ) of a finite norm such that softmax(U T v (χ)) = eϕ, the gradient must lead v(χ) to an optimal solution that is arbitrarily far away from the origin, which also applies to vt as it receives the same gradient up to a multiple according to Eq (7). As ||vt||2 2 increases unboundedly, by Theorem 1, the score st as well. So we have softmax(U T v(χ)) softmax(U T vt) eϕ and thus the cross-entropy loss drops to zero. D THE DERIVATION OF EQ (15) For a vector u, let u(ϕ) denote its ϕ-th entry. As we ignore the changes of the scores and the embeddings of non-topic words, their distributions maintain the same as the initial ones. In particular, the scores of the non-topic words are always zero. So, for ϕ = 1, 2, d, var v(ϕ)(χ) = var Z(χ) v(ϕ) t + X Z(χ) v(ϕ) w Since h( v(χ); y) is assumed Lipschitz continuous in v(χ), there exists L R such that for ϕ = 1, 2, d, |h( v(χ1); y)(ϕ) h( v(χ2); y)(ϕ)| L|| v(χ1) v(χ2)||1, where v(χ1), v(χ2) Rd and || ||1 denote the l1-distance by taking the sum of the absolute values of the entry differences on each dimension. So we also have 1 Lh( v(χ1); y)(ϕ) 1 Lh( v(χ2); y)(ϕ) || v(χ1) v(χ2)||1. According to the work of Bobkov & Houdrè (1996), for ϕ = 1, 2, , d, var(L 1h( v(χ); y)(ϕ)) Xd ϕ=1var v(ϕ)(χ) = mσ2 var(h( v(χ); y)(ϕ)) mσ2L2 Then, the Cauchy-Schwarz inequality implies, for ϕ = 1, 2, , d, cov h( v(χ); y)(ϕ) exp(st) Z(χ) , v(ϕ) t v(ϕ)(χ) = cov h( v(χ); y)(ϕ) exp(st) Z(χ) , v(ϕ)(χ) var h( v(χ); y)(ϕ) exp(st) 1/2 h var v(ϕ)(χ) i1/2 < mσ2L exp(st) Published as a conference paper at ICLR 2021 By the triangle inequality, vt v(χ) T Ψt h( v(χ); y)exp(st) Ψt (vt v(χ))T h( v(χ); y)exp(st) "D v(ϕ) t v(ϕ)(χ) E h( v(χ); y)(ϕ) exp(st) Ψt (v(ϕ) t v(ϕ)(χ))h( v(χ); y)(ϕ) exp(st) D v(ϕ) t v(ϕ)(χ) E h( v(χ); y)(ϕ) exp(st) Ψt (v(ϕ) t v(ϕ)(χ))h( v(χ); y)(ϕ) exp(st) cov h( v(χ); y)(ϕ) exp(st) Z(χ) , v(ϕ) t v(ϕ)(χ) P w χ\{t} exp(sw) = m 1.