# implicit_kernel_attention__b7fe20f8.pdf

Implicit Kernel Attention

Kyungwoo Song1, Yohan Jung2, Dongjun Kim2, Il-Chul Moon*2

1 Department of AI, University of Seoul 2 Department of ISys E, Korea Advanced Institute of Science and Technology (KAIST) kyungwoo.song@uos.ac.kr, {becre1776,dongjoun57,icmoon}@kaist.ac.kr

Attention computes the dependency between representations, and it encourages the model to focus on the important selective features. Attention-based models, such as Transformer and graph attention network (GAT), are widely utilized for sequential data and graph-structured data. This paper suggests a new interpretation and generalized structure of the attention in Transformer and GAT. For the attention in Transformer and GAT, we derive that the attention is a product of two parts: 1) the RBF kernel to measure the similarity of two instances and 2) the exponential of L2 norm to compute the importance of individual instances. From this decomposition, we generalize the attention in three ways. First, we propose implicit kernel attention with an implicit kernel function instead of manual kernel selection. Second, we generalize L2 norm as the Lp norm. Third, we extend our attention to structured multi-head attention. Our generalized attention shows better performance on classiﬁcation, translation, and regression tasks.

Introduction

Attention (Bahdanau, Cho, and Bengio 2014) is widely utilized to improve model performance as well as to explain the model mechanisms. The attention in Transformer computes the dot-product between query and key, which are the linear projections of hidden features. The attention in GAT also utilizes the dot-product between a weight vector and a concatenation of hidden representations. It is well known that the dot-product of two vectors is a product of two distinct terms: 1) the cosine of the angle between two vectors, which computes the similarity; and 2) the individual norm of two vectors, which measures the magnitude of each vector. This opens a question on how to analyze the attention with the explicitly separated terms of the similarity and the magnitude, and how to generalize the separation in attentions. We provide a new interpretation of attention, and this interpretation leads to generalized attention by formalizing the attention weight into a multiplication of two terms: similarity and magnitude. We derive the explicit separation, so the derivation reveals that the attention in Transformer and GAT is a product of 1) the Radial Basis Function (RBF) kernel

*Corresponding author. Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

between instances and 2) the exponential of L2 norm for each instance.

This paper proposes an implicit kernel function that generalizes the RBF kernel function, embedded in the attention. Given that attention uses a kernel implicitly, we note that the kernel function measures the similarity under the inductive bias embedded in the kernel. Traditionally, the modelers manually selected kernels with the expert knowledge for a speciﬁc domain or a dataset. We construct an implicit kernel function instead of an explicit kernel to ﬁnd a data-adaptive kernel, which provides better task performances under the data context. We formulate the implicit kernel function by constructing the spectral density depending on the dataset because a kernel construction can be interpreted as the spectral density estimation (Reed and Simon 1975; Yaglom 1987). This paper further develops the formulation by introducing multi-head implicit attention with a structured spectral density estimation.

The magnitude term is the second component in the decomposition of attention, and the magnitude separately measures the importance of each instance pair by an exponential of their L2 norms. As L2 norm in the attention is rigid without considering the dataset property, we extend L2 as the Lp norm with a hyper-parameter p to control the scale of magnitude terms and attention weight sparsity. The p controls the growth rate of the magnitude terms, which are the scale of individual importance. By treating p as a hyper-parameter, we can impose an inductive bias to focus on the relative importance between inputs (similarity) on translation tasks; or the individual importance on each input (magnitude) for the classiﬁcation task. Besides, we ﬁnd that the decrement of p is eventually related to the sparsity of the attention weights, empirically and theoretically.

In summary, we derive the decomposition of attention as a product of similarity and magnitude. Under the decomposition, we generalize the attention with an implicit kernel function and a Lp norm. Further, we extend our works to structured multi-head attention with the implicit kernel functions. Our generalized implicit attention is exchangeable with the attention in Transformer and GAT without increasing the algorithmic complexity in order.

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Preliminary Transformer The attention layers in Transformer compute the importance of each feature to consider the semantics of the given sequence. Transformer calculates the importance weight by the dot-product of qi = hi WQ and kj = hj WK with the hidden feature h, the query qi, the key kj, and linear projection matrix WQ, WK (Vaswani et al. 2017). The attention weight αij in Eq. 1 utilizes the softmax function with a scaling 1/ dk where dk is the dimension of kj

αij = exp q T i kj/ dk

P l exp q T i kl/ dk (1)

Graph Attention Network GAT (Veliˇckovi c et al. 2018) adopts the attention in Eq. 2 to aggregate the relevant neighborhood s features. GAT transforms each hidden feature, hi and hj, with a linear projection matrix, W. GAT aggregates features by concatenating the transformed hidden features. After the concatenation, GAT computes the dot-product between a weight vector, a, and aggregated features. GAT adopts Leaky Re LU (Maas, Hannun, and Ng 2013), and it should be noted that Leaky Re LU has a different slope parameter, c, for the positive region and the negative region. GAT utilizes the softmax function for the neighbor of a node i, Ni.

αij = exp Leaky Re LU a T [Whi||Whj] P l Ni exp (Leaky Re LU (a T [Whi||Whl])) (2)

Multi-Head Attention Both Transformer and GAT extend the attention to Multi-Head Attention (MHA) to capture the diverse features by introducing the different linear projection matrix. The original MHA assumes the independence between muti-heads, and this paper provides a dependence structure by using the spectral densities of multi-heads.

Kernel Spectral Density Due to the dual property given by Bochner theorem (Reed and Simon 1975), a stationary kernel, k(x, x ), is selected uniquely by a corresponding spectral density, p(w), and vice versa, as in Eq. 3. The hypothesis class of the spectral density function determines the adaptiveness of a kernel, and the explicit speciﬁcation on the density function class could restrict the adaptiveness. Hence, we provide an implicit density function class through an implicit generative model. For instance, the generator in Generative Adversarial Network (GAN) (Goodfellow et al. 2014) is able to represent spectral density, ﬂexibly (Li et al. 2019). Generally, Eq. 3 with a spectral density, p(w), is intractable, so we approximate Eq. 3 by the Monte Carlo (MC) integration with R sampled spectral points, wr, as shown in Eq. 4 (Ton et al. 2018).

k(x, x ) = Z

Rd exp iw T (x x ) p(w)dw (3)

r=1 exp iw T r (x x ) (4)

Kernel in Attention Tsai et al. (Tsai et al. 2019) utilize a kernel to formulate the attention weights. They replace the exp(q T i kj/ dk) in scaled dot-product attention with a manually selected kernel, such as the RBF kernel, without an explicit decomposition. Because they do not derive the decomposition explicitly, they only utilize the RBF kernel for a similarity term without a magnitude term. We name it RBF-only for the comparison. The concurrent works (Katharopoulos et al. 2020; Choromanski et al. 2020b,a) focus on efﬁciency and propose linear time complexity attention based on the relationship between kernel and attention. This paper focuses on the new interpretation and generalization of attention to improve the capacity of attention in Transformer and GAT. We emphasize the improved capacity of attention with linear time complexity can be reached by combining our approach with the concurrent works.

Copula The original MHA in Transformer and GAT lacks an explicit model to handle the dependency between multiple attentions. Therefore, we formulate a copula-augmented spectral density estimation to consider the dependency between the spectral densities in each head. The copula is a cumulative distribution function (CDF) deﬁned on the unit cube with uniform marginals (Nelsen 2007). Formally, a copula, C, is deﬁned as C (u1, ..., ud) = P (U1 u1, ..., Ud ud), where the marginal distribution on Ui is uniformly deﬁned on [0,1]. By Sklar s theorem (Sklar 1959), we can represent a joint cumulative distribution of x1, ..., xd as a marginal distribution of each random variables and their copula, F (x1, ...xd) = C [F1(x1), ...Fd(xd)]. We impose the dependencies between attentions in MHA by estimating each head s spectral density jointly with a copula augmented inference.

Methodology Decomposition of Transformer and GAT This section derives the decomposition of attention in Transformer and GAT as the product of two distinct terms: 1) similarity term and 2) magnitude term. The similarity measures the relative importance, and the magnitude computes the individual importance. We derive the decomposition for the scaled dot-product attention in Proposition 1.

Proposition 1. Let αij be 1 Z1(α) exp q T i kj

where Z1(α) is a

normalizing constant. Then αij has the form:

1 Z1(α) exp qi kj 2 2 2 dk

| {z } similarity

exp qi 2 p=2 + kj 2 p=2 2 dk

| {z } magnitude

The scaled dot-product attention has the similarity term, f RBF = exp qi kj 2 2 2 dk

, which is the RBF kernel with

ﬁxed length-scale hyper-parameters, 4 dk. Additionally, the scaled dot-product attention yields the magnitude term,

exp qi 2 p=2+ kj 2 p=2 2 dk

, and it measures the individual im-

portance of each instance as qi 2 p=2 and kj 2 p=2 with L2

norm. A large similarity term and a large magnitude term induce the large attention weight for the given pair of query and key. On GAT, we formulate the decomposition for the attention in Eq. 2. The decomposition in Proposition 2 is slightly different from the decomposition of Transformer. For an attention weight between node i and node j, GAT computes the two similarity terms: 1) the similarity between qi and ca, 2) the similarity between ca and kj. From the decomposition, we can interpret the weight vector, a, as the medium that connects qi and kj.

Proposition 2. Deﬁne αij = exp(Leaky Re LU(a T [W hi||W hj]))

qi = h Whi|| 0 i , and kj = h 0 ||Whj i where Z2(α) is a normalizing constant, and c be the slope parameter in Leaky Re LU. Then αij has the form:

1 Z2(α) exp qi ca 2 2 2

exp ca kj 2 2 2

| {z } similarity

2 ca 2 p=2 + qi 2 p=2 + kj 2 p=2 2

| {z } magnitude

Variations of Implicit Kernel Attention From the factorization, we propose a new attention method, which formulates an implicit kernel function, and which utilizes a Lp norm. Attention in Transformer and GAT uses a ﬁxed single Gaussian spectral density, p(w), and its corresponding RBF kernel in Figure 1a. In contrast, we propose implicit kernel attention (IKA) that estimates the spectral density implicitly to ﬁnd an appropriate kernel depending on the dataset in Figure 1b. Second, we interpret p in Lp norm as a hyper-parameter, so we deﬁne IKA with norm (IKAN). Besides, we propose a deterministic and simple but effective model, IKAN-direct, to optimize spectral points w, directly in Figure 1c. Lastly, we propose multi-head IKAN (MIKAN), which adopts a copula-augmented inference to estimate a structured spectral density of MHA, jointly in Figure 1d. Figure 1e represents the structure overview of our models.

IKA: Implicit Kernel Attention RBF measures the similarity by the Euclidean distance. Recent works (Li and Dunson 2019; Chen et al. 2018) conjecture that the Euclidean metric might not be proper in the hidden feature space. Furthermore, RBF includes the exponential term, and the range of exponential is a positive real-value without zero. Therefore, the sparsity cannot be achieved in the attention weight because of the positive nature. By following (Reed and Simon 1975; Yaglom 1987), the implicit spectral density allows us to construct an implicit kernel, so this paper proposes learning an implicit spectral density from the dataset. To formulate an implicit spectral density, we construct a base distribution, p(z), such as a

Gaussian distribution, and we sample z from p(z). Then, we transform the sampled z with a ﬂexible function such as a neural network to construct the spectral density, p(w). It should be noted that the spectral density should be symmetric for the real-valued kernel, which we add constraints in Eq. 8. Under the implicit spectral density, the log marginal likelihood, log p(y|h), is intractable where y is an output variable and h is a hidden feature. Therefore, we derive the evidence lower bound (ELBO) of the log marginal likelihood with an inference network, q(z|h). We maximize the ELBO in Eq. 5 with a re-parameterization method to backpropagate the gradients (Kingma and Welling 2014).

L = Eq(z|h) [log p(y, z|h)] Eq(z|h) [log q(z|h)] (5)

For MHA in Transformer or GAT, we independently sample z from q(z|h) for each head by formulating q(z|h) = QM m=1 q(z(m)|h(m)). We construct each inference network, q(z(m)|h(m)), with a neural network parameterized by ψ1 as in Eq. 6. We formulate z(m) as a concatenation of ez(m) and ez(m) to preserve the symmetric of a base distribution.

µ(m), log σ(m) = NNψ1(h(m)) (6)

ez(m) N(µ(m), (σ(m))2I), z(m) = h ez(m)|| ez(m)i

A symmetric base distribution enables us to construct an unconstrained neural network NNψ2 to transform z into w. After sampling z(m), we transform it with NNψ2 to construct the spectral density in Eq. 8. To preserve the symmetricity of the spectral density, we use the absolute value of |z(m)| as an input, and we multiply the sign of z(m) (Li et al. 2019).

w(m) = sign z(m) NNψ2 |z(m)| (8)

Eq. 9 and Eq. 10 denote the formulation of un-normalized attention weights, eα(m) ij , with the implicit kernel function f for Transformer and GAT, respectively. From the un-normalized attention weights, eα(m) ij ; we compute attention weights, α(m) ij , with normalization in Eq. 11. The implicit kernel function, f, is a similarity term of attention in Proposition 1 and 2. The rest of the structure is the same as the original Transformer and GAT with a neural network NNψ3 in Eq. 11.

eα(m) ij = f q(m) i , k(m) j , w(m) exp( q(m) i 2 2 + k(m) j 2 2 2 dk )

eα(m) ij = f q(m) i , ca(m), w(m) f ca(m), k(m) j , w(m)

exp( 2 ca(m) 2 2 + q(m) i 2 2 + k(m) j 2 2 2 ) (10)

by = NNψ3 α(1:M), h(1:M) , α(m) ij = eα(m) ij P l eα(m) il (11)

IKAS: Implicit Kernel Attention with Stationary Kernel This section explains the construction of kernel function f,

(a) Attention

(b) IKA(S,NS), IKAN

(c) IKAN-direct

(e) Structure overview

Figure 1: Visualization of the attention and our models. To capture the spectral density ﬂexibly, we ﬁrst sample z from the base distribution p(z) and transform it to w with a ﬂexible function such as a neural network. The IKAS, IKANS and IKAN in (b) estimate the base distribution and the spectral density in each head depending on the dataset, unlike the scaled dot-product attention. IKAN-direct in (c) optimizes spectral points w directly, and MIKAN in (d) estimates the base distribution p(z(1:M)) jointly. We provide a structure overview of our model in (e).

and we omit the multi-head related notation, m, for simplicity. We approximate a continuous stationary kernel with the MC integration with R sampled spectral points, wr, and its random Fourier feature map, φr, by taking the real part in Eq. 4 (Ton et al. 2018; Rahimi and Recht 2008; Jung, Song, and Park 2020).

f(qi, kj, w) = 1

r=1 φr(qi)T φr(kj) (12)

φr(qi) = cos w T r qi

sin (w Tr qi)

, φr(kj) = cos w T r kj

sin (w Tr kj)

We generalize the attention by replacing the RBF in the scaled dot-product attention with an implicit kernel function, f as shown in Eq. 9. We name this attention as IKAS. Similarly, we can construct the implicit kernel, f, for GAT in Eq. 10 by following Eq. 12 and 13. In spite of the above kernel construction, the output of the kernel function in Eq. 12 might have a negative value, so it contradicts with the non-negative property of attention weights. As a result, we square Eq. 12, f 2, to ensure the non-negativity of IKA. Proposition 3 claims that the squared implicit kernel function still generalizes the RBF kernel that is utilized in the scaled dot-product attention.

Proposition 3. Let k(l) RBF be the RBF kernel function with a lengthscale l, and f be its approximation with R random Fourier features by Eq. 12 and 13. If we sample spectral points, w, from a Gaussian distribution N 0, 1 2l2 I , then lim R f 2 = k(l) RBF .

IKANS: Implicit Kernel Attention with Non-Stationary Kernel Similar to the stationary kernel, we can approximate a continuous non-stationary kernel with the random Fourier feature map φr in Eq. 14-16 (Ton et al. 2018; Yaglom 1987). We formulate a new attention, IKANS, induced by a non-stationry kernel function in Eq. 14-16. Different from stationary kernel, we sample two types of R spectral points, w1,r and w2,r, from two distinct implicit spectral density estimators. Similarly, we construct an implicit kernel, f, for

GAT in Eq. 10 by following Eq. 14-16.

f(qi, kj, w) = 1

r=1 φr(qi)T φr(kj) (14)

φr(qi) = cos(w T 1,rqi) + cos(w T 2,rqi) sin(w T 1,rqi) + sin(w T 2,rqi)

φr(kj) = cos(w T 1,rkj) + cos(w T 2,rkj) sin(w T 1,rkj) + sin(w T 2,rkj)

The error bound for a non-stationary kernel approximation is not well explored. To validate IKANS, we show that the sufﬁcient enough samples from a spectral density guarantee the non-stationary kernel approximation in Proposition 4.

Proposition 4. Let k(x, x ) be a non-stationary kernel, f(x, x ) be an approximated kernel with R sampled spectral points, w1 and w2, and M be a compact subset of Rd with a diameter D. Let σ2 1 = E[w T 1 w1], σ2 2 = E[w T 2 w2] and E = supx,x M|f(x, x ) k(x, x )|. Then,

i) P [E ϵ] 28 D

σ2 1+σ2 2 ϵ

2 exp Rϵ2 2(d+1)

ii) E ϵ with any constant probability when R =

σ2 1+σ2 2 ϵ )

Similar to the IKAS, we use f 2 instead of f to ensure the non-negativity of the attention weights.

IKAN: Implicit Kernel Attention with Lp

IKAN In addition to the implicit kernel in IKANS, we generalize the L2 norm in the magnitude term in Proposition 1 and 2 to be the Lp, which determines the importance of individual representations. The optimal p might be different for each dataset and task, so we select p through experiments. There are tasks where the similarity is more important than the magnitude and vice versa. Treating p as a hyper-parameter that determines the scale of magnitude term imposes an inductive bias on the model. Furthermore, we analyze the relationship between the magnitude and the sparsity of attention weights. For the analysis, we deﬁne attention weights, αij(p), whose magnitude utilizes x p = (|x1|p + ... + |xd|p)1/p instead

of x 2. It should be noted that x p is an absolutely homogeneous function for p > 0. We show that αij(p) becomes sparse when p goes to zero in Proposition 5.

Proposition 5. Let 1A(i) p be the multi-dimensional indicator

function with n-th component to be 1 if n A(i) p and 0 otherwise, where A(i) p is the set of index for maximal elements: A(i) p = {l|l = argmaxjαij(p)}. Then, limp 0+ αij(p) = 1A(i) p /|A(i) p |

IKAN-direct We propose a simple and effective alternative that sets spectral points, w, as a learnable parameter, such as a weight in a neural network. IKAN-direct optimizes spectral points, w, directly without sampling. It corresponds to handling the spectral density as a mixture of delta distribution.

MIKAN: Multi-Head Attention with IKAN The dependency modeling between heads in MHA is necessary to diversify the attention weights. Similar to the individual calculation of the attention in Transformer and GAT, the variations of IKA estimate the spectral density in each head independently. Therefore, we introduce q(z|h) = cq F1(z(1)), , , , , FM(z(M)) QM m=1 q(z(m)|h(m)), where cq is a copula density; Fm is a CDF of each z(m); M is the number of attention head. This alternative variational distribution of z, q(z|h), develops IKA to be structured as Multi-head IKAN (MIKAN) by introducing the joint structure through cq. MIKAN maximizes ELBO in Eq. 5 with a Monte Carlo estimation by sampling from q(z|h) (Kingma and Welling 2014). MIKAN estimates the base distribution without the mean-ﬁeld assumption by introducing a copulaaugmented posterior. IKAN is a special case of MIKAN if we ﬁx cq as a uniform distribution (Tran, Blei, and Airoldi 2015). cq can be any copula density, and we set cq as a Gaussian copula with a covariance, Σ.

Model Complexity Table 1 denotes the time complexity of computing the attention weights, αij, from a hidden representation, h, where T is the length of a sequence or the number of nodes. If we consider a task with a long sequence with a high T, there is no asymptotic complexity increment in order, so all variations of IKA and baselines fall into the complexity of O(T 2). Also, it should be noted that the complexity of O(T 2R) is controllable because R is the number of samples determined by a modeler.

Results Sentence Classiﬁcation We compare our models with the scaled dot-product attention in Transformer (Vaswani et al. 2017), and RBF-only (Tsai et al. 2019) on ﬁve popular datasets (Kim 2014; Wang et al. 2020). We perform ten-fold cross validations by following the experimental settings (Wang et al. 2020). RBF-only represents the scaled dot-product attention with a RBF kernel without a magnitude term. To validate the performance for

Model Complexity

Baselines Transformer O(T 2d + Td2) RBF-only O(T 2d + Td2)

Expsin O(T 2d + Td2) Linear O(T 2d + Td2) IKAS O(T 2R + Td2 + Td R) IKANS O(T 2R + Td2 + Td R) IKAN O(T 2R + Td2 + Td R) IKAN-direct O(T 2R + Td2 + Td R) MIKAN O(T 2R + Td2 + Td R + d M 3)

Table 1: Time Complexity for computing attention weights where T, d, R, M denotes the length of sequence, hidden dimension, the number of sample size, and the number of attention heads respectively.

the different kernels, we provide additional models, Expsin and Linear. Expsin and Linear denote the scaled dot-product attention that uses periodic kernel and linear kernel, instead of RBF kernel, respectively. For the magnitude term, Expsin and Linear utilize L2 norm that is the same as the scaled dot-product attention. Table 2 indicates that the appropriate kernel is different for each dataset. The RBF kernel in Transformer performs better than other baselines on MR and SUBJ. On the other hand, Expsin shows better performance on CR and SST, and Linear is superior to RBF on TREC. However, IKAS and IKANS show consistently better performance than RBF-only, Expsin, and Linear. This consistent tendency demonstrates the importance of data-adaptive implicit kernel empirically, instead of a manual kernel selection. Compared to IKANS, IKAN shows better performances, and this supports the importance of adapting the magnitude with p as a hyper-parameter. IKAN-direct, which optimizes w directly without sampling, shows lower performance than IKAN, but it still performs better than the scaled dot-product attention in Transformer. MIKAN with copula augmentation performs best except the SUBJ dataset. The second column of Table 2 represents the number of parameters for each model on the MR dataset. Expsin and Linear have the same parameters as the scaled dot-product in Transformer. For the implicit kernel-based model, we adopt an implicit spectral density estimator, and it requires additional parameters. Because we utilize the shared spectral density estimator for each layer, the implicit kernel-based models require a marginal increment in the number of parameters. Figure 2a and 2b show attention weights for a sentence in MR. In Figure 2a, our implicit kernel-based attention captures the important words kiss and waste. Besides, MIKAN focuses on the additional important word, just. Figure 2b represents the attention weights of the ﬁrst four heads among eight heads in Transformer. Each row and column of the square matrix corresponds to the individual word in the given sentence. All ﬁrst four heads in IKAN focus on the same last word waste. The ﬁrst, second, and fourth heads in IKAN also focus on the words kiss together. On the other hand, MIKAN that adopts structured MHA shows the relatively diverse attention weights across the multi-heads.

Model # Param. MR CR SST SUBJ TREC

Transformer - 73.1% 76.2% 76.1% 86.3% 83.4%

Transformer 27.9M 74.45 0.94% 77.98 3.16% 76.25 1.60% 90.32 0.68% 84.12 0.77% RBF-only 27.9M 73.98 0.48% 78.25 3.02% 76.00 1.82% 89.82 1.16% 84.24 0.90%

Expsin 27.9M 73.81 0.98% 78.78 3.26% 76.99 1.21% 89.10 0.28% 84.76 0.86% Linear 27.9M 73.70 0.49% 78.57 2.86% 76.67 1.12% 89.22 0.85% 84.76 1.41%

IKAS 28.1M 76.06 0.48% 78.83 3.26% 77.37 1.68% 90.66 0.87% 84.88 1.29% IKANS 28.2M 76.42 0.82% 79.05 2.26% 77.49 1.84% 90.86 0.60% 85.32 0.94% IKAN 28.2M 76.63 0.89% 79.21 3.05% 77.49 1.84% 91.48 0.61% 85.84 0.43% IKAN-direct 28.3M 75.97 0.54% 79.26 3.05% 77.10 1.68% 90.98 0.53% 85.16 0.38% MIKAN 28.2M 76.70 1.03% 80.63 3.24% 77.69 1.85% 90.86 0.61% 85.92 0.44%

Table 2: Accuracy for the text classiﬁcation task. denotes the reported performance in (Wang et al. 2020). # Param. denotes the number of parameters for the MR dataset.

Figure 2: (a): Attention weights for the given sentence on MR, whose task is the sentiment classiﬁcation of a movie review. The sentiment of the given sentence just a kiss is a just a waste is negative. To identify the sentiment, the model should focus on the just, kiss, just, and waste, and MIKAN captures all important words. (b): Attention weights matrix of ﬁrst four heads among eight heads for IKAN and MIKAN. MIKAN shows relatively diverse attention weights across the multi-head because of the structured model with the copula.

Translation

We compare our models and baselines on IWSLT14 De En. We include MIKAN and IKAN-direct because MIKAN is a generalized model of IKA and IKAN. Table 3 shows the results of three runs for Transformer, IKAN-direct and

Beam Search Optimization 26.36 Actor-Critic 28.53 Neural PBMT + LM 30.08 Minimum Risk Training 32.84 Variational Attention 33.69 Transformer 34.44 0.07

IKAN-direct 34.59 0.09 MIKAN 34.70 0.09

Table 3: BLEU for IWSLT14 De-En. denotes the reported performance in (Deng et al. 2018).

MIKAN. Our models perform better than other models, Beam Search Optimization (Wiseman and Rush 2016), Actor-Critic (Bahdanau et al. 2017), Neural PBMT + LM (Jang, Gu, and Poole 2017), Minimum Risk Training (Edunov et al. 2018), Variational Attention (Deng et al. 2018), and Transformer. We analyze the changes in similarity and magnitude terms in Figure 3. Figure 3a and 3b show the change in the encoder self-attention layer (self.); and Figure 3c and 3d represent the changes in the decoder-encoder cross attention layer (cross.). We visualize the changes of maximum similarity and average magnitude from the ﬁrst epoch (start) to the last epoch (end). Figure 3b and 3d show that Transformer heavily depends on the magnitude term, instead of similarity. We see the same phenomena in the cross attention layer, where we hypothesize that the similarity is important for the alignment given the pair-wise dependency. This phenomenon occurs at the start of training, and it gets worse during the training. On the other hand, IKAN and MIKAN represent the stable scale of similarity and magnitude because of the implicit kernel and the Lp norm.

(a) Similarity in self.

(b) Magnitude in self.

(c) Similarity in cross.

(d) Magnitude in cross.

Figure 3: Similarity and magnitude of encoder self-attention layer (self.) and decoder-encoder attention layer (cross.).

Model CO2 Passenger Housing Concrete Parkinsons

GP (RBF) 0.060 0.003 0.174 0.052 0.234 0.045 0.178 0.012 0.075 0.006 GP (SM) 0.057 0.002 0.171 0.048 0.191 0.044 0.170 0.021 0.072 0.004 Transformer 0.057 0.002 0.153 0.043 0.126 0.038 0.120 0.017 0.040 0.005

IKAN-direct 0.055 0.002 0.110 0.039 0.118 0.038 0.111 0.018 0.018 0.004 MIKAN 0.055 0.002 0.142 0.044 0.108 0.029 0.108 0.015 0.037 0.004

Table 4: RMSE on the UCI Regression dataset.

(a) p = 2.0

(b) p = 0.1

(c) Sensitivity

Figure 4: (a), (b): A histogram for attention weights in MIKAN for different p. (c) shows that RMSE of MIKAN varies depending on p.

Regression We apply Transformer on the UCI regression dataset, and replace the scaled dot-product attention with our models. We perform 10-fold cross-validations, and Table 4 reports the performance following (Tompkins et al. 2019). Our models perform better than baselines. Afterward, we perform the qualitative analysis to clarify the relationship between the p and the sparsity of attention weights. Figure 4a and 4b visualize a histogram of MIKAN attention weights for different ps on the Housing dataset. As p goes to zero, most attention weights have either zero or one. Figure 4c shows the sensitive analysis with respect to p. The RMSE varies depending on p, but all results show better performance than the baselines.

Node Classiﬁcation We compare GAT and our generalized attention in Proposition 2. We perform ten-fold cross validations, and Table 5 represents the average accuracy for Deep Walk (Perozzi, Al Rfou, and Skiena 2014), ICA (Geetor and Lu 2003), Chebyshev (Defferrard, Bresson, and Vandergheynst 2016), GCN (Kipf and Welling 2017), GAT (Veliˇckovi c et al. 2018).

Model Cora Citeseer Pubmed

Deep Walk 67.2% 43.2% 65.3% ICA 75.1% 69.1% 73.9% Chebyshev 81.2% 69.8% 74.4% GCN 81.5% 70.3% 79.0% GAT 83.0 0.7% 72.5 0.7% 79.0 0.3%

IKAN-direct 83.3 0.7% 72.8 1.0% 79.2 0.4% MIKAN 83.4 0.6% 72.9 0.8% 79.1 0.6%

Table 5: Accuracy for the node classiﬁcation task. denotes the reported performance in (Veliˇckovi c et al. 2018).

(a) w in GAT

(b) w1 in MIKAN

(c) Σ in qc

(d) Std. of α(m) ij

Figure 5: (a), (b): Sampled spectral points for GAT and MIKAN. (c): Learned full covariance Σ of qc in MIKAN. (d): Standard deviation of attention weights across the heads.

Figure 5a and 5b show the spectral density in GAT and MIKAN by using t-SNE (Maaten and Hinton 2008), respectively. The attention in GAT depends on the RBF kernel, and its spectral density is ﬁxed as a Gaussian distribution. However, MIKAN estimates the appropriate spectral density and its corresponding kernel depending on the dataset. Figure 5c shows the covariance, Σ, in qc of MIKAN, and Figure 5d represents the standard deviation of attention weights across the heads. A large standard deviation denotes that each head has different attention weights. The results support that MIKAN has relatively diverse attention weights across the heads by imposing the dependency between heads with the copula.

Conclusion This work provides the new interpretation of the attention as a product of similarity with the RBF kernel and the magnitude with the exponential of L2 norm. We analyze the property of the kernel and the norm in the attention, theoretically and empirically. From the derivation and analysis, we generalize the attention with an implicit kernel function and a Lp norm. Furthermore, we propose the copula-augmented spectral density estimation for the dependence modeling in MHA to capture the diverse context. Our generalized attention can substitute the original attention in Transformer and GAT with the same algorithmic time complexity, and our experiments show better performance from our generalized attention.

Ethical Impact We generalize the attention in Transformer and GAT, and we believe that our model is useful for capturing the underlying pattern or context of the given dataset. We expect that our proposed models contribute to both machine learning and social science. For the machine learning society, attention is widely used in many domains, such as natural language processing and vision. We propose a new direction to improve and interpret the attention in Transformer and GAT. For the social science aspect, we can apply our model to the recommendation, psychotherapy, and political ideal point estimation of legislators by capturing the context or patterns of user log history. We honor the AAAI Publications Ethics and Malpractice Statement, as well as the AAAI Code of Professional Conduct.

Acknowledgments This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF2018R1C1B600865213)

References Bahdanau, D.; Brakel, P.; Xu, K.; Goyal, A.; Lowe, R.; Pineau, J.; Courville, A. C.; and Bengio, Y. 2017. An Actor Critic Algorithm for Sequence Prediction. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net. Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. ar Xiv preprint ar Xiv:1409.0473 . Chen, N.; Klushyn, A.; Kurle, R.; Jiang, X.; Bayer, J.; and Smagt, P. 2018. Metrics for deep generative models. In International Conference on Artiﬁcial Intelligence and Statistics, 1540 1550. Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Davis, J.; Sarlos, T.; Belanger, D.; Colwell, L.; and Weller, A. 2020a. Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers. ar Xiv preprint ar Xiv:2006.03555 . Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. 2020b. Rethinking attention with performers. ar Xiv preprint ar Xiv:2009.14794 . Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral ﬁltering. In Advances in neural information processing systems, 3844 3852. Deng, Y.; Kim, Y.; Chiu, J.; Guo, D.; and Rush, A. 2018. Latent alignment and variational attention. In Advances in Neural Information Processing Systems, 9712 9724. Edunov, S.; Ott, M.; Auli, M.; Grangier, D.; and Ranzato, M. 2018. Classical Structured Prediction Losses for Sequence to Sequence Learning. In Proceedings of NAACL-HLT, 355 364.

Geetor, L.; and Lu, Q. 2003. Link-based classiﬁcation. In Twelth International Conference on Machine Learning (ICML 2003), Washington DC.

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672 2680. Jang, E.; Gu, S.; and Poole, B. 2017. Categorical Reparametrization with Gumble-Softmax. In International Conference on Learning Representations (ICLR 2017). Open Review. net. Jung, Y.; Song, K.; and Park, J. 2020. Approximate Inference for Spectral Mixture Kernel. ar Xiv preprint ar Xiv:2006.07036 .

Katharopoulos, A.; Vyas, A.; Pappas, N.; and Fleuret, F. 2020. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, 5156 5165. PMLR.

Kim, Y. 2014. Convolutional Neural Networks for Sentence Classiﬁcation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746 1751.

Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Variational Bayes. In Bengio, Y.; and Le Cun, Y., eds., 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classiﬁcation with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net.

Li, C.-L.; Chang, W.-C.; Mroueh, Y.; Yang, Y.; and Poczos, B. 2019. Implicit Kernel Learning. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2007 2016.

Li, D.; and Dunson, D. B. 2019. Geodesic distance estimation with spherelets. ar Xiv preprint ar Xiv:1907.00296 . Maas, A. L.; Hannun, A. Y.; and Ng, A. Y. 2013. Rectiﬁer nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing. Citeseer. Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579 2605. Nelsen, R. B. 2007. An introduction to copulas. Springer Science & Business Media.

Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701 710.

Rahimi, A.; and Recht, B. 2008. Random features for largescale kernel machines. In Advances in neural information processing systems, 1177 1184.

Reed, M.; and Simon, B. 1975. II: Fourier Analysis, Self Adjointness, volume 2. Elsevier. Sklar, M. 1959. Distribution functions in dimensions and their margins. Publ. inst. statist. univ. Paris 8: 229 231. Tompkins, A.; Senanayake, R.; Morere, P.; and Ramos, F. 2019. Black Box Quantiles for Kernel Learning. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, 1427 1437. Ton, J.-F.; Flaxman, S.; Sejdinovic, D.; and Bhatt, S. 2018. Spatial mapping with Gaussian processes and nonstationary Fourier features. Spatial statistics 28: 59 78. Tran, D.; Blei, D.; and Airoldi, E. M. 2015. Copula variational inference. In Advances in Neural Information Processing Systems, 3564 3572. Tsai, Y.-H. H.; Bai, S.; Yamada, M.; Morency, L.-P.; and Salakhutdinov, R. 2019. Transformer Dissection: An Uniﬁed Understanding for Transformer s Attention via the Lens of Kernel. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4335 4344.

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998 6008.

Veliˇckovi c, P.; Cucurull, G.; Casanova, A.; Romero, A.; Li o, P.; and Bengio, Y. 2018. Graph Attention Networks. In International Conference on Learning Representations.

Wang, B.; Zhao, D.; Lioma, C.; Li, Q.; Zhang, P.; and Simonsen, J. G. 2020. Encoding word order in complex embeddings. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net.

Wiseman, S.; and Rush, A. M. 2016. Sequence-to-Sequence Learning as Beam-Search Optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1296 1306. Yaglom, A. M. 1987. Correlation Theory of Stationary and Related Random Functions. Volume I: Basic Results. 526.