# chefs_random_tables_nontrigonometric_random_features__28940702.pdf

Chefs Random Tables: Non-Trigonometric Random Features

Valerii Likhosherstov* University of Cambridge vl304@cam.ac.uk

Krzysztof Choromanski* Google Research & Columbia University kchoro@google.com

Avinava Dubey* Google Research

Frederick Liu* Google Research Tamas Sarlos Google Research Adrian Weller University of Cambridge & The Alan Turing Institute

We introduce chefs random tables (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax-kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps [46]. We present variants of CRTs where RFs are positive, a key requirement for applications in recent low-rank Transformers [15]. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the optimal positive random features (OPRFs), is to our knowledge the first RF method for unbiased softmax-kernel estimation with positive and bounded RFs, resulting in exponentially small tails and much lower variance than its counterparts. As we show, orthogonal random features applied in OPRFs provide additional variance reduction for any dimensionality d (not only asymptotically for sufficiently large d, as for RKS). We test CRTs on many tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new state-of-the-art results for low-rank text Transformers, while providing linear space and time complexity of the attention.

1 Introduction & related work

The idea that nonlinear mappings of the random-weight linear combinations of data features can be used to linearize various nonlinear similarity functions transformed kernel methods. This led to the development of Random Kitchen Sinks (RKSs) techniques; and the new field of scalable kernel algorithms, introduced in the paper trilogy [44, 45, 46], was born. RKSs were subsequently used in many applications, ranging from kernel and function-to-function regression [1, 33, 41], SVM algorithms [50] to operator-valued and semigroup kernels [36, 62], neural networks [25, 61, 11, 27] and even differentially-private ML algorithms [9], as well as (very recently) nonparametric adaptive control [4]. Random features (RFs) are a subject of much theoretical analysis [34, 63, 51, 48].

To approximate shift invariant (e.g. Gaussian, Cauchy or Laplace) and softmax kernels, RKSs rely on the trigonometric nonlinear mappings provided directly by Bochner s Theorem [36]. Trigonometric RFs provide strong concentration results (e.g. uniform convergence, see Claim 1 in [45]), but suffer from a weakness that was noted recently they are not guaranteed to be positive. This makes them unsuitable for approximating softmax-attention in scalable Transformers relying on implicit attention via random features [15]. As noted in [15], trigonometric features lead to unstable training, as they yield poor approximations of the partition functions applied to renormalize attention and involving

* Equal contribution

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Trig RF Pos RF OPRF

FAVOR FAVOR+ FAVOR++

+ orthogonality

+ in Transformers

Figure 1: (left) A map of RF methods for the Gaussian kernel approximation. Existing RFs (Section 2.2), RFs proposed in this paper. (right) The utility function µ (defined as the logarithm of the ratio of the variance of OPRF and Pos RF mechanisms for the Gaussian and softmax kernel estimation) as a function of squared length of the sum of kernels inputs x + y 2 (smaller values imply larger gains coming from OPRF). Different curves correspond to different dimensionalities. Based on the plots, OPRFs have > e60 times smaller variance when d = 64, x + y 2 = 100 (configuration taken from the standard Transformer application).

several small softmax kernel values. To address this, [15] proposed a new method for unbiased softmax kernel estimation with positive RFs, the so-called FAVOR+ mechanism (Fast Attention Via Orthogonal Random Positive Features), as opposed to FAVOR using trigonometric RFs (as in [13]). Positive random features guarantee that the denominator in self-attention is a sum of positive numbers, hence it cannot be negative or too small.

Unfortunately FAVOR+ features are not bounded, and worse, the moment generating function of the corresponding softmax kernel estimator is not finite. Consequently, no concentration results beyond those involving second moment methods (variance bounds) have been provided for FAVOR+. Despite active research on constructing new RFs for implicit attention in Transformers [12, 17], the following questions of great practical importance remained open:

Does there exist an unbiased estimator of the softmax/Gaussian kernel relying on positive and simultaneously bounded random features? Can it be efficiently constructed?

We answer both questions affirmatively in this paper, introducing a new mechanism called optimal positive random features (OPRFs). We propose other RF methods that, as OPRFs, do not apply trigonometric functions and provide positivity. We call this new set of RF mechanisms chefs random tables (CRTs, see Figure 1-left). The new OPRF-based method for fast self-attention approximation, applying in addition block-orthogonal random projections, is referred to as FAVOR++.

We compute the variance of OPRF-based estimators (see Theorem 3.1 & 3.2) and show that they can provide e60x variance reduction for Gaussian/softmax kernel estimation (see Figure 1). We give the first exponentially small upper bounds for tails of the Gaussian/softmax kernel estimators relying on positive RFs, leveraging boundedness of OPRFs (see Theorem 4.2). Consequently, using OPRFs we give the first uniform convergence results for softmax attention approximation with positive RFs (Theorem 4.3). We show that orthogonal random projections combined with OPRFs (leading to FAVOR++) provably reduce the variance of OPRFs for any dimensionality d (see Theorem 4.1) as opposed to only asymptotically for d large enough which is the case for RKSs. Finally, we provide extensive empirical evaluation in Section 5, for Transformers (text, image and speech domains), establishing new state-of-the-art results for low-rank attention Transformers for text.

2 Prerequisites

2.1 The definition of random features

Let x, y Rd be real vectors and K(x, y) = exp( 1

2 x y 2) be a Gaussian kernel where denotes the L2-norm. By random features (RFs) for the Gaussian kernel we denote two functions f (1)(ω, x), f (2)(ω, y) : Rd Rd C where ω is a random vector from some distribution p(ω) on Rd. Functions f ( )(ω, x) satisfy the following:

K(x, y) = Ep(ω)Re f (1)(ω, x)f (2)(ω, y) (1)

for all x, y Rd where Re ( ) denote the real part of a complex number (Im ( ) for the imaginary part). The decomposition (1) can be used for an unbiased approximation of the linear operator

K = (K(x(i), y(j)))L,L i,j=1 RL L where x(i), y(j) Rd. Such linear operators emerge in various applications, e.g. kernel SVM [45], kernel regression [38, 59] or Transformers [15] (see Section 2.3).

For any c RL, evaluating Kc naively would result in O(d L2) time complexity which is prohibitively expensive for large L. Instead, we can use the Monte Carlo approximation: draw i.i.d. samples ω1, . . . , ωM p(ω), where M L, and compute for 1 i L:

j=1 K(x(i), y(j))cj

m=1 Re f (1)(ωm, x(i))f (2)(ωm, y(j)) !

m=1 f (1)(ωm, x(i))

j=1 f (2)(ωm, y(j))cj

Therefore, Kc can be approximated by first precomputing {PL j=1 f (2)(ωm, y(j))cj}M m=1 and then evaluating (2) in O(d ML) total time. Precision of this approximation can be theoretically bounded [45, 15]. In this manuscript, we will use the variance Varp(ω)Re f (1)(ω, x)f (2)(ω, y) of (1) to judge the precision of the Monte Carlo approximation. Since samples ω1, . . . , ωM are i.i.d., the number M of RFs controls the tradeoff between the total variance Varω1,...,ωM (. . . ) = 1 M Varp(ω)(. . . ) (inversely proportional to M) and number of computations (directly proportional to M).

The softmax kernel is defined as: Ksfm(x, y) = exp(x T y), and can be easily derived from the Gaussian kernel K as follows: Ksfm(x, y) = exp( x 2/2)K(x, y) exp( y 2/2). Thus, in particular, any RF mechanism for the Gaussian kernel immediately transfers to the corresponding one for the softmax kernel and vice versa. Thus from now on, unless explicitly stated otherwise, the estimators we consider are approximating the Gaussian kernel.

2.2 Existing trigonometric and positive random feature methods

Here we summarize existing RFs for Gaussian kernel estimation. Trigonometric RFs (Trig RFs), the core of RKSs [45] and FAVOR [13], are defined as follows: f (1) trig(ω, x) = exp(iω x), f (2) trig(ω, y) = exp( iω y), ptrig(ω) N(0d, Id) where i denotes an imaginary unit (as opposed to the index notation i), 0d Rd is a vector of zeros and Id Rd d is an identity matrix. The variance of these RFs has the following form [15]: Varptrig(ω)Re f (1) trig(ω, x)f (2) trig(ω, y) = 1

2 1 K(x, y)2 2.

Positive RFs (Pos RFs) [15], the key ingredient of the FAVOR+ mechanism, are defined as follows: f (1) pos(ω, x) = f (2) pos(ω, x) = exp(ω x x 2), ppos(ω) N(0d, Id). Their name is due to the fact that f (1) pos(ω, x), f (2) pos(ω, y) are always positive real numbers. Pos RF variance has the form [15]:

Varppos(ω) f (1) pos(ω, x)f (2) pos(ω, y) = exp(4x y) K(x, y)2.

2.3 Random features for scalable Transformers

One recent application of RFs is in the area of scalable Transformers for processing long sequences [15]. Let L be the length of the sequence. Interactions between elements in Transformers are implemented via the self-attention mechanism. Given three matrices Q, K, V RL d, the selfattention mechanism returns the following result:

Y = softmax(d 1/2QK )V = diag(Ksfm1L) 1Ksfm V, Ksfm = (Ksfm(xi, yj))L,L i,j=1 (3)

where 1L RL is a vector of all ones, x(i) = d 1/4Qi,: (i th row of Q) and y(j) = d 1/4Kj,:, 1 i, j L. We deduce that computing (3) reduces to applying the linear operator Ksfm to d + 1 vectors: 1L, V:,1, ..., V:,d. Hence, when L is large, RF approximation similar to (2) but for the Ksfm( , ) kernel can reduce the computational complexity from O(d L2) to O(d ML).

Importantly, when the approximation (2) with the replacement K Ksfm can take negative and/or near-zero values, training is unstable since this approximation emerges in the denominator (inversed) term diag(Ksfm1L) in (3). One way to address this is to restrict f (1)(ω, x) and f (2)(ω, y) to always map into strictly positive numbers R+. This is where Pos RFs introduced in 2.2 are particularly relevant.

3 Chefs Random Tables

We are ready to present our mechanism of chefs random tables. All proofs are in the Appendix.

3.1 Generalized exponential RFs (GERFs) & optimal positive RFs (OPRFs)

Our first goal will be to generalize both trigonometric and positive RFs. Then we will focus on one special case of this generalization, that will directly lead to the FAVOR++ mechanism.

We will be looking for RFs of the following generalized exponential form for p GE(ω) N(0d, Id):

f (1) GE(ω, x) = D exp(A ω 2 + Bω x + C x 2),

f (2) GE(ω, y) = D exp(A ω 2 + s Bω y + C y 2), (4)

where A, B, C, D C and s { 1, +1}. It can be seen that A = 0, B = i, C = 0, D = 1, s = 1 corresponds to trigonometric RFs and A = 0, B = 1, C = 1, D = 1, s = 1 corresponds to positive RFs. The next theorem describes the conditions under which f (.) GE can be used to approximate the Gaussian kernel. Theorem 3.1. p GE(ω) and f (.) GE, defined in (4), satisfy (1) if

Re (1 4A) > 0, B = p

s(1 4A), C = (s + 1)/2, D = ( 4

1 4A)d, (5)

where and n denotes a principal root if the argument is complex.

Hence, A and s can be treated as free parameters and B, C, D as dependent ones. The variance of these RFs can be expressed through A and s as follows:

Theorem 3.2. Let Re (1 8A) > 0. The variance of (2) using p GE(ω), f (.) GE is given as

Varp GE(ω)Re f (1) GE(ω, x)f (2) GE(ω, y) = 1

2 exp (s + 1) x 2 + y 2

Re α1 exp α2 x + sy 2 + α3 exp α4 x + sy 2 K(x, y)2. (6)

where α1 = q

d , α2 = s + s 1 8A , α3 = 1 + 16|A|2

1 8Re(A) d/2 , α4 = s 2 + s+2|1 4A| 2(1 8Re(A)) .

While it is unclear how to find a global minimum of the objective (6) with respect to A C, Re (1 8A) > 0 and s { 1, +1}, we observe that it s possible to find an optimum when we restrict A to be a real number and fix s = +1. Theorem 3.3 (Minimum variance). When s = +1, A is restricted to be a real number and x+y 2 > 0, the variance (6) is minimized when A = (1 1/ρ )/8 where 0 < ρ < 1,

(2 x + y 2 + d)2 + 8d x + y 2 2 x + y 2 d / 4 x + y 2 . (7)

Note: One can show that A < 0 for x = y thus the corresponding estimator is bounded since the term A ω 2 prevails over linear terms Bω x and s Bω y in (4). When x + y 0 then ρ 1 and thus A 0. Therefore for x = y the mechanism reduces to Pos RF described in Sec. 2.2 as expected, since for x = y Pos RFs provide perfect estimation (variance equal to zero). Larger values of x + y lead to larger gains coming from the new mechanism.

From (5) it can be inferred that B, C, D are real when A is real and s = +1. Hence, f (1)(ω, x), f (2)(ω, y) are positive real numbers in this case. Furthermore, s = +1, A = 0 corresponds to positive RFs. Therefore, we refer to RFs with A defined according to (7) as optimal positive RFs (OPRFs). Figure 1-right illustrates the analytical variance reduction achieved via OPRFs.

In practice, we are given sets {x(i)}, {y(j)} instead of a single pair x, y. For this reason, in (6,7), we can use the averages of x(i) 2, y(j) 2, x(i) + sy(j) 2 instead of x 2, y 2, x + sy 2. This heuristic is based on the assumption that all {x(i)} and {y(j)} are homogeneous and x(i) 2,

y(j) 2, x(i) + sy(j) 2 are tightly concentrated around their mean. Computing averages of x(i) 2, y(j) 2 takes O(Ld) time. Using the formula below, the average of x(i) +sy(j) 2 can be computed with the same complexity:

j=1 x(i)+sy(j) 2 = 1

i=1 x(i) 2+ 2s

i=1 x(i) ! L X

i=1 y(i) 2. (8)

The closed-form solution for real A and s = +1 allows O(1)-time optimization of (6) after precomputing these statistics. In the general case we can rely on numerical optimization of (6) with respect to A C and s { 1, +1}. Using precomputed statistics, each evaluation of (6) takes O(1) time. As long as the total number of these evaluations is O(LM(d + n)), where n is the number of K or Ksfm evaluations (n = d + 1 in Section 2.3), it does not affect the total complexity.

The next class of mechanisms, if implemented straightforwardly, does not give positive-valued RFs but, as we explain in Section 3.2.3, can be easily transformed to variants providing positivity.

3.2 Discretely-induced random features (DIRFs)

Take a discrete probabilistic distribution p(ω) where ω1, . . . , ωd are i.i.d. with P(ωl = k) = pk, P k=0 pk = 1 and pk > 0 for k {0} N. Note that, by Taylor series expansion of exp( ),

K(x, y) exp( x 2

2 ) exp( y 2

2 ) = exp(x y) =

k=0 pk xk l yk l pkk! = E

where Xl = xωl l (ωl!) 1

2 ωl , Yl = yωl l (ωl!) 1

2 ωl . Thus we can define discretely-induced random features providing Gaussian kernel estimation as follows:

f (1) DI (ω, x) = f (2) DI (ω, x) = f DI(ω, x) = exp( x 2

l=1 xωl i (ωl!) 1

2 ωl . (10)

Different instantiations of the above mechanism are given by different probabilistic distributions {pk}. We will consider two prominent special cases: (a) Poisson, and (b) geometric distributions.

3.2.1 Poisson random features (Pois RFs)

If {pk} is a Poisson distribution, i.e. pk = e λλk/k!, k {0} N, then the corresponding RFs are defined as: f (1) pois(ω, x) = f (2) pois(ω, x) = fpois(ω, x) = eλd/2 x 2/2 Qd l=1 xωl l λ ωl/2.

Theorem 3.4. Variance of (2) with ppois, fpois is given by

Varppois(ω) (fpois(ω, x)fpois(ω, y))=exp

l=1 x2 l y2 l x 2 y 2 !

K(x, y)2. (11)

The exp argument in (11) is convex as a function of λ > 0. By setting its derivative to zero, we find that λ = d 1/2(Pd l=1 x2 l y2 l )1/2 gives the minimum of (11).

When, instead of a single pair x, y, sets {x(i)}, {y(j)} are provided, we can use the same homogeneity assumption as in Section 3.1 and substitute the average of Pd l=1(x(i) l )2(y(j) l )2 over 1 i, j L instead of Pd l=1 x2 l y2 l . This average can be computed efficiently in O(Ld) time as follows:

l=1 (x(i) l )2(y(j) l )2 = L 2 d X

i=1 (x(i) l )2 ! L X

i=1 (y(i) l )2 !

After computing this statistic, we can calculate λ in O(1) time using the analytic formula.

3.2.2 Geometric random features (Geom RFs)

If {pk} is a geometric distribution, i.e. pk = p(1 p)k, k {0} N, for a parameter 0 < p < 1, then the corresponding RFs are defined as: f (1) geom(ω, x) = f (2) geom(ω, x) = fgeom(ω, x) = p d/2e x 2/2 Qd l=1 xωl l (1 p) ωl/2(ωl!) 1/2. Theorem 3.5. The variance of (2) with pgeom, fgeom is given as

Varpgeom(ω)(fgeom(ω,x)fgeom(ω, y))=p de x 2 y 2 d Y

l=1 I0(2(1 p) 1

2 |xlyl|) K(x, y)2 (13)

where I0( ) is the modified Bessel function of the first kind of order 0.

Again as for the previously described mechanisms, when sets {x(i)}, {y(j)} are given, we can use averages of |x(i) l y(j) l |, 1 l d, instead of |xlyl| in (13) assuming homogeneity of x(i) s and y(j) s. Each out of d averages can be computed in O(L) time as follows:

j=1 |x(i) l y(j) l | = L 2 L X

i=1 |x(i) l |

i=1 |y(i) l |

After precomputation of these statistics, evaluation of (11) takes O(d) time. A numerical optimization can be used to minimize (11) with respect to p. As long as the number of variance evaluations is O(LM(1 + n/d)), the total complexity estimate is not affected.

3.2.3 Making discretely-induced RFs positive

As can be inferred from Eq. 10, DIRFs are positive when all elements of x and y are positive. If this is not the case, and positive-valued RFs are needed, e.g. in applications involving scalable Transformers, one way to make them positive is to take some vector c Rd such that cl < xl, yl. An example of such a vector is given by cl = mini min(x(i) l , y(i) l ) ϵ where ϵ > 0 is a small constant. Next, define bx(i) = x(i) c, by(j) = y(j) c. Then, clearly, bx(i) by(j) = x(i) y(j), K(bx(i), by(j)) = K(x(i), y(j)) and RFs can be used on bx(i), by(j) which have positive entries. We refer to these variants of Pois RFs and Geom RFs as Pois RF+ and Geom RF+ respectively.

4 Additional theoretical results & FAVOR++

Interestingly, as in the case of the Pos RF mechanism from [15], OPRFs also benefit from applying block-orthogonal ensembles of projections ω (see Appendix 9.1 and [15] for the exact definition). We show below that orthogonal RFs reduce the variance of GERFs with A R, s = 1 (which is the case for OPRF) for any d > 0:

Theorem 4.1 (Orthogonal OPRFs). If Var( b Kort M (x, y)) denotes the variance of the orthogonal GERF estimator b Kort M (x, y) of the Gaussian kernel with A R, s = 1 in (4) at x, y Rd using M RFs and Var( b Kiid M (x, y)) is the analogous expression with i.i.d. samples, then for some C( x + y ) 0:

Var( b Kort M (x, y)) Var( b Kiid M (x, y)) (1 1

M ) 2 d + 2C( x + y ). (15)

Note: The analogous inequality can be obtained for Trig RFs only in the asymptotic sense (for d large enough, see Theorem 3.8 in [14]). One of the key properties used in the proof of Theorem 4.1 is positivity of RFs. We conclude that positive-valued RFs are particularly well suited for the quasi Monte-Carlo methods based on the orthogonal ensembles. Analogously to FAVOR+ [15], we refer to the self-attention approximation mechanism based on orthogonal OPRFs as FAVOR++.

We now provide strong concentration results for the OPRF-based estimators critically relying on the boundedness of OPRFs. Boundedness of OPRFs is due to A = (1 1/ρ )/8 < 0 even when x + y 2 is substituted by average (8) in (7). To the best of our knowledge, these are the first such results for positive-valued RFs. Denote by L the Legendre Transform of the random variable Z = f (1) GE(ω, x)f (2) GE(ω, y) for f ( ) GE as in (4) with A, B, C, D R defining GERFs.

Figure 2: Log-variance of different RF mechanisms, mean and standard deviation. For each sampling method, we plot the results for non-positive and positive RFs on separate plots for 0.1 σ 1.

Theorem 4.2. Suppose A < 0. The following is true for any ϵ > 0: P[| b Kiid M (x, y) K(x, y)| ϵ] 2 exp( Mϵ2

2 exp( x 2+ y 2

2A )(1 4A) d

2 ). Furthermore, for the orthogonal variant we have: P[ b Kort M (x, y) K(x, y) ϵ] exp( ML(K(x, y) + ϵ)) and L(K(x, y) + ϵ)) > 0.

Finally, below we provide the first result regarding uniform convergence for attention approximation in the efficient low-rank Transformers (Section 2.3).

Theorem 4.3 (Uniform convergence for attention approximation). Assume that rows of Q and K from (3) come from the L2-ball of radius R > 0. Denote by b Ksfm the approximation of Ksfm from (3) via the GERF-mechanism with A < 0 using M independent random projections. Then Ksfm b Ksfm ϵ with any constant probability when M = Ω((1 4A) d 2 Γ d

ϵ )), where:

Γ = exp( 3R2

4 , γ = (1 4A) d 2 q

d + d2) (for A as in the OPRFs definition).

5 Experiments

We present an extensive empirical evaluation of CRTs. Additional details and results for each experiment can be found in the Appendix 9.10.

5.1 Comparing variance of CRTs for Gaussian kernel estimation

In this initial experiment, we sample synthetic pairs of vectors x, y and evaluate variance of CRTs based on the analytic formulas (6,11,13). Our goal is to check whether there are scenarios when the newly introduced RF mechanisms have smaller variance than existing Trig RF and Pos RF methods. We set d = 64 which is standard in e.g. Transformer applications (Section 2.3). We use four different regimes for drawing x, y: normal corresponds to x, y sampled from N(0d, σ2Id), sphere corresponds to x, y sampled uniformly on a sphere σSd 1, heterogen corresponds to x and y sampled from two heterogeneous distributions: N(0d, σ2Id) and N(σ1d, σ2Id) and mnist corresponds to x, y being random images from MNIST dataset [19] resized to 8 8, scaled by σ > 0 and flattened.

In many scenarios (see Figure 2), CRTs outperform Trig RF and Pos RF baselines. Among other improvements, GERF gives more than e80, e125, e10 times variance reduction compared to Trig RF in normal, heterogen and mnist when σ = 1. OPRF and Geom RF+ give more than e75, e125, e7 times variance reduction compared to Pos RF in normal, heterogen and mnist when σ = 1.

5.2 Comparing CRTs in the non-parametric classification

Our next experiment is a non-parametric classification where probabilities are predicted by kernel regression [38, 59] with the Gaussian kernel. Training data consists of objects o(1), . . . , o(L) Rd

with corresponding one-hot encoded labels r(1), . . . , r(L) Rn. The predicted label distribution for the new object o is defined as r = PL i=1 K(σo , σo(i))r(i)/ PL i=1 K(σo , σo(i)) where σ > 0

Table 1: Non-parametric classification, test accuracy (%). M = 128. The best result, second best.

Dataset Trig RF Pos RF GERF Pois RF Geom RF OPRF Pois RF+ Geom RF+ L

abalone [39] 12.0 16.0 17.0 18.0 18.3 17.1 14.0 15.1 3758 banknote [22] 66.2 83.4 92.4 84.4 94.5 92.6 80.1 85.6 1233 car [5] 66.3 69.2 70.9 66.3 66.3 69.5 66.3 67.2 1554 yeast [30] 29.7 34.4 42.9 36.9 35.9 44.4 29.7 31.0 1334 cmc [35] 46.6 45.1 47.8 46.6 47.3 46.3 35.5 43.5 1324 nursery [40] 31.3 77.4 63.8 77.1 77.1 78.9 77.3 71.0 11664 wifi [47] 15.2 88.8 93.3 95.3 95.8 93.3 77.2 82.9 1799 chess [23] 16.5 20.2 20.4 19.1 19.5 20.2 19.2 22.5 25249

Average 35.5 54.3 56.1 55.5 56.8 57.8 49.9 52.3 N/A

is a hyperparameter tuned on the validation set. Using the RF approximation for the kernel as in (2), we, with O(n LM) preprocessing, can approximate r in O(n M) time per example instead of O(n L) for the exact computation.

Since the predicted class is argmax1 l nr , we can ignore the denominator term and, therefore, use non-positive RFs. We evaluate on classification benchmarks from UCI Repository [24] (Table 1). The best results are achieved by new RF mechanisms, with Geom RF and OPRF performing particularly well. OPRF shows the best average performance, therefore our recommendation for practitioners is to opt for this method. For the same reason, we focus on the FAVOR++ variant (OPRF with orthogonal random projections for attention approximation) in our Transformer experiments below.

5.3 FAVOR++ in scalable Transformers

5.3.1 Natural language processing

In this setting, we test different low-rank attention Transformers on the General Language Understanding Evaluation (GLUE) benchmark [57], consisting of 8 different natural language understanding tasks with the sequence length ranging from 32 to 128. We used the same training parameters as mentioned in [20] (see Appendix 9.10.3 for details). We compared FAVOR+ [15], ELU [31] and Re LU [15] variants of the Performers [15] against a FAVOR++ variant and report the results in Table 2. We find that FAVOR++ outperforms all these low-rank Transformers in most GLUE tasks. In particular, FAVOR++ outperforms FAVOR+ on all GLUE tasks, demonstrating downstream effectiveness of the variance reduction of the softmax kernel estimation. Furthermore, warm-starting with pre-trained BERT-base model checkpoint [20] (Uptrain FAVOR++ in Table 2), further improves performance demonstrating backward-compatibility of FAVOR++ with the exact softmax kernel.

5.3.2 Speech modelling

We compare FAVOR++ with FAVOR+ on speech models with the Libri Speech ASR corpus ([42]). We apply both to approximate attention blocks in the 17-layer Conformer-Transducer encoder ([26]) of only 4 attention heads and use the word error rate (WER) metric a standard way to evaluate speech models. In both cases FAVOR++ outperforms FAVOR+, as shown in Figure 3. The WER

Table 2: GLUE Dev results on base sized models. Number of training examples is reported below each task. MCC score is reported for Co LA, F1 score is reported for MRPC, Spearman correlation is reported for STS-B, and accuracy scores are reported for the other tasks. The best result, second best.

System MNLI QQP QNLI SST-2 Co LA STS-B MRPC RTE 392k 363k 108k 67k 8.5k 5.7k 3.5k 2.5k

FAVOR+[15] 80.26 89.53 87.13 90.58 53.17 85.07 83.82 67.59 ELU[31] 80.72 90.05 89.09 91.51 48.43 86.68 85.05 68.59 Re LU[15] 81.39 90.11 88.85 91.97 52.08 87.64 84.56 67.51

FAVOR++ 81.25 90.15 89.58 92.00 54.95 85.62 85.78 67.87 Uptrain FAVOR++ 82.29 90.43 89.73 92.20 58.85 85.90 88.73 67.63

Figure 3: Comparison of the Conformer-Transducer encoder with FAVOR++ and FAVOR+ attention on the Libri Speech [42] corpus for M = 16 and M = 8 RFs. We used common word error rate (WER) metric.

Figure 4: Image-Transformers experiments. Left: Accuracy of training FAVOR+ and FAVOR++ on Image Net from scratch and fine-tuning from softmax MAE pre-trained weights (Uptraining). Right: Comparing sequence length vs number of steps per second for FAVOR+, FAVOR++ and regular Transformer variant (Softmax).

improvement for FAVOR++ is substantial: 2.49% for M = 8 and 3.05% for M = 16 with a negligible O(Ld) O(LMd) overhead for computing (8) compared to FAVOR+.

5.3.3 Vision Transformers

To further showcase the need for more accurate softmax kernel approximation, we compare the performance of FAVOR+ and FAVOR++ on Image Net ([18]). We inject both mechanisms to the attention modules of Vision Transformers (Vi T [21]). In Figure 4, we show the results of training from scratch and uptraining from the MAE checkpoint [29]. We see that, as opposed to FAVOR+, FAVOR++ is more stable and is able to improve performance especially for uptraining, demonstrating backward-compatibility with the exact softmax kernel.

Finally, we compare the computational complexity of FAVOR+ and FAVOR++. In Figure 4, the right plot shows the number of steps per second as a function of sequence length L on the same hardware. We see that attention modules using FAVOR+ and FAVOR++ have very similar computation time (both provide linear attention). Moreover, for sequence lengths above 1000, training a regular Vi T model became increasingly difficult due to out-of-memory errors.

6 Limitations of this work & broader impact

Several of the mechanisms proposed in this paper can be further extended, potentially leading to even more accurate algorithms. For instance, it remains an open question how to choose theoretically optimal parameters for GERF and Geom RF mechanisms. Furthermore, DIRFs can benefit from optimizing the discrete distributions defining them (to minimize variance of the estimation) rather than choosing them a priori. Our methods should be used responsibly, given rising concerns regarding the carbon footprint of training massive Transformer models and other societal issues [49, 8, 3, 60].

7 Conclusion

We presented a new class of RF mechanisms called chefs random tables (CRTs) including methods providing positivity and boundedness of random features two key properties for new applications of RFs in Transformer training. We provided comprehensive theoretical results and extensive empirical evaluation, resulting in particular in new state-of-the-art low-rank attention Transformers for text.

8 Acknowledgements

V. L. acknowledges support from the Cambridge Trust and Deep Mind. V. L. was part-time employed by Google while a Ph D student. A.W. acknowledges support from a Turing AI Fellowship under EPSRC grant EP/V025279/1, The Alan Turing Institute, and the Leverhulme Trust via CFI.

[1] Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 253 262. PMLR, 2017.

[2] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020.

[3] Emily M. Bender, Timnit Gebru, Angelina Mc Millan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 21, page 610 623, New York, NY, USA, 2021. Association for Computing Machinery.

[4] Nicholas M. Boffi, Stephen Tu, and Jean-Jacques E. Slotine. Nonparametric adaptive control and prediction: Theory and randomized algorithms. In 60th IEEE Conference on Decision and Control, CDC 2021, Austin, TX, USA, December 14-17, 2021, pages 2935 2942. IEEE, 2021.

[5] Marko Bohanec and Vladislav Rajkoviˇc. V.: Knowledge acquisition and explanation for multi-attribute decision. In Making, 8 th International Workshop Expert Systems and Their Applications, 1988.

[6] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018.

[7] Richard P. Brent. An algorithm with guaranteed convergence for finding a zero of a function. Comput. J., 14:422 425, 1971.

[8] Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models. Co RR, abs/2012.07805, 2020.

[9] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12:1069 1109, 2011.

[10] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers, 2019.

[11] Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Yoshua Bengio, Dale Schuurmans, John D. Lafferty, Christopher K. I. Williams, and Aron Culotta, editors, Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada, pages 342 350. Curran Associates, Inc., 2009.

[12] Krzysztof Choromanski, Haoxian Chen, Han Lin, Yuanzhe Ma, Arijit Sehanobish, Deepali Jain, Michael S. Ryoo, Jake Varley, Andy Zeng, Valerii Likhosherstov, Dmitry Kalashnikov, Vikas Sindhwani, and Adrian Weller. Hybrid random features. In International Conference on Learning Representations (ICLR), 2022.

[13] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Jared Davis, Tamás Sarlós, David Belanger, Lucy J. Colwell, and Adrian Weller. Masked language modeling for proteins via linearly scalable long-context transformers. Co RR, abs/2006.03555, 2020.

[14] Krzysztof Choromanski, Mark Rowland, Tamás Sarlós, Vikas Sindhwani, Richard E. Turner, and Adrian Weller. The geometry of random features. In Amos J. Storkey and Fernando Pérez Cruz, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, volume 84 of Proceedings of Machine Learning Research, pages 1 9. PMLR, 2018.

[15] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021.

[16] Krzysztof Marcin Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness of structured random orthogonal embeddings. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 219 228, 2017.

[17] Sankalan Pal Chowdhury, Adamos Solomou, Avinava Dubey, and Mrinmaya Sachan. On learning the transformer kernel. Co RR, abs/2110.08323, 2021.

[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248 255. IEEE Computer Society, 2009.

[19] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

[20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

[21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

[22] Dheeru Dua and Casey Graff. Banknote authentication data set, UCI machine learning repository, 2017.

[23] Dheeru Dua and Casey Graff. Chess (king-rook vs. king) data set, UCI machine learning repository, 2017.

[24] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.

[25] Lukas Gonon. Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality. Co RR, abs/2106.08900, 2021.

[26] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolutionaugmented transformer for speech recognition. In Helen Meng, Bo Xu, and Thomas Fang Zheng, editors, Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 5036 5040. ISCA, 2020.

[27] Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, and Jinwoo Shin. Random features for the neural tangent kernel. Co RR, abs/2104.01351, 2021.

[28] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, September 2020.

[29] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. ar Xiv preprint ar Xiv:2111.06377, 2021.

[30] Paul Horton and Kenta Nakai. A probabilistic classification system for predicting the cellular localization sites of proteins. In Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, page 109 115. AAAI Press, 1996.

[31] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, 2020.

[32] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020.

[33] Valero Laparra, Diego Marcos Gonzalez, Devis Tuia, and Gustau Camps-Valls. Large-scale random features for kernel regression. In 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 17 20, 2015.

[34] Zhu Li, Jean-Francois Ton, Dino Oglic, and Dino Sejdinovic. Towards a unified analysis of random Fourier features. J. Mach. Learn. Res., 22:108:1 108:51, 2021.

[35] T S Lim, Wei-Yin Loh, and Yu-Shan Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40:203 228, 09 2000.

[36] Ha Quang Minh. Operator-valued Bochner theorem, Fourier feature maps for operator-valued kernels, and vector-valued learning. Co RR, abs/1608.05639, 2016.

[37] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.

[38] E. A. Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141 142, 1964.

[39] Warwick J. Nash and Tasmania. The Population biology of abalone (Haliotis species) in Tasmania. 1, Blacklip abalone (H. rubra) from the north coast and the islands of Bass Strait / Warwick J. Nash ... [et al.]. Sea Fisheries Division, Dept. of Primary Industry and Fisheries, Tasmania Hobart, 1994.

[40] Manuel Olave, Vladislav Rajkovic, and Marko Bohanec. An application for admission in public school systems. Expert Systems in Public Administration, 1:145 160, 1989.

[41] Junier B. Oliva, Willie Neiswanger, Barnabás Póczos, Eric P. Xing, Hy Trac, Shirley Ho, and Jeff G. Schneider. Fast function to function regression. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, volume 38 of JMLR Workshop and Conference Proceedings. JMLR.org, 2015.

[42] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 5206 5210. IEEE, 2015.

[43] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4055 4064, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR.

[44] A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, Los Alamitos, CA, USA, sep 2008. IEEE Computer Society.

[45] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

[46] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 1313 1320. Curran Associates, Inc., 2008.

[47] Jayant Rohra, Boominathan Perumal, Swathi J.N., Priya Thakur, and Rajen Bhatt. User Localization in an Indoor Environment Using Fuzzy Hybrid of Particle Swarm Optimization and Gravitational Search Algorithm with Neural Networks, pages 286 295. 02 2017.

[48] Bharath K. Sriperumbudur and Zoltán Szabó. Optimal rates for random Fourier features. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1144 1152, 2015.

[49] Emma Strubell, Ananya Ganesh, and Andrew Mc Callum. Energy and policy considerations for deep learning in NLP. Co RR, abs/1906.02243, 2019.

[50] Yitong Sun, Anna C. Gilbert, and Ambuj Tewari. But how does it work in theory? Linear SVM with random features. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 3383 3392, 2018.

[51] Danica J. Sutherland and Jeff G. Schneider. On the error of random Fourier features. In Marina Meila and Tom Heskes, editors, Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI 2015, July 12-16, 2015, Amsterdam, The Netherlands, pages 862 871. AUAI Press, 2015.

[52] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention in transformer models, 2021.

[53] Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9438 9447. PMLR, 13 18 Jul 2020.

[54] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021.

[55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 5998 6008. Curran Associates, Inc., 2017.

[56] Guido Walz. Lexikon der Mathematik: Band 2: Eig bis Inn. Springer-Verlag, 2016.

[57] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. ar Xiv preprint ar Xiv:1804.07461, 2018.

[58] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020.

[59] Geoffrey S. Watson. Smooth regression analysis. Sankhy a: The Indian Journal of Statistics, Series A (1961-2002), 26(4):359 372, 1964.

[60] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. ar Xiv preprint ar Xiv:2112.04359, 2021.

[61] Jiaxuan Xie, Fanghui Liu, Kaijie Wang, and Xiaolin Huang. Deep kernel learning via random Fourier features. Co RR, abs/1910.02660, 2019.

[62] Jiyan Yang, Vikas Sindhwani, Quanfu Fan, Haim Avron, and Michael W. Mahoney. Random laplace feature maps for semigroup kernels on histograms. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 971 978. IEEE Computer Society, 2014.

[63] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström method vs random Fourier features: A theoretical and empirical comparison. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 485 493, 2012.

[64] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17283 17297. Curran Associates, Inc., 2020.

[65] Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw., 23(4):550 560, 1997.

[66] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In IEEE international conference on computer vision, pages 19 27, 2015.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 6. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See statements of Theorems 3.1, 3.2, 3.3, 3.4, 3.5, 4.1, 4.3. (b) Did you include complete proofs of all theoretical results? [Yes] See Appendices 9.2, 9.3, 9.4, 9.5, 9.6, 9.7, 9.8, 9.9.

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We include the part of the code that is not confidential, the core CRT variant: FAVOR++ mechanism. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 5 and Appendix 9.10. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Figure 2 and Table 3 in the Appendix. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We mention that in the Appendix 9.10. We don t report the total amount of compute, because that would violate internal policies of our organisation. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] We use standard publicly available datasets and frameworks. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] We only use standard publicly available datasets. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] We only standard publicly available datasets. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] We didn t use crowdsourcing (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]