# hybrid_random_features__bf3d1cbb.pdf Published as a conference paper at ICLR 2022 HYBRID RANDOM FEATURES Krzysztof Choromanski 12, Haoxian Chen 2, Han Lin 2, Yuanzhe Ma 2, Arijit Sehanobish , Deepali Jain1, Michael S Ryoo1, Jake Varley1, Andy Zeng1, Valerii Likhosherstov13, Dmitry Kalashnikov1, Vikas Sindhwani1, Adrian Weller34 1Google Brain Robotics, 2Columbia University, 3University of Cambridge, 4The Alan Turing Institute We propose a new class of random feature methods for linearizing softmax and Gaussian kernels called hybrid random features (HRFs) that automatically adapt the quality of kernel estimation to provide most accurate approximation in the defined regions of interest. Special instantiations of HRFs lead to well-known methods such as trigonometric (Rahimi & Recht, 2007) or (recently introduced in the context of linear-attention Transformers) positive random features (Choromanski et al., 2021b). By generalizing Bochner s Theorem for softmax/Gaussian kernels and leveraging random features for compositional kernels, the HRFmechanism provides strong theoretical guarantees - unbiased approximation and strictly smaller worst-case relative errors than its counterparts. We conduct exhaustive empirical evaluation of HRF ranging from pointwise kernel estimation experiments, through tests on data admitting clustering structure to benchmarking implicit-attention Transformers (also for downstream Robotics applications), demonstrating its quality in a wide spectrum of machine learning problems. 1 INTRODUCTION & RELATED WORK Consider the softmax and Gaussian kernel functions K : Rd d R defined as follows: SM(x, y) def = exp(x y), Kgauss(x, y) def = exp( x y 2 2 2 ). (1) These two are prominent examples of functions used in the so-called kernel methods (Gretton et al., 2005; Zhang et al., 2018) and beyond, i.e. in softmax-sampling (Blanc & Rendle, 2018). Random features (RFs, Rahimi & Recht, 2007; Liu et al., 2020; Peng et al., 2021) yield a powerful mechanism for linearizing and consequently scaling up kernel methods with dot-product kernel decompositions disentangling x from y in the formulae for kernel value K(x, y) ϕ(x) ϕ(y) via data-agnostic probabilistic (random feature) maps ϕ : Rd Rm. The tight relationship between softmax and Gaussian kernels given by the transformation Kgauss(x, y) = exp( x 2 2 2 )SM(x, y) exp( y 2 2 2 ) provides a mapping of any random feature vector ϕSM(u) for the softmax kernel to the corresponding one ϕgauss(u) = exp( u 2 2 2 )ϕSM(u) for the Gaussian kernel, thus we will focus on the former kernel. The classic random feature map mechanism ϕtrig m for the softmax kernel, obtained from Bochner s Theorem applied to the Gaussian kernel (Rahimi & Recht, 2007), is of the form: ϕtrig m (u) = 1 m exp( u 2 2 ) sin(ω 1 u), ..., sin(ω mu), cos(ω 1 u), ..., cos(ω mu) , (2) where m stands for the number of random features and ω1, ..., ωm iid N(0, Id). The above (common in most downstream use-cases) method for linearizing softmax/Gaussian kernels was recently shown to fail in some of the new impactful applications of scalable kernel methods such as implicit-attention Transformer architectures called Performers (Choromanski et al., 2021b). We denote by MSE the mean squared error of the estimator (i.e. its variance since all estimators considered in this paper are unbiased). The above mechanism struggles to accurately approximate close-to-zero kernel values, as characterized by the particularly large MSE in that region. This is a crucial problem since most of the entries of the attention matrices in Transformers models are very Equal Contribution, Correspondence to kchoro@google.com. Authorship in alphabetical order Independent researcher. Work done during postdoc at Yale University Published as a conference paper at ICLR 2022 small and the approximators need to be particularly accurate there. Otherwise the renormalizers computed to make attention matrices row-stochastic (standard attention normalization procedure in Transformers) would be estimated imprecisely, potentially even by negative values. The solution to this problem was presented in FAVOR+ mechanism (Choromanski et al., 2021b), where a new positive random feature map for unbiased softmax kernel estimation was applied: ϕ++ m (u) = 1 2m exp( u 2 2 ) exp(ω 1 u), ..., exp(ω mu), exp( ω 1 u), ..., exp( ω mu) . (3) Even though, very accurate in estimating small softmax kernel values (which turns out to be crucial in making RFs work for Transformers training), this mechanism is characterized by larger MSE for large kernel values. In several applications of softmax kernels (in particular Transformers, where attention matrices typically admit sparse combinatorial structure with relatively few but critical large entries and several close-to-zero ones or softmax sampling) the algorithm needs to process simultaneously very small and large kernel values. The following natural questions arise: Is it possible to get the best from both the mechanisms to obtain RFs-based estimators particularly accurate for both very small and large softmax kernel values ? Furthermore, can those estimators be designed to have low variance in more generally pre-defined regions ? We give affirmative answers to both of the above questions by constructing a new class of random feature maps techniques called hybrid random features or HRFs. Theoretical methods used by us to develop HRFs: (a) provide a unifying perspective, where trigonometric random features from Bochner s Theorem and novel mechanisms proposed in (Choromanski et al., 2021b) are just special corollaries of the more general result from complex analysis, (b) integrate in the original way several other powerful probabilistic techniques such as Goemans-Willimson method (Goemans & Williamson, 2004) and random features for the compositional kernels (Daniely et al., 2017). We provide detailed theoretical analysis of HRFs, showing in particular that they provide strictly more accurate worst-case softmax kernel estimation than previous algorithms and lead to computational gains. We also conduct their thorough empirical evaluation on tasks ranging from pointwise kernel estimation to downstream problems involving training Transformer-models or even end-toend robotic controller-stacks including attention-based architectures. Related Work: The literature on different random feature map mechanisms for Gaussian (and thus also softmax) kernel estimation is voluminous. Most focus has been put on reducing the variance of trigonometric random features from (Rahimi & Recht, 2007) via various Quasi Monte Carlo (QMC) methods, where directions and/or lengths of Gaussian vectors used to produce features are correlated, often through geometric conditions such as orthogonality (Choromanski et al., 2017; Rowland et al., 2018; Yu et al., 2016; Choromanski et al., 2019; Choromanski & Sindhwani, 2016). Our HRFs do not compete with those techniques (and can be in fact easily combined with them) since rather than focusing on improving sampling mechanism for a given approximation algorithm, they provide a completely new algorithm. The new application of random features for softmax kernel in Transformers proposed in (Choromanski et al., 2020; 2021b) led to fruitful research on the extensions and limitations of these methods. Schlag et al. (2021) replaced random features by sparse deterministic constructions (no longer approximating softmax kernel). Luo et al. (2021) observed that combining L2-normalization of queries and keys for variance reduction of softmax kernel estimation with FFT-based implementations of relative position encoding and FAVOR+ mechanism from Performers helps in training. Trigonometric random features were applied for softmax sampling in (Rawat et al., 2019). Several other techniques such as Nystr om method (Yang et al., 2012; Williams & Seeger, 2000; Rudi et al., 2015) were proposed to construct data-dependent feature representations. Even though, as we show in Sec. 2.4, certain instantiations of the HRF mechanism clearly benefit from some data analysis, our central goal remains an unbiased estimation of the softmax/Gaussian kernels which is no longer the case for those other techniques. 2 HYBRID RANDOM FEATURES 2.1 PRELIMINARIES Whenever we do not say explicitly otherwise, all presented lemmas and theorems are new. We start with the following basic definitions and results. Definition 2.1 (Kernel with a Random Feature Map Representation). We say that a kernel function K : Rd Rd R admits a random feature (RF) map representation if it can be written as Published as a conference paper at ICLR 2022 K(x, y) = Eω Ω i=1 ξi(x, ω)ξi(y, ω) for some ξi : Rd Rd R, and where ω is sampled from some probabilistic distribution Ω P(Rd). The corresponding random feature map, for a given m N, is defined as ϕm(u) = 1 mϕ1 m(u) ... ϕl m(u) Rml, (5) where ϕi m = (ξi(u, ω1), ..., ξi(u, ωm)) , stands for vertical concatenation, and ω1, ..., ωm iid Ω. Random feature maps can be used to unbiasedly approximate corresponding kernels, as follows: b K(x, y) = ϕm(x) ϕm(y). (6) Using the above notation, a trigonometric random feature ϕtrig m from Equation 2 can be encoded as applying l = 2, ξ1(u, ω) = sin(ω u), ξ2(u, ω) = cos(ω u). Similarly, positive random features can be encoded as taking l = 2 and ξ1(u, ω) = 1 2 exp(ω u), ξ2(u, ω) = 1 2 exp( ω u). The following result from (Choromanski et al., 2021b) shows that the mean squared error (MSE) of the trigonometric estimator is small for large softmax kernel values and large for small softmax kernel values, whereas an estimator applying positive random features behaves in the opposite way. Denote by d SM trig m (x, y) an estimator of SM(x, y) for x, y Rd using trigonometric RFs and ω1, ..., ωm iid N(0, Id). Denote by d SM ++ m (x, y) its analogue using positive RFs. We have: Lemma 2.2 (positive versus trigonometric RFs). Take = x y, z = x + y, f1(u) = (2m) 1 exp(u2)SM 2(x, y), f2(u) = (2m) 1 exp(u2)SM2(x, y), f3(u) = (1 exp( u2))2. The MSEs of these estimators are: MSE(d SM trig m (x, y)) = f1( z 2)f3( 2), MSE(d SM ++ m (x, y)) = f2( z 2)f3( z 2). (7) 2.2 THE ALGORITHM We are ready to present the mechanism of Hybrid Random Features (HRFs). Denote by E = (d SM k(x, y))p+1 k=1 a list of estimators of SM(x, y) (the so-called base estimators) and by Λ = (bλk(x, y))p k=1 a list of estimators of {λk(x, y)}p k=1 for some functions λk : Rd Rd [0, 1], constructed independently from E. Take the following estimator of SM(x, y): d SM E,Λ(x, y) = k=1 bλk(x, y)d SM k(x, y) + k=1 bλk(x, y) ! d SM p+1(x, y) (8) In the next section, we explain in detail how base estimators are chosen. We call d SM E,Λ(x, y) a hybrid random feature (HRF) estimator of SM(x, y) parameterized by E, Λ. The role of the λcoefficients is to dynamically (based on the input (x, y)) prioritize or deprioritize certain estimators to promote those which are characterized by lower variance for a given input. Note that if elements of E are unbiased estimators of SM(x, y), then trivially d SM E,Λ(x, y) is also an unbiased estimator of SM(x, y). Assume that each d SM k(x, y) is of the form d SM k(x, y) = (ϕk 1,m(x)) ϕk 2,m(y) for ϕk j,m(u) = 1 mϕ1,k j,m(u) ... ϕtk,k j,m (u), tk > 0 and ϕ1,k j,m, ..., ϕtk,k j,m : Rd Rm, where j {1, 2}. Assume also that λk(x, y) can be written as: λk(x, y) = ak + Eτ Ω[ i=1 f i 1,k(x, τ)f i 2,k(y, τ)] (9) for some scalars ak R, distribution Ω P(Rd) (where P(Rd stands for the set of probabilistic distributions on Rd), mappings ξi k, ηi k : Rd Rd R and that the corresponding estimator bλk(x, y) = bλk n(x, y) of λk(x, y) is of the form: bλk n(x, y) = ak + (ρk 1,n(x)) ρk 2,n(y) (10) Published as a conference paper at ICLR 2022 for ρk j,n(u) = 1 nρ1,k j,n(u) ... ρlk,k j,n (u), ρi,k j,n(u) = (f i j,k(u, τ1), ..., f i j,k(u, τn)) , τ1, ..., τn Ω, and where j {1, 2}. Linearization of the λ-coefficients given by Equation 9 is crucial to obtain linearization of the hybrid estimators, and consequently random feature map decomposition. We denote a hybrid estimator using n random features for its base estimators and m to approxi- mate λ-coefficients as d SM hyb m,n. Furthermore, for two vectors u, v, we denote their vectorized outer- product by u v. Finally, we denote by Q vectors-concatenation. Estimator d SM hyb m,n(x, y) can be rewritten as a dot-product of two (hybrid) random feature vectors, as the next lemma shows. Lemma 2.3. The HRF estimator d SM hyb m,n(x, y) satisfies d SM hyb m,n(x, y) = Ψ1(x) Ψ2(y), where Ψj for j {1, 2} is given as Ψj(z) = Ψ1 j(z) Ψ2 j(z) Ψ3 j(z) Ψ4 j(z) and: m ϕ1,k j,m(z) ... ϕtk,k j,m (z) Ψ2 j(z) = 1 mn i,j {1,...,lk} {1,...,tk} ρi,k j,n(z) ϕj,k j,m(z) 1 Pp k=1 ak m ϕ1,p+1 j,m (z) ... ϕtp+1,p+1 j,m (z) Ψ4 j(z) = i mn i,j {1,...,lk} {1,...,tp+1} ρi,k j,n(z) ϕj,p+1 j,m (z) Bipolar estimators: A prominent special case of the general hybrid estimator defined above is the one where: E = (d SM ++(x, y), d SM trig(x, y)). Thus consider the following estimator d SM hyb m,n: d SM hyb m,n(x, y) = bλn(x, y)d SM ++ m (x, y) + (1 bλn(x, y))d SM trig m (x, y). (12) The question arises whether d SM hyb m,n defined in such a way can outperform both d SM ++ m and d SM trig m . That of course depends also on the choice of λ : Rd Rd R. If we consider a (common) normalized setting where all input vectors have the same norm, we can rewrite λ(x, y) ad λ(θx,y, r), where θx,y is an angle between x and y and x = y = r. By our previous analysis we know that d SM ++ m becomes perfect for θ = π and d SM trig m becomes perfect for θ = 0. That suggests particularly simple linear dependence of λ on θ to guarantee vanishing variance for both the critical values: θ = 0 and θ = π. It remains to show that such a λ-coefficient can be linearized. It turns out that this can be done with a particularly simple random feature map mechanism, as we will show later, leading to the so-called angular hybrid variant (see: Section 2.4). 2.3 CHOOSING BASE ESTIMATORS FOR HRFS: COMPLEX EXPONENTIAL ESTIMATORS In this section, we explain how we choose base estimators in Equation 8. We denote by i a complex number such that i2 = 1. The following lemma is a gateway to construct unbiased base estimators. Lemma 2.4. Let ω N(0, Id). Then for every z = (z1, ..., zd) Cd the following holds: E exp(ω z) = exp Pd i=1 z2 i 2 For a complex vector z = (z1, ..., zd) Cd, we denote z2 = Pd i=1 z2 i . Consider z = Ax + (A ) 1y for an invertible (in Cd d) matrix A Cd d. Using Lemma 2.4, we get: exp ((A ) 1y)2 SM(x, y) = E[exp(ω (Ax + (A ) 1y)] (14) Thus for Ψm M(u) def = 1 mexp( (Mu)2 2 )(exp(ω 1 Mu), ..., exp(ω m Mu)) and ωi N(0, Id): SM(x, y) = E[Ψm A(x) Ψm (A ) 1(y)]. (15) Published as a conference paper at ICLR 2022 We obtain the new random feature map mechanism providing unbiased estimation of the softmax kernel. In general it is asymmetric since it applies different parameterization for x and y, i.e. one using A, the other one (A ) 1. Asymmetric random features is a simple generalization of the model using the same ϕ for both x and y, presented earlier. The resulting estimators d SM cexp m (x, y), that we call complex exponential (CE) will serve as base estimators, and are defined as: d SM cexp m (x, y) def = Ψm A(x) Ψm (A ) 1(y) (16) For A = i Id, CE-estimator becomes trigonometric and for A = Id, it becomes d SM ++ m (x, y). Note also that an estimator based on this mechanism has variance equal to zero if x = (A 1)(A ) 1y. (17) In fact we can relax that condition. It is easy to check that for the variance to be zero, we only need: Re(A)x + Re((AT ) 1)y = 0 and Im(A)x + Im((AT ) 1)y = 0 (18) 2.4 ADAPTING HRF ESTIMATORS: HOW TO INSTANTIATE HYBRID RANDOM FEATURES We will use complex exponential estimators from Section 2.3 as base estimators in different insantiations of HRF estimators (that by definition admit random feature map decomposition). The key to the linearization of the HRF estimators is then Equation 9 which implies linearization of the shifted versions of λ-coefficients (Equation 10) and consequently - desired random feature map decomposition via the mechanism of compositional kernels (details in the Appendix). The questions remains how to construct λ-coefficients to reduce variance of the estimation in the desired regions of interest. Accurate approximation of small and large softmax kernel values simultaneously: As we explained earlier, in several applications of softmax estimators a desired property is to provide accurate estimation in two antipodal regions - small and large softmax kernel values. This can be done in particular by choosing p = 1, d SM 1 = d SM ++, d SM 2 = d SM trig and λ(x, y) = θx,y π , where θx,y stands for an angle between x and y. The key observation is that λ(x, y) can be unbiasedly approximated via the mechanism of sgn random features (following from Goemans-Willimson algorithm (Goemans & Williamson, 2004)) as follows (details in the Appendix, Sec. D): bλ(x, y) = 1 2 + ρn(x) ρn(y), (19) where ρ1,n(z) = ρ2,n(z) = ρn(z) def = i 2n(sgn(τ 1 z), ..., sgn(τ n z)) for τ1, ..., τn N(0, Id) and i2 = 1. We call the resulting estimator of the softmax kernel angular hybrid. As we show in Sec. 3, this estimator is indeed particularly accurate in approximating both small and large softmax kernel values. For instance, for the prominent case when input vectors are of fixed length, its variance is zero for both the smallest (θx,y = π) and largest (θx,y = 0) softmax kernel values (see: Fig. 1). Gaussian Lambda Coefficients: Another tempting option is to instantiate λ-coefficients with approximate Gaussian kernel values (estimated either via positive or trigonometric random features). In that setting, functions bλk(x, y) are defined as bλk(x, y) = exp( x+Mky 2 2 2c2 k ), for some M1, ..., Mp Rd d and c1, ..., cp R. Since this time λ-coefficients are the values of the Gaussian kernel between vectors x ck , we can take ak = 0 and define ρk 1, ,ρk 2, as: ρk 1, (x) = exp( x 2 2 2c2 k )ϕSM( x ck ) and ρk 2, (y) = exp( Mky 2 2 2c2 k )ϕSM( Mky where ϕSM is the random feature map corresponding to a particular estimator of the softmax kernel. Note that the resulting hybrid estimator, that we call Gaussian hybrid has nice pseudo-combinatorial properties. The coefficients λk (0, 1] fire if x Mky (the firing window can be accurately controlled via hyperparameters ck). Thus matrices Mk can be coupled with the base estimators particularly accurate in the regions, where x Mky. The mechanism can be thus trivially applied to the setting of accurate approximation of both small and large softmax kernel values for inputs of fixed length, yet it turns out to be characterized by the larger variance than its angular hybrid counterpart and cannot provide variance vanishing property for both θx,y = 0 and θx,y = π. Published as a conference paper at ICLR 2022 MSE MSE MSE Figure 1: MSEs for three softmax kernel estimators (from left to right): d SM trig, d SM ++ and angular hybrid for m = 10, n = 1 and input of lenghts r = 5. MSEs are given as functions of: an angle θx,y [0, π] between x and y and r (symmetrized along 0 for length axis) . For each plot, we marked in grey its slice for a fixed r to illustrate that only for the angular hybrid estimator, the MSE goes to zero for both θx,y 0 and θx,y π. Adaptation to data admitting clustering structure: Assume now that the inputs x come from a distribution P(X) admitting certain clustering structure with clusters X1, ..., Xu Rd of centers x1, ..., xu Rd and that the analogous fact is true for y with the corresponding clusters Y1, ..., Yw Rd and centers y1, ..., yw Rd. For a fixed pair of clusters centers (xi, yj), one can associate a complex exponential estimator d SM cexp (xi,yj) with the corresponding matrix A Cd d (see: Sec. 2.3) that minimizes the following loss, where |z| for z Cd is defined as |z| = q Pd i=1 |zi|2: l(A) = |Axi + (A ) 1yj|2 (21) Note that, by Equation 17, if one can find A such that l(A) = 0, then d SM cexp (xi,yj) makes no error in approximating softmax kernel value on the centers xi, yj. Furthermore, if sampled points are close to the centers and l(A) is small, the corresponding variance of the estimator will also be small. Thus one can think about d SM cexp (xi,yj) as an estimator adapted to a pair of clusters (xi, yj). We can add additional constraints regarding A, i.e. A Rd d or A being diagonal (or both) for closedform formulae of the optimal A, yet not necessarily zeroing out l(A) (details in the Appendix, Sec. H.3). Complex exponential estimators defined in this way can be combined together, becoming base estimators in the HRF estimator. The corresponding λ-coefficients can be chosen to be Gaussian as described in the previous paragraph (with optimized matrices Mk so that a coefficient fires for the corresponding pair of clusters centers), but there is another much simpler choice of λs. We index different base estimators by pairs of clusters centers (xi, yj). One can simply take ρ(xi,yj) 1 (x) = 1 if x belongs to the cluster of center xi and ρ(xi,yj) 1 (x) = 0 otherwise. Vectors ρ(xi,yj) 2 (y) (that also become scalars in this scenario) are defined analogously (the n symbol vanishes since ρ is deterministic). Thus effectively the hybrid estimator activates precisely this random feature mechanism that is optimal for the particular pair of clusters. This scheme is feasible if cluster-membership calculation is computationally acceptable, otherwise aforementioned Gaussian coefficients can be applied. We conclude with an observation that in principle (to reduce the number of base estimators if needed) one can also construct HRF estimators that take base estimators defined in the above way only for the most massive pairs of clusters and add an additional default base estimators (for instance d SM ++). 3 THEORETICAL GUARANTEES We provide here theoretical guarantees regarding HRF estimators. We focus on the bipolar setting with two base estimators: d SM trig and d SM ++. The following is true: Theorem 3.1 (MSE of the bipolar hybrid estimator). Take the bipolar hybrid estimator d SM hyb m,n(x, y), where d SM trig m (x, y) and d SM ++ m (x, y) are chosen independently i.e. their random projections are chosen independently (note that we always assume that bλn(x, y) is constructed in- dependently from d SM trig m (x, y) and d SM ++ m (x, y)). Then the following holds: MSE(d SM hyb m,n(x, y)) = E[bλ2 n(x, y)]MSE(d SM ++ m (x, y)) + E[(1 bλn(x, y))2]MSE(d SM trig m (x, y)) (22) Published as a conference paper at ICLR 2022 Furthermore, if d SM trig m (x, y) and d SM ++ m (x, y) apply the exact same sets of random projections, the mean squared error of the hybrid estimator is further reduced, namely we have: MSE(d SM hyb m,n(x, y)) = E[bλ2 n(x, y)]MSE(d SM ++ m (x, y)) + E[(1 bλn(x, y))2]MSE(d SM trig m (x, y)) m SM2(x, y)(1 cos( x 2 2 y 2 2))E[bλn(x, y)(1 bλn(x, y))] The exact formula on MSE(d SM ++ m (x, y)) and MSE(d SM trig m (x, y)) is given in Lemma 2.2. We can apply this result directly to the angular hybrid estimator, since: Lemma 3.2. For the angular hybrid estimator the following holds: E[bλ2 n(x, y)] = θx,y , E[bλn(x, y)] = θx,y We see that, as mentioned before, the variance of the angular hybrid estimator is zero for both θx,y = 0 and θx,y = π if inputs x, y have the same length. We need one more definition. Definition 3.3. Assume that the inputs to the estimators are taken from some given bounded set C Rd. For a given estimator d SM on feature vectors x, y C, we define its max-relative-error with respect to C as ϵC(d SM) = maxx,y C ϵx,y(d SM), where ϵx,y(d SM) = MSE(d SM(x,y)) SM(x,y) . Denote by S(r) a sphere centered at 0 and of radius r. Define ϵθ,r(d SM) def = ϵx,y(d SM) for x, y S(r) and such that θ = θx,y (note that the mean squared errors of the considered estimators depend only on the angle θx,y for x, y chosen from a fixed sphere). This definition captures the critical observation that in several applications of the softmax kernel estimation, e.g. efficient softmax sampling or linear-attention Transformers (Choromanski et al., 2021b), small relative errors are a much more meaningful measure of the quality of the method than small absolute errors. It also enables us to find hidden symmetries between different estimators: Lemma 3.4. The following holds: ϵθ,r(d SM trig m ) = ϵπ θ,r(d SM ++ m ) = 1 2m exp(2r2 sin2(θ 2)) 1 exp( 4r2 sin2(θ and consequently for W(r) = exp(2r2) 1 exp( 4r2) : ϵS(r)(d SM trig m ) = ϵS(r)(d SM ++ m ) = lim θ π ϵθ,r(d SM trig m ) = lim θ 0 ϵθ,r(d SM ++ m ) = 1 2m W(r) (26) Our main result shows that HRF estimators can be applied to reduce the max-relative-error of the previously applied estimators of the softmax kernel. In the next theorem we show that the maxrelative-error of the angular hybrid estimator scales as 1 r exp(2r2) in the length r of its inputs as opposed to exp(2r2) as it is the case for d SM trig and d SM ++ (see: Lemma 3.4). Furthermore, the max-relative-error scales as π θ as θ 0 and θ π respectively, in particular goes to 0 in both critical cases. This is not true for d SM trig nor for d SM ++. Theorem 3.5. The max-relative-error of the angular hybrid estimator for the inputs x, y on the sphere S(r) of radius r 1 satisfies for W(r) = exp(2r2) 1 exp( 4r2) : ϵS(r)(d SM anghyb m,n ) 1 nπ + 1 n π (27) Furthermore, limθ 0 ϵθ,r(d SM anghyb m,n ) θ = limθ π ϵθ,r(d SM anghyb m,n ) θ π = q 1 2πmn W(r). Additional implications of Theorem 3.5, using the fact that Θ(mn)-dimensional HRFs can be constructed in time O(nd + md + mn) (regular RFs need O(mnd)), are in the Appendix (Sec. D.4). 4 EXPERIMENTS In this section we conduct exhaustive evaluation of the mechanism of HRFs on several tasks. Published as a conference paper at ICLR 2022 4.1 POINTWISE SOFTMAX KERNEL ESTIMATION We start with ablation studies regarding empirical relative errors of different softmax kernel estimators across different inputs lengths r and angles θx,y. The results are presented in Fig. 2. Ablations over more lengths are presented in the Appendix (Sec. G). The angular hybrid estimator most accurately approximates softmax kernel and has smaller max-relative-error than other estimators. Figure 2: Pointwise estimation of SM(x, y) for the same-length 64-dim inputs (r = 1.0 and r = 1.5) and various angles θx,y. Red-dotted lines are for marking zero-level. We used s = 10000 estimated softmax values in each subplot. The true value and the 5th and 95th quantile estimated values are shown by the left y-axis, and the empirical relative errors are shown by the right y-axis. Trigonometric estimator and FAVOR+ applied 128 random features. To make fair comparison, for the hybrid variant the configuration leading to the similar number of FLOPS operations per random feature map creation was applied. Similar gains as for the angular are obtained by the Gaussian hybrid variant. Comparison with QMC-methods: Even though, as we explained in Section 1, our algorithm does not compete with various methods for variance reduction based on different sampling techniques (since those methods, as orthogonal to ours, can be easily incorporated into our framework), we decided to compare HRFs also with them. We tested angular hybrid HRF variant as well as wellestablished QMC baselines: orthogonal random features (ORF)(Yu et al., 2016) (regular variant ORF-reg as well as the one applying Hadamard matrices ORF-Had) and QMCs based on random Halton sequences (Halton-R)(Avron et al., 2016). For HRFs, we tested two sub-variants: with orthogonal (HRF-ort) and regular iid random features (HRF-iid). We computed empirical MSEs by averaging over 100 randomly sampled pairs of vectors from two UCI datasets: wine and Boston. HRFs outperform all other methods, as we see in Table 1. Table 1: Comparison of different estimators of the softmax kernel on the datapoints from two UCI datasets: wine and Boston in terms of the MSE (measured in 10 3 units). The non-HRF estimators apply 512 random features and the HRF-ones are set up to match the non-HRFs in terms of the number of FLOPS (for a fair comparison). We also reported standard deviations. Datasets HRF-ort HRF-iid ORF-reg ORF-Had Halton-R Wine 0.70 0.08 0.85 0.05 1.00 0.04 1.10 0.06 4.02 0.1 Boston 0.72 0.06 0.79 0.08 1.05 0.03 1.14 0.02 2.53 0.08 4.2 LANGUAGE MODELING In this section, we apply HRFs to the language modeling task, training a 2-layer LSTM with hidden size h = 200 and applying RFs for softmax sampling in training as described in (Rawat et al., 2019). Experimental results are obtained over 10 runs. Experimental details and additional validation results are in the Appendix (Sec. I, Table 4). Figure 3: Statistical metrics measuring softmax matrix approximation quality on Penn Tree Bank. For standard estimators, the number of random features are 64, 128, 256, 512. To make fair comparison, for the hybrid variants, the configurations leading to the similar number of FLOPS operations per random feature map creation were applied. Negative fractions were not reported for FAVOR+ since by definition they are equal to zero. Statistical Metrics. We compute 1d-Wasserstein distance and the Kolmogorov-Smirnov (KS) metric (Ramdas et al., 2017) between a true softmax distribution induced by a query on all the classes Published as a conference paper at ICLR 2022 and the approximate one given by different RF-mechanisms. We also compute the fraction of all probability estimates with undesired negative values. The results on the Penn Tree Bank Marcus et al. (1993) are presented in Fig. 3. In the Appendix (Sec. I) we present additional results for the Wiki Text2 dataset. We see that HRFs lead to most accurate softmax distribution estimation. 4.3 TRAINING SPEECH MODELS WITH HRF-CONFORMERS-PERFORMERS We also tested HRFs on speech models with Libri Speech ASR corpus (Panayotov et al., 2015). We put implicit Performer s attention into 17-layer Conformer-Transducer encoder (Gulati et al., 2020) and compared softmax kernel estimators in terms of word error rate (WER) metric, commonly used to evaluate speech models. We tested HRFs using angular hybrid variant as well as those clustering-based. For the latter ones, the clusters were created according to the K-means algorithm after first 1000 steps of training (and were frozen afterwards) and corresponding matrices A were constructed to minimize the loss given in Equation 21. The results are presented in Table 2. HRFs produce smallest WER models and the clustering-based variants fully exercising the most general formulation on HRFs (with general complex estimators) turn out to be the best. Additional details regarding speech experiments as well as additional experiments with clustering-based HRFs are presented in the Appendix (Sec. J and Sec. H.1 respectively). Table 2: Comparison of WERs of Conformer-Transducer applying different RF-mechanisms for the implicit attention. For methods other than clustering-based HRFs (HRF-C), numbers next to method names define the values of m or (m, n). Method HRF-A stands for the angular hybrid variant. Numbers next to HRF-C correspond to the number of clusters constructed in the query and key space respectively. HRF-C uses 64 random features. We also report standard deviations averaged over 10 different training runs. HRF-C(3,3) HRF-C(2, 3) HRF-C(3,2) HRF-C(2,2) HRF-A(16,8) WER 1.72 0.02% 1.75 0.03% 1.83 0.03% 1.85 0.04% 2.03 0.08% HRF-A(8, 8) FAVOR+ 432 FAVOR+ 256 Trig 432 Trig 256 WER 2.05 0.05% 2.65 0.06% 2.77 0.04% 3.12 0.05% 3.3 0.06% 4.4 DOWNSTREAM ROBOTICS EXPERIMENTS In Robotics, we leverage the accuracy of HRFs to obtain inference improvements, critical for onrobot deployment. To abstract from a particular hardware characteristic, we measure it in the number of FLOPS. We conduct two experiments targeting: (a) quadruped locomotion and (b) robotic-arm manipulation. In the former, HRFs are applied as a replacement of the default mechanism using positive random features in the class of implicit-attention architectures for vision processing called Implicit Attention Policies (or IAPs) (Choromanski et al., 2021a). In the latter, HRFs become a part of the regular Vision-Performer stack used to process high-resolution (500 x 500 x 3) input. The results are presented in Fig. 4, where HRFs need 3x fewer FLOPS to obtain same quality policies as baselines. All details regarding both experimental setups are given in the Appendix (Sec. K). Videos of the HRF-trained robotic policies are included in the supplementary material. Figure 4: Left: Step-stone locomotion task. Comparison of the training curves for: the IAP using angular hybrid estimator of m = n = 8 and IAP applying regular FAVOR+ mechanism from (Choromanski et al., 2021b) with m = 256. Both final policies are of similar quality, yet HRF-method requires 3x+ fewer FLOPS to run its trained policy. The visualization of the HRF policy in action and its attention (with filtered out pixels marked in red) is in the bottom right corner. Right: Similar setting (and conclusions) but for the robotic-arm manipulation task. The additional five regular RF-configurations did not train by producing Nan loss due to large variance of the underlying softmax kernel estimators. 5 CONCLUSION We presented a new class of random feature techniques, called hybrid random features for softmax/Gaussian kernel estimation that more accurately approximate these kernels than previous algorithms. We also demonstrated their robustness in a wide range of applications from softmax sampling to Transformers/attention training, also for downstream Robotics tasks. Published as a conference paper at ICLR 2022 Reproducibility Statement: Section 4 and the Appendix include all experimental details (e.g. hyperparameter setup) needed to reproduce all the results presented in the paper. The part of the code that we could make publicly available can be found in the following github: https://github. com/HL-hanlin/HRF_ICLR2022. Ethics Statement: HRFs can be used in principle to train massive Transformer models with large number of parameters and proportional compute resources. Thus they should be used responsibly, in particular given CO2 emission challenges that scientific community tries to address now. ACKNOWLEDGEMENTS AW acknowledges support from a Turing AI Fellowship under grant EP/V025379/1, The Alan Turing Institute, and the Leverhulme Trust via CFI. AS acknowledges AWS compute resources from Onur Kara. Unitree Robotics. URL http://www.unitree.cc/. Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W. Mahoney. Quasi-monte carlo feature maps for shift-invariant kernels. J. Mach. Learn. Res., 17:120:1 120:38, 2016. URL http: //jmlr.org/papers/v17/14-538.html. Guy Blanc and Steffen Rendle. Adaptive sampled softmax with kernel based sampling. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm assan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp. 589 598. PMLR, 2018. URL http://proceedings.mlr.press/v80/blanc18a.html. Y. Cho and L. K. Saul. Analysis and extension of arc-cosine kernels for large margin classification. Technical Report CS2012-0972, Department of Computer Science and Engineering, University of California, San Diego., 2012. Krzysztof Choromanski and Vikas Sindhwani. Recycling randomness with structure for sublinear time kernel expansions. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp. 2502 2510. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/choromanski16. html. Krzysztof Choromanski, Mark Rowland, Wenyu Chen, and Adrian Weller. Unifying orthogonal Monte Carlo methods. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 1203 1212. PMLR, 2019. URL http://proceedings.mlr.press/v97/ choromanski19a.html. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, David Belanger, Lucy Colwell, and Adrian Weller. Masked language modeling for proteins via linearly scalable long-context transformers. ar Xiv preprint ar Xiv:2006.03555, 2020. Krzysztof Choromanski, Deepali Jain, Jack Parker-Holder, Xingyou Song, Valerii Likhosherstov, Anirban Santara, Aldo Pacchiano, Yunhao Tang, and Adrian Weller. Unlocking pixels for reinforcement learning via implicit attention. Co RR, abs/2102.04353, 2021a. URL https: //arxiv.org/abs/2102.04353. Krzysztof Marcin Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness of structured random orthogonal embeddings. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 219 228, 2017. Published as a conference paper at ICLR 2022 Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tam as Sarl os, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J. Colwell, and Adrian Weller. Rethinking attention with performers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021b. URL https://openreview.net/ forum?id=Ua6zuk0WRH. Amit Daniely, Roy Frostig, Vineet Gupta, and Yoram Singer. Random features for compositional kernels. Co RR, abs/1703.07872, 2017. URL http://arxiv.org/abs/1703.07872. Michel X. Goemans and David P. Williamson. Approximation algorithms for MAX-3-CUT and other problems via complex semidefinite programming. J. Comput. Syst. Sci., 68(2):442 470, 2004. doi: 10.1016/j.jcss.2003.07.012. URL https://doi.org/10.1016/j.jcss. 2003.07.012. Arthur Gretton, Ralf Herbrich, Alexander J. Smola, Olivier Bousquet, and Bernhard Sch olkopf. Kernel methods for measuring independence. J. Mach. Learn. Res., 6:2075 2129, 2005. URL http://jmlr.org/papers/v6/gretton05a.html. Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented transformer for speech recognition. In Helen Meng, Bo Xu, and Thomas Fang Zheng (eds.), Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pp. 5036 5040. ISCA, 2020. doi: 10.21437/Interspeech.2020-3015. URL https://doi.org/10.21437/Interspeech. 2020-3015. Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In Proceedings of the 5th International Conference on Learning Representations, 2017. Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan A. K. Suykens. Random features for kernel approximation: A survey in algorithms, theory, and beyond. Co RR, abs/2004.11154, 2020. URL https://arxiv.org/abs/2004.11154. Shengjie Luo, Shanda Li, Tianle Cai, Di He, Dinglan Peng, Shuxin Zheng, Guolin Ke, Liwei Wang, and Tie-Yan Liu. Stable, fast and accurate: Kernelized attention with relative positional encoding. Advances in Neural Information Processing Systems, 34, 2021. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313 330, 1993. URL https://aclanthology.org/J93-2004. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture Models. 5th International Conference on Learning Representations, abs/1609.07843, 2017. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 1924, 2015, pp. 5206 5210. IEEE, 2015. doi: 10.1109/ICASSP.2015.7178964. URL https: //doi.org/10.1109/ICASSP.2015.7178964. Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. Random feature attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum?id=Qt TKTd Vr FBB. Ofir Press and Lior Wolf. Using the Output Embedding to Improve Language Models, 2017. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis (eds.), Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, Published as a conference paper at ICLR 2022 pp. 1177 1184. Curran Associates, Inc., 2007. URL http://papers.nips.cc/paper/ 3182-random-features-for-large-scale-kernel-machines. Aaditya Ramdas, Nicol as Garc ıa Trillos, and Marco Cuturi. On Wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017. doi: 10.3390/e19020047. URL https://doi.org/10.3390/e19020047. Ankit Singh Rawat, Jiecao Chen, Felix X. Yu, Ananda Theertha Suresh, and Sanjiv Kumar. Sampled softmax with random Fourier features. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alch e-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 13834 13844, 2019. URL https://proceedings.neurips.cc/paper/2019/ hash/e43739bba7cdb577e9e3e4e42447f5a5-Abstract.html. Mark Rowland, Krzysztof Choromanski, Franc ois Chalus, Aldo Pacchiano, Tam as Sarl os, Richard E. Turner, and Adrian Weller. Geometrically coupled Monte Carlo sampling. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicol o Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montr eal, Canada, pp. 195 205, 2018. URL https://proceedings.neurips.cc/ paper/2018/hash/b3e3e393c77e35a4a3f3cbd1e429b5dc-Abstract.html. Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr om computational regularization. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1657 1665, 2015. URL https://proceedings.neurips.cc/paper/ 2015/hash/03e0704b5690a2dee1861dc3ad3316c9-Abstract.html. Imanol Schlag, Kazuki Irie, and J urgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 9355 9366. PMLR, 2021. URL http://proceedings.mlr.press/v139/schlag21a.html. A aron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. Co RR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748. Christopher K. I. Williams and Matthias W. Seeger. Using the Nystr om method to speed up kernel machines. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp (eds.), Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pp. 682 688. MIT Press, 2000. URL https://proceedings.neurips.cc/paper/2000/hash/ 19de10adbaa1b2ee13f77f679fa1483a-Abstract.html. Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr om method vs random Fourier features: A theoretical and empirical comparison. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, L eon Bottou, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pp. 485 493, 2012. URL https://proceedings.neurips.cc/ paper/2012/hash/621bf66ddb7c962aa0d22ac97d69b793-Abstract.html. Felix X. Yu, Ananda Theertha Suresh, Krzysztof Marcin Choromanski, Daniel N. Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 1975 1983, 2016. URL http://papers.nips.cc/paper/ 6246-orthogonal-random-features. Published as a conference paper at ICLR 2022 Qinyi Zhang, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. Large-scale kernel methods for independence testing. Stat. Comput., 28(1):113 130, 2018. doi: 10.1007/s11222-016-9721-7. URL https://doi.org/10.1007/s11222-016-9721-7. Published as a conference paper at ICLR 2022 APPENDIX: HYBRID RANDOM FEATURES A PROOF OF LEMMA 2.4 Proof. From the independence of ω1, ..., ωd, we get: E[exp(ω z)] = i=1 E[exp(ωizi)] (28) Thus it suffices to show that for g N(0, 1) and any z C the following holds: E[exp(gz)] = exp(z2 Define f(z) = E[exp(gz)]. Note that f(z) = exp( z2 2 ) for z = ix, where i2 = 1 and x R which follows from the formula of the characteristic function of the Gaussian distribution. Similarly, f(z) = exp( z2 2 ) for z R which follows from the formula of the moment generating function of the Gaussian distribution. We now use the fact from complex analysis that if two analytic functions C C are identical on uncountably many points then they are equal on C to complete the proof (we leave to the reader checking that both f(z) and g(z) def = exp( z2 2 ) are analytic). B PROOF OF LEMMA 2.3 Proof. The following is true: d SM hyb m,n(x, y) = m (ϕ1,k 1,m(x) ... ϕtk,k 1,m(x)) (ϕ1,k 2,m(y) ... ϕtk,k 2,m(y))+ j=1 (ρi,k 1,n(x) ϕj,k 1,m(x)) (ρi,k 2,n(y) ϕj,k 2,m(y))+ 1 Pp k=1 ak m (ϕ1,p+1 1,m (x) ... ϕtp+1,p+1 1,m (x)) (ϕ1,p+1 2,m (y) ... ϕtp+1,p+1 2,m (y)) j=1 (ρi,k 1,n(x) ϕj,p+1 1,m (x)) (ρi,k 2,n(y) ϕj,p+1 2,m (y)). The statement of the lemma follows directly from that. Note that the key trick is an observation that a product of two functions of x, y that can be linearized on expectation (i.e. rewritten with x and y disentangled) is itself a function that can be linearized on expectation. In the setting where the former are approximated by random features, the latter is approximated by their cartesian product (the random feature mechanism for the compositional kernels from (Daniely et al., 2017)). C BIPOLAR HRF ESTIMATORS: PROOF OF THEOREM 3.1 Proof. The following holds: Var(d SM hyb m,n(x, y)) = Var bλn(θ)d SM ++ m (x, y) + Var (1 bλn(θ))d SM trig m (x, y) + 2Cov bλn(θ)d SM ++ m (x, y), (1 bλn(θ))d SM trig m (x, y) (31) We will focus now on the covariance term. To simplify the notation, we will drop index n in bλn. Assume first easier to analyze case where d SM trig m (x, y) and d SM ++ m (x, y) are chosen independently. Then the following is true: Published as a conference paper at ICLR 2022 Cov bλ(θ)d SM ++ m (x, y), (1 bλ(θ))d SM trig m (x, y) = E[bλ(θ)(1 bλ(θ))d SM ++ m (x, y)d SM trig m (x, y)] E[bλ(θ)d SM ++ m (x, y)]E[(1 bλ(θ))d SM trig m (x, y)] = E[bλ(θ)(1 bλ(θ))] E[bλ(θ)]E[(1 bλ(θ))] E[d SM ++ m (x, y)]E[d SM trig m (x, y)] = (SM(x, y))2Var(bλ(θ)) (32) Now assume that d SM trig m (x, y) and d SM ++ m (x, y) use the exact same random projections. Then, using similar analysis as before, we get: Cov bλ(θ)d SM ++ m (x, y), (1 bλ(θ))d SM trig m (x, y) = E[d SM ++ m (x, y)d SM trig m (x, y)]E[bλ(θ)(1 bλ(θ))] (SM(x, y))2E[bλ(θ)]E[(1 bλ(θ))] (33) This time however it is no longer the case that E[d SM ++ m (x, y)d SM trig m (x, y)]E[bλ(θ)(1 bλ(θ))] = E[d SM ++ m (x, y)]E[d SM trig m (x, y)] = (SM(x, y))2 since d SM ++ m (x, y) and d SM trig m (x, y) are no longer independent. In order to compute E[d SM ++ m (x, y)d SM trig m (x, y)]E[bλ(θ)(1 bλ(θ))], we will first introduce useful denotation. Denote by ω1, ..., ωm iid N(0, Id) the random projections sampled to construct both d SM ++ m (x, y) and d SM trig m (x, y). Denote: Yi = cosh(ω i (x + y)) def = exp(ω i (x+y))+exp( ω i (x+y)) 2 . We have: d SM ++ m (x, y) = exp x 2 + y 2 Y1 + ... + Ym If we denote: Zi = cos(ω i (x y)), then we can write d SM trig m (x, y) as: d SM trig m (x, y) = exp x 2 + y 2 Z1 + ... + Zm We can then rewrite E[d SM ++ m (x, y)d SM trig m (x, y)] as: E[d SM ++ m (x, y)d SM trig m (x, y)] = 1 m2 i =j E[Yi Zj] + i=1 E[Yi Zi] (SM(x, y))2 + m E[cosh(ω (x + y)) cos(ω (x y))] , where ω N(0, Id). The equality follows from the unbiasedness of d SM trig m (x, y) and d SM ++ m (x, y) and the fact that different ωi are chosen independently. Thus it remains to compute ρ = E[cosh(ω (x + y)) cos(ω (x y))]. Note first that ρ = E[exp(ω (x + y)) cos(ω (x y))] since ω N(0, Id) and cos is an even function. Denote z = x + y + i(x y). We have: E[exp(ω (x + y)) cos(ω (x y))] = Re E[exp(ω z)] = Re[ i=1 exp(z2 i 2 )] = Pd j=1(xj + yj)2 + 2i(x2 j y2 j ) (xj yj)2 = (SM(x, y))2 cos( x 2 2 y 2 2) Thus we conclude that: E[d SM ++ m (x, y)d SM trig m (x, y)] = (1 1 m)(SM(x, y))2 + 1 m(SM(x, y))2 cos( x 2 2 y 2 2) (38) Published as a conference paper at ICLR 2022 Therefore we get the formulae for the covariance term in both: the setting where random projections of d SM trig m (x, y) and d SM ++ m (x, y) are shared (variant II) and when they are not (variant I). The following is true for Z = 1 m(1 cos( x 2 2 y 2 2)E[bλ(θ)(1 bλ(θ))]: Cov(bλ(θ)d SM ++ m (x, y), (1 bλ(θ))d SM trig m (x, y)) = (SM(x, y))2Var(bλ(θ)) for variant I (SM(x, y))2(Var(bλ(θ)) + Z) for variant II (39) We also have the following: Var(bλ(θ)d SM ++ m (x, y)) = E[(bλ(θ))2(d SM ++ m (x, y))2] (E[bλ(θ)d SM ++ m (x, y)])2 = E[(bλ(θ))2] MSE(d SM ++ m (x, y)) + (SM(x, y))2 (E[bλ(θ)])2(SM(x, y))2 = (SM(x, y))2Var(bλ(θ)) + E[(bλ(θ))2]MSE(d SM ++ m (x, y)) and furthermore (by the analogous analysis): Var((1 bλ(θ))d SM trig m (x, y)) = (SM(x, y))2Var(bλ(θ))+E[(1 bλ(θ))2]MSE(d SM trig m (x, y)) (41) By putting the derived formula for the above variance terms as well as covariance terms back in the Equation 31, we complete the proof of the theorem (note that the mean squared error of the hybrid estimator is its variance since it is unbiased). D ANGULAR HYBRID ESTIMATORS D.1 PROOF OF LEMMA 3.2 Proof. The formula for the expectation is directly implied by the formula (1) from Sec. 2.1 of Cho & Saul (2012) for the zeroth-order arc-cosine kernel k0(x, y) def = 1 θx,y π . It also follows directly from Goemans-Williamson algorithm (Goemans & Williamson, 2004). We will now derive the formula for c = E[bλ2(x, y)]. Since λ depends only on the angle θ = θx,y, we will refer to it as: λ(θ) and to its estimator as bλ(θ). Denote: ϕang n (z) = 1 n(sgn(τ 1 z), ..., sgn(τ n z)) (42) Denote: Xi = (ϕang n (x))[i]ϕang n (y)[i]. We have: Note first that by the construction of bλ(θ), we have: E[bλ(θ)] = θ π and thus: E[Pn i=1 Xi] = 1 2θ π . Therefore we conclude that: i=1 E[X2 i ] + X i =j E[Xi]E[Xj] n2 + n(n 1) 1 Published as a conference paper at ICLR 2022 D.2 EXPLICIT FORMULA FOR THE MSE OF THE ANGULAR HYBRID ESTIMATOR Theorem 3.1 and Lemma 3.2 immediately imply the following result. Theorem D.1 (MSE of the angular hybrid estimator). Take the angular hybrid estimator d SM anghyb m,n (x, y), where d SM trig m (x, y) and d SM ++ m (x, y) are chosen independently i.e. their random projections are chosen independently (note that we always assume that bλ(x, y) is chosen indepen- dently from d SM trig m (x, y) and d SM ++ m (x, y)). Then the following holds: MSE(d SM anghyb m,n (x, y)) = θ 2m exp( z 2)SM2(x, y)(1 exp( z 2))2+ 2m exp( z 2)SM 2(x, y)(1 exp( 2))2 for = x y and z = x + y. Furthermore, if d SM trig m (x, y) and d SM ++ m (x, y) apply the exact same sets of random projections, the mean squared error of the hybrid estimator is further reduced, namely we have: MSE(d SM anghyb m,n (x, y)) = θ 2m exp( z 2)SM2(x, y)(1 exp( z 2))2+ 1 θ 2m exp( z 2)SM 2(x, y)(1 exp( 2))2 m SM2(x, y)(1 cos( x 2 2 y 2 2)) θ Thus if x 2 = y 2 = r then regardless of whether the same sets of random projections are applied or not, we get: MSE(d SM anghyb m,n (x, y)) = θ 2m exp(8r2 cos2(θ (1 exp( 4r2 cos2(θ 2m exp(2r2)(1 exp( 4r2 sin2(θ D.3 PROOF OF THEOREM 3.5 Proof. Note first that from the derived above formula of the MSE of the lambda-angular bipolar hybrid estimator and the definition of the max-relative-error, we obtain: ϵS(r)(d SM anghyb m,n ) = exp(r2) max θ [0,π] hr(θ), (48) hr(θ) = ar(θ) + ar(π θ) (49) and ar(θ) is defined as: n) exp(2r2 cos(θ)) 1 exp( 4r2 cos2(θ 2)) 2 . (50) Therefore we have: ϵS(r)(d SM anghyb m,n ) exp(r2) m max θ [0,π] ar(θ) (51) Notice that: ar(θ) br(θ)(1 exp( 4r2))2 (52) Published as a conference paper at ICLR 2022 where: br(θ) = b1 r(θ) + b2 r(θ) (53) b1 r(θ) = (1 1 π2 exp(2r2 cos(θ)), (54) b2 r(θ) = 1 π exp(2r2 cos(θ)). (55) Therefore: max θ [0,π] br(θ) max θ [0,π] b1 r(θ) + max θ [0,π] b2 r(θ) (56) Denote: b1 = maxθ [0,π] b1 r(θ) and b2 = maxθ [0,π] b2 r(θ). Note that: db1 r(θ) dθ = exp(2r2 cos(θ))(1 1 π2 (1 r2θ sin(θ)) (57) and db2 r(θ) dθ = exp(2r2 cos(θ)) 1 nπ (1 2r2θ sin(θ)) (58) Thus, from the properties of function: θ θ sin(θ) and the fact that r 1, we conclude that both derivatives are first non-negative, then non-positive and then non-negative and that the unique local maximum on the interval [0, π] is achieved for θ π 2 . Note also that b1 r(θ), b1 r(θ) 0 and b1 r(0) = b2 r(0) = 0, b1 r(π) = (1 1 n) exp( 2r2), b2 r(π) = 1 n exp( 2r2). We conclude that global maximum for bi r on the interval [0, π] for i = 1, 2 is achieved either in its unique local maximum on that interval or for θ = π. Let us consider first: b1 r. In its local maximum on [0, π] we have: θ sin(θ ) = 1 Since θ sin(θ) π 2 ], we get: 2 1 r2 , (60) Therefore: b1 r(θ ) (1 1 n) 1 2πr2 exp(2r2) b1 r(π) (62) We thus conclude that: max θ [0,π] b1 r(θ) (1 1 n) 1 2πr2 exp(2r2) (63) By the completely analogous analysis applied to b2 r, we obtain: max θ [0,π] b1 r(θ) 1 2n πr2 exp(2r2) (64) Now, using Equation 51, Equation 52, and Equation 56, we obtain: ϵS(r)(d SM)anghyb m,n exp(2r2) 2mr (1 exp( 4r2)) nπ + 1 n π (65) and that completes the first part of the proof (proof of Inequality 27). The equations on the limits are directly implied by the fact that: ϵS(r)(d SM anghyb m,n ) = exp(2r2) hr(θ). (66) Published as a conference paper at ICLR 2022 D.4 DISCUSSION OF THEOREM 3.5 Theorem 3.5 leads to yet another important conclusion not discussed in the main body of the paper. Note that asymptotically as θ 0 or θ π, the relative error decreases (as a function of m and n) as 1 mn = Θ( 1 dest ), where dest stands for the number of random features of the resulting estimator. Note also that for the regular estimator the rate of decrease is also Θ( 1 dest ) (here we treat all other arguments of the formula for the relative error as constants; we already know that the dependence on them of the relative error for the HRF estimators is superior to the one for regular estimators). The key difference from the computational point of view is that for the HRF estimator, the random feature map can be constructed in time O(nd + md + mn), whereas for the regular estimator it requires time O(mnd) (see: Sec. L). Thus for the regular estimator random feature map computation is much more expensive. An important application, where random feature map computations takes substantial time of the overall compute time are implicit-attention Transformers, such as Performers (Choromanski et al., 2021b). This is also the setting, where an overwhelming fraction of the approximate softmax kernel values will be extreme (very small or large) since an overwhelming fraction of the attention matrix entries after some training will have extreme values (very small or large). In the setting, where the lengths of queries and keys do not vary much (for instance a prominent case where they all have the same fixed length) that corresponds to angle values θ π or θ 0. This is exactly the scenario from the second part of Theorem 3.5. E PROOF OF LEMMA 3.4 Proof. Equation 25 follows directly from Lemma 2.2. Notice that the relative error ϵθ,r(d SM trig m ) is an increasing function of sin2( θ 2) and thus is largest for θ = π. Plugging in this value into the formula of the relative error gives us the expression from the statement of the lemma. Completely analogous analysis holds for d SM ++ m . F GAUSSIAN HYBRID ESTIMATORS In this section we will provide more results regarding Gaussian hybrid estimators. We will focus on the bipolar scenario and within this scenario on the so-called normalized variant described below. If x and y have fixed L2-norm ( x 2 = y 2 = r), we can propose a normalized bipolar Gaussian hybrid estimator such that λ(x, y) = 0 if θx,y = 0 and λ(x, y) = 1 if θx,y = π. Furthermore the variance of that estimator, that we call shortly d SM gausshyb m,n , will be zeroed out for θx,y = 0 or θx,y = π, but not for both. The coefficient-function λ : Rd Rd R is defined as: λ(x, y) = 1 exp( σ2 2 x y 2) ρ (67) where ρ is given as: ρ = 1 exp( 2σ2r2) (68) The hyperparameter σ controls the smoothness of the Gaussian kernel. The exponential in λ can be estimated either by trigonometric or positive random features. It is easy to see that for the fixed input lengths, in the former case the variance of the estimator is zero for θx,y = 0 and in the latter it is zero for θx,y = π, but the variance never zeroes out for both θx,y = 0 and θx,y = π. In principle, one can derive entire hierarchy of HRF estimators, where the coefficients in an HRF estimator from one level of hierarchy are estimated with the use of HRF estimators from the higher level. In that context, angular hybrid estimator can be applied to provide approximation of the coefficients of the normalized bipolar Gaussian hybrid estimator leading to variance zeroing out for both θx,y = 0 and θx,y = π. However such nested constructions are not the topic of this paper. From now on we will assume that trigonometric random features are applied to estimate λ-coefficient, Therefore we have: Published as a conference paper at ICLR 2022 bλ(x, y) = 1 i=1 cos(σ(ω i (x y))) (69) and that leads to the following choice of: a, ρ1,2, ρ2,n from Equation 10: ( a = 1 ρ ρ1,n(z) = ρ2,n(z) = i nρ(sin(τ 1 z), cos(τ 1 z), ..., sin(τ n z), cos(τ n z)) (70) where τ1, ..., τn N(0, Id) and i2 = 1. Using Theorem 3.1, we conclude that: Theorem F.1 (MSE of the normalized bipolar Gaussian estimator). Take the normalized bipolar Gaussian estimator d SM gausshyb m,n (x, y), where d SM trig m (x, y) and d SM ++ m (x, y) are chosen independently i.e. their random projections are chosen independently (note that we always assume that bλ(x, y) is chosen independently from d SM trig m (x, y) and d SM ++ m (x, y)). Denote: = x y. Then the following holds: MSE(d SM gausshyb m,n (x, y)) = E[bλ2(x, y)]MSE(d SM ++ m (x, y))+E[(1 bλ(x, y))2]MSE(d SM trig m (x, y)) (71) E[(bλ(x, y))2] = 1 ρ2 (1 exp( σ2 2 2))2 + 1 2ρ2n(1 exp( σ2 2))2 (72) E[(1 bλ(x, y))2] = 1 ρ2 ((1 ρ) exp( σ2 2 2))2 + 1 2ρ2n(1 exp( σ2 2))2. (73) Furthermore, if d SM trig m (x, y) and d SM ++ m (x, y) apply the exact same sets of random projections, the mean squared error of the hybrid estimator remains the same. Proof. We can calculate E[(bλ(x, y))2] and E[(1 bλ(x, y))2] as follows: E[(1 bλ(x, y))2] = 1 ρ2 E[(ρ 1 + 1 i=1 cos(σω (x y)))2] ρ2 + 2(ρ 1) ρ2 E[cos(σω (x y))] + 1 i=1 cos(σω (x y)))2] ρ2 + 2(ρ 1) ρ2 exp( σ2 2) + 1 2ρ2n(1 exp( σ2 2))2 ρ2 ((1 ρ) exp( σ2 2 2))2 + 1 2ρ2n(1 exp( σ2 2))2 E[(bλ(x, y))2] = 1 2E[1 bλ(x, y)] + E[(1 bλ(x, y))2] 2 2) + E[(1 bλ(x, y))2] ρ2 (1 exp( σ2 2 2))2 + 1 2ρ2n(1 exp( σ2 2))2 By plugging in these two formulae in the expression for MSE, we obtain the first half of the theorem. Published as a conference paper at ICLR 2022 Note that if d SM trig m (θ, r) and d SM ++ m (θ, r) apply the exact same sets of random projections, then (1 cos( x 2 2 y 2 2)) in MSE(d SM gausshyb m,n (x, y)) becomes zero in Theorem F.1, therefore the MSE remains the same. We thus obtain the second part of the theorem and that completes the proof. Now let us assume the setting of the same-lengths inputs. We denote the length by r and an angle between inputs by θ. We can rewrite Eq. 67 and provide equivalent definition for λ by replacing x and y with their angle θ and norm r: λ(θ, r) = 1 exp( 2σ2r2 sin( θ 2)2) ρ (76) We obtain the following version of the above theorem: Theorem F.2 (MSE of the normalized bipolar Gaussian hybrid estimator for same-length inputs). Assume that x 2 = y 2 = r and denote: θ = θx,y. Take the normalized bipolar Gaussian hybrid estimator d SM gausshyb m,n (θ, r), where d SM trig m (θ, r) and d SM ++ m (θ, r) are chosen independently i.e. their random projections are chosen independently (note that we always assume that bλ(θ, r) is chosen independently from d SM trig m (θ, r) and d SM ++ m (θ, r)). Then the following holds: MSE(d SM gausshyb m,n (θ, r)) = E[bλ2(θ, r)]MSE(d SM ++ m (θ, r)) + E[(1 bλ(θ, r))2]MSE(d SM trig m (θ, r)) (77) where E[(bλ(θ, r))2] = 1 ρ2 (1 exp( 2σ2r2 sin(θ 2)2))2 + 1 2ρ2n(1 exp( 4σ2r2 sin(θ 2)2))2 (78) E[(1 bλ(θ, r))2] = 1 ρ2 ((1 ρ) exp( 2σ2r2 sin(θ 2)2))2+ 1 2ρ2n(1 exp( 4σ2r2 sin(θ 2)2))2 (79) Furthermore, if d SM trig m (θ, r) and d SM ++ m (θ, r) apply the exact same sets of random projections, the mean squared error of the hybrid estimator will remains the same. Proof. For x 2 = y 2 = r2, the following holds for = x y: 2 = 4r2 sin(θ And this theorem holds trivially by replacing 2 with the above formula in Theorem F.1. G POINTWISE SOFTMAX KERNEL ESTIMATION EXPERIMENTS In Fig. 5 we present complete ablation studies regarding empirical relative errors of different RFbased estimators of the softmax kernel over different lengths of inputs r and angles θx,y. H COMPLEX EXPONENTIAL ESTIMATORS FOR CLUSTERED DATA H.1 SYNTHETIC EXPERIMENTS ON DATA ADMITTING CLUSTERING STRUCTURE Here we test HRF estimators customized to data with clustering structure, as described in Sec. 2.4. Inputs x are taken from two 50-dimensional 1000-element Gaussian clusters and the inputs y from two other 50-dimensional 1000-element Gaussian clusters. We use empirical MSE metric and constrain matrix A to be real diagonal for a compact closed-form formulae (see: Sec. H.3). We created four different clusters configurations corresponding to different values of s = l(A) (see: Eq. 21) for all pairs of clusters and with different concentrations around centers controlled by the standard deviation σ (small/large values of s and σ). As shown in Fig. 6, for all variants HRF estimators adapted to clustered data consistently provide notable accuracy gains, even for larger values of s and less concentrated clusters. Additional experimental details are given in Sec. H.3. Published as a conference paper at ICLR 2022 Figure 5: Pointwise estimation of SM(x, y) for the same-length 64-dim inputs (r = 1.0, r = 1.25 and r = 1.5) and various angles θx,y. We used s = 10000 estimated softmax values in each subplot. The true softmax value and the 5th and 95th quantile estimated values are shown by the left y-axis, and the empirical relative errors are shown by the right y-axis. Trigonometric estimator and FAVOR+ applied 128 random features. To make fair comparison, for the hybrid variant the configuration leading to the similar number of FLOPS operations per random feature map creation was applied. Figure 6: Comparing different estimators for data with clustering structure. The empirical MSE is obtained by averaging over 20 trials. For the HRFs, we apply deterministic λ-coefficients. The fact that the empirical error for FAVOR+ is not perfectly monotonic in m was first observed in (Luo et al., 2021) (see: Fig. 1: (b)). H.2 INSTANTIATING COMPLEX EXPONENTIAL ESTIMATORS FOR CLUSTERED DATA Assume that the inputs x (queries) can be modeled by nq clusters and that the inputs y (keys) can be modeled by nk clusters (e.g. via k-means clustering algorithm). Denote the center of each cluster as ri Rd, (i = 1, ..., nk) and rj Rd, (j = 1, ..., nq). Then there exist nqnk pairs of (ri, rj), (i = 1, ..., nq, j = 1, ..., nk), so we can construct nqnk softmax kernel estimators to estimate crossgroup softmax kernel values. Consider z = Ari + (A ) 1rj for an invertible (in Cd d) matrix A Cd d. An estimator based on this mechanism has variance equal to 0 if: z = Ari + (A ) 1rj = 0 (81) From now on we constrain A to be diagonal, so A = A . We can rewrite the above equation as: z = Ari + (A) 1rj = 0 (82) Since position (k, k) in matrix A is a complex number of the form αk + βki, we need to satisfy the following equation for each k = 1, ..., d: (αk + βki)ri,k + 1 αk + βkirj,k = 0 (83) (αk + βki)ri,k + αk βki α2 k + β2 k rj,k = 0 (84) Published as a conference paper at ICLR 2022 where ri,k is the k-th entry of vector ri. We can simplify this equation and separate into real and imaginary part: Re : (α2 k + β2 k)αkri,k + αkrj,k = 0 Im : (α2 k + β2 k)βkri,k βkrj,k = 0 (85) Our goal now is to choose values for αk and βk given ri and rj. If ri,krj,k > 0, we can set: αk = 0 βk = p rj,k/ri,k (86) And if ri,krj,k < 0, we can set: αk = p rj,k/ri,k βk = 0 (87) When ri,k = rj,k = 0, we can take αk = βk = 1. If ri,k = 0, rj,k = 0 or the opposite, then we cannot satisfy the above equation perfectly. We can take αk to some large positive value and set βk = 0 when ri,k = 0, rj,k = 0, and set αk = βk to some small positive value close to zero when ri,k = 0, rj,k = 0. When restricting A = diag(a1, . . . , ad) to Rd d, it is easily seen that to minimize |Ari + (A ) 1rj|2, we can take |rj,k/ri,k| (88) if ri,k = 0. And if ri,k = 0, rj,k = 0, ak can be set to a large positive number. We can set ak = 1 if ri,k = 0, rj,k = 0. In the analysis below we denote matrix A calculated from Eq. 85 and Eq. 88 (depending on whether we are using real or complex matrices) given ri and rj as Ai,j. From the analysis in the main body of the paper we know that for arbitrary vectors x, y Rd: exp( Ai,jx 2 2 )exp( (Ai,j) 1y 2 2 )SM(x, y) = E[exp(ω (Ai,jx + (Ai,j) 1y)] (89) Therefore, the softmax kernel can be estimated as: SM(x, y) Ψm Ai,j(x) Ψm (Ai,j) 1(y) (90) with Ψm Ai,j(x) and Ψm (Ai,j) 1(y) given by: Ψm Ai,j(x) def = 1 mexp( Ai,jx 2 2 )(exp(ω 1 Ai,jx), ..., exp(ω m Ai,jx)) (91) Ψm (Ai,j) 1(y) def = 1 mexp( (Ai,j) 1y 2 2 )(exp(ω 1 (Ai,j) 1y), ..., exp(ω m(Ai,j) 1y)) (92) for ω1, ..., ωm N(0, Id). Such a softmax kernel estimator has variance equal to zero if x = ri and y = rj if we are using the complex matrix and has small variance if we are using the real matrix. For other pairs of vectors (x, y) close to ri and rj, the variance is close to zero, or relatively small. We can also take the Gaussian λ-coefficient estimator that gets it maximum value when x = ri, y = rj as λi,j(x, y). Such an estimator could be constructed with Ai,j in a similar way. It can be easily linearized since: Published as a conference paper at ICLR 2022 λi,j(x, y) = exp( Ai,jx + (Ai,j) 1y 2 exp( Ai,jx 2 2τ 2 )exp( (Ai,j) 1y 2 where SM(x/τ, y/τ) could be estimated with similar random feature maps as given by Eq. 90. Therefore, the Gaussian λ-coefficients can be estimated as: λi,j(x, y) = exp( Ai,jx + (Ai,j) 1y 2 2τ 2 ) Ψτ,m Ai,j(x) Ψτ,m (Ai,j) 1(y) (94) with Ψτ,m Ai,j(x) and Ψτ,m (Ai,j) 1( y) given by: Ψτ,m Ai,j(x) = 1 mexp( Ai,jx 2 τ 2 )(exp(ω 1 Ai,jx/τ), ..., exp(ω m Ai,jx/τ)) (95) Ψτ,m (Ai,j) 1( y) = 1 mexp( (Ai,j) 1y 2 τ 2 )(exp( ω 1 (Ai,j) 1y/τ), ..., exp( ω m(Ai,j) 1y/τ)) for ω1, ..., ωm N(0, Id) chosen independently and that were applied in base estimators. We summarize this section with the following two constructions of the hybrid estimators for the clustered data that are implied by the above analysis. Definition H.1 (Hybrid Gaussian-Mixtures estimators for clustered data). Assume that inputs x (queries) and y (keys) can be modeled by nq and nk clusters respectively with centers ri and rj Rd, (i = 1, ..., nk, j = 1, ..., nq). We denote Ai,j as the complex matrix satisfying Eq. (85) with center ri, rj. Furthermore, we denote E = (d SM i,j(x, y))nq,nk i=1,j=1 as a list of estimators of the softmax kernel SMi,j(x, y) and Λ = bλi,j(x, y))nq,nk i=1,j=1 as a list of estimators of λi,j(x, y) constructed independently from E. We also use one additional base estimator (e.g. d SM trig(x, y) or d SM ++(x, y)) and denote the estimator of its softmax kernel as d SM 0(x, y) and λ-coefficient as bλ0(x, y). Then our hybrid estimator takes the following form: d SM E,Λ(x, y) = j=1 bλi,j(x, y)d SM i,j(x, y) + bλ0(x, y)d SM 0(x, y) (97) with constraint: j=1 bλi,j(x, y) + bλ0(x, y) = 1 (98) and where the estimators of λ-coefficients and base estimators are given as: bλi,j(x, y) = Ψτ,m Ai,j(x) Ψτ,m (Ai,j) 1( y) (99) d SM i,j(x, y) = Ψm Ai,j(x) Ψm (Ai,j) 1(y) (100) for Ψm Ai,j(x), Ψm (Ai,j) 1(y), Ψτ,m Ai,j(x), Ψτ,m (Ai,j) 1( y) given by Eq. (91, 92, 95, 96). Published as a conference paper at ICLR 2022 Definition H.2 (Hybrid Zero-One-Mixtures estimators for clustered data). Assume that inputs x (queries) and y (keys) can be modeled by nq and nk clusters respectively with centers ri and rj Rd, (i = 1, ..., nk, j = 1, ..., nq). We denote Ai,j as the complex matrix satisfying Eq. (85) with center ri, rj. Furthermore, we denote E = (d SM i,j(x, y))nq,nk i=1,j=1 as a list of estimators of the softmax kernel SMi,j(x, y) and Λ = bλi,j(x, y))nq,nk i=1,j=1 as a list of estimators of λi,j(x, y) constructed independently from E. Then our hybrid estimator takes the following form: d SM E,Λ(x, y) = j=1 bλi,j(x, y)d SM i,j(x, y) (101) with constraint: j=1 bλi,j(x, y) = 1 (102) and where the estimators of λ-coefficients and base estimators are given as: bλi,j(x, y) = Ψi(x)Ψj(y) (103) d SM i,j(x, y) = Ψm Ai,j(x) Ψm (Ai,j) 1(y) (104) for Ψm Ai,j(x), Ψm (Ai,j) 1(y) given by Eq. (91, 92), with Ψi(x) being a scalar indicating whether x belongs to the i-th cluster of the nq clusters and similarly with Ψj(y) being a scalar indicating whether y belongs to the j-th of the nk clusters. In our experiments we used the formulation from the second definition with A constrained to be real diagonal. In other words, we are using Eq. 88 to construct A. H.3 ADDITIONAL EXPERIMENTAL DETAILS To create the synthetic data, we first generate two random vectors X0, Y0 in R50. Then we generate four random orthogonal matrices Oi,j R50 50 where i = 1, 2 and j = 1, 2 and use Xi = Oi,1X0 and Yi = Oi,2Y0 for i = 1, 2 as the mean vectors for our Gaussian clusters. For each pair of clusters, the minimal value s of Eq. 21 may be non-zero if sign(xi,k) = sign(yj,k) for some k due to the usage of real diagonal matrices. To control the value of s, we manually adjust the sign of each element of Xi and Yi. The data for input x is the combination of two clusters, each consisting 1000 data points following Gaussian distribution with mean vectors X1, X2 and the common covariance matrix σ2I. And similar constructions are done for input y. We create four synthetic data sets, with different values of s and magnitudes of the standard deviation σ. After this step, we normalize the data points by controlling their ℓ2 norms to make the true softmax value be within a reasonable range. After the data is created, we first use K-means clustering algorithm with k = 2 to cluster x and y, and we obtain centers (xi, yj)i,j {1,2}. Then we use the set of real diagonal matrices to minimize Eq. 21 for all pairs of centers of x and y to generate four complex exponential base estimators. To choose λ coefficients, we use the more efficient approach described above, i.e. using indicator vectors, and these are already computed in the clustering process. For each data set, we compare the performance of these estimators by calculating the mean square error of them, and we do this multiple times with different numbers of random features used. All the results are obtained by averaging over 20 repetitions. In addition to the average performance which can be found in Fig. 6, we also record the maximal and minimal empirical MSE over 20 repetitions for all estimators in the Table 3, where s1 = [0.08, 0.05, 0.04, 0.03], s2 = [0.11, 0.09, 0.08, 0.07], s3 = [0.05, 0.03, 0.03, 0.02], s4 = [0.07, 0.06, 0.05, 0.04] representing the lists of s values for all pairs of clusters for different synthetic data sets. Published as a conference paper at ICLR 2022 Data Set Number of Random Features Hybrid-Cluster FAVOR+ Trigonometric s = s1, σ = 1 20 [0.0016, 0.0183] [0.0071, 0.1455] [0.0039, 0.0808] 50 [0.0011, 0.0081] [0.0026, 0.2541] [0.0031, 0.0510] 80 [0.0005, 0.0052] [0.0032, 0.0282] [0.0010, 0.0281] 100 [0.0006, 0.0036] [0.0022, 0.0528] [0.0013, 0.0170] 120 [0.0004, 0.0025] [0.0028, 0.0534] [0.0012, 0.0135] 150 [0.0003, 0.0020] [0.0014, 0.0582] [0.0007, 0.0134] 200 [0.0002, 0.0010] [0.0013, 0.0205] [0.0004, 0.0087] s = s2, σ = 1 20 [0.0032, 0.0265] [0.0095, 0.2071] [0.0036, 0.0989] 50 [0.0012, 0.0079] [0.0036, 0.2123] [0.0019, 0.0565] 80 [0.0007, 0.0059] [0.0037, 0.0570] [0.0008, 0.0208] 100 [0.0007, 0.0040] [0.0016, 0.0280] [0.0010, 0.0164] 120 [0.0005, 0.0025] [0.0013, 0.0262] [0.0007, 0.0246] 150 [0.0004, 0.0024] [0.0010, 0.0407] [0.0004, 0.0190] 200 [0.0003, 0.0018] [0.0018, 0.0236] [0.0004, 0.0134] s = s3, σ = 20 [0.0083, 0.0300] [0.0362, 0.1218] [0.0174, 0.0490] 50 [0.0046, 0.0183] [0.0125, 0.1262] [0.0081, 0.0249] 80 [0.0029, 0.0105] [0.0108, 0.0281] [0.0044, 0.0143] 100 [0.0024, 0.0056] [0.0077, 0.0421] [0.0041, 0.0095] 120 [0.0020, 0.0046] [0.0077, 0.0402] [0.0034, 0.0077] 150 [0.0015, 0.0041] [0.0049, 0.0395] [0.0025, 0.0073] 200 [0.0011, 0.0031] [0.0037, 0.0150] [0.0018, 0.0050] s = s4, σ = 20 [0.0097, 0.0294] [0.0332, 0.1553] [0.0179, 0.0485] 50 [0.0049, 0.0172] [0.0176, 0.1345] [0.0058, 0.0260] 80 [0.0031, 0.0113] [0.0113, 0.0415] [0.0038, 0.0110] 100 [0.0028, 0.0061] [0.0080, 0.0292] [0.0031, 0.0100] 120 [0.0022, 0.0045] [0.0065, 0.0185] [0.0028, 0.0128] 150 [0.0016, 0.0044] [0.0045, 0.0235] [0.0021, 0.0094] 200 [0.0012, 0.0031] [0.0046, 0.0166] [0.0015, 0.0067] Table 3: Minimal and maximal empirical MSE with 20 repetitions Published as a conference paper at ICLR 2022 I LANGUAGE MODELING TRAINING DETAILS AND ADDITIONAL RESULTS For the Language Modeling tasks, we trained a 2-layer LSTM of hidden sizes 200 and 650 on the Penn Tree Bank (Marcus et al., 1993) and the Wiki Text2 dataset (Merity et al., 2017) respectively. We tied the weights of the word embedding layer and the decoder layer (Inan et al., 2017; Press & Wolf, 2017). Thus we could treat the language modeling problem as minimizing the cross entropy loss between the dot product of the model output (queries) and the class embedding (keys) obtained from the embedding layer with the target word. We now present detailed statistics of HRFs on the Penn Tree Bank dataset. The mean and the standard deviation of the statistical metric results in Fig. 3, Sec. 4.2 is shown in Table 4. The distribution of the lengths of the keys and the queries are shown in Figure 7. For our experiments with the Angular Hybrid and the Gaussian Hybrid, the number of random features are 8, 16, 32, 64 for the base estimators, and 8 for the coefficient estimators. Table 4: Results are computed over 10 runs on Penntree Bank Dataset. Boldface denotes the best one, and underline denotes the second best. Negative fractions for Favor+ is not reported as it produces positive random features. 1d-Wasserstein RF types d = 64 d = 128 d = 256 d = 512 FAVOR+ 3171.35 69.04 3142.65 85.55 3050.58 108.25 2985.18 73.15 Trig. 1577.44 13.59 1582.50 13.42 1574.22 7.01 1576.50 5.83 Gaussian 1520.32 11.65 1521.10 8.79 1516.50 6.66 1515.41 5.96 Angular 1395.03 79.91 1445.24 58.27 1394.97 62.73 1425.88 65.82 KS statistics FAVOR+ 0.573 0.0073 0.569 0.0098 0.558 0.013 0.551 0.0092 Trig. 0.424 0.0040 0.427 0.0041 0.425 0.0024 0.424 0.0023 Gaussian 0.411 0.0029 0.410 0.0019 0.410 0.0013 0.410 0.0014 Angular 0.404 0.017 0.419 0.010 0.418 0.0089 0.414 0.0093 Negative fraction Trig. 0.257 0.0057 0.258 0.0039 0.258 0.0021 0.257 0.0017 Gaussian 0.155 0.0039 0.159 0.0019 0.157 0.0011 0.151 0.0009 Angular 0.152 0.0041 0.158 0.0027 0.153 0.0018 0.157 0.0013 Figure 7: Distribution of the lengths of the keys and queries in the Penn Tree Bank dataset Published as a conference paper at ICLR 2022 Since the HRF mechanisms can be sensitive to the lengths of the keys and the queries, for the language modeling task on the Wiki Text2 dataset, we added a regularizer to constrain the lengths of the keys and the queries to be close to 1. However, constraining the lengths close to 1 hurts the model performance and so we chose to use a temperature scaling as in (Rawat et al., 2019). Thus before passing the computed dot product to the softmax layer, we scaled the dot product by τ > 1. Our final loss function for the language modeling task was: L(θ) = LCE + λ1E(||q||2 1)2 + λ2E(||k||2 1)2 (105) where || ||2 is the row-wise L2 norm of the appropriate matrices, and LCE is the cross-entropy loss. The distribution of the lengths of the keys and the queries are shown in Figure 9. Finally we present some additional results on the Wiki Text2 dataset. For our experiments on the Wiki Text2 dataset, we chose λ1 = λ2 = 2 and τ = 6. Our model trained with these hyperparameters achieve a perplexity score of 105.35 on the test set after 50 epochs. We compute the 1-dimensional Wasserstein distance and the Kolmogrov-Smirnov (KS) metric (Ramdas et al., 2017) between the approximated softmax distribution and the true softmax distribution. The mean and the standard deviation of the statistical metric results in Fig. 8 is shown in Table 5. For our experiments with the Angular Hybrid and the Gaussian Hybrid, the number of random features are 8, 16, 32, 64 for the base estimators, and 8 for the coefficient estimators. Figure 8: Statistical metrics measuring softmax matrix approximation quality on the Wiki Text2 dataset. For standard estimators, the number of random features are 64, 128, 256, 512. To make fair comparison, for the hybrid variants, the configurations leading to the similar number of FLOPS operations per random feature map creation were applied. Results reported over 10 runs. Table 5: Results are computed over 10 runs on the Wikitext2 Dataset. Boldface denotes the best one, and underline denotes the second best. 1d-Wasserstein RF types d = 64 d = 128 d = 256 d = 512 FAVOR+ 12970.38 119.49 12971.67 108.79 12965.82 89.62 12954.17 83.95 Trig. 12913.66 59.22 12912.65 57.42 12910.91 37.23 12913.25 35.12 Gaussian 12809.03 97.2 12805.71 78.91 12808.02 84.57 12809.40 65.88 Angular 12833.56 41.87 12765.44 38.91 12561.26 26.48 12534.48 25.88 KS statistics FAVOR+ 0.687 0.0043 0.687 0.0051 0.687 0.0019 0.686 0.0067 Trig. 0.685 0.0071 0.685 0.0053 0.685 0.0029 0.685 0.0014 Gaussian 0.683 0.0092 0.683 0.0073 0.683 0.0049 0.682 0.0089 Angular 0.682 0.0079 0.682 0.0066 0.682 0.0037 0.678 0.0064 The above tables (Table 4 and Table 5) show that the our hybrid variants consistently outperform Favor+ and the trigonometric random features. I.1 LANGUAGE MODELING USING HRF In this subsection, we will describe how one can use HRF to train a LSTM on the language modeling task on the Penn Tree Bank dataset, similar to the experiments carried out in (Rawat et al., 2019). Published as a conference paper at ICLR 2022 Figure 9: Distribution of the lengths of the keys and queries in the Wiki Text2 dataset We trained a 2-layer LSTM model with hidden and output sizes of 200, and used the output as input embedding for sampled softmax. We sampled 40 negative (other than true) classes out of 10000 classes to approximate the expected loss in each training epoch. As observed in Fig. 2, the relative error of all three estimators grow exponentially fast with embedding norm r. Therefore, if we keep un-normalized embeddings during sampling, then even though we could get an unbiased estimation of the loss, the variance could be high. Such bias-variance trade off is also mentioned in paper (Rawat et al., 2019). To solve this issue, we used normalized input and class embeddings during sampling to generate biased sampling distribution with lower variance, while keeping unnormalized embeddings to calculate the loss. We trained our model for 80 epochs, with batch size equal to 20, dropout ratio in LSTM equal to 0.5. Implementation details could be seen in our Github repository. Figure 10: Language Modeling with softmax sampling: training results on Penn Tree Bank dataset (5080 epoch window zoomed in on the right subfigure). The solid line for each method is estimated over 25 independent runs. The shaded areas represent perplexity within 1 standard deviation of the average. FAVOR+/trigonometric mechanisms used 1024 random features. To make fair comparison, for HRFs, the configurations leading to the similar number of FLOPS operations per random feature map creation were applied. 128 and 8 random features is used to estimate the softmax estimators and λ-coefficient in HRFs respectively. Even though statistical metrics for HRFs are better in Fig. 3, the difference between different random feature estimators is not very significant. We compared our hybrid estimator with Trigonometric, FAVOR+, the uniform method which sampled 40 negative classes with equal probability, as well as the sampled softmax method which uses the unbiased true probability calculated from full softmax as the sampling probability. Trigonometric and FAVOR+ mechanisms used 1024 random features. To make comparison fair, hybrid variants were using (m, n)-configurations characterized by the similar number of FLOPS needed to construct their corresponding random features as regular RF-estimators. We reported our comparison Published as a conference paper at ICLR 2022 results averaged over 25 independent runs in Fig. 10. The perplexity score in Fig. 10 is defined as 2cross entropy loss. We could conclude that even though the statistical metrics for HRFs are better in Fig. 3, the difference between HRFs and other random feature estimators in this specific softmax sampling downstream task is not very significant. J SPEECH EXPERIMENTS: ADDITIONAL DETAILS Tested Conformer-Performer models consisted of l = 17 conformer layers. Each attention layer used H = 8 heads. The embedding dimensionality was p = 512 and since dimensions were split equally among different heads, query/key dimensionality was set up to d QK = 64. Each input sequence was of length L 500. Padding mechanism was applied for all tested variants. K DOWNSTREAM ROBOTICS EXPERIMENTS In both Robotics experiments we found query/key L2-normalization technique particularly effective (queries and keys of L2-norm equal to one) thus we applied it in all the experiments. K.1 VISUAL LOCOMOTION WITH QUADRUPED ROBOTS We evaluate HRF-based attention RL policies in a robotic task of learning locomotion on uneven terrains from vision input. We use the quadruped from Unitree called Laikago (lai). It has 12 actuated joints, 3 per leg. The task is to walk forward on a randomized uneven terrain that requires careful foot placement planning based on visual feedback. The ground is made of a series of stepstones with gaps in between. The step stones widths are fixed at 50 cm, the lengths are between [50, 80] cm in length, and the gap size between adjacent stones are between [10, 20] cm. It perceives the ground through 2 depth cameras attached to its body, one on the front and other on the belly facing downwards. We use Implicit Attention Policy (IAP) architecture (masking variant) described in Choromanski et al. (2021a) which uses Performer-based attention mechanism to process 32 24 depth images from the 2 cameras. We apply a hierarchical setup to solve this task. The high level uses IAP-rank with masking to process camera images and output the desired foot placement position. The low level employs a position-based swing leg controller, and an model predictive control (MPC) based stance leg controller, to achieve the foot placement decided by high level. The policies are trained with evolutionary strategies (ES). In Fig. 4 (left) we compare FAVOR+ with angular hybrid approximation in IAP. HRF with 8 x 8 random projections (m = n = 8) performs as well as the softmax kernel with 256 random projections. A series of image frames along the episode of a learned locomotion HRF-based IAP policy is shown bottom right of Fig. 4. On the top-left corner of the images, the input camera images are attached. The red part of the camera image is the area masked out by self-attention. The policy learns to pay attention to the gaps in the scene in order to avoid stepping on them. The training curves in Fig. 4 (left) are obtained by averaging over 5 top runs. The shaded regions shows the standard deviation over multiple runs. K.2 ROBOTIC MANIPULATION: BI-MANUAL SWEEPING Here we consider the bi-manual sweep robotic-arm manipulation. The two-robotic-arm system needs to solve the task of placing red balls into green bowls (all distributions are equally rewarded, so in particular robot can just use one bowl). In this bi-manual sweeping task, the reward per step is between 0 and 1, the fraction of the blocks within the green bowls. The loss for this task is the negative log likelihood between the normalized softmax probability distribution of sampled actions (one positive example, and all others uniform negative counter-examples), and the ground truth one-hot probability distribution for a given observation from the training dataset. For more details see (van den Oord et al., 2018). This is a 12-Do F cartesian control problem, the state observation consists of the current cartesian pose of the arms and a 500 500 3 RGB image. An example of the camera image is given in Fig. 11. The training curves Fig. 4 (right) are obtained by averaging over 3 learning rates: 10 3, 10 4, 10 5. Published as a conference paper at ICLR 2022 Figure 11: An example of the camera image for the bi-manual sweep robotic-arm manipulation task. L COMPUTATIONAL GAINS OF HYBRID ESTIMATORS Consider the general hybrid estimator of SM(x, y) from Equation 8, using p + 1 base estimators. Assume that the kth base estimator for k = 1, ..., p + 1 applies m random features in each of the corresponding tk random maps and that n random features are used in each of the corresponding lλk random maps to approximate kth λ-coefficient for k = 1, ..., p. From Equation 11, we conclude that time complexity of constructing Ψ(z) is: (t1 + ... + tp+1)md + (nd + md + mn) Thus the resulting Θ(mn)-dimensional random feature map Ψ(z) can be constructed in time O(nd+ md + mn) as opposed to O(mnd) as it is the case for the regular estimator, providing substantial computational gains. This is true if the (θ)(mn)-dimensional HRFs can provide similar quality as their mn-dimensional regular counterparts. It turns out that this is often the case, see: Section D.4.