# simplex_random_features__3b8c2f0e.pdf Simplex Random Features Isaac Reid 1 Krzysztof Choromanski * 2 3 Valerii Likhosherstov 1 Adrian Weller * 1 4 We present Simplex Random Features (Sim RFs), a new random feature (RF) mechanism for unbiased approximation of the softmax and Gaussian kernels by geometrical correlation of random projection vectors. We prove that Sim RFs provide the smallest possible mean square error (MSE) on unbiased estimates of these kernels among the class of weight-independent geometricallycoupled positive random feature (PRF) mechanisms, substantially outperforming the previously most accurate Orthogonal Random Features (ORFs, Yu et al., 2016) at no observable extra cost. We present a more computationally expensive Sim RFs+ variant, which we prove is asymptotically optimal in the broader family of weight-dependent geometrical coupling schemes (which permit correlations between random vector directions and norms). In extensive empirical studies, we show consistent gains provided by Sim RFs in settings including pointwise kernel estimation, nonparametric classification and scalable Transformers (Choromanski et al., 2020).1 1. Introduction Embedding methods, which project feature vectors into a new space, are ubiquitous in machine learning. The canonical example is the Johnson-Lindenstrauss Transform (JLT) (Johnson, 1984; Dasgupta et al., 2010; Kane & Nelson, 2014; Kar & Karnick, 2012), where a collection of highdimensional points is embedded in a much lower dimensional space whilst (approximately) preserving their metric relationships, e.g. distances and dot-products. Another application is found in kernel approximation (Liu et al., 2022; Yang et al., 2014; Pennington et al., 2015; Li et al., 2010), *Equal senior co-leads 1University of Cambridge 2Google 3Columbia University 4Alan Turing Institute. Correspondence to: Isaac Reid , Krzysztof Choromanski . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1Code is available at https://github.com/isaac-reid/ simplex random features. where the nonlinear similarity measure (kernel) in the original space is translated to a linear kernel in the latent space. For example, a kernel K( , ) : Rd Rd R can be approximated using so-called random features (RFs): randomised nonlinear transformations ϕ( ) : Rd Rd constructed such that K(x, y) = E[ b K(x, y)], where b K(x, y) def = ϕ(x) ϕ(y). (1) Provided K is stationary, meaning K(x, y) = K(x y), we can use Bochner s theorem to write Rd p(w)eiw (x y)ddw, (2) where p(w) is the Fourier transform of K. If K is positive semidefinite, p(w) is non-negative so we can treat it as a probability density. This invites Monte Carlo (MC) sampling, yielding Random Fourier Features (RFFs) of the following form, where vectors wi are sampled from p(w), m is their number and denotes concatenation (Rahimi & Recht, 2007; 2008): ϕRFF(z) def = 1 m( m i=1[sin(w i z), cos(w i z)]) . (3) Furthermore, if K is a Gaussian kernel, defined by Kgauss(x, y) def = exp( x y 2 2 2 ), (4) random vectors wi are sampled from the multivariate Gaussian distribution N(0, Id). Another kernel, of key interest in Transformer architectures (Vaswani et al., 2017; Choromanski et al., 2020), is the so-called softmax kernel: Ksmax(x, y) def = exp(x y). (5) Since Kgauss(x, y) = Ksmax(x, y) exp( x2 2 ), RF mechanisms for the Gaussian kernel can be readily converted into the corresponding mechanism for softmax and vice versa (Likhosherstov et al., 2022). Our results will hence apply to both settings. For brevity, we will mostly refer to Kgauss. However, as noted in (Choromanski et al., 2020), RFFs lead to unstable training of implicit linear-attention Transformers. Simplex Random Features The authors address this by proposing Positive Random Features (PRFs), defined by Kgauss(x, y) = E[ϕPRF(x) ϕPRF(y)], (6) where for w1, ..., wm N(0, Id), ϕPRF(z) def = 1 m exp( z 2 2)( m i=1[exp(w i z)]) . The straightforward implementation of PRFs (and RFFs) draws wi independently a strategy we refer to as IIDRFs. However, the isotropy of the Gaussian distribution permits us to entangle different wi to be exactly orthogonal2 whilst preserving the Gaussian marginal distributions wi N(0, Id) (Yu et al., 2016). This mechanism is referred to as Orthogonal Random Features (ORFs), and is an example of a weight-independent geometrically-coupled RF mechanism. Definition 1.1. Consider the random vectors {wi|i m} Rd, which can be described by norms wi = wi 2 and directions b wi = wi wi 2 . An RF mechanism is described as geometrically-coupled if the norms of random vectors {wi} are independent, but the directions { b wi} are permitted to be correlated with one another and with the norms {wi}. Such a coupling is weight-independent under the further restriction that directions { b wi} are independent of the norms {wi}. Unless otherwise stated, all coupling mechanisms considered in this work will be geometrical. ORFs provide a lower mean squared error (MSE) on Gaussian kernel approximation than IIDRFs (Yu et al., 2016; Choromanski et al., 2020), though for RFFs only at asymptotically large d. ORFs are used in a broad range of applications including kernel ridge regression and Transformers. In the latter case, they offer linear (cf. quadratic) spaceand time-complexity of the attention module, enabling efficient long-range attention modelling as part of the so-called Performer architecture (Choromanski et al., 2020). Sec. 2 details further applications beyond Gaussian and softmax kernel estimation. Recently Likhosherstov et al. (2022) showed that further MSE reduction (for fixed m and preserving unbiasedness) can be achieved by collecting light data statistics. RFs can also be applied with more computationally expensive preprocessing to improve accuracy in downstream tasks (Trokicic & Todorovic, 2019), but they no longer approximate the Gaussian kernel. However, the following question remains open: do ORFs provide the lowest possible MSE on unbiased estimates of the Gaussian kernel among the class of weight-independent geometrically-coupled PRF mechanisms? 2All wi can be orthogonal if m d. If m > d we construct ensembles of independent orthogonal blocks. IIDRFs < ORFs < Sim RFs < Sim RFs+ Geometrically-coupled Weight-independent Weight-dependent Figure 1. Schematic of performance of RF mechanisms described in this manuscript. Sim RFs and Sim RFs+ are novel. Here, we comprehensively answer this question, finding that ORFs are not optimal. We derive the optimal mechanism, coined Simplex Random Features (Sim RFs), and show that it substantially outperforms ORFs at close to no extra computational cost. We also consider the broader family of weight-dependent geometrically-coupled PRFs, where random vector directions { b wi} can be correlated with norms {wi}, and present a Sim RFs+ variant which we prove is asymptotically optimal in this more general class. Our empirical studies demonstrate the consistent gains provided by Sim RFs in diverse settings, including pointwise kernel estimation, nonparametric classification and scalable Transformers (Choromanski et al., 2020). In more detail, our principal contributions are as follows: 1. In Sec. 3, we introduce Sim RFs and prove that they provide the lowest kernel estimator MSE of any weightindependent geometrically-coupled PRF mechanism, outperforming the previously most accurate ORFs. We demonstrate that a fast, simple scheme applying minor alterations to Sim RFs yields Sim RFs+: a marginally better weight-dependent mechanism. See Fig. 1. 2. In Sec. 4, we provide novel theoretical results to add insight to the discussion in Sec. 3. They may be of independent interest. We derive the first non-asymptotic closed-form formulae for the MSE for PRFs in the IIDRF, ORF and Sim RF settings, and show how it is straightforward to generalise some of these forms to RFFs. This allows us to precisely quantify how much the kernel estimator MSE can be suppressed by geometrical coupling. We also compare the timeand space-complexities of the different PRF mechanisms and describe a faster, approximate implementation. 3. In Sec. 5, we support our theoretical results with comprehensive experiments, demonstrating the superiority of Sim RFs over ORFs and IIDRFs. We empirically confirm that they offer lower kernel estimator MSE, and find that this translates to better downstream performance in nonparametric classification tasks (Sec. 5.3) and scalable Transformers (Sec. 5.4). Proofs not provided in the main body are in Appendix A. Simplex Random Features 2. Related Work The literature on structured RFs, where random vectors are conditionally dependent, is extensive (Ailon & Chazelle, 2009; Liberty et al., 2011; Ailon & Liberty, 2013; Le et al., 2013; Yu et al., 2017). ORFs were first proposed for nonlinear kernel estimation in (Yu et al., 2016), where the authors derived strict asymptotic gains from ORFs compared to IIDRFs when using RFFs for Gaussian kernel approximation. We refer to this phenomenon the supression of kernel estimator MSE when random features are conditioned to be orthogonal as the orthogonality gap. Further progress towards an understanding of the orthogonality gap was provided in (Choromanski et al., 2018), where the authors introduced and studied the so-called charm property of stationary kernels. However, a rigorous mathematical analysis in the non-asymptotic setting remained out of reach. In (Choromanski et al., 2017), the authors showed the superiority of ORFs over IIDRFs for angular kernel estimation in any d (not just asymptotic) and conducted an extensive analysis of the linear (dot-product) kernel, but they did not address stationary kernels. The authors of (Lin et al., 2020) used the lens of determinantal point processes and the negative dependence property (Kulesza & Taskar, 2012) to explore the efficacy of ORFs. ORFs are used with PRFs in Performers (Choromanski et al., 2020; Schlag et al., 2021; Luo et al., 2021; Likhosherstov et al., 2021; Chowdhury et al., 2021; Xiao et al., 2022): a recently-proposed class of efficient Transformer (Kitaev et al., 2020; Roy et al., 2021) that can be applied to ultralong sequences or to expedite inference on regular-size sequences. 3. Simplex Random Features (Sim RFs) In this section, we describe our core contributions. We begin by presenting Simplex Random Features (Sim RFs). In analogy to the square orthogonal block, we define the so-called simplex block, consisting of d d-dimensional random vectors {wi|i d}. In practical applications where m > d random features are needed, multiple simplex blocks are constructed independently. Instead of being orthogonal, the rows of the simplex block point towards the vertices of a d 1-dimensional simplex embedded in d-dimensional space, subtending angles θ = arccos( 1 d 1). The entire simplex (or, equivalently, the vector it operates on) is randomly rotated to preserve isotropy, and the rows are independently renormalised by weights wi χd such that they are marginally Gaussian. Explicitly, we define the simplex block Wsimp Rd d by Wsimp = DSR (8) where D Rd d = diag(wi) with wi sampled from a IIDRFs ORFs Sim RFs Figure 2. Schematic of different geometrical couplings for small d. Dotted lines have a component into the plane of the paper, thick lines have a component out, and is purely out (i.e. perpendicular to the paper s plane). With IIDRFs, the respective orientations of vectors are chosen independently. With ORFs, we condition the vectors to be perpendicular. With Sim RFs, they subtend angles θ = arccos( 1 d 1). Intuitively, conditioning the vectors to subtend fixed, obtuse angles means they explore Rd better, suppressing the kernel estimator MSE. All norms are drawn independently from a χd-distribution. χd-distribution. R Rd d is a random orthogonal matrix drawn from Haar measure on O(d), the group of orthogonal matrices in Rd d, constructed e.g by Gram-Schmidt orthogonalisation of an unstructured Gaussian matrix (Yu et al., 2016). The rows si of the simplex projection matrix S Rd d are given by the unit vectors d+1 (d 1)3/2 (1, ..., 1, 0) for 1 i < d 1 d 1(1, 1, ..., 1, 0) for i = d (9) which are manifestly normalised and subtend obtuse angles. Fig. 2 visualises the different geometrical couplings of IIDRFs, ORFs and Sim RFs in low data dimensionality d. 3.1. RF-Conformity and Sim RFs vs ORFs Recalling again that the Gaussian and softmax kernels are readily interchanged, we focus on Kgauss without loss of generality. We begin by defining the RF-conformity. Definition 3.1. The RF-conformity, ρ(x, y), is given by ρ(x, y) def = Γ( d i,j =i Ewij v2kw2k ij 22kk!Γ(k + d (10) with wij = wi + wj 2, v = x + y 2 for x, y Rd, Γ the Gamma-function and m the no. random vectors wi. ρ(x, y) depends on correlations induced between random vector directions. It is bigger when random vectors point in similar directions, exploring Rd less effectively. In Appendix A.1, we prove the following important result. Theorem 3.2 (MSE depends on RF-conformity). For PRFs, the MSE of the unbiased estimator b K(x, y) is given by MSE( b K) = e 2x2 2y2 +(m 1)(ρ(x, y) ev2) . (11) Simplex Random Features That is, the MSE is an increasing function of the RFconformity. For any wi, wj, Sim RFs give strictly smaller values of wij than ORFs because the random vectors subtend a bigger angle. Explicitly, wij = (w2 i + w2 j + 2wiwj cos θ)1/2 is smaller when cos θ = 1 d 1 (Sim RFs) compared to when cos θ = 0 (ORFs). This leads to smaller values of ρ(x, y), which immediately implies the following important result. Corollary 3.3 (Sim RFs outperform ORFs). For PRFs, the kernel estimator MSE obtained with Sim RFs is strictly lower than with ORFs for arbitrary data dimensionality d. In fact, we are able to make the following substantially stronger statement, proved in Appendix A.2. Theorem 3.4 (Sim RFs optimal for weight-independent geometrical coupling). Supposing that d random vector norms {wi|i d} are i.i.d., Sim RFs constitute the best possible weight-independent geometrical coupling mechanism, giving the lowest possible PRF kernel estimator MSE. 3.2. Sim RFs+ Now we consider the broader family of weight-dependent geometrical coupling mechanisms, where random vector directions { ˆwi} are permitted to be correlated with norms {wi}. In particular, given d vectors {wi} of known norms (from d draws of χd), we would like to arrange them in d-dimensional space in order to minimise the sum3 ρ(x, y) = Γ( d v2kw2k ij 22kk!Γ(k + d One brute-force approach is to parameterise each of the d random vector directions in hyperspherical coordinates and use an off-the-shelf numerical optimiser (e.g. scipy.optimize). This is prohibitively slow, and moreover the solution has data-dependence via v = x + y 2 which frustrates the method s scalability: the optimisation needs to be carried out pairwise for every (x, y), which undermines our ability to quickly evaluate b K(x, y) = ϕ(x) ϕ(y) for any given pair of input vectors. However, the numerical approach does benchmark the lowest possible RF-conformity that can be achieved with weight-dependent geometrical coupling. The generic analytic minimisation of Eq. 12 is challenging, and solutions will suffer the same v-dependence described above, so we instead consider a tractable approximation. 3We remove the expectation value because, given a fixed set of norms, assigning any probability mass to suboptimal configurations will increase the RF-conformity in expectation that is, the best geometrical coupling between vectors of known magnitudes {wi} is deterministic. Dropping constant prefactors for clarity, the first few terms from Eq. 10 are given by: i,j =i Ewij 2) + v2w2 ij 4Γ( d 2 + 1) + v4w4 ij 32Γ( d 2 + 2) + ... i,j =i 1 + τ E(w4 ij) E(w2 ij) + ... (13) with τ = Γ( d 2 )v2E(w2 ij) 4Γ( d 2 +1) . The precise value of E(w4 ij) E(w2 ij) will depend on the geometrical coupling scheme employed, but for the types we have considered we generally expect it to scale as d, with some constant prefactor4. Therefore the sum in Eq. 10 can be approximated by: i,j =i 1 + Γ( d 2)v2E(w2 ij) 1 + O(v2) + ... . (14) In the limit of small v, this invites us to truncate the sum at k = 1, dropping the O(v2) terms. Omitting additive constants, we are left with the approximate objective ρ(x, y) = Γ(d/2)v2 4m(m 1)Γ(1 + d/2) i,j =i w2 ij, (15) the physical analogue of which is the Heisenberg Hamiltonian with different coupling constants between different spin pairs. This is exactly minimised by P j =i wj P j =i wj 2 wi i = 1, ..., d (16) where each random vector points away from the resultant of all the others (see Appendix A.3 for details). Fig. 3 captures this essential difference between Sim RFs and Sim RFs+: in the latter case, vectors with larger norms subtend bigger angles. Empirically, we find that the iterative update scheme P j =i wj P j =i wj 2 wi (17) converges to Eq. 16 quickly (after a small number of passes through the set of d vectors), especially if we initialise in the near-optimal simplex geometry. Conveniently, the solution has no v-dependence and is therefore scalable: the optimisation needs to be carried out for every draw of weights {wi} but not every pair of data points (x, y). We refer to this mechanism of weight-dependent geometrical coupling as Sim RFs+, and emphasise that it is asymptotically optimal (in the sense of minimising ρ(x, y)) in the v 1 limit. 4For example, with orthogonal coupling E(w4 ij) E(w4 i +w4 j +2w2 i w2 j ) E(w2 i +w2 j ) = E(w4 i ) E(w2 i ) +E(w2 i ) = 2 Γ( d 2 +1) +2 Γ( d where we took moments of the χd distribution. We can perform similar analyses in the i.i.d. and simplex cases. Simplex Random Features Sim RFs Sim RFs+ Figure 3. With Sim RFs, random vectors are geometrically correlated such that all pairs subtend an equal angle θ = arccos( 1 d 1). With Sim RFs+, random vectors with bigger norms subtend bigger angles, guaranteeing smaller kernel estimator MSE when v is sufficiently small. Fig. 4 compares the RF-conformity of the mechanisms we have considered, as well as the outcome of the inefficient numerical optimisation. The additional benefits of weightdependent coupling are marginal: Sim RFs+ access only slightly lower conformity than Sim RFs at the expense of an extra optimisation step of time-complexity O(d3). This gives context to the excellent performance of Sim RFs; they can compete with members of a much broader class at a fraction of the computational cost. We also note that the minimisation of the truncated objective (Sim RFs+) is a good approximation to the minimisation of the true objective ( numerically optimised ), accessing comparably small values of ρ. Informally, Sim RFs+ are close to optimal among the class of weight-dependent geometrically-coupled PRF mechanisms. 0 5 10 15 iterations RF-conformity optimisation numerically optimised Figure 4. Comparison of the RF-conformity defined in Eq. 10 (lower is better) for a single random draw of norms {wi}, v = x + y 2 = 1 and d = 6. IIDRFs, ORFs, Sim RFs and Sim RFs+ are implemented as described in the main text. Numerically optimised uses an off-the-shelf numerical optimiser to arrange vectors to minimise the RF-conformity: a scheme which is too computationally inefficient to be practical but benchmarks the lowest possible value. Any improvements above Sim RFs using weight-dependent geometrical coupling are marginal. The IIDRF value is averaged over 100 random couplings of fixed weights, and the shaded region gives 1 standard deviation. 4. From ORFs to Sim RFs: the Theory This section provides more detailed theoretical analysis to add insight to the results of Sec. 3. It can safely be omitted on a quick reading. We derive analytic expressions for the RF-conformity ρ(x, y), and therefore the kernel estimator MSE, for IIDRFs, ORFs and Sim RFs. This allows us to quantitatively compare the performance of different coupling mechanisms. As before, we specialise to Kgauss. Detailed proofs are provided in Appendix A. We have seen that RF-conformity depends on an expectation value over wij = wi + wj 2. This motivates us to begin with the following auxiliary lemma. Lemma 4.1 (IIDRF conformity). When random vectors wi, wj Rd are i.i.d. (IIDRFs), the probability distribution p(wij) with wij = wi + wj 2 is given by pi.i.d.(wij) = wd 1 ij e w2 ij/4 which induces an RF-conformity ρIIDRF(x, y) = ev2 (19) where x, y Rd and v = x + y 2. Now we make the following important observation. Lemma 4.2 (PDF for vectors subtending θ). Supposing random vectors wi, wj are marginally Gaussian but are conditioned to subtend a fixed angle θ, the probability distribution pθ(wij), is given by ϕ=0 dϕ(sin ϕ cos ϕ)d 1 e w2 2(1+sin 2ϕ cos θ) (1 + sin 2ϕ cos θ)d . ORFs and Sim RFs correspond to special instances of this with cos θ = 0 and cos θ = 1 d 1, respectively. It is instructive to observe that, in the orthogonal case, the distribution reduces to the χ2d-distribution. The probability distribution pθ(wij) induces an RF-conformity ρθ(x, y) = 1 2d 1Γ( d 0 dϕ(sin ϕ)d 1 v2k(1 + sin ϕ cos θ)k 2kk!Γ(k + d 2) Γ(k + d). (21) Inspecting the form closely, we see that every term in the sum over k is proportional to the integral Z π 0 dϕ(sin ϕ)d 1(1 + sin ϕ cos θ)k (22) which is strictly smaller for cos θ < 0 compared to cos θ = 0 (since sin ϕ is nonnegative everywhere in the domain). Simplex Random Features Since every term in the sum is positive, we immediately conclude that for PRFs the conformity of Sim RFs is strictly smaller than ORFs, and hence the MSE is smaller. We already derived this in Sec. 3, but are now also able to provide the following closed forms. Theorem 4.3 (ORF and Sim RF conformity closed forms). For PRFs with x, y Rd, the RF-conformity of ORFs is ρORF(x, y) = Γ( d 2kk! Γ(k + d) Γ(k + d whereas the RF-conformity of Sim RFs is ρSim RF(x, y) = π Γ( d Γ(k + d) Γ(k + d 2 ) 1 (k p)!p!. These results are novel. They permit the first analytic characterisation of the difference in kernel estimator MSE between IIDRFs, ORFs and Sim RFs. We make one further observation. Corollary 4.4 (ORFs always outperform IIDRFs). In the PRF setting, the orthogonality gap (difference in kernel estimator MSE between IIDRFs and ORFs) is given by MSE( b K(x, y)) = e 2x2 2y2 m 1 k! Γ(k + d) Γ(k + d/2) where x, y Rd, v = x + y 2 and m d is the number of random vectors. This is positive everywhere. The sign of this orthogonality gap was first reported in (Choromanski et al., 2020) but without an accompanying closed form. Plotting each of derived probability distributions p(wij) (Eq. 18 and Eq. 20, taking cos θ = 0 and cos θ = 1 d 1) and noting from Eq. 10 that the RF-conformity depends on the expectation value of the monotonically increasing function f(wij, v) = Γ( d 2) P k=0 v2kw2k ij 22kk!Γ(k+ d 2 ), the intuitive reason for the relative efficacy of Sim RFs, ORFs and IIDRFs becomes clear: conformity is penalised by tails at large wij, which we suppress with geometrical coupling (Fig. 5). 0.0 2.5 5.0 7.5 10.0 wij Probability distributions over wij, d = 6 simplex f(wij, v) Figure 5. Probability distributions over the random variable wij = wi + wj 2 for IIDRFs, ORFs and Sim RFs. The RF-conformity depends on the expectation of a monotonically increasing function f(wij). With PRFs, geometrical coupling decreases this by reducing the probability mass at large wij. 4.1. Extension to RFFs We briefly note that, with minimal work, the preceding results for PRFs can be modified to consider RFFs. For example, the following is true. Theorem 4.5 (RFF orthogonality gap). In the RFF setting, the orthogonality gap (difference in kernel estimator MSE between IIDRFs and ORFs) is given by MSE( b K(x, y)) = m 1 2kk! Γ(k + d) Γ(k + d/2) where x, y Rd, z = x y 2 and m d is the number of random vectors. To the best of our knowledge, this result is also novel. The expression does not admit the same simple analysis as the PRF form (25) because successive terms in the sum oscillate in sign, but a cursory numerical analysis reveals that the MSE of ORFs is smaller than IIDRFs up to some threshold zcrit(d), the value of which diverges as d . Taylor expanding our exact result in 1 d reproduces the following. Corollary 4.6 (RFF asymptotic MSE ratio, Yu et al. (2016)). The ratio of ORF to IIDRF kernel estimator MSE is given by MSE( b KORF) MSE( b KIIDRF) = 1 (m 1) d(1 e z2)2 + O 1 (27) where x, y Rd, z = x y 2 and m d is the number of random features. The negative subleading term shows that the RFF orthogonality gap is positive everywhere when d . 4.2. Implementation, Complexity and Fast Sim RFs The replacement of ORFs with Sim RFs is straightforward: instead of calculating random projections Wx using the Simplex Random Features orthogonal block Wort = DR, we use the simplex block Wsimp = DSR, with the matrices D, S, R Rd d and the object x Rd defined at the beginning of Sec. 3. By choosing the order of computation D(S(Rx)), we can avoid the O(d3) time complexity of computing matrix-matrix products. Both D and S support matrix-vector multiplication of time complexity O(d) (see Appendix B.2.1). Generically, the time complexity to sample the random orthogonal matrix R is O(d3) and the matrix-vector multiplication Rx is O(d2). However, following exactly the same tricks as with ORFs, it is possible to replace R with a proxy e R which is approximately sampled from the orthogonal group according to Haar measure and which supports fast matrix-vector multiplication: for example, HD-product matrices (Choromanski et al., 2017) or products of Givens random rotations (Dao et al., 2019). Then the time-complexity will be limited by the computation e Rx which is subquadratic by construction (e.g. O(d log d) for the examples above). We refer to this mechanism as fast Sim RFs, and show its excellent experimental performance in Appendix B.2.2. Sim RFs+ are implemented by Wsimp+ = DS R, where S is obtained from S according to the O(d3) iterative optimisation scheme defined in Eq. 17. This will dominate the scaling of time-complexity if we apply fast Sim RFs+. Table 1. Time complexities of RF-mechanisms and their fast variants. Time-complexity ORFs Sim RFs Sim RFs+ Regular O(d3) O(d3) O(d3) Fast O(d log d) O(d log d) O(d3) For all regular schemes, the space complexity to store R is O(d2). For fast ORFs and fast Sim RFs, the space complexity becomes O(d) because we no longer need to explicitly store e R, just the d weights {wi} from χd. But the space complexity of fast Sim RFs+ is still O(d2) since all vectors must be stored during the optimisation step. It is clear that Sim RFs are essentially equal in computational cost to ORFs, and in Sec. 5 we will see that they often perform substantially better in downstream tasks. Meanwhile, Sim RFs+ are mostly of academic interest. 5. Experiments Here we report the outcomes of an extensive empirical evaluation of Sim RFs for PRFs, demonstrating their superiority over IIDRFs and ORFs in a variety of settings. Technical details are reported in Appendix B. The section is organised as follows: (a) in Sec. 5.1 we plot the derived MSE expressions for IIDRFs, ORFs and Sim RFs; (b) in Sec. 5.2 we verify that Sim RFs permit higher-quality kernel matrix approximation by considering the Frobenius norm of the dif- ference between the true and approximated Gram matrices; (c) in Sec. 5.3 we compare the performance of the different RF mechanisms on nonparametric classification tasks using kernel regression; (d) in Sec. 5.4 we compare the RF mechanisms for approximation of the attention module in vision Performer-Transformers. 5.1. Comparison of MSE Between RF Mechanisms We begin by plotting the MSE of the PRF estimator b K with IIDRFs, ORFs and Sim RFs, given by Eq. 10 with the RF-confirmities 19, 23 and 24. We note that the ratio of the MSE of any pair of RF mechanisms only depends in the data x, y via v = x + y 2, so it is natural to plot MSEORF/MSEIIDRF and MSESim RF/MSEIIDRF as a function of v see Fig. 6. We take d = 64 which is standard in Transformer applications. Sim RFs always outperform ORFs and IIDRFs, but the size of the improvement depends sensitively on the data. Sim RFs are particularly effective compared to ORFs and IIDRFs when estimating kernel evaluations at small v. This can be understood from their respective Taylor expansions. For both IIDRFs and ORFs, the MSE goes as MSEIIDRF,ORF = v2 + O(v4). Meanwhile, for Sim RFs, MSESim RF = v2 1 πΓ(d+1)Γ( d 2 +1)22d + O(v4). For d = 64, the Sim RF v2 prefactor evaluates to 0.0078 which is manifestly substantially smaller than 1. 10 3 10 2 10 1 10 0 10 1 10 2 v MSE ratio between RF mechanisms Figure 6. Analytic form the the MSE ratio of the PRF kernel estimator b K for different couplings, plotted as a function of v = x + y 2. Smaller values indicate lower MSE and are hence better. Sim RFs always perform the best, followed by ORFs then IIDRFs. The size of the improvement depends on the data; it is bigger at smaller v. 5.2. Quality of Gram Matrix Approximation Another straightforward task is to directly compare the quality approximation of the Gram matrix b K with the different RF mechanisms. We can quantify this using the Frobenius norm between the exact and approximated matrices PN i=1 PN j=1(Kij b Kij)2, where Kij def = Kgauss(xi, xj) and b Kij is the corresponding low-rank decomposition. For demonstration purposes, we randomly generate N = 64 Simplex Random Features data points of dimensionality d = 64 according to the distribution xi N(0, σ2Id). We take σ = 0.1. Fig. 7 shows the results; the quality of Gram matrix approximation improves with the number of features, and is better with Sim RFs than ORFs and IIDRFs. No. features (/d) Frob. norm (/d2) Gram matrix Frobenius error Figure 7. Frobenius norm between the true and approximated Gram matrices (lower is better) using different RF mechanisms and a different number of random features. More features give a better approximation, and Sim RFs consistently outperform ORFs and IIDRFs. The data is of dimensionality d = 64 and we take N = 64 points, generated normally with σ = 0.1. The shading gives one standard deviation on estimates of the mean. 5.3. Nonparametric Classification Using Kernel Regression Here we demonstrate how reduced kernel estimator MSE translates to better performance in downstream classification tasks. We use 8 different datasets retrieved from the UCI Machine Learning Repository (Dua & Graff, 2017a), each consisting of L training data {(x, y)} and test data {(x , y )}. The objects are d-dimensional vectors x, x Rd and their labels are one-hot encoded y, y Rn. We predict the label distribution of a test object using kernel regression with the Gaussian kernel, y pred = PL i=1 K(σx , σx(i))y(i)/ PL i=1 K(σx , σx(i)). We then predict a class by taking the greatest argument of y pred. We measure accuracy by the proportion of correct label predictions across the test-set. The σ > 0 hyperparameter is tuned for good PRF performance on a validation dataset; see Appendix B.1 for detailed discussion. Fig. 8 presents the results, plotting classification accuracy against the number of random features used. The size of the benefit accrued from using Sim RFs depends on the data (as we noted in Sec. 5.1) and in the limit of large m performance tends towards the exact kernel result. Sim RFs consistently perform best. 5.3.1. SIMRFS+ FOR NONPARAMETRIC CLASSIFICATION Table 2 compares the classification accuracies achieved with Sim RFs and Sim RFs+ on the task detailed above, using m = d random features. As suggested in Sec. 3 (see in particular Fig. 4), Sim RFs are already close to optimal and No. features (/d) No. features (/d) No. features (/d) No. features (/d) No. features (/d) No. features (/d) No. features (/d) No. features (/d) No. features (/d) IIDRFs ORFs Sim RFs Figure 8. Nonparametric classification using kernel regression for a variety of datasets (Dua & Graff, 2017a; Nash et al., 1994; Dua & Graff, 2017b; Bohanec & Rajkovic, 1988; Horton & Nakai, 1996; Lim et al., 2000; Olave et al., 1989; Dua & Graff, 2017c), where the Gaussian kernel is approximated with different RFs. Plots show mean classification accuracy vs the number of random features used to approximate the kernel (/d, the dimensionality of the objects x). Shading gives the standard deviation on the estimates of the mean. Sim RFs consistently perform best. any gain provided by using Sim RFs+ is marginal. Moreover, improvements tend to occur where v is small so truncating the objective series expansion at k = 1 is reasonable. Table 2. Classification accuracies from kernel regression with Sim RFs and Sim RFs+, using random features of length m = d. v records the mean (σ-scaled) value of v in each dataset. Note that both variants substantially outperform ORFs on every dataset. Data set v Classification accuracy Sim RFs Sim RFs+ abalone 1.7 0.1421 0.0002 0.1419 0.0002 banknote 2.6 0.7229 0.0012 0.7132 0.0012 car 5.0 0.6754 0.0004 0.6751 0.0004 yeast 3.1 0.3202 0.0004 0.3208 0.0004 cmc 2.0 0.4047 0.0005 0.4065 0.0005 nursery 1.4 0.6874 0.0005 0.6917 0.0004 wifi 0.8 0.6314 0.0018 0.6473 0.0018 chess 2.3 0.2000 0.0001 0.2000 0.0001 5.4. Sim RFs-Performers: Scalable Attention for Transformers PRFs were first introduced in (Choromanski et al., 2020) in order to accurately approximate the softmax attention module of Transformers an architecture coined the Performer. This technique for kernelising the attention mechanism, which identifies complex dependencies between the elements of an input sequence, permits linear (c.f. quadratic) spaceand time-complexity without assuming restrictive priors such as sparsity and low-rankness. Performers offer Simplex Random Features Image Net2012 Fashion-MNIST I-Naturalist2021 Places365 Sim RFs ORFs Sim RFs ORFs Sim RFs ORFs Sim RFs ORFs Accuracy Accuracy Figure 9. Accuracy comparison (higher is better) of the Sim RFs Performer and the regular ORFs-Performer. Tests are on four image classification tasks: (a) Image Net2012, (b) Fashion-MNIST, (c) I-Naturalist2021, (d) Places365. x-axis is training epochs. competitive results across a range of tasks (Tay et al., 2021), including vision modeling (Yuan et al., 2021; Horn et al., 2021) and speech (Liutkus et al., 2021). Since Performers apply the ORF variant of PRFs, it is natural to expect that the Sim RFs mechanism, which gives provably lower kernel estimator MSE, will be more effective. We refer to this architecture as the Sim RFs-Performer, and show that it outperforms the regular ORFs-Performer. We focus on the performised versions of Vision Transformers (Vi Ts) (Dosovitskiy et al., 2021) and consider four datasets: (a) Image Net2012 (Deng et al., 2009) (1K classes, 1.2M training images, 100K test set); (b) Fashion-MNIST (Xiao et al., 2017) (10 classes, 60K training images, 10K test set); (c) I naturalist2021 (Horn et al., 2018) (10K classes, 2.7M training images, 500K test set) and (d) Places365 (Zhou et al., 2018) (365 classes, 1.8M training images, 328K test set). These are often used to benchmark Vi Ts. In all four experiments, we use a Vi T with 12 layers, 12 heads, mlp dim equal to 3072, a dropout rate of 0.1 and no attention dropout. We use the adam optimiser with weight decay equal to 0.1 and batch size bs = 4096, trained for 300 epochs on the TPU architecture. We apply 130 random vectors to approximate the softmax attention kernel with PRFs, testing both the ORF and Sim RF coupling mechanisms. The results, comparing ORFs and Sim RFs for approximating attention, are presented in Fig. 9. The Sim RFs Performer often achieves gains over the regular ORFs Performer and is certainly never worse for no observable extra cost. The exact difference depends on the data distribution (see. Sec. 5.1) and the importance of MSE reduction for that particular task; if some other factor is bottlenecking Performer accuracy, then improving the approximation of the attention matrix cannot provide gains. Nonetheless, for some of the tested datasets the difference is substantial: for instance, on Image Net2012, which is frequently used to benchmark new Transformer variants, the Sim RFs Performer saturates at an accuracy which is greater than the regular ORFs-Performer by 0.5%. It is remarkable that such a large gain can be accrued with a single drop-in matrix multiplication at no observable computational cost, without any architectural or Vi T-specific changes. 6. Conclusion We have introduced Simplex Random Features (Sim RFs), a new mechanism for unbiased approximation of the Gaussian and softmax kernels. By correlating the directions of random vectors in the ensemble, we access lower kernel estimator MSE than the previously predominant Orthogonal Random Features (ORFs): a fact we have verified both theoretically and empirically via extensive experiments. We have shown that the suppressed MSE of Sim RFs compared to ORFs often permits better performance in downstream applications, including in nonparametric classification and scalable Transformer training. However, the size of the gain depends on the data distribution and whether the quality of kernel approximation is currently bottlenecking model performance. We have proved that Sim RFs constitute the best weight-independent geometrically-coupled PRF mechanism, with further marginal improvements available in some regimes from a weight-dependent Sim RFs+ variant. Finally, through our detailed quantitative analysis of the different RF mechanisms, we have derived novel closed-form results for ORFs, precisely formalising qualitative and asymptotic findings previously reported in the literature. 7. Relative Contributions and Acknowledgements IR developed the Sim RF and Sim RF+ mechanisms, proved all theoretical results, and ran the pointwise kernel evaluation, Frobenius norm and nonparametric classification experiments. KC designed and ran the Performer experiments, and was crucially involved in all aspects of the work throughout. AW and VL provided helpful discussion and feedback on drafts. IR acknowledges support from a Trinity College External Studentship. VL acknowledges support from the Cambridge Trust and Deep Mind. AW acknowledges support from a Turing AI Fellowship under grant EP/V025279/1 and the Leverhulme Trust via CFI. Simplex Random Features Ailon, N. and Chazelle, B. The fast johnson lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302 322, 2009. doi: 10.1137/060673096. URL https://doi.org/10.1137/060673096. Ailon, N. and Liberty, E. An almost optimal unrestricted fast johnson-lindenstrauss transform. ACM Trans. Algorithms, 9(3):21:1 21:12, 2013. doi: 10. 1145/2483699.2483701. URL https://doi.org/ 10.1145/2483699.2483701. Bohanec, M. and Rajkovic, V. Knowledge acquisition and explanation for multi-attribute decision making. In 8th intl workshop on expert systems and their applications, pp. 59 78. Avignon France, 1988. URL https://kt. ijs.si/Marko Bohanec/pub/Avignon88.pdf. Bojarski, M., Choromanska, A., Choromanski, K., Fagan, F., Gouy-Pailler, C., Morvan, A., Sakr, N., Sarlos, T., and Atif, J. Structured adaptive and random spinners for fast machine learning computations. In Artificial intelligence and statistics, pp. 1020 1029. PMLR, 2017. URL http://proceedings.mlr.press/ v54/bojarski17a/bojarski17a.pdf. Choromanski, K., Rowland, M., Sarl os, T., Sindhwani, V., Turner, R. E., and Weller, A. The geometry of random features. In Storkey, A. J. and P erez-Cruz, F. (eds.), International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, volume 84 of Proceedings of Machine Learning Research, pp. 1 9. PMLR, 2018. URL http://proceedings.mlr.press/ v84/choromanski18a.html. Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. ar Xiv preprint ar Xiv:2009.14794, 2020. URL https: //openreview.net/pdf?id=Ua6zuk0WRH. Choromanski, K. M., Rowland, M., and Weller, A. The unreasonable effectiveness of structured random orthogonal embeddings. Advances in neural information processing systems, 30, 2017. URL https://arxiv.org/ abs/1703.00864. Chowdhury, S. P., Solomou, A., Dubey, A., and Sachan, M. On learning the transformer kernel. Co RR, abs/2110.08323, 2021. URL https://arxiv.org/ abs/2110.08323. Dao, T., Gu, A., Eichhorn, M., Rudra, A., and R e, C. Learning fast algorithms for linear transforms using butterfly factorizations. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 1517 1527. PMLR, 2019. URL http://proceedings.mlr.press/ v97/dao19a.html. Dasgupta, A., Kumar, R., and Sarl os, T. A sparse johnson: Lindenstrauss transform. In Schulman, L. J. (ed.), Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010, pp. 341 350. ACM, 2010. doi: 10.1145/1806689.1806737. URL https: //doi.org/10.1145/1806689.1806737. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248 255. IEEE Computer Society, 2009. doi: 10.1109/CVPR.2009.5206848. URL https://doi.org/10.1109/CVPR.2009. 5206848. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum? id=Yicb Fd NTTy. Dua, D. and Graff, C. UCI machine learning repository, 2017a. URL http://archive.ics.uci.edu/ ml. Dua, D. and Graff, C. Banknote authentication dataset, UCI machine learning repository, 2017b. URL http: //archive.ics.uci.edu/ml. Dua, D. and Graff, C. Chess (king-rook vs. king) dataset, UCI machine learning repository, 2017c. URL http: //archive.ics.uci.edu/ml. Faris, W. Radial functions and the fourier transform, 2008. URL http://www.math.arizona.edu/ faris/methodsweb/hankel.pdf. Accessed: 20-10-2022. Horn, G. V., Aodha, O. M., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., and Belongie, S. J. The inaturalist species classification and detection dataset. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8769 8778. Simplex Random Features Computer Vision Foundation / IEEE Computer Society, 2018. doi: 10.1109/CVPR.2018.00914. URL http: //openaccess.thecvf.com/content_cvpr_ 2018/html/Van_Horn_The_INaturalist_ Species_CVPR_2018_paper.html. Horn, M., Shridhar, K., Groenewald, E., and Baumann, P. F. M. Translational equivariance in kernelizable attention. Co RR, abs/2102.07680, 2021. URL https://arxiv. org/abs/2102.07680. Horton, P. and Nakai, K. A probabilistic classification system for predicting the cellular localization sites of proteins. In Ismb, volume 4, pp. 109 115, 1996. URL https://pubmed.ncbi.nlm.nih. gov/8877510/. Johnson, W. B. Extensions of lipschitz mappings into a hilbert space. Contemp. Math., 26:189 206, 1984. URL https://doi.org/10.1090/conm/ 026/737400. Kane, D. M. and Nelson, J. Sparser johnson-lindenstrauss transforms. J. ACM, 61(1):4:1 4:23, 2014. doi: 10.1145/2559902. URL https://doi.org/10. 1145/2559902. Kar, P. and Karnick, H. Random feature maps for dot product kernels. In Lawrence, N. D. and Girolami, M. A. (eds.), Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012, La Palma, Canary Islands, Spain, April 21-23, 2012, volume 22 of JMLR Proceedings, pp. 583 591. JMLR.org, 2012. URL http://proceedings.mlr. press/v22/kar12.html. Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020. URL https://openreview.net/forum? id=rkg NKk Htv B. Kulesza, A. and Taskar, B. Determinantal point processes for machine learning. Found. Trends Mach. Learn., 5 (2-3):123 286, 2012. doi: 10.1561/2200000044. URL https://doi.org/10.1561/2200000044. Le, Q. V., Sarl os, T., and Smola, A. J. Fastfood - computing hilbert space expansions in loglinear time. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pp. 244 252. JMLR.org, 2013. URL http: //proceedings.mlr.press/v28/le13.html. Li, F., Ionescu, C., and Sminchisescu, C. Random fourier approximations for skewed multiplicative histogram kernels. In Goesele, M., Roth, S., Kuijper, A., Schiele, B., and Schindler, K. (eds.), Pattern Recognition - 32nd DAGM Symposium, Darmstadt, Germany, September 22-24, 2010. Proceedings, volume 6376 of Lecture Notes in Computer Science, pp. 262 271. Springer, 2010. doi: 10.1007/978-3-642-15986-2\ 27. URL https:// doi.org/10.1007/978-3-642-15986-2_27. Liberty, E., Ailon, N., and Singer, A. Dense fast random projections and lean walsh transforms. Discret. Comput. Geom., 45(1):34 44, 2011. doi: 10.1007/ s00454-010-9309-5. URL https://doi.org/10. 1007/s00454-010-9309-5. Likhosherstov, V., Choromanski, K. M., Davis, J. Q., Song, X., and Weller, A. Sub-linear memory: How to make performers slim. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 6707 6719, 2021. URL https://doi.org/10.48550/ ar Xiv.2012.11346. Likhosherstov, V., Choromanski, K., Dubey, A., Liu, F., Sarlos, T., and Weller, A. Chefs random tables: Non-trigonometric random features. In Neur IPS, 2022. URL https://doi.org/10. 48550/ar Xiv.2205.15317. Lim, T.-S., Loh, W.-Y., and Shih, Y.-S. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine learning, 40(3):203 228, 2000. URL https: //doi.org/10.1023/A:1007608224229. Lin, H., Chen, H., Choromanski, K. M., Zhang, T., and Laroche, C. Demystifying orthogonal monte carlo and beyond. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL https: //doi.org/10.48550/ar Xiv.2005.13590. Liu, F., Huang, X., Chen, Y., and Suykens, J. A. K. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Trans. Pattern Anal. Mach. Intell., 44(10):7128 7148, 2022. doi: 10.1109/TPAMI.2021.3097011. URL https://doi. org/10.1109/TPAMI.2021.3097011. Liutkus, A., C ıfka, O., Wu, S., Simsekli, U., Yang, Y., and Richard, G. Relative positional encoding for transformers with linear complexity. In Meila, M. and Zhang, Simplex Random Features T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 7067 7079. PMLR, 2021. URL http://proceedings.mlr.press/v139/ liutkus21a.html. Luo, S., Li, S., Cai, T., He, D., Peng, D., Zheng, S., Ke, G., Wang, L., and Liu, T. Stable, fast and accurate: Kernelized attention with relative positional encoding. In Ranzato, M., Beygelzimer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pp. 22795 22807, 2021. URL https://doi.org/10.48550/ ar Xiv.2106.12566. Nash, W. J., Sellers, T. L., Talbot, S. R., Cawthorn, A. J., and Ford, W. B. The population biology of abalone (haliotis species) in tasmania. i. blacklip abalone (h. rubra) from the north coast and islands of bass strait. Sea Fisheries Division, Technical Report, 48:p411, 1994. URL https://www.researchgate.net/ publication/287546509_7he_Population_ Biology_of_Abalone_Haliotis_species_ in_Tasmania_I_Blacklip_Abalone_ H_rubra_from_the_North_Coast_and_ Islands_of_Bass_Strait. Olave, M., Rajkovic, V., and Bohanec, M. An application for admission in public school systems. Expert Systems in Public Administration, 1:145 160, 1989. URL https://www.academia.edu/16670755/An_ application_for_admission_in_public_ school_systems. Pennington, J., Yu, F. X., and Kumar, S. Spherical random features for polynomial kernels. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1846 1854, 2015. URL https://proceedings. neurips.cc/paper/2015/file/ f7f580e11d00a75814d2ded41fe8e8fe-Paper. pdf. Rahimi, A. and Recht, B. Random features for large-scale kernel machines. Advances in neural information processing systems, 20, 2007. URL https://people.eecs.berkeley.edu/ brecht/papers/07.rah.rec.nips.pdf. Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pp. 1313 1320. Curran Associates, Inc., 2008. URL https://people.eecs.berkeley.edu/ brecht/papers/08.rah.rec.nips.pdf. Roy, A., Saffar, M., Vaswani, A., and Grangier, D. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguistics, 9:53 68, 2021. doi: 10.1162/tacl\ a\ 00353. URL https://doi.org/ 10.1162/tacl_a_00353. Schlag, I., Irie, K., and Schmidhuber, J. Linear transformers are secretly fast weight programmers. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 9355 9366. PMLR, 2021. URL http://proceedings.mlr.press/ v139/schlag21a.html. Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham, P., Rao, J., Yang, L., Ruder, S., and Metzler, D. Long range arena : A benchmark for efficient transformers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https: //openreview.net/forum?id=q Vye W-gr C2k. Trokicic, A. and Todorovic, B. Randomized nystr om features for fast regression: An error analysis. In Ciric, M., Droste, M., and Pin, J. (eds.), Algebraic Informatics - 8th International Conference, CAI 2019, Niˇs, Serbia, June 30 - July 4, 2019, Proceedings, volume 11545 of Lecture Notes in Computer Science, pp. 249 257. Springer, 2019. doi: 10.1007/978-3-030-21363-3\ 21. URL https:// doi.org/10.1007/978-3-030-21363-3_21. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998 6008, 2017. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017. URL http: //arxiv.org/abs/1708.07747. Simplex Random Features Xiao, X., Zhang, T., Choromanski, K., Lee, T. E., Francis, A. G., Varley, J., Tu, S., Singh, S., Xu, P., Xia, F., Persson, S. M., Kalashnikov, D., Takayama, L., Frostig, R., Tan, J., Parada, C., and Sindhwani, V. Learning model predictive controllers with real-time attention for real-world navigation. Co RL 2022, abs/2209.10780, 2022. doi: 10.48550/ar Xiv.2209.10780. URL https: //doi.org/10.48550/ar Xiv.2209.10780. Yang, J., Sindhwani, V., Avron, H., and Mahoney, M. W. Quasi-monte carlo feature maps for shift-invariant kernels. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pp. 485 493. JMLR.org, 2014. URL http://proceedings.mlr.press/ v32/yangb14.html. Yu, F. X., Bhaskara, A., Kumar, S., Gong, Y., and Chang, S. On binary embedding using circulant matrices. J. Mach. Learn. Res., 18:150:1 150:30, 2017. URL http: //jmlr.org/papers/v18/15-619.html. Yu, F. X. X., Suresh, A. T., Choromanski, K. M., Holtmann Rice, D. N., and Kumar, S. Orthogonal random features. Advances in neural information processing systems, 29, 2016. URL https://doi.org/10.48550/ ar Xiv.1610.09072. Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F. E. H., Feng, J., and Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 538 547. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00060. URL https://doi. org/10.1109/ICCV48922.2021.00060. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1452 1464, 2018. doi: 10.1109/ TPAMI.2017.2723009. URL https://doi.org/10. 1109/TPAMI.2017.2723009. Simplex Random Features A. Supplementary Proofs and Discussion In this appendix we provide further discussion and proofs of results stated in the main text. A.1. Proof of Theorem 3.2 (MSE Depends on RF-conformity) We begin by deriving a form for the kernel estimator MSE in the PRF setting, showing how it depends upon the so-called RF-conformity defined in Eq. 10. From the definitions in Eq. 6 and Eq. 7, it follows that b K = ϕ(x) ϕ(y) = e x2 y2 i=1 ew i (x+y) = e x2 y2 i=1 bi (28) where m is the number of random features and x, y Rd. We introduced bi, where bi def = ew i v, (29) with v = x + y Rd. Here, i = 1, ..., m enumerates the random features. It is straightforward to show that this is an unbiased estimator of the Gaussian kernel K(x, y) = exp( x y 2 2 2 ) when wi are sampled from N(0, Id); in particular, we find that E(bi) = e v2 2 . After some algebra, we can also show that MSE( b K) = e 2x2 2y2 (e2v2 ev2) + (m 1)( 1 m(m 1) i =j E[bibj] ev2) Now consider the correlation term 1 m(m 1) P i P i =j E[bibj] = 1 m(m 1) P i P i =j E[e(wi+wj) v] more carefully. Evidently, we care about the probability distribution over the random variable wi + wj, denoted compactly by wij. For all couplings we consider, the random vectors wi and wj are marginally isotropic (a necessary condition to be marginally Gaussian) and their resultant wij will also be marginally isotropic. This permits us to rewrite the expectation value using the Hankel transform (Faris, 2008): E[ew ijv] = Z Rd ddwijp(wij)ew ijv = Γ(d/2)2 d 2 1 Z 0 dwijp(wij)(iwijv)1 d 2 1(iwijv) (31) 2 1 is a Bessel function of the first kind5. Importantly, we are integrating over a single variable: the norm of the resultant random vector, wij = wi + wj 2. The probability distribution p(wij) will depend on whether the random vectors are i.i.d. or exhibit geometrical coupling, even though the marginal distributions are identical (Gaussian) in every case. Recalling the Taylor expansion k!Γ(k + α + 1) 2k+α , (32) we can rewrite the correlation term as E[bibj] = Γ d v2kw2k ij 22kk!Γ(k + d Inserting this into Eq. 74, this immediately yields the important result: MSE( b K) = e 2x2 2y2 (e2v2 ev2) + (m 1)(ρ(x, y) ev2) , (34) 5In fact, given the purely imaginary argument, the function Iα(x) = i αJα(ix) is referred to as the modified Bessel function. Simplex Random Features where we defined the RF-conformity ρ(x, y) def = Γ( d v2kw2k ij 22kk!Γ(k + d as in Eq. 10 of the main text. Summations run from i = 1 to m, the number of random features. The MSE is manifestly an increasing function of ρ(x, y), which itself depends sensitively on the any correlations induced between random vectors via p(wij). It is clear that any coupling mechanisms that reduce values of wij = wi + wj 2, e.g. by conditioning that random vectors point away from one another, will suppress ρ(x, y). The RF-conformity will form a core consideration in the discussion that follows. A.2. Proof of Theorem 3.4 (Sim RFs Optimal for Weight-Independent Geometrical Coupling) Here, we prove the central result that, supposing the weights wi = wi 2, i = 1, ..., d are i.i.d. (in our case from χd), Sim RFs constitute the best possible weight-independent geometrical coupling scheme. Recall again that by weightindependent we mean that vector directions { ˆ wi} are independent of norms {wi}, though directions can still be correlated among themselves. Our choice of geometrical coupling will not depend on each particular draw of norms; we just use the fact that all wi are identically distributed. We begin by proving the following simpler auxiliary lemma. Lemma A.1 (Sim RFs optimal for equal norms). Suppose that, instead of being sampled from a χd distribution, we condition that wi Rd for i = 1, ..., d all have equal lengths w. Then ρ(x, y) is minimised when the ensemble exhibits simplex geometrical coupling. Proof: given the set of vector norms wi = w with i = 1, ..., d, we would like to know how to choose the angles θij subtended between each pair wi and wj to minimise the RF-conformity ρ(x, y). It is immediately obvious that we should choose θij deterministically rather than probabilistically, because assigning probability mass to suboptimal configurations will always increase the expectation value (that is, p(wij|wi, wj = w) = δ(wij p 2w2(1 + cos θij)), with δ the delta function). So the task is to choose {θij} to minimise ρ(x, y) = Γ( d 22kk!Γ(k + d 2) b wi + b wj 2k 2 = X j =i f( b wi + b wj 2 2) (36) where we defined the increasing convex function f( b wi + b wj 2 2) def = Γ( d 22kk!Γ(k + d 2) b wi + b wj 2k 2 . (37) It follows from Jensen s inequality that j =i f( b wi + b wj 2 2) m(m 1)f j =i b wi + b wj 2 2 m(m 1) with equality when b wi + b wj 2 is identical for every i, j, i.e. all random vectors subtend equal angles. Since f is increasing, j =i b wi + b wj 2 2 = X j =i 2 + 2 b w i b wj = 2m(m 1) + 2 X = 2m(m 2) + 2( X i b wi) ( X = 2m(m 2) + 2 X i b wi 2 2 2m(m 2) Simplex Random Features with equality achieved when P i b wi = 0. Therefore, ρ(x, y) = X j =i f( b wi + b wj 2 2) m(m 1)f 2(m 2) This shows that the conformity is minimised when we have that i) all vectors b wi subtend equal angles, and ii) P i b wi = 0. This is nothing other than the geometry of a d 1-dimensional simplex embedded in d-dimensional space, as described by the basis vectors defined in Eq. 9. Armed with the result of Lemma A.1, we now consider the more general setting where {wi} are i.i.d. random variables but draws are not generically identical. We begin with the observation that, if random variables w1, ..., wd are i.i.d., the joint distribution p(w1, w2, ..., wd) = p(w1) p(w2) ... p(wd) is invariant under permutation of wi. This is because the joint distribution factorises into d identical functions, though more general joint distributions with this property exist. Intuitively, for every given draw of weights {w1, ..., wd}, there are d! 1 other draws of equal probability given by the permutations {w P1, ..., w Pd} where P Sd, the symmetric group on d letters. Therefore, the RF conformity can be expressed as ρ(x, y) = Γ( d Z dw1dw2...dwdp(w1, ..., wd) X v2kw2k ij 22kk!Γ(k + d Z dw1dw2...dwd p(w1, ..., wd) 22kk!Γ(k + d w2 Pi ˆw i ˆwi + w2 Pj ˆw j ˆwj + 2w Piw Pj ˆw i ˆwj k where we wrote p(w1, ..., wd) = 1 P Sn p(w P1, w P2, ..., w Pd) then relabelled the integration variables. Here, we have permuted the random vector norms wi but not the directions ˆwi. We would like to obtain the geometry { ˆwi} that minimises ρ(x, y), subject to the condition that the normalisations of the unit vectors ˆw i ˆwi = 1 are fixed. Since the integrand is nonnegative everywhere, we minimise the sum in the final line of Eq. 41, namely X j =i f w2 Pi + w2 Pj + 2w Piw Pj ˆw i ˆwj 22kk!Γ(k + d w2 Pi + w2 Pj + 2w Piw Pj ˆw i ˆwj k (42) where f is once again convex and positive definite. Relabelling summation variables then using Jensen s inequality, we can write this as j =i f w2 i + w2 j + 2wiwj ˆw Pi ˆw Pj d! X P Sd w2 i + w2 j + 2wiwj ˆw Pi ˆw Pj d! with equality when ˆw Pi ˆw Pj is identical for every permutation that is, when all the random vectors subtend identical angles. With this in mind, we write the Lagrangian as 22kk!Γ(k + d w2 Pi ˆw i ˆwi + w2 Pj ˆw j ˆwj + 2w Piw Pj ˆw i ˆwj k X i λi( ˆw i ˆwi 1). (44) Differentiating wrt ˆwi, v2kkw2k 2 Pi Pj 22kk!Γ(k + d w2 Pi ˆwi + w Piw Pj ˆwj λiwi = 0 i = 1, ..., d. (45) where we used that ˆw Pi ˆw Pj = ˆw i ˆwj to take w2 Pi ˆw i ˆwi + w2 Pj ˆw j ˆwj + 2w Piw Pj ˆw i ˆwj = w Pi Pj = w2 Pi ˆw Pi ˆw Pi + w2 Pj ˆw Pj ˆw Pj + 2w Piw Pj ˆw Pi ˆw Pj. (46) Simplex Random Features Eq. 45 implies that v2kk 22kk!Γ(k + d P Sd w2k 2 Pi Pj w Piw Pj ˆwj i = 1, ..., d (47) with the proportionality constant fixed by the normalisation of ˆwi. Crucially, since we are summing over all permutations Sd of the d labels, the term in parentheses P P Sd w2k 2 Pi Pj w Piw Pj is identical for every i, j. This immediately implies that j =i ˆwj i = 1, ..., d. (48) Subject to the further constraint that all ˆwi subtend equal angles, this is uniquely given by the simplex geometry described by the basis vectors in Eq. 9. That is, supposing the vector norms wi are i.i.d. and that the geometrical coupling is weight-independent, Sim RFs give the lowest possible MSE in the PRF setting. An intuitive explanation of this result is as follows. In Lemma A.1, we observed that Sim RFs are optimal if all vector norms wi are equal. Supposing norms are not equal but are identically distributed, any geometrical coupling scheme that is better for some particular draw of norms {wi} will be worse for some of the (equally probable) label permutations {w Pi}, P Sd. The effect of summing over all the permutations is the same as collapsing all the distributions over wi to a single, identical value. A.3. Derivation of Eq. 16 (Sim RFs+ Geometry Minimises the Truncated RF-Conformity Objective) Here, we show that the Sim RFs+ geometrical coupling mechanism (Eq. 16) minimises the truncated approximation to the RF-conformity ρ(x, y) (Eq. 15). Writing a Lagrangian using the truncated sum and differentiating, it is straightforward to find that X 2)(wi + wj) λiwi = 0 i = 1, ..., d (49) with the Lagrange multipliers λi fixed by the (known) normalisations of wi. Should such a geometry exist, this will be solved by wi X j =i wj i = 1, ..., d. (50) Note that, on account of the truncation of the objective, we do not need to make any assumptions about the vector norms or angles subtended being equal to reach this conclusion. It is straightforward to convince oneself that such a geometry always exists for any set of norms: if one norm wi exceeds the sum of all the others, Eq. 50 is trivially satisfied by arranging the vector of maximum norm to be antialigned with all the rest; if this is not the case, it is always possible to arrange the vectors such that they sum to 0, i.e. form a closed loop. Then wi = P j =i wj, which satisfies Eq. 50. We conclude that the Sim RFs+ geometry j =i wj 2 wi i = 1, ..., d (51) minimises ρ(x, y). We briefly note that Eq. 51 does not actually define one unique geometrical coupling, but empirically the iterative update scheme in Eq. 17 always finds a good solution when initialised in the simplex geometry. A.4. Proof of Lemma 4.1 (IIDRF Conformity) In this appendix, we derive the probability distribution p(wij) over wij = wi + wj 2 in the case that all wi follow independent Gaussian distributions N(0, Id), and use it to evaluate the corresponding IIDRF conformity ρ(x, y). In the i.i.d. case, each component of the vector wi + wj is the sum of two standard normal distributions, N(0, 1). This gives another normal distribution with twice the variance, N(0, 2), which leads simply to the generalised χd distribution p(wij) = wd 1 ij e w2 ij/4 Simplex Random Features Considering the definition of ρ(x, y) in Eq. 10, it is straightforward to calculate ρ (x, y) = Γ(d 0 dwwd 1e w2 22kk!Γ(k + d k! = ev2, (53) as reported in the main text. We used the fact that all wij follow the same distribution and suppressed the ij subscripts for notational clarity. To perform the integral over w, we used the identity R w=0 dww2z 1e w2 2 = 2z 1Γ(z) . This result is obtained more quickly by noting that, following the notation in Sec. A.1, ρ(x, y) = 1 m(m 1) P i =j E[e(wi+wj) v] = E[ew 1 v]E[ew 2 v]. We used the fact that wi and wj are independent and that all wij follow the same distribution (then choosing i = 1 and j = 2 wlg). We have already seen that E[ew 1 v] = e v2 2 (in fact the condition for unbiased estimation of b K), which immediately yields ρ(x, y) = ev2. But the approach using p(wij) is a good warmup for the theorems that follow and will permit a more unified account. A.5. Proof of Lemma 4.2 (PDF for Vectors Subtending θ) In this appendix, we derive the form of Eq. 20, the probability distribution of wij = wi + wj 2 if wi, wj Rd are marginally Gaussian vectors conditioned to subtend a fixed angle θ. Later, the special cases of θ = π 2 (orthogonal) and θ = arccos( 1 d 1) (simplex) will be of particular interest. Clearly w2 = w2 i + w2 j + 2wiwj cos θ, with weight magnitudes wi,j χd (we have suppressed the ij subscript, replacing wij by w, to minimise notational clutter). Diagonalising the quadratic form, we see that a constant w surface will trace out an ellipse in (wi, wj) space with semi-major (-minor) axis lengths w 1 cos(θ). Now p(w < w ) = Z A pχ(wi)pχ(wj)dwidwj (54) where pχ denotes the χd distribution obeyed by wi,j and A denotes the area in the positive quadrant bounded by an ellipse of constant w = w (recall that wi,j 0 since these are vector magnitudes). Expressing this in polar coordinates, p(w < w ) = Z π/2 1+sin(2ϕ) cos(θ) r=0 drrpχ(r cos ϕ)pχ(r sin ϕ). (55) Differentiating wrt w to get the pdf, p(w) = Z π/2 1 + sin(2ϕ) cos(θ) pχ( w cos ϕ p 1 + sin(2ϕ) cos(θ) )pχ( w sin ϕ 1 + sin(2ϕ) cos(θ)) ϕ=0 dϕ(sin ϕ cos ϕ)d 1 e w2 2(1+sin 2ϕ cos θ) (1 + sin 2ϕ cos θ)d , as reported in Eq. 20 of the main text. As an aside, it is instructive to set θ = π/2 and inspect the form of p(w). Doing so, we arrive at the integral p(w) = w2d 1 ϕ=0 dϕ(sin ϕ cos ϕ)d 1e w2 ϕ=0 dϕ(sin ϕ)d 1e w2 Recalling the Legendre duplication formula, πΓ(2z) = 22z 1Γ(z)Γ(z + 1 2), this reduces to p(w) = w2d 1e w2/2 2d 1Γ(d) . (58) This is nothing other than the χ-distribution with 2d degrees of freedom. This makes intuitive sense because, since wi and wj are orthogonal, it follows that w2 = w2 i + w2 j. Now wi,j follow χd distributions (square root of sum of squares of d standard normal variates), so w must be a square root of sum of squares of 2d standard normal variates that is, a χ2d distribution. Simplex Random Features A.6. Proof of Theorem 4.3 (ORF and Sim RF Conformity Closed Forms) Here, we derive the RF-conformities ρ(x, y) of the ORF and Sim RF variants. Recall the form of ρ(x, y), defined in Eq. 10 and reproduced here for convenience: ρ(x, y) = Γ( d v2kw2k ij 22kk!Γ(k + d Use the probability distribution for two marginally Gaussian weights conditioned to subtend an angle θ, p(wij) = w2d 1 ij 2d 2Γ( d ϕ=0 dϕ(sin ϕ cos ϕ)d 1 e w2 ij 2(1+sin 2ϕ cos θ) (1 + sin 2ϕ cos θ)d , (60) where wij = w2 i + w2 j 2 with i = j (see Lemma 4.2 and the accompanying proof in Sec. A.5). Since all wij follow the same distribution, the sums give a multiplicative factor of m(m 1) that cancels with the denominator. Now we have 22kk!2d 2Γ( d w=0 dw Z π 2 ϕ=0 dϕw2k+2d 1(sin ϕ cos ϕ)d 1 e w2 2(1+sin 2ϕ cos θ) (1 + sin 2ϕ cos θ)d . (61) Changing variables w w 1 + sin 2ϕ cos θ and doing the integral over w, v2kΓ(k + d) 2kk!2d 2Γ( d ϕ=0 dϕ(sin 2ϕ)d 1(1 + sin 2ϕ cos θ)k. (62) Finally, changing variables ϕ ϕ 2 and rearranging, we arrive at ρθ(x, y) = 1 2d 1Γ( d 0 dϕ(sin ϕ)d 1 v2k(1 + sin ϕ cos θ)k 2kk!Γ(k + d 2) Γ(k + d) (63) as reported in Eq. 21 of the main text. Now we substitute in the values of θ corresponding to the particular cases of ORFs and Sim RFs. 1) ORFs: cos θ = 0 Note that 1 Γ( d ϕ=0 dϕ(sin ϕ)d 1 = π Γ( d 2) = 2d 1 Γ( d 2) Γ(d) (64) where we used the identity R π 0 dx sind x = 2 +1) and the Legendre duplication formula. It follows immediately that ρORF(x, y) = Γ( d 2kk! Γ(k + d) Γ(k + d We could have obtained this more directly using the χ2d distribution (see discussion at the end of Sec. A.5), but leaving θ unspecified for as long as possible permits a more direct comparison with Sim RFs. 2) Sim RFs: cos θ = 1 d 1 Carrying out the binomial expansion, 0 dϕ(sin ϕ)d 1 1 sin ϕ k! (k p)!p! 0 dϕ(sin ϕ)d+p 1 k! (k p)!p! Simplex Random Features Substituting this in, we immediately arrive at ρSim RF(x, y) = π Γ( d Γ(k + d) Γ(k + d 2 ) 1 (k p)!p! (67) which we have seen is smaller than ρORF(x, y). A.7. Proof of Corollary 4.4 (ORFs Always Outperform IIDRFs) Here we derive an analytic expression for the orthogonality gap (difference in kernel estimator MSE between the IIDRF and ORF mechanisms) and show that it is positive everywhere. From Eq. 11, we immediately have that MSE( b K(x, y)) = e 2(x2+y2) m 1 m (ρIIDRF(x, y) ρORF(x, y)) . (68) Inserting the respective RF-conformities from Eqs. 19 and 23, MSE( b K(x, y)) = e 2(x2+y2) m 1 2kk! Γ(k + d) Γ(k + d = e 2x2 2y2 m 1 1 (k + d 1)! (d 1)! (d 2)!! (2k + d 2)!! where !! denotes the double factorial. We can write the term in parentheses as d + 2 ... d + k 2 d + 2(k 2) d + k 1 d + 2(k 1) > 0 (70) so the series expansion is positive. It follows that the kernel estimator MSE with ORFs is upper bounded but that of IIDRFs. A.8. Proof of Theorem 4.5 (RFF Orthogonality Gap) Here, we demonstrate how, with minor modifications, many of the stated results for PRFs can be translated to RFFs. Recall the RFF definition, stated in Eq. 3 of the main text and reproduced here for convenience. ϕRFF(z) def = 1 m( m i=1[sin(w i z), cos(w i z)]) . (71) Now we have that b K = ϕ(x) ϕ(y) = 1 i=1 cos w i (x y) = 1 i=1 ai (72) where we defined ai def = cos(w i z) (73) and let z = x y. It is straightforward to show that b K is an unbiased estimator of the Gaussian kernel e x y]|2 2 2 when we sample wi N(0, Id); that is, E[ai] = e z2 2 . After some work, we also have that MSE( b K) = 1 2 + (m 1)( 1 m(m 1) i =j E[aiaj] e z2) The object of interest (which is precisely the analogue of ρ(x, y) but for RFFs) is P i =j E[aiaj] = P i =j E[cos w i z cos w j z]. It will vary depending on any geometrical coupling scheme employed. Simplex Random Features From elementary trigonometry, cos w i z cos w j z = 1 2(cos (wi + wj) z + cos (wi wj) z . It is also simple to convince oneself that, when the random vectors wi and wj are (a) i.i.d. or (b) conditioned to be orthogonal, the distributions of the two random variables wi + wj and wi wj are identical. As such, we can just consider the single random variable wij = wi + wj wlg. Then we have that E[cos w ijz] = Z Rd p(wij)e iw ijzddwij = Γ(d/2)2 d 2 1 Z 0 dwijp(wij)(wijz)1 d 2 1(wijz) (75) where we have used that the probability distribution p(wij) is real regardless of whether the random vectors are i.i.d. or orthogonal, and written the expression as a Hankel transform (Faris, 2008). Note that we do not consider the simplex coupling case, where the random variables wi + wj and wi wj will follow different distributions. Carefully comparing with Eq. 31 in Sec. A.1, we observe that the expression is identical to the PRF case, but instead taking v iz. This means that we can obtain all the previously stated IIDRF and ORF results in the RFF setting with minimal extra work. For instance, inspecting Eq. 25, we can immediately state the RFF orthogonality gap (difference in kernel estimator MSE between IIDRFs and ORFs) reported in Theorem 4.5: MSE( b K(x, y)) = m 1 e z2 Γ(d/2) 2kk! Γ(k + d) Γ(k + d/2) (Note that we also dropped the exponential prefactor e 2x2 2y2, originating from the definition of PRFs (7) where it is needed to keep kernel estimation unbiased). A.9. Proof of Corollary 4.6 (RFF Asymptotic MSE Ratio) Here we derive Eq. 27, the ratio of ORF to IIDRF kernel estimator MSEs in the d limit. This was first reported in (Yu et al., 2016), and is included here to show consistency with our more general (finite d) closed forms. Considering the discussion in Sec. A.8, it is straightforward to reason that the ratio of MSEs is given by MSEORF MSEIIDRF = 1 + 2(m 1) (1 + e z2)2 (Eort(a1a2) e z2) (77) where ai = e iw i z and the expectation is being taken over the random variable w12 = w1 +w2 2, with w1,2 conditioned to be orthogonal (see e.g. Eq. 58 for the appropriate probability distribution). From the discussion in Sec. A.8, the term in parentheses on the right can be written as the series expansion 1 2k Γ(k + d)Γ( d Recalling Stirling s formula for the asymptotic form of the Gamma function, lim x Γ(x + 1) = 2πxe xxx, (79) we can rewrite term lim d 1 2k Γ(k + d)Γ( d (k + d 1)( d (k + d 1)m+d 1( d 2 1(d 1)d 1 (k + d 1)( d 2 1)(d 1) 1 k + d 1 k + d 1 + k d 1 d 1 1 + k d 2 1 We can Taylor expand each of the constituent components, s (k + d 1)( d 2 1)(d 1) 1 k Simplex Random Features k + d 1 k + d 1 + k d 1 d 1 1 + k d 2 1 which combine to yield 1 2k Γ(k + d)Γ( d 2)Γ(d) 1 k(k 1) It follows that 1 2k Γ(k + d)Γ( d 2de z2. (85) Putting this into Eq. 77, MSE( b KORF) MSE( b KIIDRF) = 1 (m 1) d(1 e z2)2 + O 1 as reported in Eq. 27 and (Yu et al., 2016). Importantly, the negative sign of the subleading term means that, in the RFF setting, ORFs will always outperform IIDRFs when d . B. Experimental Details In this appendix, we provide further experimental details to supplement the discussion in Sec. 5. B.1. Choosing σ Here we elaborate on Sec 5.3, where we report tuning the hyperparameter σ with a validation dataset. In particular, given some fixed dataset (x, y) and a Gaussian kernel K(x, x ) = e x x 2 2 2 , we apply the scalar transformation (x, y) (σx, y) with σ R+ to optimise the IIDRF performance. There are two (potentially competing) factors to consider: 1. σ implicitly controls the smoothness of the the kernel K that we are approximating. Multiplying the data by σ is equivalent to rescaling the kernel characteristic lengthscale by 1 σ, i.e. taking K(x, x ) = e x x 2 2 2 e x x 2 2 2/σ2 . This will change the classifier accuracy even when using the exact kernel. 2. The kernel estimator variance has some dependence on the data (x, x ) consider e.g. any of the results in Sec. 4, or Fig. 6. Roughly speaking, PRFs tend to perform worse at large σ (equivalent to a sharply varying kernel). In order to navigate a possible tradeoff between these factors and pick a suitable value for σ, we tune by running a coarse search optimising classifier accuracy on a validation set. This is sensible because we are specifically interested in comparing the performance of Sim RFs, ORFs and IIDRFs in settings where kernel approximation with random features is already effective. Fig. 10 shows the results; we choose the value of σ at which the i.i.d. PRF classification accuracy (orange solid line) peaks. As we have suggested, this does not generically coincide with where the exact kernel performance (blue dotted line) peaks. Simplex Random Features classification accuracy exact IIDRF classification accuracy classification accuracy classification accuracy classification accuracy classification accuracy classification accuracy classification accuracy Figure 10. Plots showing classification accuracy vs the data rescaling factor σ for each of the nonparametric classification validation datasets, including both the exact (dotted blue) and IIDRF (solid orange) kernels. They are used for tuning the σ hyperparameter. The RFs are of dimensionality m = 10d, with d the data dimensionality, and the shaded region gives one standard deviation on the estimate of the mean classification accuracy over N = 10 samples. During this coarse σ search phase, the large nursery and chess datasets are restricted to 1000 training examples and 100 test examples for speed. There is also a broader question about the relationship between the kernel estimator MSE and the performance in downstream tasks. The fact that Sim RFs consistently outperform ORFs and IIDRFs in a variety of nonparametric classification tasks and when used to approximate the attention module in Performers confirms that lower variance on kernel estimates often helps performance in applications. But making rigorous mathematical statements about the effect of estimator variance especially, why its importance differs between tasks is complicated and is left as an open question. B.2. Fast Sim RFs: Further Discussion and Experimental Results In this appendix, we provide more detailed discussion of fast Sim RFs (Sec. 4.2) and demonstrate a simple implementation on the nonparametric classification task. Experiments will be closely related to those described in Sec. 5.3 so readers are advised to review this section first. Recall the definition of the simplex block, Wsimp = DSR, (87) where D Rd d = diag(wi) with wi sampled from a χd-distribution. R Rd d is a random orthogonal matrix drawn from Haar measure on O(d), the group of orthogonal matrices in Rd d. The rows si of the simplex projection matrix S Rd d are given by the simplex unit vectors, defined in Eq. 9 and reproduced below for convenience. d+1 (d 1)3/2 (1, ..., 1, 0) for 1 i < d 1 d 1(1, 1, ..., 1, 0) for i = d. (88) Recall further that, with fast Sim RFs, we replace the matrix R by an orthogonal proxy e R that is only approximately sampled from Haar measure, but which supports fast matrix-vector multiplication. B.2.1. S SUPPORTS FAST MATRIX-VECTOR MULTIPLICATION We begin by showing that the matrix S supports fast matrix-vector multiplication, as stated in the main body. The following simple algorithm of time-complexity O(d) calculates Sx, with x Rd. Simplex Random Features Algorithm 1 Fast matrix-vector multiplication with S Input: object vector x Rd with components xi, i = 1, ..., d Output: simplex projection vectors y = Sx Rd Main: yd = 1 d 1 Pd 1 i=1 xi for i = 1 to d 1 do We note that, in downstream applications such as the Sim RFs-Performer, the time taken by this extra simplex projection (whether using the fast implementation or not) is typically dwarfed by other computational requirements. It is rarely observable. That said, the extra time cost compared to ORFs is technically nonzero and constitutes the only real weakness of Sim RFs. B.2.2. IMPLEMENTATION USING HD-PRODUCT MATRICES To demonstrate one possible implementation, we use the so-called HD-product matrices, formed by multiplication of k N HD blocks, i=1 HD(R) i . (89) Here, H is the normalised Hadamard matrix, defined by the recursive relation H1 = (1), Hi = 1 Hi 1 Hi 1 Hi 1 Hi 1 for i > 1, (90) and D(R) i = diag(di) with di Unif({ 1}), i.i.d. Rademacher random variables. HD-blocks have previously received attention for dimensionality reduction (Ailon & Chazelle, 2009), locally-sensitive hashing methods (Bojarski et al., 2017) and kernel approximation (Choromanski et al., 2017), where they exhibit good computational and statistical properties. Importantly, given some vector x Rd, the matrix-vector product Hx can be computed in time O(d log d) via the fast Walsh-Hadamard transform. We report the results of the nonparametric classification tasks described in Sec. 5.3, now inserting HD-product matrices with k = 3 in place of R. With m = d random features, we observe that using fast Sim RFs and fast ORFs does not substantially change the accuracy of nonparametric classification. Results are frequently identical to the regular case. Table 3. Classification accuracies from kernel regression with IIDRFs, ORFs and Sim RFs, where we include both regular and fast implementations in the latter two cases. Replacing the random orthogonal matrix R (sampled from Haar measure) with a structured HD-product does not change the accuracy of nonparametric classification. Classification accuracy Dataset IIDRFs ORFs Sim RFs Regular Fast Regular Fast abalone 0.1432 0.0003 0.1445 0.0003 0.1447 0.0003 0.1455 0.0003 0.1462 0.0003 banknote 0.6441 0.0024 0.6612 0.0025 0.6596 0.0024 0.7196 0.0019 0.7296 0.0017 car 0.6768 0.0006 0.6788 0.0006 0.6784 0.0006 0.6797 0.0006 0.6800 0.0006 yeast 0.3187 0.0006 0.3193 0.0006 0.3171 0.0006 0.3187 0.0006 0.3195 0.0006 cmc 0.4088 0.0009 0.4149 0.0009 0.4159 0.0009 0.4206 0.0008 0.4222 0.0008 nursery 0.5870 0.0013 0.6213 0.0019 0.6193 0.0019 0.7030 0.0008 0.7037 0.0008 wifi 0.4914 0.0026 0.5224 0.0025 0.5310 0.0024 0.6509 0.0027 0.6533 0.0027 chess 0.2011 0.0002 0.2017 0.0002 0.2016 0.0002 0.2021 0.0002 0.2021 0.0002 Note that Hadamard matrices are defined such that H Rd d with d = 2i, i N a non-negative integer. Given data of some arbitrary dimensionality, we have to pad each object vector x with 0s such that its length is a power of 2. This Simplex Random Features accounts for the small discrepancies between the results reported in Table 3 and accuracies in Fig. 8, where vectors are not padded so m = d is smaller. Results in each column of the table are for the same effective d so the regular and fast mechanisms can be safely compared.