# unifying_orthogonal_monte_carlo_methods__b38e616e.pdf

Unifying Orthogonal Monte Carlo Methods

Krzysztof Choromanski * 1 Mark Rowland * 2 Wenyu Chen 3 Adrian Weller 2 4

Many machine learning methods making use of Monte Carlo sampling in vector spaces have been shown to be improved by conditioning samples to be mutually orthogonal. Exact orthogonal coupling of samples is computationally intensive, hence approximate methods have been of great interest. In this paper, we present a unifying perspective of many approximate methods by considering Givens transformations, propose new approximate methods based on this framework, and demonstrate the ﬁrst statistical guarantees for families of approximate methods in kernel approximation. We provide extensive empirical evaluations with guidance for practitioners.

1. Introduction

Monte Carlo methods are used to approximate integrals in many applications across statistics and machine learning. Back at least as far as (Metropolis & Ulam, 1949), the study of variance reduction or other ways to improve statistical efﬁciency has been a key area of research. Popular approaches include control variates, antithetic sampling, and randomized quasi-Monte Carlo (Dick & Pillichshammer, 2010).

When sampling from a multi-dimensional probability distribution, a variety of recent theoretical and empirical results have shown that coupling samples to be orthogonal to one another, rather than being i.i.d., can signiﬁcantly improve statistical efﬁciency. We highlight applications in linear dimensionality reduction (Choromanski et al., 2017), localitysensitive hashing (Andoni et al., 2015), random feature approximations to kernel methods such as Gaussian processes (Choromanski et al., 2018a) and support vector machines (Yu et al., 2016), and black-box optimization (Choromanski

*Equal contribution 1Google Brain 2University of Cambridge 3Massachusetts Institute of Technology 4Alan Turing Institute. Correspondence to: Krzysztof Choromanski <kchoro@google.com>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

et al., 2018b). We refer to the class of methods using such orthogonal couplings as orthogonal Monte Carlo (OMC).

The improved statistical efﬁciency of OMC methods bears the cost of additional computational overhead. To reduce this cost signiﬁcantly, several popular Markov chain Monte Carlo (MCMC) schemes sample from an approximate distribution. We refer to such schemes as approximate orthogonal Monte Carlo (AOMC). Much remains to be understood about AOMC methods, including which methods are best to use in practical settings. In this paper, we present a unifying account of AOMC methods and their associated statistical and computational considerations. In doing so, we propose several new families of AOMC methods, and provide theoretical and empirical analysis of their performance.

Our approaches are orthogonal to, and we believe could be combined with, methods in recent papers which focus on control variates (rather than couplings) for variance reduction of gradients of deep models with discrete variables (Tucker et al., 2017; Grathwohl et al., 2018).

We highlight the following novel contributions:

1. We draw together earlier approaches to scalable orthogonal Monte Carlo, and cast them in a unifying framework using the language of random Givens transformations; see Sections 2 and 3.

2. Using this framework, we introduce several new variants of approximate orthogonal Monte Carlo, which empirically have advantages over existing approaches; see Sections 3 and 4.

3. We provide a theoretical analysis of Kac s random walk, a particular AOMC method. We show that several previous theoretical guarantees for the performance of exact OMC can be extended to approximate OMC via Kac s random walk; see Section 5. In particular, to our knowledge we give the ﬁrst theoretical guarantees showing that some classes of AOMCs provide gains not only in computational and space complexity, but also in accuracy, in non-linear domains (RBF kernel approximation).

4. We evaluate empirically AOMC approaches, noting relative strengths and weaknesses; see Section 6. We include an extensive analysis of the efﬁciency of AOMC methods in reinforcement learning evolutionary strategies, showing they can successfully replace exact OMC.

Unifying Orthogonal Monte Carlo Methods

2. Orthogonal Monte Carlo

Consider an expectation of the form

EX µ [f(X)] ,

with µ P(Rd) an isotropic probability distribution, and f : Rd R a measurable, µ-integrable function. A standard Monte Carlo estimator is given by

i=1 f(Xi) , where (Xi)N i=1 i.i.d. µ .

Suppose for now that N d. In contrast to the i.i.d. estimator above, orthogonal Monte Carlo (OMC) alters the joint distribution of the samples (Xi)N i=1 so that they are mutually orthogonal ( Xi, Xj = 0 for all i = j) almostsurely, whilst maintaining marginal distributions Xi µ for all i [N]. As mentioned in Section 1, there are many scenarios where estimation based on OMC yields great statistical beneﬁts over i.i.d. Monte Carlo. When N > d, OMC methods are extended by taking independent collections of d samples which are mutually orthogonal.

We note that for an isotropic measure µ P(Rd), in general there exist many different joint distributions for (Xi)N i=1 that induce an orthogonal coupling.

Example 2.1. Let µ P(Rd) be an isotropic distribution, and let ρµ be the corresponding distribution of the norm of a vector with distribution µ. Let v1, . . . , vd be the rows of a random orthogonal matrix drawn from Haar measure on O(d) (the group of orthogonal matrices in Rd d) and let R1, . . . , Rd i.i.d. ρ. Then both (Rivi)d i=1 and (R1vi)d i=1 form OMC sequences for µ. More advanced schemes may incorporate non-trivial couplings between the (Ri)d i=1.

Example 2.1 illustrates that although a variety of OMC couplings exist for any given target distribution, all such algorithms have in common the task of sampling an exchangeable collection of mutually orthogonal vectors v1, . . . , vd such that each vector marginally has uniform distribution over the sphere Sd 1. We state this in an equivalent form below.

Problem 2.2. Sample a matrix M from Haar measure on O(d), the group of orthogonal matrices in Rd d.

Several methods are known for solving this problem exactly (Genz, 1998; Mezzadri, 2007), involving Gram-Schmidt orthogonalisation, QR decompositions, and products of Householder and Givens rotations. Computationally, these methods incur high costs: (i) Computational cost of sampling. All OMC methods require O(d3) time to sample a matrix vs. O(d2) for i.i.d. Monte Carlo. (ii) Computational cost of computing matrix-vector

products. If the matrix M is only required in order to compute matrix-vector products, then the Givens and Householder methods yield such products in O(d2) time, without needing to construct the full matrix M. The Gram-Schmidt method does not offer this advantage. (iii) Space requirements. All methods require the storage of O(d2) ﬂoating-point numbers.

Approximate OMC methods are motivated by the desire to reduce the computational overheads of exact OMC, whilst still maintaining statistical advantages that arise from orthogonality. Additionally, it turns out in many cases that it is simultaneously possible to improve on (ii) and (iii) in the list above, via the use of structured matrices. Indeed, we will see that good quality AOMC methods can achieve O(d2 log d) sampling complexity, O(d log d) matrix-vector product complexity, and O(d) space requirements.

3. Approximate Orthogonal Monte Carlo

In Section 2, we saw that the sampling problem in OMC is reducible to sampling random matrices from O(d) according to Haar measure, and that the best known complexity for performing this task exactly is O(d3). For background details on approximating Haar measure on O(d), see reviews by Genz (1998); Mezzadri (2007). Here, we review several approximate methods for this task, including Hadamard Rademacher random matrices, which have proven popular recently, and cast them in a unifying framework. We begin by recalling the notion of a Givens rotation (Givens, 1958).

Deﬁnition 3.1. A d-dimensional Givens rotation is an orthogonal matrix speciﬁed by two distinct indices i, j [d], and an angle θ [0, 2π). The Givens rotation is then given by the matrix G[i, j, θ] satisfying

G[i, j, θ]k,l =

cos(θ) if k = l {i, j} sin(θ) if k = i, l = j sin(θ) if k = j, l = i 1 if k = l {i, j} 0 otherwise .

Thus, the Givens rotation G[i, j, θ] ﬁxes all coordinates of Rd except i and j, and in the two-dimensional subspace spanned by the corresponding basis vectors, it performs a rotation of angle θ. A Givens rotation G[i, j, θ] composed on the right with a reﬂection in the j coordinate will be termed a Givens reﬂection and written e G[i, j, θ]. Givens rotations and reﬂections will be generically referred to as Givens transformations.

We now review several popular methods for AOMC, and show that they may be understood in terms of Givens transformations.1

1We brieﬂy note that some methods always return matrices

Unifying Orthogonal Monte Carlo Methods

3.1. Kac s Random Walk

Kac s random walk composes together a series of random Givens rotations to obtain a random orthogonal matrix. It may thus be interpreted as a random walk over the special orthogonal group S O(d). Formally, it is deﬁned as follows.

Deﬁnition 3.2 (Kac s random walk). Kac s random walk on S O(d) is deﬁned to be the Markov chain (KT ) T =1, given by

t=1 G[It, Jt, θt] ,

where for each t N, the random variables (It, Jt) Unif([d](2)) and θt Unif([0, 2π)) are independent.

Here and in the sequel, the product notation QT t=1 Mt always denotes the product MT M1, with the highestindex matrix appearing on the left. It is well known that Kac s random walk is ergodic, and has Haar measure on S O(d), the special orthogonal group, as its unique invariant measure. More recently, ﬁnite-time analysis of Kac s random walk has established its mixing time as O(d2 log d) (Oliveira, 2009). Further, considering a ﬁxed vector v Sd 1, the sequence of random variables (Ktv) t=1 can be interpreted as a Markov chain on Sd 1, and it is known to converge to the uniform distribution on the sphere, with mixing time O(d log d) (Pillai & Smith, 2017). Thus, an approximation to Haar measure on O(d) may be achieved by simulating Kac s random walk for a certain number of steps; the mixing times described above give a guide as to the number of steps required for a close approximation.

3.2. Hadamard-Rademacher Matrices

Another popular mechanism for approximating Haar measure are Hadamard-Rademacher random matrices. These involve taking products between random diagonal matrices, and certain structured deterministic Hadamard matrices.

Deﬁnition 3.3 (Hadamard-Rademacher chain). The Hadamard-Rademacher chain on O(2L) is deﬁned to be the following Markov chain (XT ) T =1, given by

t=1 HDt , (1)

where (Dt) t=1 are independent random diagonal matrices, with each diagonal element a Rademacher (Unif({ 1})) random variable, and H is the normalised Hadamard ma-

with determinant 1 (i.e. taking values in the special orthogonal group S O(d)); such methods are easily adjusted to yield matrices across the full orthogonal group O(d) by composing with diagonal matrix with Unif({ 1}) entries. We will not mention this in the sequel.

trix, deﬁned as the following Kronecker product

| {z } L times

These matrices XT (typically with T {1, 2, 3}) have been used recently in the context of dimensionality reduction (Choromanski et al., 2017) (see also (Ailon & Chazelle, 2009)), kernel approximation (Yu et al., 2016), and localitysensitive hashing (Andoni et al., 2015). Ailon & Chazelle (2009) give an interpretation of such matrices as randomised discrete Fourier transforms; here, we show that they can be thought of as products of random Givens rotations with more structure than in Kac s random walk, giving a unifying perspective on the two methods. To do this, we ﬁrst require some notation. It is a classical result that the Hadamard matrix H R2L 2L can be understood as the discrete Fourier transform over the additive Abelian group FL 2 , by identifying {1, . . . , 2L} with FL 2 in the following manner. We associate the element λ = (λ1, . . . , λL) FL 2 with the element x {1, . . . , 2L} with the property that x 1 expressed in binary is λL . . . λ1. With this correspondence understood, we will write expressions such as e G[λ, λ , θ] for λ, λ FL 2 without further comment. Denoting the canonical basis of FL 2 by e1, . . . , e L, deﬁne, for j {1, . . . , L},

λ FL 2 λj=0

e G[λ, λ + ej, π/4] O(2L) . (2)

Then the normalised Hadamard matrix HL O(2L) can be written

i=1 e Fi,L . (3)

Thus, HL is naturally described as the product of Givens reﬂections as above, and indeed it is this decomposition which exactly describes the operations constituting the fast Hadamard transform. These relationships are illustrated in Figure 1, with further illustration in Appendix Section D.

Thus, we may give a new interpretation of the Hadamard Rademacher random matrix HDt appearing in Expression (1), by writing

i=1 e Fi,L !

(e FL,LDt) .

In this expression, we may interpret e FL,LDt as a product of random Givens transformations with a deterministic, structured choice of rotation axes, and rotation angle chosen uniformly from {π/4, 3π/4}, and chosen uniformly at random to be a rotation or reﬂection. This perspective will allow us to generalise this popular class of AOMC methods in Section 4.

Unifying Orthogonal Monte Carlo Methods

Figure 1. Top: the matrix e F2,3 expressed as a commuting product of Givens reﬂections, as in Expression (2). Bottom: the normalised Hadamard matrix H3 written as a product of e F1,3, e F2,3 and e F3,3. Matrix elements are coloured white/black to represent 0/1 elements, and grey/blue to represent elements in (0, 1) and ( 1, 0).

3.3. Butterﬂy Matrices

Butterﬂy matrices generalise Hadamard-Rademacher random matrices and are a well known means of approximately sampling from Haar measure. They have found recent application in random feature sampling for kernel approximation (Munkhoeva et al., 2018). A butterﬂy matrix is given by deﬁning transform matrices of the form

Fj,L[(θj,µ)µ FL j 2 ]= Y

λ FL 2 λj=0

G[λ, λ + ej, θj,λj+1:L] O(2L) .

Then the butterﬂy matrix BL is the random matrix taking values in the special orthogonal group S O(2L) as below,

where ((θi,µ)µ FL i 2 )L i=1 i.i.d. Unif([0, 2π)):

i=1 Fi,L[(θi,µ)µ FL i 2 ] . (4)

Thus butterﬂy matrices and Hadamard-Rademacher matrices may both be viewed as versions of Kac s random walk that introduce statistical dependence between various random variables.

4. New AOMC Methods

Having developed a unifying perspective of existing AOMC methods in terms of Givens rotations, we now introduce two new families of AOMC methods that extend this framework.

4.1. Structured Givens Products

We highlight the work of Mathieu & Le Cun (2014), who propose to (approximately) parametrise O(2L) as a structured product of Givens rotations, for the purposes of learning approximate factorised Hessian matrices. This construction is straightforward to randomise, and yields a new method for AOMC, generalising both Hadamard-Rademacher random matrices and butterﬂy random matrices, deﬁned precisely

λ FL 2 λj=0

G[λ, λ + ej, θi,λ]

where (θi,λ)λ FL 2 ,i [L] i.i.d. Unif([0, 2π)). This can be understood as generalising random butterﬂy matrices by giving each constituent Givens rotation an independent rotation angle, whereas in Expression (4), some Givens rotations share the same random rotation angles.

4.2. Hadamard-Multi Rademacher matrices

Given the representation of Hadamard-Rademacher matrices in Expression (1), a natural generalisation of these matrices is given by the notion of a Hadamard-Multi Rademacher random matrix, deﬁned below.

Deﬁnition 4.1. The Hadamard-Multi Rademacher random matrix on O(2L) is deﬁned by the product

e Fi,LDi , (5)

where (e Fi,L)L i=1 are the structured products of deterministic Givens reﬂections of Expression (2), and (Di)L i=1 are independent random diagonal matrices, with each diagonal element having independent Rademacher distribution.

5. Approximation Theory

Having described various AOMC methods and their computational advantages, we now turn to statistical properties. We consider theoretical guarantees ﬁrst when AOMC methods are used for linear dimensionality reduction, and then for non-linear applications. Analysis of Hadamard-Rademacher matrices for linear dimensionality reduction was undertaken by Choromanski et al. (2017); in Section 5.1 we contribute similar analysis for Hadamard-Multi Rademacher random matrices and Kac s random walk. In contrast, extending theoretical guarantees in non-linear applications (such as random feature kernel approximation) from exact OMC methods to AOMC methods has not yet been possible, to the best of our knowledge. In Section 5.2, we give the ﬁrst guarantees that the statistical beneﬁts in kernel approximation that OMC methods yield are also available when using AOMC methods based on Kac s random walk. All proofs are in the Appendix.

5.1. Linear Dimensionality Reduction Analysis

Consider the linear (dot-product) kernel deﬁned as: K(x, y) = x, y , for all x, y Rd. In the dimensionality

Unifying Orthogonal Monte Carlo Methods

reduction setting the goal is to ﬁnd a mapping Ψ : Rd Rm such that m < d and Ψ(xi), Ψ(xj) K(xi, xj) for all i, j [N], for some dataset {xi}N i=1 Rd. The random projections approach to this problem deﬁnes a random linear map Ψm(x) =

d m Mx (for all x Rd), with M a random matrix taking values in Rm d. A commonly used random projection is given by taking M to have i.i.d. N(0, 1/d) entries. This yields the unstructured Johnson Lindenstrauss transform (Johnson & Lindenstrauss, 1984, JLT), with corresponding dot-product estimator given by b Kbase m (x, y) = d

m(Mx) (My). Several improvements on the JLT have been proposed, yielding computational beneﬁts (Ailon & Chazelle, 2009; Dasgupta et al., 2010). In the context of AOMC methods, Choromanski et al. (2017) demonstrated that by replacing the Gaussian matrix in the Johnson-Lindenstrauss transform with a general Hadamard Rademacher matrix composed with a random coordinate projection matrix P uniformly selecting m coordinates without replacement, it is possible to simultaneously improve on the standard JLT in terms of: (i) estimator MSE, (ii) cost of computing embeddings, (iii) storage space for the random projection, and (iv) cost of sampling the random projection.

We show new results that similar improvements are available for random projections based on Hadamard Multi Rademacher random matrices and Kac s random walk speciﬁcally, projections of the form

ΨHMD m , ΨKAC k,m : x 7

d m PMx , x Rd , (6)

where M is either a Hadamard-Multi Rademacher random matrix (Deﬁnition 4.1), or a Kac s random walk matrix with k Givens rotations (Deﬁnition 3.2). We denote the corresponding dot-product estimators by b KHMD m (x, y) and b KKAC k,m (x, y), respectively. Theorem 5.1. The Hadamard-Multi Rademacher dotproduct estimator has MSE given by:

MSE( b KHMD m (x, y)) =

x 2 2 y 2 2 + x, y 2 2 X

Comparing with the known formula for MSE( b Kbase m (x, y)) in (Choromanski et al., 2017), the MSE associated with the Hadamard-Multi Rademacher embedding is strictly lower. Theorem 5.2. The dot-product estimator based on Kac s random walk with k steps has MSE given by

MSE( b KKAC k,m (x, y)) = d

where χ = Θk Pd i=1 x2 i y2 i + 1 Θk 2(1 Θ)d(d 1)(2 x, y 2 +

x 2 2 y 2 2) and Θ = (d 2)(2d+1)

2d(d 1) . In particular, there ex-

ists a universal constant C > 0 such that for k = Cd log(d) the following holds:

MSE( b KKAC k,m (x, y)) < MSE( b Kbase m (x, y)).

As we see, estimators using only O(d log d) Givens random rotations are more accurate than unstructured baselines and they also provide computational gains.

5.2. Non-linear Kernel Approximation Analysis

Kernel methods such as Gaussian processes and support vector machines are widely used in machine learning. Given a stationary isotropic continuous kernel K : Rd Rd R, with K(x, y) = φ( x y ) for some positive deﬁnite function φ : R R, the celebrated Bochner s theorem states that there exists a probability measure µφ P(Rd) such that:

Kφ(x, y) = Re Z

Rd exp(iw (x y))µφ(dw) . (7)

Rahimi & Recht (2007) proposed to use a Monte Carlo approximation, yielding a random feature map Ψm,d : Rd R2m given by

Ψm,d(x) = 1 m cos(w i x), 1 m sin(w i x) m

with (wi)m i=1 i.i.d. µφ. Inner products of these features:

b Kφ,m base(x, y) = Ψm,d(x), Ψm,d(y) (8)

are then standard Monte Carlo estimators of Expression (7), allowing computationally fast linear methods to be used in approximation non-linear kernel methods. Yu et al. (2016) proposed to couple the directions of the (wi)m i=1 to be orthogonal almost surely, whilst keeping their lengths independent. Empirically this leads to substantial empirical variance reduction, but in order for the method to be practical, an AOMC method is required to simulate the orthogonal directions; Yu et al. (2016) used Hadamard-Rademacher random matrices. However, theoretical improvements were only proven for exact OMC methods (Yu et al., 2016; Choromanski et al., 2018a); thus, the empirical success of AOMC methods in this domain were unaccounted for.

Here, we close this gap, showing that using AOMC simulation of the directions of (wi)m i=1 using Kac s random walk leads to provably lower-variance estimates of kernel values in Expression (7) than for the i.i.d. approach. Before stating this result formally, we introduce some notation.

Deﬁnition 5.3. We denote by GRRk d a distribution over the orthogonal group O(d) corresponding to Kac s random walk with k Givens rotations.

Unifying Orthogonal Monte Carlo Methods

Deﬁnition 5.4. For a 1D-distribution Φ, we denote by GRRΦ,k d the distribution over matrices in Rd d given by the distribution of the product DA, where A GRRk d and independently, D is a diagonal matrix with diagonal entries sampled independently from Φ.

We denote the kernel estimator using random vectors (wi)m i=1 drawn from GRRΦ,k d (rather than i.i.d. samples from µφ) by b Kφ,m,k kac (x, y). We also denote by S(ϵ) a ball of radius ϵ and centered at 0. We now state our main result. Theorem 5.5 (Kac s random walk estimators of RBF kernels). Let Kd : Rd Rd R be the Gaussian kernel and let ϵ > 0. Let B be a set satisfying diam(B) B for some universal constant B that does not depend on d (B might be for instance a unit sphere). Then there exists a constant C = C(B, ϵ) > 0 such that for every x, y B\S(ϵ) and d large enough we have:

MSE( b Kφ,m,k kac (x, y)) < MSE( b Kφ,m base(x, y)),

where k = C d log d and m = ld for some l N.

Let us comment ﬁrst on the condition x, y B\S(ϵ). This is needed to avoid degenerate cases, such as x = y = 0, where both MSEs are trivially the same. Separation from zero and boundedness are mild conditions and hold in most practical applications. Whilst the result is stated in terms of the Gaussian kernel, it holds more generally; results are given in the Appendix. We emphasise that, to our knowledge, this is the ﬁrst result showing that AOMC methods can be applied in non-linear estimation tasks and achieve improved statistical performance to i.i.d. methods, whilst simultaneously incurring a lower computational cost, due to requiring only O(d log d) Givens rotations.

We want to emphasize that we did not aim to obtain optimal constants in the above theorems. In the experimental section we show that in practice we can choose small values for them. In particular, for all experiments using Kac s random walk matrices we use C = 2.

6. Experiments

We illustrate the theory of Section 5 with a variety of experiments, and provide additional comparisons between the AOMC methods described in Sections 3 and 4. In all experiments, we used Cd log(d) rotations with C = 2 for the KAC mechanism. We note that there is a line of work on learning some of these structured contructions (Jing et al., 2017), but in this paper we focus on randomized transformations.

6.1. MMD Comparisons

We directly compare the distribution of M obtained from AOMC algorithms with Haar measure on O(d) via max-

imum mean discrepancy (MMD) (Gretton et al., 2012). Given a set X, MMD is a distance on P(X), speciﬁed by choosing a kernel K : X X R, which encodes similarities between pairs of points in X. The squared MMD between two distributions η, µ P(X) is then deﬁned by

MMD(η, µ)2 = EX,X [K(X, X )] (9)

2EX,Y [K(X, Y )] + EY,Y [K(Y, Y )] ,

where X, X i.i.d. η, and independently, Y, Y i.i.d. µ. Many metrics can be used to compare probability distributions. MMD is a natural candidate for these experiments for several reasons: (i) it straightforward to compute unbiased estimators of the MMD given samples from the distributions concerned, unlike e.g. Wasserstein distance; (ii) MMD takes into account geometric information about the space X, unlike e.g. total variation; and (iii) in some cases, it is possible to deal with uniform distributions analytically, rather than requiring approximation through samples.

The comparison we make is the following. For ﬁxed vectors v Sd 1, we compare the distribution of Mv against uniform measure on the sphere Sd 1, for cases where M is drawn from an AOMC method. In order to facilitate comparison of various AOMC methods, we compare number of ﬂoating-point operations (FLOPs) required to evaluate matrix-vector products vs. MMD squared between the two distributions on the sphere described above; we use FLOPs to facilitate straightforward comparison between methods without needing to consider speciﬁc implementation details and hardware optimisation, but observe that in practice, such considerations may also warrant attention.

To use the MMD metric deﬁned in Equation (9), we require a kernel K : Sd 1 Sd 1 R. We propose the exponentiated-angular kernel, deﬁned by Kλ(x, y) = exp( λθ(x, y)) for λ > 0, where θ(x, y) is the angle between x and y. With this kernel, we can analytically integrate out the terms in Equation (9) concerning the uniform distribution on the sphere (see Appendix for details). Results for comparing FLOPs against MMD are displayed in Figure 2. Several interesting observations can be made.

First, whilst a single Hadamard-Rademacher matrix incurs a low number of FLOPs relative to other methods (by virtue of the restriction on the angles appearing in their Givens rotation factorisations; see Section 3), this comes at a cost of signiﬁcantly higher squared MMD relative to competing methods. Pleasingly, the Hadamard-Multi Rademacher random matrix achieves a much more competitive squared MMD without incurring any additional FLOPs, making this newly-proposed method a strong contender as judged by an MMD vs. FLOPs trade-off. Secondly, butterﬂy and structured Givens product matrices incur higher numbers of FLOPs due to the lack of restrictions placed on the random angles in their Givens factorisations, but achieve extremely

Unifying Orthogonal Monte Carlo Methods

Figure 2. MMD squared vs. ﬂoating-point operations required for matrix-vector products, dimensionality 16.

small squared MMD. Finally, we observe the dramatic savings in FLOPs that can be made, even in modest dimensions, by passing from exact OMC methods to AOMC methods.

6.2. Kernel Approximation

We present experiments on four datasets: boston, cpu, wine, parkinson (more datasets studied in the Appendix).

Pointwise kernel approximation: We computed empirical mean squared error (MSE) for several estimators of a Gaussian kernel and dot-product kernel considered in this paper for several datasets (see Appendix). We tested the following estimators: baseline using Gaussian unstructured matrices (IID), exact OMC using Gaussian orthogonal matrices and producing orthogonal random features (ORF), AOMC methods using Hadamard-Rademacher matrices (HD) with three HD blocks, Hadamard-Multi Rademacher matrices (HMD), Kac s random walk matrices (KAC), structured Givens products (SGP), and butterﬂy matrices (BFLY). Results for the Gaussian kernel are presented in Fig. 3, 4.

Approximating kernel matrices: We test the relative error of kernel matrix estimation for the above estimators for the Gaussian kernel (following the setting of Choromanski & Sindhwani, 2016). Results are presented in Figure 5.

(d) parkinson

Figure 3. Empirical MSE (mean squared error) for the pointwise evaluation of the Gaussian kernel for different MC estimators.

(d) parkinson

Figure 4. Number of FLOPs required to reach particular empirical MSE levels for the pointwise evaluation of the Gaussian kernel for different MC estimators.

6.3. Policy Search

We consider here applying proposed classes of structured matrices to construct AOMCs for the gradients of Gaussian smoothings of blackbox functions that can be used for blackbox optimization. The Gaussian smoothing (Nesterov & Spokoiny, 2017) of a blackbox function F is given as:

Fσ(θ) = Eg N(0,Id)[F(θ + σg)] (10)

Unifying Orthogonal Monte Carlo Methods

(a) boston, Gaussian

(b) cpu, Gaussian

(c) wine, Gaussian

(d) parkinson, Gaussian

Figure 5. Normalized Frobenius norm error for the gaussian kernel matrix approximation. We compare the same estimators as for pointwise kernel approximation experiments.

for a smoothing parameter σ > 0. The gradient of the Gaussian smoothing of F is given by the formula:

σ Eg N(0,Id)[F(θ + σg)g]. (11)

The above formula leads to several MC estimators of Fσ(θ) using as vectors g the rows of matrices sampled from certain distributions (Conn et al., 2009; Salimans et al., 2017). In particular, it was recently shown that exact OMCs provide in that setting more accurate estimators of Fσ(θ) that in turn lead to more efﬁcient blackbox optimization algorithms applying gradient-based methods with the estimated gradients used to ﬁnd maxima/minima of blackbox functions. In the reinforcement learning (RL) setting the blackbox function F takes as input the parameters θ of a policy πθ : S A (mapping states to actions that should be applied in that state), usually encoded by feedforward neural networks, and outputs the total reward obtained by an agent applying that policy π in the given environment. We conduct two sets of RL experiments.

Open AI Gym tasks: We compare different MC estimators on the task of learning a RL policy for the Swimmer task from Open AI Gym. The policy is encoded by a neural network with two hidden layers of size 41 each and using Toeplitz matrices. The gradient vector is 253-dimensional and we use k = 253 samples for each experiment. We compare different MC estimators, including our new constructions. The results are presented in Fig. 6. GORT stands for the exact OMC (using Gaussian orthogonal directions).

Figure 6. Comparing learning curves for RL policy training for algorithms using different MC estimators to approximate the gradient of the blackbox function on the example of Swimmer task.

Quadruped locomotion with Minitaur platform: We apply Kac s random walk matrices to learn RL walking policies on the simulator of the Minitaur robot. We learn linear policies of 96 parameters. We demonstrate that AOMCs based on Kac s random walk matrices can easily learn good quality walking behaviours (see Appendix for details and full result). We attach a video library showing how these learned walking policies work in practice.

Comments on results: Across Figures 3-5, all OMC/AOMC methods beat IID signiﬁcantly, conﬁrming earlier observations. Our new HMD approach does particularly well on Frobenius norm, which suggests it may be more effective for downstream tasks. We aim to study this phenomenon in future work. The KAC method performs very well, indeed best in 3 of the 4 datasets in Fig. 3. This is encouraging given our theoretical guarantees in Theorem 5.5, showing KAC works well in practice for small values of the constant C. Another advantage of KAC is that one can use any dimensionality without zero-padding, drastically reducing the number of rollouts required in policy search tasks. In the Swimmer RL task shown in Fig. 6, both HMD and KAC provide excellent performance, rapidly reaching high reward.

7. Conclusion

We have given a unifying account of several approaches for approximately uniform orthogonal matrix generation. Through this unifying perspective, we introduced a new random matrix distribution, Hadamard-Multi Rademacher. We also gave the ﬁrst guarantees that approximate methods for OMC can yield statistical improvements relative to baselines, by harnessing recent developments in Kac s random walk theory and conducted extensive empirical evaluation.

Unifying Orthogonal Monte Carlo Methods

Acknowledgements

We thank the anonymous reviewers for helpful comments. MR acknowledges support by EPSRC grant EP/L016516/1 for the Cambridge Centre for Analysis. AW acknowledges support from the David Mac Kay Newton research fellowship at Darwin College, The Alan Turing Institute under EPSRC grant EP/N510129/1 & TU/B/000074, and the Leverhulme Trust via the CFI.

Ailon, N. and Chazelle, B. The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302 322, 2009.

Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I., and Schmidt, L. Practical and optimal LSH for angular distance. In Neural Information Processing Systems (NIPS), 2015.

Choromanski, K. and Sindhwani, V. Recycling randomness with structure for sublinear time kernel expansions. In International Conference on Machine Learning (ICML), 2016.

Choromanski, K., Rowland, M., and Weller, A. The unreasonable effectiveness of structured random orthogonal embeddings. In Neural Information Processing Systems (NIPS), 2017.

Choromanski, K., Rowland, M., Sarlos, T., Sindhwani, V., Turner, R. E., and Weller, A. The geometry of random features. In Artiﬁcial Intelligence and Statistics (AISTATS), 2018a.

Choromanski, K., Rowland, M., Sindhwani, V., Turner, R. E., and Weller, A. Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning (ICML), 2018b.

Conn, A. R., Scheinberg, K., and Vicente, L. N. Introduction to Derivative-Free Optimization. SIAM, 2009.

Dasgupta, A., Kumar, R., and Sarl os, T. A sparse Johnson Lindenstrauss transform. In Symposium on Theory of Computing (STOC), pp. 341 350. ACM, 2010.

Dick, J. and Pillichshammer, F. Digital Nets and Sequences: Discrepancy Theory and Quasi-Monte Carlo Integration. Cambridge University Press, 2010.

Genz, A. Methods for generating random orthogonal matrices. In Monte Carlo and Quasi-Monte Carlo Methods (MCQMC), 1998.

Givens, W. Computation of plane unitary rotations transforming a general matrix to triangular form. Journal of the Society for Industrial and Applied Mathematics, 6(1): 26 50, 1958.

Grathwohl, W., Choi, D., Wu, Y., Roeder, G., and Duvenaud, D. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations (ICLR), 2018.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. J. Mach. Learn. Res., 13(1):723 773, March 2012.

Jing, L., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S. A., Le Cun, Y., Tegmark, M., and Soljacic, M. Tunable efﬁcient unitary neural networks (EUNN) and their application to RNNs. In International Conference on Machine Learning, ICML, 2017.

Johnson, W. and Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. In Conference in Modern Analysis and Probability, volume 26, pp. 189 206. 1984.

Mathieu, M. and Le Cun, Y. Fast approximation of rotations and Hessians matrices. ar Xiv, 2014.

Metropolis, N. and Ulam, S. The Monte Carlo method. Journal of the American Statistical Association, 44(247): 335 341, 1949.

Mezzadri, F. How to generate random matrices from the classical compact groups. Notices of the American Mathematical Society, 54(5):592 604, 5 2007.

Munkhoeva, M., Kapushev, Y., Burnaev, E., and Oseledets, I. Quadrature-based features for kernel approximation. In Neural Information Processing Systems (Neur IPS), 2018.

Nesterov, Y. and Spokoiny, V. Random gradient-free minimization of convex functions. Found. Comput. Math., 17 (2):527 566, April 2017. ISSN 1615-3375.

Oliveira, R. I. On the convergence to equilibrium of Kac s random walk on matrices. Ann. Appl. Probab., 19(3): 1200 1231, 06 2009.

Pillai, N. S. and Smith, A. Kac s walk on n-sphere mixes in n log n steps. Ann. Appl. Probab., 27(1):631 650, 02 2017.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Neural Information Processing Systems (NIPS), 2007.

Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. ar Xiv, 2017.

Unifying Orthogonal Monte Carlo Methods

Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., and Sohl Dickstein, J. REBAR: low-variance, unbiased gradient estimates for discrete latent variable models. In Neural Information Processing Systems (NIPS), 2017.

Yu, F., Suresh, A., Choromanski, K., Holtmann-Rice, D., and Kumar, S. Orthogonal random features. In Neural Information Processing Systems (NIPS), 2016.