# trf_learning_kernels_with_tuned_random_features__da5582b5.pdf

TRF: Learning Kernels with Tuned Random Features

Alistair Shilton, Sunil Gupta, Santu Rana, Arun Kumar Venkatesh, Svetha Venkatesh

Applied Artiﬁcial Intelligence Institute (A2I2), Deakin University, Geelong, Australia {alistair.shilton, sunil.gupta, santu.rana, aanjanapuravenk, svetha.venkatesh}@deakin.edu.au

Random Fourier features (RFF) are a popular set of tools for constructing low-dimensional approximations of translationinvariant kernels, allowing kernel methods to be scaled to big data. Apart from their computational advantages, by working in the spectral domain random Fourier features expose the translation invariant kernel as a density function that may, in principle, be manipulated directly to tune the kernel. In this paper we propose selecting the density function from a reproducing kernel Hilbert space to allow us to search the space of all translation-invariant kernels. Our approach, which we call tuned random features (TRF), achieves this by approximating the density function as the RKHS-norm regularised leastsquares best ﬁt to an unknown true optimal density function, resulting in a RFF formulation where kernel selection is reduced to regularised risk minimisation with a novel regulariser. We derive bounds on the Rademacher complexity for our method showing that our random features approximation method converges to optimal kernel selection in the large N,D limit. Finally, we prove experimental results for a variety of real-world learning problems, demonstrating the performance of our approach compared to comparable methods.

1 Introduction Kernel based learning is an elegant and powerful family of techniques in machine learning (Cristianini and Shawe Taylor 2005; Hastie, Tibshirani, and Friedman 2001; Herbrich 2002; Sch olkopf and Smola 2001; Shawe-Taylor and Cristianini 2004; Steinwart and Christman 2008; Vapnik 1995; Suykens et al. 2002). Rather than constructing a complex parametric model and then learning its parameters, kernel-based methods encode this complexity into a kernel and then learn a linear (dual) representation using representor theory. A signiﬁcant theoretical framework has been developed demonstrating the advantages of this approach, backed by substantial experimental evidence. However the computational complexity typically scales as O(N 3), where N is the training set size, so kernel methods may not scale well for large datasets. Further, kernel selection is often ad-hoc, relying heavily on user knowledge and guesswork (which kernels to consider etc), and can be slow if global optimisation methods such as Bayesian optimisation are used to tune

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

hyper-parameters like weights, length-scales etc. Random Fourier features (RFF) (Rahimi and Recht 2006; Liu et al. 2020) was originally developed to tackle the problem of computational complexity. RFF-based methods work by approximating the feature map underlying a kernel using a ﬁnite (D-) dimensional map obtained by sampling from a density function - typically, but not necessarily (Chang et al. 2017; Liu et al. 2020; Bullins, Zhang, and Zhang 2017), the spectral density corresponding to the kernel by Bochner s theorem. Using this map, the problem is re-cast in approximate feature space and solved in primal form, reducing the typical complexity to O(ND3). Building on this, methods have been developed that take advantage of the fact that the feature space is exposed to, in effect, tune the feature map. For example random features may instead be drawn to maximise some criteria, typically kernel alignment (Li et al. 2019a; Yu et al. 2015; Bullins, Zhang, and Zhang 2017; Sinha and Duchi 2016). Alternatively, Fourier kernel learning (FKL) (B az avan, Li, and Sminchisescu 2012) directly learns feature weights during training, in effect making the density itself an optimisation parameter. The elegance and directness of FKL make it particularly attractive from a practical standpoint; however, regularising the feature weights using a Euclidean norm effectively casts the density function in reproducing kernel Hilbert space with a delta (diagonal) kernel, which is somewhat restrictive and does not afford the user an opportunity to incorporate their expectations regarding the spectral structure of the optimal kernel. In this paper we propose an algorithm, tuned random features (TRF), to perform simultaneous regularised risk minimisation and kernel learning in spectral space. Our algorithm incorporates kernel regularisation in spectral domain to prevent kernel over-ﬁtting and uses a meta-kernel to allow users to specify the characteristics that we expect the optimal kernel to have, or, more precisely, that we expect the optimal kernel s spectral density to have. To achieve this, we let the density function itself be the parameter (function) to be selected from a reproducing kernel Hilbert space (RKHS), where the meta-kernel deﬁning this RKHS captures the characteristics we expect of the kernel s spectral density. We incorporate kernel selection directly into the regularised empirical risk minimisation problem formulation, with regularisation towards a default (reference) kernel. Using representor and RFF theory, we obtain a convex optimisation problem combining kernel

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

learning - that is, learning a function in reproducing kernel Hilbert space HK - and learning the kernel K itself. We show that TRF is convex and readily trained using gradient-based methods, and demonstrate uniform convergence as N, D using Rademacher complexity analysis, given an appropriate schedule of hyper-parameters. Experimentally, we have tested TRF on a variety of small and large real-world problems and showed that TRF outperforms comparable methods on most occasions.

1.1 Notation Column vectors are written in lower-case bold a, b, . . . with elements ai, so a = [ai]i, matrices in upper-case bold A, B, . . . with elements Ai,j, so A = [Ai,j]i,j, and a = a T, A = A T is the conjugate transpose. The Hadamard (elementwise) product is denoted a b = [aibi]i, the Hadamard power a b = [ab i]i, and the elementwise norm |a| = [|ai|]i. Nn = {0, 1, . . . , n 1} is the integers modulo n. The weighted inner product is ζ, γ ρ = R ζ (ω)γ(ω)ρ(ω)dω for ρ : X R+, and the weighted norm ζ 2 ρ = ζ, ζ ρ. The reproducing kernel Hilbert space (RKHS) norm is HK for kernel K : X X R. We denote by H κ = {σ Hκ : σ(ω) = σ( ω) 0 ω}. The training set is D = {(xi, yi) X yi Y|i NN} where training pairs (xi, yi) S are drawn i.i.d. from distribution S. We use N for the size of the training set, d for its input dimension x X Rd, and D for the dimension of the random feature map z : Rd CD.

2 Background: Random Fourier Features As ﬁrst introduced in (Rahimi and Recht 2006), random Fourier features (RFF) (Liu et al. 2020) allow kernel methods to be scaled to large data by approximating the kernel by K(x, x ) z (x)z(x ) for some ﬁnite-dimensional z : Rd CD. So for example rather than learning a function f HK C for positive-deﬁnite kernel K:

f (x) = P i αi K (x, xi) + b (1)

which has a typical complexity O(N 3) to ﬁnd α and requires O(N 2) memory, we may instead learn:

f (x) f (x) = w z (x) + b

which has a typical complexity O(ND2) to ﬁnd τ and requires O(ND) memory. By making the complexity linear in N it becomes feasible to scale SVMs (and many other kernelbased methods such as Gaussian Processes) to large datasets. For translation-invariant kernels (Genton 2001) the basis of random Fourier features is Bochner s theorem (Bochner 1932) - for a continuous, translation invariant, positive deﬁnite kernel K(x, x ) = k(x x ), there exists an even spectral density function ρ : Rd R+ so that:

K(x, x )= R

RdeiωT(x x )ρ(ω)dω (Fourier xform) = R

Rd ξ (ω; x) ξ (ω; x )ρ(ω) dω (Split x, x terms) = ξ ( ; x) , ξ ( ; x ) ρ (Inner product in feature space) =Eω ρ [ξ (ω; x) ξ (ω; x )] (As an expectation)

where ξ(ω; x) = eiωTx, and ξ( ; x) = [ξ(ω; x)]ω Rd is the

feature map for kernel K. Thus, re-writing in feature-space form, (1) becomes (Bach 2017, Appendix A):

f(x)= τ ( ) , ξ ( ; x) ρ+b=Eω ρ[τ (ω)ξ(ω; x)]+b (3)

where τ : Rd C is the weight function and b C is the bias. Thus if ω ρ then ξ (ω; x)ξ(ω; x ) and τ (ω)ξ(ω; x) are unbiased estimates of K(x, x ) and f(x) b, respectively. By sampling ω0, ω1, . . . , ωD 1 ρ we obtain the Monte Carlo (MC) approximations of K and f:

K (x, x ) K (x, x ) = z (x) z (x ) f (x) f (x) = τ z (x) + b (4)

where z(x) = [eiωT i x/

D]i ND is the random feature map and τ = [τ(ωi)/

D]i is the weight vector. This approximate feature map has dimension D. By working in the approximate feature space we reduce computational complexity to O(ND2) rather than O(N 3), which is scalable to large N. It is not necessary to sample from the distribution ρ deﬁned by K (Yang, Sindhwanim, and Mahoney 2014; Avron et al. 2016; Chang et al. 2017; Bullins, Zhang, and Zhang 2017; Liu et al. 2020; Li et al. 2019b). Given a strictly positive, even reference density ˆρ : Rd R+ (which is associated with a reference kernel ˆK(x, x ) = ˆk(x x ) via Bochner s theorem), and deﬁning µ(ˆω) = ρ(ˆω)/ˆρ(ˆω), (2) and (3) become:

K (x, x ) = µ ( ) ξ ( ; x) , ξ ( ; x ) ˆρ = Eˆω ˆρ [µ (ˆω) ξ (ˆω; x) ξ (ˆω; x )] f (x) = µ ( ) ˆτ ( ) , ξ ( ; x) ˆρ + b = Eˆω ˆρ [µ (ˆω) ˆτ (ˆω) ξ (ˆω; x)] + b

Here µ(ˆω)ξ (ˆω; x)ξ(ˆω; x ) and µ(ˆω)ˆτ (ˆω)ξ(ˆω; x) are unbiased estimates of K(x, x ) and f(x) b, respectively, when ˆω ˆρ. Hence by sampling ˆω0, ˆω1, . . . , ˆωD 1 ˆρ:

K (x, x ) K (x, x ) = (u ˆz (x)) ˆz (x ) f (x) f (x) = (u ˆτ) ˆz (x) + b ˆK (x, x ) K ˆ (x, x ) = ˆz (x) ˆz (x ) (5)

where ˆz(x) = [eiˆωT i x/

D]i ND is the random feature map, ˆτ = [ˆτ(ˆωi)/

D]i is the weight vector, and we call u = [µ(ˆωi)]i the density (ratio) vector. As demonstrated in for example (Li et al. 2019b; Avron et al. 2017), this approach may give a better approximation of K from fewer samples (for example the features ˆωi may be placed according to the data dependent empirical ridge leverage score distribution). Table 1 summarises the deﬁnitions and notations for RFF for both the standard MC and weighted-MC approaches.

2.1 (Spectral) Kernel Selection Having exposed the feature map in spectral form, it is natural to ask if kernel selection may be done by tuning the (spectrally sampled) feature map z : Rd RD. For example, rather than drawing features ωi from ρ or ˆρ we may select them to maximise kernel alignment (Li et al. 2019a; Yu et al. 2015; Bullins, Zhang, and Zhang 2017), (Cristianini et al. 2002; Cortes, Mohri, and Rostamizadeh 2012); or we may give ρ a parametric form such as a mixture of Gaussians (Wilson and Adams 2013) or something more general (Yang

Standard Form Modiﬁed (weighted) Form Kernel K (x, x ) = ξ ( ; x) , ξ ( ; x ) ρ K (x, x ) = µ ( ) ξ ( ; x) , ξ ( ; x ) ˆρ Reference Kernel ˆK (x, x ) = ξ ( ; x) , ξ ( ; x ) ˆρ Function Form f (x) = τ ( ) , ξ ( ; x) ρ + b f (x) = µ ( ) ˆτ ( ) , ξ ( ; x) ˆρ + b

Feature map ξ ( ; x) = [eiωTx]ω Rd ξ ( ; x) = [ei ˆωTx] ˆω Rd Weight function τ : Rd C, τ L2,ρ ˆτ : Rd C, µ ( ) ˆτ ( ) L2,ˆρ

Feature weights µ (ˆω) = ρ( ˆω)

ˆρ( ˆω) Monte-Carlo Approximate Weighted Monte-Carlo Approximation Kernel K (x, x ) K (x, x ) = z (x) z (x ) K (x, x ) K (x, x ) = (u ˆz(x)) ˆz (x ) Reference Kernel ˆK (x, x ) K ˆ (x, x ) = ˆz (x) ˆz (x ) Function Form f (x) f (x) = τ z (x) + b f (x) f (x) = (u ˆτ) ˆz (x) + b

Feature map z (x) = h 1

ˆz (x) = h 1

DeiˆωT i xi

i ND Weight vector τ = h 1

D ˆτ (ˆωi) i

i ND Density vector u = [µ (ˆωi)]i ND Feature space inner product ζ, γ ρ = R

Rd ζ (ω) γ (ω) ρ (ω) dω ζ, γ ˆρ = R

Rd ζ (ˆω) γ (ˆω) ˆρ (ˆω) dˆω Samples ω0, ω1, . . . , ωD 1 ρ ˆω0, ˆω1, . . . , ˆωD 1 ˆρ

Figure 1: Summary of RFF and related notations. In this table f HK C is the function we wish to learn, where K is a translation invariant kernel with corresponding density ρ as per (2) (Bochner s theorem). The upper-left quadrant shows the spectral form of f, and the upper-right quadrant the modiﬁed (weighted) spectral form using reference kernel ˆK. The bottom-left shows the Monte-Carlo approximation with D samples, and the bottom right the weighted-Monte-Carlo approximation, where we use f and K to indicate RFF approximations of f and K, respectively.

et al. 2015) to obtain a kernel mixture that may be tuned. Alternatively, Fourier kernel learning (FKL) (B az avan, Li, and Sminchisescu 2012) proposes directly selecting u RD + during training with a regularisation term u 2 2. In the present paper we propose selecting (learning) a weight function µ H κ in spectral space, where κ is a kernel (we call κ a meta-kernel) that deﬁnes the characteristics we expect of the spectral density ρ( ) = µ( )ˆρ( ) without restricting it s exact form. Interestingly we note that FKM is a variant (special case) of our method where we use the meta-kernel κ(ω, ω ) = 1ω=ω and substitute ˆµ( ) = 0 (so ˆv = v ˆ= 0 - see sections 3.1, 3.2). For completeness we note standard approaches to kernel selection and tuning, which typically involve selecting a ﬁnite set of test kernels and using grid-search, Bayesian optimisation or similar methods to tune parameters (e.g. length-scale); or multi-kernel learning (G onen and Alpaydin 2011). While powerful, these ad-hoc approaches are often computationally expensive and restrict the search space to the span of a small, pre-deﬁned set of kernels. Alternatively, hyper-kernel methods (Ong, Williamson, and Smola 2003; Ong, Smola, and Williamson 2005) select K from a hyper-RKHS deﬁned by a hyper-kernel K. However, hyper-kernel methods tend to be computationally complex, scaling as O(N 3) or worse.

3 Tuned Random Features

In this section we introduce our method, tuned random features (TRF), for combining kernel selection and regularised

risk minimisation using random Fourier features. We begin by constructing a regularised empirical risk minimisation formulation in Fourier feature space. We then apply modiﬁed random Fourier features techniques to make the formulation practically attainable.

3.1 Learning the Kernel in the Spectral Domain Given a training set D = {(xi, yi) Rd Y|i NN} of N samples (xi, yi) S drawn i.i.d. from a distribution S, our goal is to select both f HK C, where HK is the reproducing kernel Hilbert space deﬁned by translationinvariant kernel K, and the kernel K deﬁning the hypothesis space HK itself to minimise the regularised empirical risk:

N P i ℓ(f (xi) , yi) + λ

where ℓis a loss function whose form depends on the problem being solved. Equivalently, in modiﬁed Fourier feature space (Table 1, upper-right quadrant), we may minimise:

RN (ˆτ, b) = 1

N P i ℓ µ ( ) ˆτ ( ) , ξ ( ; xi) ˆρ + b, yi

2 µ ( ) ˆτ ( ) , ˆτ ( ) ˆρ (6)

To incorporate kernel selection into this formulation we propose letting µ H κ = {σ Hκ : σ(ˆω) = σ( ˆω) 0 ˆω RD} be an even, strictly positive function for some meta-kernel κ deﬁning the spectral characteristics we expect of ρ( ) = µ( )ˆρ( ) and hence indirectly, by Bochner s theorem, our expectations of the kernel K. In modiﬁed Fourier

feature space (Table 1, top-right), we aim to ﬁnd: f (x) = µ ( ) ˆτ ( ) , ξ ( ; x) ˆρ + b (7)

where ˆτ : Rd C, b C, µ H κ minimise the tuned regularised empirical risk:

QN(ˆτ, b, µ) = 1

N P iℓ µ( )ˆτ( ) , ξ( ; xi) ˆρ+b, yi

2 µ ( ) ˆτ ( ) , ˆτ ( ) ˆρ + Λ

2 µ ˆµ 2 Hκ (8)

and: ˆµ ( ) = argmin ˆµ H κ

Rd (ˆµ (ˆω) 1)2 ˆρ (ˆω) dˆω

is the function in Hκ that most closely approximates ˆµ( ) = 1 in the least-squares sense.1 In this formulation ˆρ is deﬁned by the (ﬁxed) reference kernel ˆK, and µ H κ is characterised by the positive deﬁnite meta-kernel κ. As per our deﬁnitions in previous section and Table 1, the kernel K is deﬁned by the density function ρ( ) = µ( )ˆρ( ), so minimising QN for µ optimises ρ and hence tunes K to minimise the tuned regularised empirical risk. Note that: 1. The ﬁrst two terms in the tuned regularised empirical risk (8) are the standard empirical risk and RKHS-norm regularisation terms as per (6). 2. The ﬁnal term is a regularisation term designed to prevent over-ﬁtting of ρ (and hence K). It is designed to regularise toward µ = ˆµ 1 in the limit Λ , which corresponds to ρ ˆρ and hence K ˆK. It follows that ˆK acts as a default (fallback) kernel in the strong regularisation limit. Note that if we regularised using Λ

2 µ 2 Hκ then µ 0 in the limit, which corresponds to K = 0 and is therefore unhelpful. To ﬁnish, we further simplify the tuned regularised empirical risk minimisation problem by deﬁning η( ) = ˆτ( )µ( ). In terms of η, b in this paper we aim to ﬁnd:

f (x) = η ( ) , ξ ( ; x) ˆρ + b (9)

The variables η : Rd C and b C minimise the tuned regularised empirical risk minimisation problem (8), which we re-write in terms of η as follows:

QN(η, b) = 1

N P iℓ η( ), ξ( ; xi) ˆρ+b, yi + λ

2 r(η) (10)

In this expression r is a regulariser of the form:

r(η) = min µ H κ η ( ) 2 ˆ ρ( ) µ( ) + Λ

λ µ ˆµ 2 Hκ (11)

where 2 ρ = , ρ. Thus we see that the tuned empirical risk minimisation formulation can be written as a novel form of regularised risk minimisation.

1We cannot guarantee ˆµ = 1 in general. If the feature map ϕκ : Rd Rm associated with κ(ˆω, ˆω ) = ϕT κ (ˆω)ϕκ(ˆω ) by Mercer s theorem includes a constant term then, as ˆµ( ) = ˆv Tϕκ( ) by deﬁnition, we can always select ˆv so that ˆµ( ) = ˆv Tϕκ( ) = 1 H κ . For example if κ(ˆω, ˆω ) = (1 + ˆωT ˆω )2 and d = 2 then ϕκ(ω) = [1;

2ω1; ω2 0; ω2 1;

2ω0ω1] and ˆµ( ) = ˆv Tϕκ( ) = 1 H κ when ˆv = [1; 0]. However if κ is an RBF kernel then 1 / Hκ, though it may be approximated to arbitrary accuracy w.r.t. ˆρ as the RBF kernel is universal.

3.2 Learning the Kernel via Tuned Random Features We now apply modiﬁed random Fourier features to (8)/(10). Selecting a reference kernel ˆK and using Table 1, we see that, in terms of the weight vector ˆτ, density vector u and random feature map ˆz : Rd CD, the trained machine may be approximated using:

f (x) f (x) = (u ˆτ) ˆz (x) + b

where u 0 (this is a relaxation of µ H κ as we do not require that µ(ˆω) 0 for ˆω / {ˆω0, ˆω1, . . . , ˆωD 1}), and ˆτ, b and µ {σ Hκ : σ(ˆωi) = ui 0 i} minimise:

QN (ˆτ, b, u, µ) = 1

N P i ℓ (u ˆτ) ˆz (xi) + b, yi

2 (u ˆτ) ˆτ + Λ

2 µ ˆµ 2 Hκ (12)

To be useful in practice the ﬁnal term must be approximated in terms of the random Fourier features representation. To this end we note that we can write µ, ˆµ H κ in feature space form µ( ) = v Tϕκ( ) and ˆµ( ) = ˆv Tϕκ( ), where v, ˆv Rm and ϕκ : Rd Rm is the feature map associated with κ through Mercer s theorem. Deﬁne:

v=argmin v Rm 1 2 R

Rd v Tϕκ(ˆω) µ (ω) 2ˆρ (ˆω) dˆω

ˆv=argmin ˆv Rm

Rd ˆv Tϕκ(ˆω) 1 2ˆρ (ˆω) dˆω (13)

Note that the deﬁnition of v is tautological, while the deﬁnition of ˆv deﬁnes the sense in which ˆµ 1. We approximate these as µ(ˆω) = v Tϕκ(ˆω) and µ ˆ(ˆω) = v ˆTϕκ(ˆω), respectively, where, recalling that ui = µ(ˆωi):

v = argmin v Rm 1 2 P i v Tϕκ (ˆωi) ui 2 + γ

v ˆ= argmin v ˆ Rm

1 2D P i v ˆTϕκ (ˆωi) 1 2 + γ

2 v ˆ 2 2 (14)

That is, we replace µ and ˆµ with the regularised least-squares approximations obtained from the training sets {(ˆωi, ui) : i ND} and {(ˆωi, 1) : i ND}, respectively, where the regularisation terms are included to ensure uniform convergence µ µ, µ ˆ ˆµ in the limit D . Hence:

v = ΦTΦ + γDI 1 ΦTu

v ˆ= ΦTΦ + γDI 1 ΦT1

where Φ = [ϕκj(ˆωi)]ij. Subsequently, using the Woodbury matrix identity, ( v v ˆ)T( v v ˆ) = u THγu, where Hγ = (Γ + γDI) 1Γ(Γ + γDI) 1 and Γ = ΦΦT = [κ(ˆωi, ˆωj)]i,j. Recalling that the RKHS norm Hκ corresponds to the Euclidean norm in feature space, we see that µ ˆµ 2 Hκ = v ˆv 2 2 v v ˆ 2 2, and so we approximate:

µ ˆµ 2 Hκ (u 1)T Hγ (u 1) where: Hγ = (Γ + γDI) 1 Γ (Γ + γDI) 1

Γ = [κ (ˆωi, ˆωj)]i,j ND

With this approximation we may re-write the approximated

modiﬁed regularised empirical risk (12) entirely in terms of (ﬁnite dimensional) modiﬁed random Fourier features. To further simplify let w = u ˆτ = [η(ˆωi)]i ND, so:

f (x) f (x) = w ˆz (x) + b (16)

where w CD and b C minimise the tuned random features (TRF) objective, substituting (15) into (12):

QN(w, b) = 1

N P iℓ w ˆz (xi)+b, yi + λ

2 r(w) (17)

where r is a regulariser of the form:

r(w)= min u RD + w diag(u) 1w+ Λ

λ (u 1)THγ(u 1) (18)

Note that (16)-(18) are the random Fourier features approximation of (9)-(11). In the next section we will show that r is in fact convex, and moreover the gradient of r is elementwise positive. Thus when training with gradient-descent (or similar) the effect of r is to apply adaptive regularisation to each compound weight component w.

4 Theoretical Analysis In this section we analyse the properties of the TRF formulation from a theoretical standpoint. We ﬁrst analyse the properties of the TRF regulariser function r and demonstrate that it is a convex regularisation function with a straightforward gradient. Subsequently we analyse the Rademacher complexity of the formulation and give bounds to demonstrate uniform convergence both in terms of N (the training set size) and D (the number of random features).

4.1 Properties of the TRF Regulariser r In the previous section we demonstrated how tuned regularised empirical risk minimisation using random Fourier features could be reduced to a regularised empirical risk minimisation problem (17) where the density vector u only appears in the regulariser r deﬁned by (18). For convenience, we refactor the regulariser r as:

r (w) = ρ (u (w) ; w) (19)

where u (w) = argminu RD + ρ(u; w), and:

ρ (u; w) = w diag(u) 1w+ Λ

λ (u 1)THγ(u 1) (20) The ﬁrst term, which will dominate when Λ λ, will tend to push the density vector u , so r(w) 0; while the second term, which will dominate when Λ λ, will tend to pull toward u = 1, so ρ ˆρ and hence K ˆK. Applying ﬁrst-order optimality conditions we see that these opposing inﬂuences cancel out at the optimal u = u (w):

uρ (u ; w)= w u 1 2+2 Λ

λ Hγ(u 1)=0 (21)

As shown in the supplementary material: The regularisation function r is convex and has gradient: w r (w) = 2w u 1

The optimal density vector satisﬁes u (w) 1. The convexity of the regulariser r means that, if the loss function ℓis convex, then so too is the tuned regularised empirical risk QN, which is helpful for training.

Γ-Spectrum Polynomial Exponential ˆ i = O(Di ν) ˆ i = O(De ci) Parameter λ λ = Ω(N ϵ φ) λ = Ω(N ϵ φ) λ = O(N ϵ) λ = O(N ϵ) λ = O(D 1

2 (ν+1)) λ = O( 1

2 ) Parameter Λ Λ = Ω(N ϵ φ) Λ = Ω(N ϵ φ) Λ = O(N ϵ) Λ = O(N ϵ) Parameter γ γ = Ω(Dϵ 1) γ = Ω(Dϵ 1) γ = O(D ϵ) γ = O(D ϵ)

Figure 2: Summary of convergence characteristics for kernels with polynomially and exponentially decaying eigenspectra. In this table ˆ 0 ˆ 1 . . . ˆ D 1 are the eigenvalues of the random feature Gram matrix Γ for meta-kernel κ; 0 < ϵ 1; ν 1 and c 1 are constants characterising κ; φ = 1 10 if ℓis Lipschitz, φ = 1

6 if ℓis quadratic; and the hyper-parameter recommendations are designed to ensure uniform convergence as N, D .

4.2 Convergence and Complexity

While we formulate TRF as a combination of regularised empirical risk minimisation and kernel learning for a given dataset D, the underlying goal remains to minimise the actual risk on the distribution S, where D SN. Precisely, deﬁning (using the notation of Figure 3):

f = argmin f W QN( f), f = argmin f HK C QS(f)

where : W w ˆz ( ) + b w CD, b C

we aim to show f f with high probability as N, D (i.e. uniform convergence (Menon and Williamson 2018; Bartlett and Mendelson 2002)). To this end, as illustrated in Figure 3, we may split the analysis of TRF on to two axis, speciﬁcally the asymptotic behaviour as N , and the asymptotic behaviour as D . Then:

1. We show that QN( f) QS( f) with high probability in the limit N for all D, f W. 2. We characterise the schedule of hyper-parameters λ, Λ, γ with respect to N, D so that this convergence is guaranteed and λ 0 as N for all D, f W. 3. We show that W H ˆ K C as D .

Combining these results, we ﬁnd that f f with high probability as N, D ; ﬁrst, as N , TRF converges to an unregularised RFF approximation of actual risk minimisation, and then, as D , W H ˆ K C and noting that QS = QS (each is independent of the range of f or f), we obtain the desired result.

4.3 Convergence as N

In this section we derive a bound on the Rademacher complexity RN(W) of the (hypothesis space W of the) TRF formulation and derive schedules for λ and Λ such that λ 0 and RN(W) 0 as N . Given this, it is then trivial

f W w ˆz ( ) + b w CD, b C

QN( f) = LN( f) + λ

2 r( f) LN( f) = 1 N P

i ℓ f (xi) , yi

r( f) = minu RD + w (u w)+ Λ

λ (u 1)THγ(u 1)

D = ERMreg :

f HK C QN (f) = LN (f) + λ

2 r (η) LN (f) = 1 N P

i ℓ(f (xi) , yi) r (η) = minµ H κ η 2 ˆ ρ µ + Λ

λ µ ˆµ 2 Hκ

TRFS : f W w ˆz ( ) + b w CD, b C

QS( f) = LS( f) LS( f) = E(x,y) Sℓ( f (x) , y)

D = Actual risk minimisation:

f HK C QS (f) = LS (f) LS (f) = E(x,y) Sℓ(f (x) , y)

Figure 3: Connection between TRF (top left), tuned regularised risk minimisation (top right), TRF in the limit N (bottom left) and actual risk minimisation (bottom right).

use standard Rademacher complexity based uniform convergence bounds (Menon and Williamson 2018; Bartlett and Mendelson 2002) to guarantee that QN( f) QS( f) with high probability for a variety of loss function families. We require the following assumptions: 1. Either ℓis L-Lipschitz and sub-differentiable, or ℓ( y, y) = 1 2( y y)2 is quadratic and Y = [ m, m] R.

2. λ Λ ˆ 1/2 γ,min C < as D , where ˆ γ,min is the minimum eigenvalue of Hγ. A detailed derivation of our complexity bound is given in the supplementary and summarised here. As a ﬁrst step, we show that W {w ˆz( ) + b : r(w) R}, where (see theorems 11 and 12 in the supplementary):

R < a Dλ pa + b Dλ pb + c Dλ pc

for positive, ﬁnite-valued sequences a D, b D, c D and positive exponents pa, pb, pc; and pa < pb < pc 5 in the case where ℓis Lipschitz, pa < pb < pc 3 if ℓis quadratic; and moreover the Rademacher complexity of W is bounded as (supplementary, theorem 7):

R/N + e DR/

where e D is a positive, ﬁnite-valued sequence. Hence RN(W) 0 as N if λ, Λ = Ω(N ϵ φ) (we require Λ λ E < as N ) for 0 < ϵ 1, and φ = 1 10 if ℓ is Lipschitz, and φ = 1

6 if ℓis quadratic. The proviso λ

Λ ˆ 1/2 γ,min C < as D leads us to analyse the eigenspectrum of Hγ (supplementary, Lemma 5). Denoting the eigenspectrum of the random-feature Gram matrix Γ by ˆ 0 ˆ 1 . . . ˆ D 1, we show that:

ˆ 1 γ,min D 4 ˆ 0/D if γ γ ˆ D 1/D+2γ+γ2/( ˆ D 1/D) if γ γ

where γ = 1 D( ˆ 0 ˆ D 1)1/2. Then, for example, for kernels with polynomially decaying eigenvalues, we have ˆ i = O(Di ν) for some ν 1,2 so, in this case, λ

Λ ˆ 1/2 γ,min

2For example all ν-times continuously differentiable meta-

C < as D if Λ = O(1) (w.r.t. D) and:

D) if γ γ , λ = O(1/

Dν+1) otherwise

Finally, we note that γ = O(D 1

2 (ν+1)). We will show in our analysis of the D limit that γ = Ω(1/D) is necessary to ensure convergence QN QN, so in most cases γ > γ for large D, so λ = O( 1

Dν+1 ) is required to ensure uniform convergence. See Table 2 for a summary of scheduling requirements for hyper-parameters λ, Λ, γ.

4.4 Convergence in µ and ˆµ The role of γ in the TRF formulation is to regularise the approximation of µ and ˆµ, deﬁned by (13), via the regularised least-squares approximation µ and µ ˆ, deﬁned by (14). Thus the scheduling of γ as D is pivotal in determining whether µ µ and µ ˆ ˆµ uniformly as D . For regularised least-squares, it is well-known that γ = Ω(Dϵ 1), γ = O(D ϵ), where 0 < ϵ 1, is sufﬁcient to ensure µ µ and µ ˆ ˆµ with high probability as D . As noted previously, this implies that γ γ as D .

4.5 Convergence as D Our goal is to show that W HK C as D . Recall that ˆK is a translation invariant kernel with a corresponding, strictly positive, even density function ˆρ; and likewise K is a translation invariant kernel with a strictly positive and even density function ρ = µ( )ˆρ( ), where µ H κ . Thus, ignoring the inner-product (which is different for HK and H ˆ K), HK and H ˆ K actually represent the same set of functions:

HK = n η ( ) , ξ ( , x) ρ o = D ρ( )

ˆρ( )η ( ) , ξ ( , x) E

= n ˆη ( ) , ξ ( , x) ˆρ o = H ˆ K

kernels have polynomially decaying eigenvalues (Wathen and Zhu 2015, Theorem 1), so this includes the RBF kernel, Mat ern kernel of order 3

2 etc. See results in (Braun 2005; Williamson, Smola, and Sch olkopf 2001; Wathen and Zhu 2015) for further discussion. In the supplementary we also give results for meta-kernels with exponentially decaying eigenspectra.

so it sufﬁces to show that W H ˆ K C. Moreover, recalling that HK ˆ = {w ˆz( )|w RD} with feature map ˆz:

W = w ˆz( ) + b = HK ˆ C

so it sufﬁces to show that K ˆ ˆK as D . The convergence properties of RFF kernel approximations has been widely studied (Li et al. 2019b; Rahimi and Recht 2006; Liu et al. 2020; Yang, Sindhwanim, and Mahoney 2014). In particular, (Sutherland and Schneider 2015) present a range of convergence bounds showing that K ˆ ˆK 0 with high probability as D , any one of which sufﬁces to show that W HK C with high probability as D , and hence, when combined with our previous analysis, that f f with high probability as N, D so long as λ, Λ and γ are scheduled appropriately (see Table 2).

5 Training Considerations In this section we consider the question of training. Recall that the regulariser r may be rewritten as r(w) = ρ(u (w); w) as per (19)-(20). In section 4.1 we showed that u (w) 1 for all w, which allows us to re-write our optimisation problem - that is, minimising QN as deﬁned by (17)- (18) - as a constrained minimisation problem including u:

argmin w,u,b:u 1 TN = 1

N P i ℓ w ˆz (xi) + b, yi + . . .

2 w diag (u) 1 w + Λ

2 (u 1)T Hγ (u 1) (22)

the gradients (assuming ℓis differentiable - more generally we may use subgradient methods) of which are:

N P i ℓ w ˆz(xi) + b, yi ˆz(xi)+λu 1 w b TN = 1

N P i ℓ w ˆz(xi) + b, yi

u TN =ΛHγ (u 1) λ

2 |w| 2 u 2

where we denote by ℓ the gradient of ℓwith respect to its ﬁrst argument. As u 1 we see that all three gradients are welldeﬁned and well-behaved, making gradient-based approaches well-suited to the task of minimising TN. We have chosen to use Adam (Kingma and Ba 2015) in our experiments, with an additional clipping step after each iteration to enforce the constraint u 1. An alternative approach is to minimise (17) with a loss-speciﬁc algorithm (e.g. Pegasos (Shalev Shwartz, Singer, and Srebro 2007; Menon and Williamson 2018; Jumutc and Suykens 2013) if ℓis a hinge loss) and minimise (18) at each step as an inner loop using a gradientbased approach (or use bi-quadratic optimisation). However our initial experiments showed that a simple, single-layer gradient-based approach was signiﬁcantly faster and had a more predictable running time, so we focus on it exclusively.

6 Experimental Results In this section we present experimental results for TRF applied to classiﬁcation and regression problems for small and medium sized datasets. For classiﬁcation we use hinge loss ℓ( y, y) = (1 yy)+, and for regression ℓ( y, y) = 1

2( y y)2. Experiments were run on a Ubuntu 4.15 server with 72 x86 64 cores and 754 GB of memory running SVMHeavy 8.

Dataset SVM-RBF SVM-MKL FKL TRF Biodeg 13.9(2.8) 12.3(2.1) 12.0(2.3) 11.2(3.5) Car 1.15(0.4) 0.81(0.2) 0.75(0.1) 0.73(0.1) Contra. 40.1(4.8) 31.6(4.8) 35.0(3.7) 29.5(3.1) Fertility 9(4.2) 9(4.2) 9(4.2) 9(4.2) Ionosph. 5.92(2.5) 4.22(1.4) 3.99(1.6) 14.7(7.4) Sonar 27.1(16) 20.5(2.7) 21.1(4.5) 15.5(3.2) a8a 15.7(1.0) 13.2(2.4) 12.9(2.7) 12.6(2.7) Dataset SVM-RBF SVM-MKL FKL TRF Airfoil 0.48(0.30) ( ) 0.46(0.02) 0.43(0.02) Auto 0.18(0.04) 0.17(0.04) 0.17(0.03) 0.16(0.03) Boston 0.43(0.13) 0.38(0.07) 0.38(0.13) 0.38(0.14) Slump 0.028(0.02) 0.020(0.01) 0.20(0.01) 0.018(0.01) Yatch 0.17(0.07) 0.05(0.03) 0.07(0.06) 0.16(0.12)

Figure 4: Classiﬁcation (top, misclassiﬁcation error %) and Regression (bottom, RMSE error) results on 20% test set with 5 repeats. TRF is our method, SVM-RBF is the SVM with RBF kernel, and SVM-MKL is the SVM with MKL kernel.

For our TRF method we use an RBF meta-kernel κ with length-scale l, and an RBF reference kernel ˆK with ﬁxed length-scale 1. Hyper-parameters were selected to minimise 10-fold cross-validation error on the training set, with λ, Λ, γ, l [0.01, 100], using Bayesian optimisation with GP-UCB acquisition function (Srinivas et al. 2012) with a budget of 105 evaluations (5 in the initial random set). This was repeated for D = 50, 100, 200, 400, 800, 1600 random features to ﬁnd D to minimise 10-fold cross validation error. Our baselines were SVM (ϵ-SV regression and C-SV classiﬁcation) with RBF and MKL kernel KMKL = m0KRBF + m1KMAT + m2Kpoly (KRBF, KMAT and Kpoly are RBF, 3 2-Mat ern and 2nd-order polynomial kernels, respectively); and FKL (B az avan, Li, and Sminchisescu 2012). All hyperparameters, including C, ϵ (for regression), m0, m1, m2, and the length-scales for KRBF and KMAT were chosen using Bayesian optimisation with GP-UCB acquisition function to minimise 10-fold cross-validation error on the training set. All datasets are taken from the UCI repository (Dua and Graff 2017), normalised so xi lies in the unit hypersphere and split randomly 80% training and 20% testing. All experiments were repeated 5 times to generate error bars. Results for regression and classiﬁcation are shown in Table 4. Note that our method outperforms the baseline in most cases.

7 Conclusion We have introduced the tuned random features (TRF) algorithm that combines kernel learning and regularised risk minimisation in the spectral domain, allowing the kernel to be selected automatically from the set of all translation-invariant kernels. We have shown that TRF training may be done via a simple gradient-based approach on the convex objective. We have also analysed the convergence properties of TRF as N, D , and, using Rademacher complexity analysis, proved that TRF converges uniformly in the limit. Finally, we have demonstrated the effectiveness of TRF on a range of real regression and classiﬁcation datasets.

Acknowledgments This research was partially funded by the Australian Government through the Australian Research Council (ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).

References Avron, H.; Kapralov, M.; Musco, C.; Musco, C.; Velingker, A.; and Zandieh, A. 2017. Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. In International Conference on Machine Learning, 253 262. Avron, H.; Sindhwani, V.; Yang, J.; and Mahoney, M. W. 2016. Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels. Journal of Machine Learning Research, 17(120): 1 38. Bach, F. 2017. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19): 1 53. Bartlett, P. L.; and Mendelson, S. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3: 463 482. B az avan, E. G.; Li, F.; and Sminchisescu, C. 2012. Fourier kernel learning. In European Conference on Computer Vision, 459 473. Springer. Bochner, S. 1932. Vorlesungen uber Fouriersche Integrale. Leipzig. Braun, M. L. 2005. Spectral properties of the kernel matrix and their relation to kernel methods in machine learning. Ph.D. thesis, Universit ats-und Landesbibliothek Bonn. Bullins, B.; Zhang, C.; and Zhang, Y. 2017. Not-so-random features. ar Xiv preprint ar Xiv:1710.10230. Chang, W.-C.; Li, C.-L.; Yang, Y.; and Poczos, B. 2017. Data-driven random fourier features using stein effect. ar Xiv preprint ar Xiv:1705.08525. Cortes, C.; Mohri, M.; and Rostamizadeh, A. 2012. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13: 795 828. Cristianini, N.; and Shawe-Taylor, J. 2005. An Introductino to Support Vector Machines and other Kernel-Based Learning Methods. Cambridge, UK: Cambridge University Press. Cristianini, N.; Shawe-Taylor, J.; Elisseeff, A.; and Kandola, J. 2002. On kernel-target alignment. In Advances in Neural Information Processing Systems, 367 373. Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml. Genton, M. G. 2001. Classes of Kernels for Machine Learning: A Statistics Perspective. Journal of Machine Learning Research, 2: 299 312. Gerˇsgorin, S. 1931. Uber die Abgrenzung der Eigenwerte einer Matrix. Bulletin de l Acad emie des Sciences de l URSS. Classe des sciences math ematiques et na, 749 754. G onen, M.; and Alpaydin, E. 2011. Multiple Kernel Learning Algorithms. Journal of Machine Learning Research, 12: 2211 2268.

Hastie, T. J.; Tibshirani, R. J.; and Friedman, J. H. 2001. The Elements of Statistical Learning: Data Mining, Inference and Prediction. New York: Springer. Herbrich, R. 2002. Learning Kernel Classiﬁers: Theory and Algorithms. MIT Press. Horn, R. A.; and Johnson, C. R. 2013. Matrix Analysis. Cambridge University Press, 2nd edition. Jumutc, V.; and Suykens, J. A. K. 2013. Weighted coordinatewise Pegasos. In Proc. of the 5th International Confence on Pattern Recognition and Machine Intelligence, volume 8251, 262 269. Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In Bengio, Y.; and Le Cun, Y., eds., 3rd International Conference on Learning Representations, ICLR 2015. Li, C.-L.; Chang, W.-C.; Mroueh, Y.; Yang, Y.; and Poczos, B. 2019a. Implicit Kernel Learning. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, 2007 2016. Li, Z.; Ton, J.-F.; Oglic, D.; and Sejdinovic, D. 2019b. Towards a uniﬁed analysis of random Fourier features. In Proceedings of the 36th International Conference on Machine Learning, 3905 3914. Liu, F.; Huang, X.; Chen, Y.; and Suykens, J. A. K. 2020. Random features for kernel approximation: A survey in algorithms, theory, and beyond. ar Xiv preprint ar Xiv:2004.11154. Menon, A. K.; and Williamson, R. C. 2018. The cost of fairness in binary classiﬁcation. In Conference on Fairness, Accountability and Transparency, 107 118. PMLR. Ong, C. S.; Smola, A. J.; and Williamson, R. C. 2005. Learning the Kernel with Hyperkernels. Journal of Machine Learning Research, 6: 1043 1071. Ong, C. S.; Williamson, R. C.; and Smola, A. J. 2003. Hyperkernels. In Advances in neural information processing systems, 495 502. Rahimi, A.; and Recht, B. 2006. Random Features for Large Scale Kernel Machines. In Platt, J. C.; Koller, D.; Singer, Y.; and Roweis, S. T., eds., Advances in Neural Information Processing Systems 20, 1177 1184. Rasmussen, C. E.; and Williams, C. K. I. 2006. Gaussian Processes for Machine Learning. MIT Press. Sch olkopf, B.; and Smola, A. J. 2001. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, Massachusetts: MIT Press. ISBN 0262194759. Shalev-Shwartz, S.; Singer, Y.; and Srebro, N. 2007. Pegasos: Primal Estimated sub-Gr Adient SOlver for SVM. In Proceedings of the 24th International Conference on Machine learning (ICML07), 807 814. Shawe-Taylor, J.; and Cristianini, N. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press. Sinha, A.; and Duchi, J. C. 2016. Learning kernels with random features. In Proceedings of NIPS. Srinivas, N.; Krause, A.; Kakade, S. M.; and Seeger, M. W. 2012. Information-Theoretic Regret Bounds for Gaussian

Process Optimization in the Bandit Setting. IEEE Transactions on Information Theory, 58(5): 3250 3265. Steinwart, I.; and Christman, A. 2008. Support Vector Machines. Springer. Sutherland, D. J.; and Schneider, J. 2015. On the error of random Fourier features. In Proceedings of the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence, 862 871. Suykens, J. A. K.; Van Gestel, T.; De Brabanter, J.; De Moor, B.; and Vandewalle, J. 2002. Least Squares Support Vector Machines. New Jersey: World Scientiﬁc Publishing. Vapnik, V. 1995. Statistical Learning Theory. New York: Springer-Verlag. Wathen, A. J.; and Zhu, S. 2015. On spectral distribution of kernel matrices related to radial basis functions. Numerical Algorithms, 70: 709 726. Williams, C.; and Shawe-taylor, J. S. 2003. The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum. In Becker, S.; Thrun, S.; and Obermayer, K., eds., Advances in Neural Information Processing Systems 15 (NIPS 2002), 383 390. MIT Press. Williamson, R. C.; Smola, A. J.; and Sch olkopf, B. 2001. Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory, 47(6): 2516 2532. Wilson, A. G.; and Adams, R. P. 2013. Gaussian process kernels for pattern discovery and extrapolation. In Proceedings of the International Conference on Machine Learning, 1067 1075. Yang, J.; Sindhwanim, H., Vikasand Avron; and Mahoney, M. 2014. Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels. In Proceedings of the 31st International Conference on Machine Learning, volume 32, 485 493. Yang, Z.; Wilson, A.; Smola, A. J.; and Song, L. 2015. A la carte learning fast kernels. In Artiﬁcial Intelligence and Statistics, 1098 1106. Yu, F. X.; Kumar, S.; Rowley, H.; and Chang, S. F. 2015. Compact nonlinear maps and circulant extensions. In ar Xiv preprint ar Xiv:1503.03893. Zhu, H.; Williams, C. K. I.; Rohwer, R. J.; and Morciniec, M. 1998. Gaussian regression and optimal ﬁnite dimensional linear models. Neural Networks and Machine Learning.