# notsorandom_features__ae6ae6af.pdf

Published as a conference paper at ICLR 2018

NOT-SO-RANDOM FEATURES

Brian Bullins Cyril Zhang Yi Zhang Department of Computer Science Princeton University Princeton, NJ 08544, USA {bbullins, cyril.zhang, y.zhang}@cs.princeton.edu

We propose a principled method for kernel learning, which relies on a Fourieranalytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively reﬁning the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-ﬁnding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods.

1 INTRODUCTION

Choosing the right kernel is a classic question that has riddled machine learning practitioners and theorists alike. Conventional wisdom instructs the user to select a kernel which captures the structure and geometric invariances in the data. Efforts to formulate this principle have inspired vibrant areas of study, going by names from feature selection to multiple kernel learning (MKL).

We present a new, principled approach for selecting a translation-invariant or rotation-invariant kernel to maximize the SVM classiﬁcation margin. We ﬁrst describe a kernel-alignment subroutine, which ﬁnds a peak in the Fourier transform of an adversarially chosen data-dependent measure. Then, we deﬁne an iterative procedure that produces a sequence of feature maps, progressively improving the margin. The resulting algorithm is strikingly simple and scalable.

Intriguingly, our analysis interprets the main algorithm as no-regret learning dynamics in a zero-sum min-max game, whose value is the classiﬁcation margin. Thus, we are able to quantify convergence guarantees towards the largest margin realizable by a kernel with the assumed invariance. Finally, we exhibit experiments on synthetic and benchmark datasets, demonstrating consistent improvements over related random features-based kernel methods.

1.1 RELATED WORK

There is a vast literature on MKL, from which we use the key concept of kernel alignment (Cristianini et al., 2002). Otherwise, our work bears few similarities to traditional MKL; this and much related work (e.g. Cortes et al. (2012); G onen & Alpaydın (2011); Lanckriet et al. (2004)) are concerned with selecting a kernel by combining a collection of base kernels, chosen beforehand. Our method allows for greater expressivity and even better generalization guarantees.

Instead, we take inspiration from the method of random features (Rahimi & Recht, 2007). In this pioneering work, originally motivated by scalability, feature maps are sampled according to the Fourier transform of a chosen kernel. The idea of optimizing a kernel in random feature space was studied by Sinha & Duchi (2016). In this work, which is most similar to ours, kernel alignment is optimized via importance sampling on a ﬁxed, ﬁnitely supported proposal measure. However, the proposal can fail to contain informative features, especially in high dimension; indeed, they highlight efﬁciency, rather than showing performance improvements over RBF features.

Learning a kernel in the Fourier domain (without the primal feature maps) has also been considered previously: Oliva et al. (2016) and Yang et al. (2015b) model the Fourier spectrum parametrically, which limits expressivity; the former also require complicated posterior inference procedures.

Published as a conference paper at ICLR 2018

B az avan et al. (2012) study learning a kernel in the Fourier domain jointly with regression parameters. They show experimentally that this locates informative frequencies in the data, without theoretical guarantees. Our visualizations suggest this approach can get stuck in poor local minima, even in 2 dimensions.

Crammer et al. (2003) also use boosting to build a kernel sequentially; however, they only consider a basis of linear feature maps, and require costly generalized eigenvector computations. From a statistical view, Fukumizu et al. (2009) bound the SVM margin in terms of maximum mean discrepancy, which is equivalent to (unweighted) kernel alignment. Notably, their bound can be loose if the number of support vectors is small; in such situations, our theory provides a tighter characterization. Moreover, our attention to the margin goes beyond the usual objective of kernel alignment.

1.2 OUR CONTRIBUTIONS

We present an algorithm that outputs a sequence of Fourier features, converging to the maximum realizable SVM classiﬁcation margin on a labeled dataset. At each iteration, a pair of features is produced, which maximizes kernel alignment with a changing, adversarially chosen measure. As this measure changes slowly, the algorithm builds a diverse and informative feature representation.

Our main theorem can be seen as a case of von Neumann s min-max theorem for a zero-sum concave-linear game; indeed, our method bears a deep connection to boosting (Freund & Schapire, 1996; Schapire, 1999). In particular, both the theory and empirical evidence suggest that the generalization error of our method decreases as the number of random features increases. In traditional MKL methods, generalization bounds worsen as base kernels are added.

Other methods in the framework of Fourier random features take the approach of approximating a kernel by sampling feature maps from a continuous distribution. In contrast, our method constructs a measure with small ﬁnite support, and realizes the kernel exactly by enumerating the associated ﬁnite-dimensional feature map; there is no randomness in the features.1

2 PRELIMINARIES

2.1 THE FOURIER DUAL MEASURE OF A KERNEL

We focus on two natural families of kernels k(x, x ): translation-invariant kernels on X = Rd, which depend only on x x , and rotation-invariant kernels on the hypersphere X = Sd 1, which depend only on x, x . These invariance assumptions subsume most widely-used classes of kernels; notably, the Gaussian (RBF) kernel satisﬁes both. For the former invariance, Bochner s theorem provides a Fourier-analytic characterization: Theorem 2.1 (e.g. Eq. (1), Sec. 1.4.3 in Rudin (2011)). A translation-invariant continuous function k : Rd Rd R is positive semideﬁnite if and only if k(x x ) is the Fourier transform of a symmetric non-negative measure (where the Fourier domain is Ω= Rd). That is,

k(x x ) = Z

Ω λ(ω) ei ω,x x dω = Z

Ω λ(ω) ei ω,x ei ω,x dω, (1)

for some λ L1(Ω), satisfying λ(ω) 0 and λ(ω) = λ( ω) for all ω Ω.

A similar characterization is available for rotation-invariant kernels, where the Fourier basis functions are the spherical harmonics, a countably inﬁnite family of complex polynomials which form an orthonormal basis for square-integrable functions X C. To unify notation, let Ω N Z be the set of valid index pairs ω = (ℓ, m), and let ω 7 ω denote a certain involution on Ω; we supply details and references in Appendix B.1. Theorem 2.2. A rotation-invariant continuous function k : Sd 1 Sd 1 R is positive semideﬁnite if and only if it has a symmetric non-negative expansion into spherical harmonics Y d ℓ,m, i.e.

k( x, x ) =

l=1 λ(ℓ, m) Y d ℓ,m(x) Y d ℓ,m(x ), (2)

1We note a superﬁcial resemblance to quasi-Monte Carlo methods (Avron et al., 2016); however, these are concerned with accelerating convergence rates rather than learning a kernel.

Published as a conference paper at ICLR 2018

for some λ L1(Ω), with λ(ω) 0 and λ(ω) = λ( ω) for all valid index pairs ω = (ℓ, m) Ω.

In each of these cases, we call this Fourier transform λk(ω) the dual measure of k. This measure decomposes k into a non-negative combination of Fourier basis kernels. Furthermore, this decomposition gives us a feature map φ : X L2(Ω, λk) whose image realizes the kernel under the codomain s inner product;2 that is, for all x, x X,

k(x, x ) = Z

Ω λk(ω) φx(ω) φx (ω) dω. (3)

Respectively, these feature maps are φx(ω) = ei ω,x and φx(ℓ, m) = Y d ℓ,m(x). Although they are complex-valued, symmetry of λk allows us to apply the transformation {φx(ω), φx( ω)} 7 {Re φx(ω), Im φx(ω)} to yield real features, preserving the inner product. The analogous result holds for spherical harmonics.

2.2 KERNEL ALIGNMENT

In a binary classiﬁcation task with n training samples (xi X, yi { 1}), a widely-used quantity for measuring the quality of a kernel k : X X R is its alignment (Cristianini et al., 2002; Cortes et al., 2012),3 deﬁned by

γk(P, Q) def = X

i,j [n] k(xi, xj)yiyj = y T Gky,

where y is the vector of labels and Gk is the Gram matrix. Here, we let P = P

i:yi=1 δxi and Q = P

i:yi= 1 δxi denote the (unnormalized) empirical measures of each class, where δx is the Dirac measure at x. When P, Q are arbitrary measures on X, this deﬁnition generalizes to

γk(P, Q) def = ZZ

X 2k(x, x ) d P2(x, x ) + ZZ

X 2k(x, x ) d Q2(x, x ) ZZ

X 2k(x, x ) d P(x) d Q(x ).

In terms of the dual measure λk(ω), kernel alignment takes a useful alternate form, noted by Sriperumbudur et al. (2010). Let µ denote the signed measure P Q. Then, when k is translationinvariant, we have

γk(P, Q) = Z

X ei ω,x dµ(x)

2 dω def = Z

Ω λk(ω) v(ω). (4)

Analogously, when k is rotation-invariant, we have

γk(P, Q) = X

(ℓ,m) Ω λk(ℓ, m)

X Y d ℓ,m(x) dµ(x)

(ℓ,m) Ω λk(ℓ, m) v(ℓ, m). (5)

It can also be veriﬁed that λk L1(Ω) = k(x, x), which is of course the same for all x X. In each case, the alignment is linear in λk. We call v(ω) the Fourier potential, which is the squared magnitude of the Fourier transform of the signed measure µ = P Q. This function is clearly bounded pointwise by (P(X) + Q(X))2.

3 ALGORITHMS

3.1 MAXIMIZING KERNEL ALIGNMENT IN THE FOURIER DOMAIN

First, we consider the problem of ﬁnding a kernel k (subject to either invariance) that maximizes alignment γk(P, Q); we optimize the dual measure λk(ω). Aside from the non-negativity and symmetry constraints from Theorems 2.1 and 2.2, we constrain λk 1 = 1, as this quantity appears as a normalization constant in our generalization bounds (see Theorem 4.3). Maximizing γk(P, Q) in

2In fact, such a pair (φ, λ) exists for any psd kernel k; see Dai et al. (2014) or Devinatz (1953). 3Deﬁnitions vary up to constants and normalization, depending on the use case.

Published as a conference paper at ICLR 2018

this constraint set, which we call L L1(Ω), takes the form of a linear program on an inﬁnitedimensional simplex. Noting that v(ω) = v( ω) 0, γk is maximized by placing a Dirac mass at any pair of opposite modes ω argmaxω v(ω).

At ﬁrst, P and Q will be the empirical distributions of the classes, speciﬁed in Section 2.2. However, as Algorithm 2 proceeds, it will reweight each data point xi in the measures by α(i). Explicitly, the reweighted Fourier potential takes the form4

vα(ω) def =

i=1 yiαieι ω,xi

or vα(ℓ, m) def =

i=1 yiαi Y d ℓ,m(xi)

Due to its non-convexity, maximizing vα(ω), which can be interpreted as ﬁnding a global Fourier peak in the data, is theoretically challenging. However, we ﬁnd that it is easy to ﬁnd such peaks in our experiments, even in hundreds of dimensions. This arises from the empirical phenomenon that realistic data tend to be band-limited, a cornerstone hypothesis in data compression. An ℓ2 constraint (or equivalently, ℓ2 regularization; see Kakade et al. (2009)) can be explicitly enforced to promote band-limitedness; we ﬁnd that this is not necessary in practice.

When a gradient is available (in the translation-invariant case), we use Langevin dynamics (Algorithm 1) to ﬁnd the peaks of v(ω), which enjoys mild theoretical hitting-time guarantees (see Theorem 4.5). See Appendix A.3 for a discussion of the (discrete) rotation-invariant case.

Algorithm 1 Langevin dynamics for kernel alignment

1: Input: training samples S = {(xi, yi)}n i=1, weights α Rn. 2: Parameters: time horizon τ, diffusion rate ζ, temperature ξ. 3: Initialize ω0 arbitrarily. 4: for t = 0, . . . , τ 1 do

5: Update ωt+1 ωt + ζ vα(ωt) + q

ζ z, where z N(0, I).

6: end for 7: return ω := argmaxt v(ωt)

It is useful in practice to use parallel initialization, running m concurrent copies of the diffusion process and returning the best single ω encountered. This admits a very efﬁcient GPU implementation: the multi-point evaluations vα(ω1..m) and vα(ω1..m) can be computed from an (m, d) by (d, n) matrix product and pointwise trigonometric functions. We ﬁnd that Algorithm 1 typically ﬁnds a reasonable peak within 100 steps.

3.2 LEARNING THE MARGIN-MAXIMIZING KERNEL FOR SVM

Support vector machines (SVMs) are perhaps the most ubiquitous use case of kernels in practice. To this end, we propose a method that boosts Algorithm 1, building a kernel that maximizes the classiﬁcation margin. Let k be a kernel with dual λk, and Y def = diag(y). Write the dual l1-SVM objective,5 parameterizing the kernel by λk:

F(α, λk) def = 1T α 1

2γk(αP, αQ) = 1T α 1

Ω λk(ω) vα(ω) dω.

Thus, for a ﬁxed α, F is equivalent to kernel alignment, and can be minimized by Algorithm 1. However, the support vector weights α are of course not ﬁxed; given a kernel k, α is chosen to maximize F, giving the (reciprocal) SVM margin. In all, to ﬁnd a kernel k which maximizes the margin under an adversarial choice of α, one must consider a two-player zero-sum game:

min λ L max α K F(α, λ), (6)

4Whenever i is an index, we will denote the imaginary unit by ι. 5Of course, our method applies to l2 SVMs, and has even stronger theoretical guarantees; see Section 4.1.

Published as a conference paper at ICLR 2018

where L is the same constraint set as in Section 3.1, and K is the usual dual feasible set {0 α C, y T α = 0} with box parameter C.

In this view, we make some key observations. First, Algorithm 1 allows the min-player to play a pure-strategy best response to the max-player. Furthermore, a mixed strategy λ for the min-player is simply a translationor rotation-invariant kernel, realized by the feature map corresponding to its support. Finally, since the objective is linear in λ and concave in α, there exists a Nash equilibrium (λ , α ) for this game, from which λ gives the margin-maximizing kernel.

We can use no-regret learning dynamics to approximate this equilibrium. Algorithm 2 runs Algorithm 1 for the min-player, and online gradient ascent (Zinkevich, 2003) for the max-player. Intuitively (and as is visualized in our synthetic experiments), this process slowly morphs the landscape of vα to emphasize the margin, causing Algorithm 1 to ﬁnd progressively more informative features. At the end, we simply concatenate these features; contingent on the success of the kernel alignment steps, we have approximated the Nash equilibrium.

Algorithm 2 No-regret learning dynamics for SVM margin maximization

1: Input: training samples S = {(xi, yi)}n i=1. 2: Parameters: box constraint C, # steps T, step sizes {ηt}; parameters for Algorithm 1. 3: Set φx(ω) = ei ω,x (translation-invariant), or φx(ℓ, m) = Y d ℓ,m(x) (rotation-invariant).

4: Initialize α = Proj K[ C

2 1]. 5: for t = 1, . . . , T do 6: Use Algorithm 1 (or other vα maximizer) on S with weights αt, returning ωt. 7: Append two features {Re φx (ωt), Im φx ( ωt)} to each xi s representation Φ(xi). 8: Compute gradient gt := αF(αt, λt), where λt = δωt + δ ωt. 9: Update αt+1 Proj K[αt + ηtgt]. 10: end for 11: return features {Φ(xi) R2T }n i=1, or dual measure λ := 1 2T PT t=1 δωt + δ ωt.

We provide a theoretical analysis in Section 4.1, and detailed discussion on heuristics, hyperparameters, and implementation details in depth in Appendix A. One important note is that the online gradient gt = 1 2Y Re( Φt, Yαt Φt) is computed very efﬁciently, where Φt(i) = φxi(ωt) Cm is the vector of the most recently appended features. Langevin dynamics, easily implemented on a GPU, comprise the primary time bottleneck.

4.1 CONVERGENCE OF NO-REGRET LEARNING DYNAMICS

We ﬁrst state the main theoretical result, which quantiﬁes the convergence properties of Algorithm 2. Theorem 4.1 (Main). Assume that at each step t, Algorithm 1 returns an εt-approximate global maximizer ωt (i.e., vαt(ωt) supω Ωvαt(ω) εt). Then, with a certain choice of step sizes ηt, Algorithm 2 produces a dual measure λ L which satisﬁes

max α K F(α, λ) min λ L max α K F(α, λ) + PT t=1 εt 2T + 3(C + 2C2)n

(Alternate form.) Suppose instead that vαt(ωt) ρ at each time t. Then, λ satisﬁes

max α K F(α, λ) ρ + 3(C + 2C2)n

We prove Theorem 4.1 in Appendix C. For convenience, we state the ﬁrst version as an explicit margin bound (in terms of competitive ratio M/M ): Corollary 4.2. Let M be the margin obtained by training an ℓ1 linear SVM with the same C as in Algorithm 2, on the transformed samples {(Φ(xi), yi)}. Then, M is (1 δ)-competitive with M , the maximally achievable margin by a kernel with the assumed invariance, with

t εt T + 3(C + 2C2)n

Published as a conference paper at ICLR 2018

This bound arises from the regret analysis of online gradient ascent (Zinkevich, 2003); our analysis is similar to the approach of Freund & Schapire (1996), where they present a boosting perspective. When using an ℓ2-SVM, the ﬁnal term can be improved to O( log T

T ) (Hazan et al., 2007). For a general overview of results in the ﬁeld, refer to Hazan (2016).

4.2 GENERALIZATION BOUNDS

Finally, we state two (rather distinct) generalization guarantees. Both depend mildly on a bandwidth assumption ω 2 Rω and the norm of the data Rx def = maxi xi . First, we state a margindependent SVM generalization bound, due to Koltchinskii & Panchenko (2002). Notice the appearance of λk 1 = Rλ, justifying our choice of normalization constant for Algorithm 1. Intriguingly, the end-to-end generalization error of our method decreases with an increasing number of random features, since the margin bound is being reﬁned during Algorithm 2.

Theorem 4.3 (Generalization via margin). For any SVM decision function f : X R with a kernel kλ constrained by λ 1 Rλ trained on samples S drawn i.i.d. from distribution D, the generalization error is bounded by

Pr (x,y) D[yf(x) 0] min θ 1 n

i=1 1yif(xi) θ + 6RωRx

The proof can be found in Appendix D. Note that this improves on the generic result for MKL, from Theorem 2 in (Cortes et al., 2010), which has a log T dependence on the number of base kernels T. This improvement stems from the rank-one property of each component kernel.

Next, we address another concern entirely: the sample size required for v(ω) to approximate the ideal Fourier potential videal(ω), the squared magnitude of the Fourier transform of the signed measure P Q arising from the true distribution. For the shift-invariant case:

Theorem 4.4 (Generalization of the potential). Let S be a set of i.i.d. training samples on D, with n O d log 1

δ ε2 . We have that for all ω : ω Rω, with probability at least 1 δ,

n2 videal(ω) ε.

The O( ) suppresses factors polynomial in Rx and Rω.

The full statement and proof are standard, and deferred to Appendix E. In particular, this result allows for a mild guarantee of polynomial hitting-time on the locus of approximate local maxima of videal (as opposed to the empirical v). Adapting the main result from Zhang et al. (2017):

Theorem 4.5 (Langevin hitting time). Let ωτ be the output of Algorithm 1 on vα(ω), after τ steps. Algorithm 1 ﬁnds an approximate local maximum of videal in polynomial time. That is, with U being the set of ε-approximate local maxima of videal(ω), some ωt satisﬁes

v(ωt) inf d(U,ω) v(ω)

for some τ O(poly (Rx, Rω, d, ζ, ξ, ε 1, 1, log(1/δ)), with probability at least 1 δ.

Of course, one should not expect polynomial hitting time on approximate global maxima; Roberts & Tweedie (1996) give asymptotic mixing guarantees.

5 EXPERIMENTS

In this section, we highlight the most important and illustrative parts of our experimental results. For further details, we provide an extended addendum to the experimental section in Appendix A. The code can be found at github.com/yz-ignescent/Not-So-Random-Features.

Published as a conference paper at ICLR 2018

RBF (baseline, 92.1% test)

Data {(xi, yi)}; dual weights αt

T = 100 (98.6% test) T = 300 (99.1% test) T = 1000 (99.3% test)

Features { ωi}T i=1; potential vα(ω)

SVM decision function

(a) Evolution of Algorithm 2. Top: The windmill dataset, weighted by dual variables αt. Middle: 2t random features (magenta; last added ωt in cyan), overlaying the Fourier potential vαt(ω) in the background. Bottom: Decision boundary of a hinge-loss classiﬁer trained on the kernel, showing reﬁnement at the margin. Contours indicate the value of the margin.

(b) Detailed evolution of the Fourier potential vα(ω). Left: Initial potential v(ω), with uniform weights. Right: Reweighted vα(ω) at t = 300. Note the larger support of peaks.

(c) Toy dataset on the sphere, with top-3 selected features: spherical harmonics Y0,11, Y4,9, Y8,27.

Figure 1: Visualizations and experiments on synthetic data.

First, we exhibit two simple binary classiﬁcation tasks, one in R2 and the other on S2, to demonstrate the power of our kernel selection method. As depicted in Figure 1, we create datasets with sharp boundaries, which are difﬁcult for the standard RBF and arccosine (Cho & Saul, 2009) kernel. In both cases, ntrain = 2000 and ntest = 50000.6 Used on a ℓ1-SVM classiﬁer, these baseline kernels saturate at 92.1% and 95.1% test accuracy, respectively; they are not expressive enough.

On the R2 windmill task, Algorithm 2 chooses random features that progressively reﬁne the decision boundary at the margin. By T = 1000, it exhibits almost perfect classiﬁcation (99.7% training, 99.3% test). Similarly, on the S2 checkerboard task, Algorithm 2 (with some adaptations described

6ntest is chosen so large to measure the true generalization error.

Published as a conference paper at ICLR 2018

in Appendix A.3) reaches almost perfect classiﬁcation (99.7% training, 99.1% test) at T = 100, supported on only 29 spherical harmonics as features.

We provide some illuminating visualizations. Figures 1a and 1b show the evolution of the dual weights, random features, and classiﬁer. As the theory suggests, the objective evolves to assign higher weight to points near the margin, and successive features improve the classiﬁer s decisiveness in challenging regions (1a, bottom). Figure 1c visualizes some features from the S2 experiment.

Next, we evaluate our kernel on standard benchmark binary classiﬁcation tasks. Challenging label pairs are chosen from the MNIST (Le Cun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) datasets; each task consists of 10000 training and 2000 test examples; this is considered to be large-scale for kernel methods. Following the standard protocol from Yu et al. (2016), 512-dimensional Ho G features (Dalal & Triggs, 2005) are used for the CIFAR-10 tasks instead of raw images. Of course, our intent is not to show state-of-the-art results on these tasks, on which deep neural networks easily dominate. Instead, the aim is to demonstrate viability as a scalable, principled kernel method.

We compare our results to baseline random features-based kernel machines: the standard RBF random features (RBF-RF for short), and the method of Sinha & Duchi (2016) (LKRF),7 using the same ℓ1-SVM throughout. As shown by Table 1, our method reliably outperforms these baselines, most signiﬁcantly in the regime of few features. Intuitively, this lines up with the expectation that isotropic random sampling becomes exponentially unlikely to hit a good peak in high dimension; our method searches for these peaks.

Training accuracy

RBF-RF LKRF ours

0 1000 2000 3000 4000 5000 Number of features m

Test accuracy

RBF-RF LKRF ours

Dataset Method Number of features m 100 500 1000 2000 5000

MNIST (1-7)

RBF-RF 97.42 99.02 99.10 99.42 99.31 LKRF 97.60 98.95 99.05 99.21 99.37 ours 99.05 99.28 99.45 99.63 99.65

MNIST (4-9)

RBF-RF 87.72 94.11 96.95 97.36 98.08 LKRF 88.42 93.99 95.88 97.18 97.76 ours 93.02 96.31 97.68 98.72 99.01

MNIST (5-6)

RBF-RF 91.96 97.63 97.85 98.51 98.82 LKRF 92.78 97.60 97.79 98.36 98.54 ours 96.62 98.67 99.18 99.32 99.64

CIFAR-10 (plane-bird)

RBF-RF 71.75 78.30 79.85 80.15 81.25 LKRF 73.30 78.80 79.85 80.80 81.55 ours 81.80 82.15 83.00 84.25 85.15

CIFAR-10 (auto-truck)

RBF-RF 69.70 75.75 82.55 83.35 85.10 LKRF 71.89 77.33 83.62 84.60 84.95 ours 84.80 85.90 86.00 88.25 88.80

CIFAR-10 (dog-frog)

RBF-RF 70.80 79.95 81.85 82.50 82.95 LKRF 70.75 78.95 81.25 81.35 83.00 ours 81.95 84.05 85.40 85.95 86.25

CIFAR-10 (auto-ship)

RBF-RF 70.85 78.00 79.15 81.80 82.55 LKRF 70.55 77.90 79.55 80.95 82.85 ours 81.00 83.70 84.40 86.70 87.35

CIFAR-10 (auto-deer)

RBF-RF 84.65 90.00 91.85 92.45 92.65 LKRF 84.85 89.75 91.05 92.65 92.90 ours 92.40 93.60 93.55 94.05 94.90

Table 1: Comparison on binary classiﬁcation tasks. Our method is compared against standard RBF random features (RBF-RF), as well as the method from Sinha & Duchi (2016) (LKRF). Left: Performance is measured on the CIFAR-10 automobile vs. truck task, varying the number of features m. Standard deviations over 10 trials are shown, demonstrating high stability.

Furthermore, our theory predicts that the margin keeps improving, regardless of the dimensionality of the feature maps. Indeed, as our classiﬁer saturates on the training data, test accuracy continues increasing, without overﬁtting. This decoupling of generalization from model complexity is characteristic of boosting methods.

7Traditional MKL methods are not tested here, as they are noticeably (>100 times) slower.

Published as a conference paper at ICLR 2018

In practice, our method is robust with respect to hyperparameter settings. As well, to outperform both RBF-RF and LKRF with 5000 features, our method only needs 100 features. Our GPU implementation reaches this point in 30 seconds. See Appendix A.1 for more tuning guidelines.

6 CONCLUSION

We have presented an efﬁcient kernel learning method that uses tools from Fourier analysis and online learning to optimize over two natural inﬁnite families of kernels. With this method, we show meaningful improvements on benchmark tasks, compared to related random features-based methods. Many theoretical questions remain, such as accelerating the search for Fourier peaks (e.g. Hassanieh et al. (2012); Kapralov (2016)). These, in addition to applying our learned kernels to state-of-the-art methods (e.g. convolutional kernel networks (Mairal et al., 2014; Yang et al., 2015a; Mairal, 2016)), prove to be exciting directions for future work.

ACKNOWLEDGMENTS

We are grateful to Sanjeev Arora, Roi Livni, Pravesh Kothari, Holden Lee, Karan Singh, and Naman Agarwal for helpful discussions. This research was supported by the National Science Foundation (NSF), the Ofﬁce of Naval Research (ONR), and the Simons Foundation. The ﬁrst two authors are supported in part by Elad Hazan s NSF grant IIS-1523815.

Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195 1199. ACM, 2017.

Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. Journal of Machine Learning Research, 17(120):1 38, 2016.

Eduard Gabriel B az avan, Fuxin Li, and Cristian Sminchisescu. Fourier kernel learning. In Computer Vision ECCV 2012, pp. 459 473. Springer, 2012.

Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pp. 342 350, 2009.

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 247 254. Citeseer, 2010.

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13(Mar):795 828, 2012.

Koby Crammer, Joseph Keshet, and Yoram Singer. Kernel design using boosting. In Advances in Neural Information Processing Systems, pp. 553 560, 2003.

Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz Kandola. On kernel-target alignment. In Advances in Neural Information Processing Systems, pp. 367 373, 2002.

Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pp. 3041 3049, 2014.

Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pp. 886 893. IEEE, 2005.

Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for zero-sum games. Games and Economic Behavior, 92:327 348, 2015.

Published as a conference paper at ICLR 2018

A Devinatz. Integral representations of positive deﬁnite functions. Transactions of the American Mathematical Society, 74(1):56 77, 1953.

Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pp. 325 332. ACM, 1996.

Kenji Fukumizu, Arthur Gretton, Gert R Lanckriet, Bernhard Sch olkopf, and Bharath K Sriperumbudur. Kernel choice and classiﬁability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems, pp. 1750 1758, 2009.

Jean Gallier. Notes on spherical harmonics and linear representations of lie groups. 2009.

Mehmet G onen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(Jul):2211 2268, 2011.

Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Simple and practical algorithm for sparse Fourier transform. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1183 1194. Society for Industrial and Applied Mathematics, 2012.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016.

Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007.

Atsushi Higuchi. Symmetric tensor spherical harmonics on the N-sphere and their application to the de Sitter group SO (N, 1). Journal of mathematical physics, 28(7):1553 1566, 1987.

Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, pp. 793 800, 2009.

Michael Kapralov. Sparse Fourier transform in any constant dimension with nearly-optimal sample complexity in sublinear time. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pp. 264 277. ACM, 2016.

Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Annals of Statistics, pp. 1 50, 2002.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semideﬁnite programming. Journal of Machine Learning Research, 5(Jan):27 72, 2004.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Julien Mairal. End-to-end kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 1399 1407, 2016.

Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 2627 2635, 2014.

Colin Mc Diarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148 188, 1989.

Claus M uller. Analysis of spherical symmetries in Euclidean spaces, volume 129. Springer Science & Business Media, 2012.

Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235 249, 2005.

Published as a conference paper at ICLR 2018

Junier Oliva, Avinava Dubey, Barnabas Poczos, Jeff Schneider, and Eric P Xing. Bayesian nonparametric kernel-learning. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, 2016.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177 1184, 2007.

Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996.

Walter Rudin. Fourier analysis on groups. John Wiley & Sons, 2011.

Robert E Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artiﬁcial Intelligence, pp. 1401 1406, 1999.

IJ Schoenberg. Positive deﬁnite functions on spheres. Duke Mathematical Journal, 9(1):96 108, 1942.

Aman Sinha and John C Duchi. Learning kernels with random features. In Advances in Neural Information Processing Systems, pp. 1298 1306, 2016.

Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch olkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517 1561, 2010.

Elias M Stein and Guido Weiss. Introduction to Fourier analysis on Euclidean spaces (PMS-32), volume 32. Princeton University Press, 2016.

John von Neumann. On rings of operators. reduction theory. Annals of Mathematics, pp. 401 485, 1949.

Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1476 1483, 2015a.

Zichao Yang, Andrew Wilson, Alex Smola, and Le Song. A la carte learning fast kernels. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics, pp. 1098 1106, 2015b.

Felix X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pp. 1975 1983, 2016.

Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient Langevin dynamics. ar Xiv preprint ar Xiv:1702.05575, 2017.

Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928 936, 2003.

Published as a conference paper at ICLR 2018

A APPENDIX FOR EXPERIMENTS

Algorithm 2 gives a high-level outline of the essential components of our method. However, it conceals several hyperparameter choices and algorithmic heuristics, which are pertinent when applying our method in practice. We discuss a few more details in this section.

A.1 HYPERPARAMETER TUNING GUIDE

Throughout all experiments presented, we use hinge-loss SVM classiﬁers with C = 1. Note that the convergence of Algorithm 2 depends quadratically on C.

With regard to Langevin diffusion (Algorithm 1), we observe that the best samples arise from using high temperatures and Gaussian parallel initialization. For the latter, a rule-of-thumb is to initialize 500 parallel copies of Langevin dynamics, drawing the initial position {ω0} from a centered isotropic Gaussian with 1.5 the variance of the optimal RBF random features. (In turn, a common rule-of-thumb for this bandwidth is the median of pairwise Euclidean distances between data points.)

The step size in Algorithm 1 is tuned based on the magnitude of the gradient on v(ω), which can be signiﬁcantly smaller than the upper bound derived in Section E. As is standard practice in Langevin Monte Carlo methods, the temperature is chosen so that the pertubation is roughly at the same magnitude as the gradient step. Empirically, running Langevin dynamics for 100 steps sufﬁces to locate a reasonably good peak. To further improve efﬁciency, one can modify Algorithm 1 to pick the top k 10 samples, a k-fold speedup which does not degrade the features much.

The step size of online gradient ascent is set to balance between being conservative and promoting diverse samples; these steps should not saturate (thereby solving the dual SVM problem), in order to have the strongest regret bound. In our experiments, we ﬁnd that the step size achieving the standard regret guarantee (scaling as 1/

T) tends to be a little too conservative.

On the other hand, it never hurts (and seems important in practice) to saturate the peak-ﬁnding routine (Algorithm 1), since this contributes an additive improvement to the margin bound. Noting that the objective is very smooth (the k-th derivative scales as Rk x), it may be beneﬁcial to reﬁne the samples using a few steps of gradient descent with a very small learning rate, or an accelerated algorithm for ﬁnding approximate local minima of smooth non-convex functions; see, e.g. Agarwal et al. (2017).

A.2 PROJECTION ONTO THE SVM DUAL FEASIBLE SET

A quick note on projection onto the feasible set K = {0 α C, y T α = 0} of the SVM dual convex program: it typically sufﬁces in practice to use alternating projection. This feasible set is the intersection of a hyperplane and a hypercube; both of which admit a simple projection step. The alternation projection onto the intersection of two non-empty convex sets was originally proposed by von Neumann (1949). The convergence rate can be shown to be linear. To obtain the dual variables in our experiments, we use 10 such alternating projections. This results in a dual feasible solution up to hardware precision, and is a negligible component of the total running time (for which the parallel gradient computations are the bottleneck).

Algorithm 3 Alternating Projection for SVM Dual Constraints

Input: α Rn Parameters: box constraint C, label vector y. repeat

Project onto the box: α = clip(α, 0, C) Project onto the hyperplane: α α y T α

n y until convergence return α

Published as a conference paper at ICLR 2018

A.3 SAMPLING SPHERICAL HARMONICS

As we note in Section 3.1, it is unclear how to deﬁne gradient Langevin dynamics on v(l, m) in the inner-product case, since no topology is available on the indices (l, m) of the spherical harmonics. One option is to emulate Langevin dynamics, by constructing a discrete Markov chain which mixes to λ(l, m) eβv(l,m).

However, we ﬁnd in our experiments that it sufﬁces to compute λ(l, m) by examining all values of v(l, m) with j no more than some threshold J. One should view this as approximating the kerneltarget alignment objective function via Fourier truncation. This is highly parallelizable: it involves approximately N(m, d) degree-J polynomial evaluations on the same sample data, which can be expressed using matrix multiplication. In our experiments, it sufﬁced to examine the ﬁrst 1000 coefﬁcients; we remark that it is unnatural in any real-world datasets to expect that v(ω) only has large values outside the threshold J.

Under Fourier truncation, the domain of λ becomes a ﬁnite-dimensional simplex. In the gametheoretic view of 4.1, an approximate Nash equilibrium becomes concretely achievable via Nesterov s excessive gap technique (Nesterov, 2005; Daskalakis et al., 2015), given that the kernel player s actions are restricted to a mixed strategy over a ﬁnite set of basis kernels.

Finally, we note a signiﬁcant advantage to this setting, where we have a discrete set of Fourier coefﬁcients: the same feature might be found multiple times. When a duplicate feature is found, it need not be concatenated to the representation; instead, the existing feature is scaled appropriately. This accounts for the drastically smaller support of features required to achieve near-perfect classiﬁcation accuracy.

B MORE ON SPHERICAL HARMONICS

In this section, we go into more detail about the spherical harmonics in d dimensions. Although all of this material is standard in harmonic analysis, we provide this section for convenience, isolating only the relevant facts.

B.1 SPHERICAL HARMONIC EXPANSION OF A ROTATION-INVARIANT KERNEL

First, we provide a proof sketch for Theorem 2.2. We rely on the following theorem, an analogue of Bochner s theorem on the sphere, which characterizes rotation-invariant kernels:

Theorem B.1 (Schoenberg (1942)). A continuous rotation-invariant function k(x, x ) = k( x, x ) on Sd 1 Sd 1 is positive semi-deﬁnite if and only if its expansion into Gegenbauer polynomials P d i has only non-negative coefﬁcients, i.e.

k( x, x ) =

m=0 λm P d m( x, x ), (7)

with λm 0, m N+.

The Gegenbauer polynomials P d m : [ 1, 1] R are a generalization of the Legendre and Chebyshev polynomials; they form an orthogonal basis on [ 1, 1] under a certain weighting function (see Stein & Weiss (2016); Gallier (2009); M uller (2012) for details). Note that we adopt a different notation from Schoenberg (1942), which denotes our P d m by P (d 1)/2 m .

Unlike Theorem 2.1, Theorem B.1 alone does not provide a clear path for constructing a feature map for inner-product kernel, because the inputs x, x of the kernel are still coupled after the expansion into P d m, which acts on the inner product x, x instead of on x and x separately. Fortunately, (see, e.g., Proposition 1.18 in Gallier (2009)), the Gegenbauer polynomials evaluated on an inner product admits a decomposition into spherical harmonics Y d ℓ,m:

P d m( x, x ) = |Sd 1|

l=1 Y d ℓ,m(x)Y d ℓ,m(x ) x, x Sd 1, (8)

Published as a conference paper at ICLR 2018

where |Sd 1| denotes the surface area of Sd 1, and N(d, ℓ) = d 1+ℓ ℓ d 1+ℓ ℓ 2 . Here, we use the normalization convention that R

Sd 1 |Y d ℓ,m(x)|2 dx = 1. From the existence of this expansion follows the claim from Theorem 2.2. For a detailed reference, see M uller (2012).

B.2 PAIRING THE SPHERICAL HARMONICS

We now specify the involution ω 7 ω on indices of spherical harmonics, which gives a pairing that takes the role of opposite Fourier coefﬁcients in the X = Rd case. In particular, the Fourier transform λ of a real-valued function k : Rn R satisﬁes λ(ω) = λ( ω), so that the dual measure can be constrained to be symmetric.

Now, consider the X = Sd 1 case, where the Fourier coefﬁcients are on the set Ωof valid indices of spherical harmonics. We would like a permutation on the indices σ so that σ(σ(ω)) = ω, and λ(ω) = λ(σ(ω)) whenever λ is the spherical harmonic expansion of a real kernel.

For a ﬁxed dimension d and degree ℓ, the spherical harmonics form an N := N(d, ℓ)-dimensional orthogonal basis for a vector space V of complex polynomials from Sd 1 to C, namely the ℓhomogeneous polynomials (with domain extended to Rn) which are harmonic (eigenfunctions of the Laplace-Beltrami operator Sd 1). This basis is not unique; an arbitrary choice of basis may not contain pairs of conjugate functions.

Fortunately, such a basis exists constructively, by separating the Laplace-Beltrami operator over spherical coordinates (θ1, . . . , θd 1). Concretely, for a ﬁxed d and ℓ, the d-dimensional spherical harmonic, indexed by |ℓ1| ℓ2 ℓ3 . . . ℓd 1, is deﬁned as:

Yℓ1,...,ℓN def = 1

2π eiℓ1θ1 d 1 Y

k P ℓd 1 ℓk (θk),

where the functions in the product come from a certain family of associated Legendre functions in sin θ. For a detailed treatment, we adopt the construction and conventions from Higuchi (1987).

In the scope of this paper, the relevant fact is that the desired involution is well-deﬁned in all dimensions. Namely, consider the permutation σ that sends a spherical harmonic indexed by ω = (d, ℓ, ℓ1, ℓ2, . . . , ℓd 1) to σ(ω) = (d, ℓ, ℓ1, ℓ2, . . . , ℓd 1). Then, Y ω(x) def = Yσ(ω)(x) = Yω(x) for all ω, x.

The symmetry condition in Theorem 2.2 follows straightforwardly. By orthonormality, we know that every square-integrable function f : Rn C has a unique decomposition into spherical harmonics, with coefﬁcients λω = f, Yω , so that λ ω = f, Yω = λω. When f is real-valued, we conclude that λ ω = λω, as claimed.

C PROOF OF THE MAIN THEOREM

In this section, we prove the main theorem, which quantiﬁes convergence of Algorithm 2 to the Nash equilibrium. We restate it here: Theorem 4.1. Assume that during each timestep t, the call to Algorithm 1 returns an εt-approximate global maximizer ωt (i.e. ˆvαt(ωt) maxω K ˆvαt(ω) εt). Then, Algorithm 2 returns a dual measure λ, which satisﬁes

max α K F(α, λ) min λ L max α K F(α, λ) + PT t=1 εt 2T + O 1

Alternatively with the assumption that at each timestep t, ˆvαt(ωt) ρ, λ, satisﬁes

max α K F(α, λ) ρ + O 1

If Algorithm 2 is used on a l2-SVM, the regret bound can be improved to be O( log T

Proof. We will make use of the regret bound of online gradient ascent (see, e.g., (Hazan, 2016)). Here we only prove the theorem in the case for l1-SVM with box constraint C, under the assumption of εt-approximate optimality of Algorithm 1. Extending the proof to other cases is straightfoward.

Published as a conference paper at ICLR 2018

Lemma C.1 (Regret bound for online gradient ascent). Let D be the diameter of the constraint set

K, and G a Lipschitz constant for an arbitrary sequence of concave functions ft(α) def = F(α, λt) on K. Online gradient ascent on ft(α), with step size schedule ηt = G D

t, guarantees the following for all T 1:

t=1 ft(αt) max α K 1 T

t=1 ft(α) 3GD

Here, D C n by the box constraint, and we have

G sup α K, ω Rd 1 2Y Re( Φt, Yαt Φt) 2

(1 + 2C) n.

Thus, our regret bound is 3n(C+2C2)

T ; that is, for all T 1,

t=1 F(αt, δωt + δ ωt) max α K 1 T

t=1 F(α, δωt + δ ωt)) 3n(C + 2C2)

Since at each timestep t, ˆvαt(ωt) maxω K ˆvαt(ω) εt, we have by assumption

F(αt, δωt + δ ωt) min ω K F(αt, δω + δ ω) + εt,

and by concavity of F in α, we know that, for any ﬁxed sequence α1, α2, . . . , αT ,

min λ L max α K F(α, λ) min λ L 1 T

t=1 F(αt, λ),

it follows that

min λ L max α K F(α, λ) min λ L 1 T

t=1 F(αt, λ)

t=1 min ω K F αt, δω + δ ω

t=1 F(αt, δωt + δ ωt) PT t=1 εt

max α K 1 2T

t=1 F(α, δωt + δ ωt) 3n(C + 2C2)

T PT t=1 εt 2T

= max α K 1 2T

t=1 F(α, δωt + δ ωt) 3n(C + 2C2)

T PT t=1 εt 2T .

To complete the proof, note that for a given α K, 1 2T PT t=1 F(α, δωt + δ ωt) = F(α, λ) by linearity of F in the dual measure; here, λ = 1 2T PT t=1 δωt +δ ωt is the approximate Nash equilibrium found by the no-regret learning dynamics.

D PROOF OF THEOREM 4.3

In this section, we compute the Rademacher complexity of the composition of the learned kernel and the classiﬁer, proving Theorem 4.3.

Published as a conference paper at ICLR 2018

Theorem 4.3. For any SVM decision function f : X R with a kernel kλ constrained by λ 1 Rλ trained on samples S drawn i.i.d. from distribution D, the generalization error is bounded by

Pr (x,y) D[yf(x) 0] min θ 1 n

i=1 1yif(xi) θ + 6RωRx

Proof. For a ﬁxed sample S = {xi}n i=1, we compute the empirical Rademacher complexity of the composite hypothesis class. Below, σ Rn is a vector of i.i.d. Rademacher random variables.

sup λ 1 Rλ α Gkλα 1

j=1 σiαjkλ(xi, xj)

sup λ 1 Rλ sup α Gkλα 1 σ Gkλα

sup λ 1 Rλ σ Gkλσ

v u u t sup λ 1 Rλ

i=1 σieι ω,xi

v u u t Rλ sup ω Ω

i=1 σieι ω,xi

i=1 σieι ω,xi

i=1 σi cos( ω, xi )

i=1 σi sin( ω, xi )

i=1 σi cos( ω, xi )

i=1 σi sin( ω, xi )

i=1 σi cos( ω, xi )

i=1 σi sin( ω, xi )

Note that each term in the formula is the empirical Rademacher complexity of the composition of a 1-Lipschitz function (cosine or sine) with a linear function with ℓ2 norm bounded by Rω. By the composition property of Rademacher complexity, we have the following:

i=1 σi cos( ω, xi )

i=1 σi cos( ω, xi )

i=1 σi cos( ω, xi )

Published as a conference paper at ICLR 2018

and because sin( ) is an odd function,

i=1 σi sin( ω, xi )

Putting everything together, we ﬁnally obtain that

from which we apply the result from Koltchinskii & Panchenko (2002) to conclude the theorem.

E SAMPLE COMPLEXITY FOR THE FOURIER POTENTIAL

In this section, we prove the following theorem about concentration of the Fourier potential. It sufﬁces to disregard the reweighting vector α; to recover a guarantee in this case, simply replace ε with ε/C2. Note that we only argue this in the translation-invariant case.

Theorem 4.4. Let S = {(xi, yi)}n i=1 be a set of i.i.d. training samples. Then, if

n 8R2 x R4 ω(d log 2Rω

we have that with probability at least 1 δ, 1

n2 v(n)(ω) videal(ω) ε for all ω Ωsuch that ω Rω.

Let P(n), Q(n) denote the empirical measures from a sample S of size n, arising from i.i.d. samples from the true distribution, whose classes have measures P, Q, adopting the convention P(X) + Q(X) = 1. Then, in expectation over the sampling, and adopting the same normalization conventions as in the paper, we have E[P(n)/n] = P and E[Q(n)/n] = Q for every n.

Let v(n)(ω) denote the empirical Fourier potential, computed from P(n), Q(n), so that we have E[v(n)(ω)/n2] = videal(ω). The result follows from a concentration bound on v(n)(ω)/n2. We ﬁrst show that it is Lipschitz:

Lemma E.1 (Lipschitzness of v(n) in ω). The function

i yieι ω,xi

is 2n2Rx-Lipschitz with respect to ω.

Proof. We have

i,j [n] yiyjeι ω,xi xj

i,j [n] yiyji(xi xj)eι ω,xi xj

n2 max i,j ||xi xj|| 2n2Rx.

Thus, the Lipschitz constant of v(n)(ω)/n2 scales linearly with the norm of the data, a safe assumption. Next, we show Lipschitzness with respect to a single data point:

Lemma E.2 (Lipschitzness of v(n) in xi). v(n)(ω) is 2n Rω-Lipschitz with respect to any xi.

Published as a conference paper at ICLR 2018

Proof. Similarly as above:

xiv(n)(ω) =

i,j [n] yiyjeι ω,xi xj

j [n] yiyj iωeι ω,xi xj

Now, we complete the proof. By Lemma E.2, replacing one xi with another x i (whose norm is also bounded by Rx) changes v(n)(ω)/n2 by at most 4Rx Rω/n. Then, by Mc Diarmid s inequality (Mc Diarmid, 1989), we have the following fact:

Lemma E.3 (Pointwise concentration of v(n)). For all ω Ωand ε > 0,

Pr 1 n2 v(n)(ω) videal(ω) ε exp nε2

Now, let W be an (ε/Rω)-net of the set {ω Ω: ω Rω}.

Applying the union bound on Lemma E.3 with ω ranging over each element of W, we have

Pr 1 n2 v(n)(ω) videal(ω) ε, ω Rω

Pr 1 n2 v(n)(ω) videal(ω) ε Rω , ω W

|W| exp nε2

To make the LHS smaller than δ, it sufﬁces to choose

n 8R2 x R4 ω(d log 2Rω

which is the claimed sample complexity.