# notsorandom_features__ae6ae6af.pdf Published as a conference paper at ICLR 2018 NOT-SO-RANDOM FEATURES Brian Bullins Cyril Zhang Yi Zhang Department of Computer Science Princeton University Princeton, NJ 08544, USA {bbullins, cyril.zhang, y.zhang}@cs.princeton.edu We propose a principled method for kernel learning, which relies on a Fourieranalytic characterization of translation-invariant or rotation-invariant kernels. Our method produces a sequence of feature maps, iteratively refining the SVM margin. We provide rigorous guarantees for optimality and generalization, interpreting our algorithm as online equilibrium-finding dynamics in a certain two-player min-max game. Evaluations on synthetic and real-world datasets demonstrate scalability and consistent improvements over related random features-based methods. 1 INTRODUCTION Choosing the right kernel is a classic question that has riddled machine learning practitioners and theorists alike. Conventional wisdom instructs the user to select a kernel which captures the structure and geometric invariances in the data. Efforts to formulate this principle have inspired vibrant areas of study, going by names from feature selection to multiple kernel learning (MKL). We present a new, principled approach for selecting a translation-invariant or rotation-invariant kernel to maximize the SVM classification margin. We first describe a kernel-alignment subroutine, which finds a peak in the Fourier transform of an adversarially chosen data-dependent measure. Then, we define an iterative procedure that produces a sequence of feature maps, progressively improving the margin. The resulting algorithm is strikingly simple and scalable. Intriguingly, our analysis interprets the main algorithm as no-regret learning dynamics in a zero-sum min-max game, whose value is the classification margin. Thus, we are able to quantify convergence guarantees towards the largest margin realizable by a kernel with the assumed invariance. Finally, we exhibit experiments on synthetic and benchmark datasets, demonstrating consistent improvements over related random features-based kernel methods. 1.1 RELATED WORK There is a vast literature on MKL, from which we use the key concept of kernel alignment (Cristianini et al., 2002). Otherwise, our work bears few similarities to traditional MKL; this and much related work (e.g. Cortes et al. (2012); G onen & Alpaydın (2011); Lanckriet et al. (2004)) are concerned with selecting a kernel by combining a collection of base kernels, chosen beforehand. Our method allows for greater expressivity and even better generalization guarantees. Instead, we take inspiration from the method of random features (Rahimi & Recht, 2007). In this pioneering work, originally motivated by scalability, feature maps are sampled according to the Fourier transform of a chosen kernel. The idea of optimizing a kernel in random feature space was studied by Sinha & Duchi (2016). In this work, which is most similar to ours, kernel alignment is optimized via importance sampling on a fixed, finitely supported proposal measure. However, the proposal can fail to contain informative features, especially in high dimension; indeed, they highlight efficiency, rather than showing performance improvements over RBF features. Learning a kernel in the Fourier domain (without the primal feature maps) has also been considered previously: Oliva et al. (2016) and Yang et al. (2015b) model the Fourier spectrum parametrically, which limits expressivity; the former also require complicated posterior inference procedures. Published as a conference paper at ICLR 2018 B az avan et al. (2012) study learning a kernel in the Fourier domain jointly with regression parameters. They show experimentally that this locates informative frequencies in the data, without theoretical guarantees. Our visualizations suggest this approach can get stuck in poor local minima, even in 2 dimensions. Crammer et al. (2003) also use boosting to build a kernel sequentially; however, they only consider a basis of linear feature maps, and require costly generalized eigenvector computations. From a statistical view, Fukumizu et al. (2009) bound the SVM margin in terms of maximum mean discrepancy, which is equivalent to (unweighted) kernel alignment. Notably, their bound can be loose if the number of support vectors is small; in such situations, our theory provides a tighter characterization. Moreover, our attention to the margin goes beyond the usual objective of kernel alignment. 1.2 OUR CONTRIBUTIONS We present an algorithm that outputs a sequence of Fourier features, converging to the maximum realizable SVM classification margin on a labeled dataset. At each iteration, a pair of features is produced, which maximizes kernel alignment with a changing, adversarially chosen measure. As this measure changes slowly, the algorithm builds a diverse and informative feature representation. Our main theorem can be seen as a case of von Neumann s min-max theorem for a zero-sum concave-linear game; indeed, our method bears a deep connection to boosting (Freund & Schapire, 1996; Schapire, 1999). In particular, both the theory and empirical evidence suggest that the generalization error of our method decreases as the number of random features increases. In traditional MKL methods, generalization bounds worsen as base kernels are added. Other methods in the framework of Fourier random features take the approach of approximating a kernel by sampling feature maps from a continuous distribution. In contrast, our method constructs a measure with small finite support, and realizes the kernel exactly by enumerating the associated finite-dimensional feature map; there is no randomness in the features.1 2 PRELIMINARIES 2.1 THE FOURIER DUAL MEASURE OF A KERNEL We focus on two natural families of kernels k(x, x ): translation-invariant kernels on X = Rd, which depend only on x x , and rotation-invariant kernels on the hypersphere X = Sd 1, which depend only on x, x . These invariance assumptions subsume most widely-used classes of kernels; notably, the Gaussian (RBF) kernel satisfies both. For the former invariance, Bochner s theorem provides a Fourier-analytic characterization: Theorem 2.1 (e.g. Eq. (1), Sec. 1.4.3 in Rudin (2011)). A translation-invariant continuous function k : Rd Rd R is positive semidefinite if and only if k(x x ) is the Fourier transform of a symmetric non-negative measure (where the Fourier domain is Ω= Rd). That is, k(x x ) = Z Ω λ(ω) ei ω,x x dω = Z Ω λ(ω) ei ω,x ei ω,x dω, (1) for some λ L1(Ω), satisfying λ(ω) 0 and λ(ω) = λ( ω) for all ω Ω. A similar characterization is available for rotation-invariant kernels, where the Fourier basis functions are the spherical harmonics, a countably infinite family of complex polynomials which form an orthonormal basis for square-integrable functions X C. To unify notation, let Ω N Z be the set of valid index pairs ω = (ℓ, m), and let ω 7 ω denote a certain involution on Ω; we supply details and references in Appendix B.1. Theorem 2.2. A rotation-invariant continuous function k : Sd 1 Sd 1 R is positive semidefinite if and only if it has a symmetric non-negative expansion into spherical harmonics Y d ℓ,m, i.e. k( x, x ) = l=1 λ(ℓ, m) Y d ℓ,m(x) Y d ℓ,m(x ), (2) 1We note a superficial resemblance to quasi-Monte Carlo methods (Avron et al., 2016); however, these are concerned with accelerating convergence rates rather than learning a kernel. Published as a conference paper at ICLR 2018 for some λ L1(Ω), with λ(ω) 0 and λ(ω) = λ( ω) for all valid index pairs ω = (ℓ, m) Ω. In each of these cases, we call this Fourier transform λk(ω) the dual measure of k. This measure decomposes k into a non-negative combination of Fourier basis kernels. Furthermore, this decomposition gives us a feature map φ : X L2(Ω, λk) whose image realizes the kernel under the codomain s inner product;2 that is, for all x, x X, k(x, x ) = Z Ω λk(ω) φx(ω) φx (ω) dω. (3) Respectively, these feature maps are φx(ω) = ei ω,x and φx(ℓ, m) = Y d ℓ,m(x). Although they are complex-valued, symmetry of λk allows us to apply the transformation {φx(ω), φx( ω)} 7 {Re φx(ω), Im φx(ω)} to yield real features, preserving the inner product. The analogous result holds for spherical harmonics. 2.2 KERNEL ALIGNMENT In a binary classification task with n training samples (xi X, yi { 1}), a widely-used quantity for measuring the quality of a kernel k : X X R is its alignment (Cristianini et al., 2002; Cortes et al., 2012),3 defined by γk(P, Q) def = X i,j [n] k(xi, xj)yiyj = y T Gky, where y is the vector of labels and Gk is the Gram matrix. Here, we let P = P i:yi=1 δxi and Q = P i:yi= 1 δxi denote the (unnormalized) empirical measures of each class, where δx is the Dirac measure at x. When P, Q are arbitrary measures on X, this definition generalizes to γk(P, Q) def = ZZ X 2k(x, x ) d P2(x, x ) + ZZ X 2k(x, x ) d Q2(x, x ) ZZ X 2k(x, x ) d P(x) d Q(x ). In terms of the dual measure λk(ω), kernel alignment takes a useful alternate form, noted by Sriperumbudur et al. (2010). Let µ denote the signed measure P Q. Then, when k is translationinvariant, we have γk(P, Q) = Z X ei ω,x dµ(x) 2 dω def = Z Ω λk(ω) v(ω). (4) Analogously, when k is rotation-invariant, we have γk(P, Q) = X (ℓ,m) Ω λk(ℓ, m) X Y d ℓ,m(x) dµ(x) (ℓ,m) Ω λk(ℓ, m) v(ℓ, m). (5) It can also be verified that λk L1(Ω) = k(x, x), which is of course the same for all x X. In each case, the alignment is linear in λk. We call v(ω) the Fourier potential, which is the squared magnitude of the Fourier transform of the signed measure µ = P Q. This function is clearly bounded pointwise by (P(X) + Q(X))2. 3 ALGORITHMS 3.1 MAXIMIZING KERNEL ALIGNMENT IN THE FOURIER DOMAIN First, we consider the problem of finding a kernel k (subject to either invariance) that maximizes alignment γk(P, Q); we optimize the dual measure λk(ω). Aside from the non-negativity and symmetry constraints from Theorems 2.1 and 2.2, we constrain λk 1 = 1, as this quantity appears as a normalization constant in our generalization bounds (see Theorem 4.3). Maximizing γk(P, Q) in 2In fact, such a pair (φ, λ) exists for any psd kernel k; see Dai et al. (2014) or Devinatz (1953). 3Definitions vary up to constants and normalization, depending on the use case. Published as a conference paper at ICLR 2018 this constraint set, which we call L L1(Ω), takes the form of a linear program on an infinitedimensional simplex. Noting that v(ω) = v( ω) 0, γk is maximized by placing a Dirac mass at any pair of opposite modes ω argmaxω v(ω). At first, P and Q will be the empirical distributions of the classes, specified in Section 2.2. However, as Algorithm 2 proceeds, it will reweight each data point xi in the measures by α(i). Explicitly, the reweighted Fourier potential takes the form4 vα(ω) def = i=1 yiαieι ω,xi or vα(ℓ, m) def = i=1 yiαi Y d ℓ,m(xi) Due to its non-convexity, maximizing vα(ω), which can be interpreted as finding a global Fourier peak in the data, is theoretically challenging. However, we find that it is easy to find such peaks in our experiments, even in hundreds of dimensions. This arises from the empirical phenomenon that realistic data tend to be band-limited, a cornerstone hypothesis in data compression. An ℓ2 constraint (or equivalently, ℓ2 regularization; see Kakade et al. (2009)) can be explicitly enforced to promote band-limitedness; we find that this is not necessary in practice. When a gradient is available (in the translation-invariant case), we use Langevin dynamics (Algorithm 1) to find the peaks of v(ω), which enjoys mild theoretical hitting-time guarantees (see Theorem 4.5). See Appendix A.3 for a discussion of the (discrete) rotation-invariant case. Algorithm 1 Langevin dynamics for kernel alignment 1: Input: training samples S = {(xi, yi)}n i=1, weights α Rn. 2: Parameters: time horizon τ, diffusion rate ζ, temperature ξ. 3: Initialize ω0 arbitrarily. 4: for t = 0, . . . , τ 1 do 5: Update ωt+1 ωt + ζ vα(ωt) + q ζ z, where z N(0, I). 6: end for 7: return ω := argmaxt v(ωt) It is useful in practice to use parallel initialization, running m concurrent copies of the diffusion process and returning the best single ω encountered. This admits a very efficient GPU implementation: the multi-point evaluations vα(ω1..m) and vα(ω1..m) can be computed from an (m, d) by (d, n) matrix product and pointwise trigonometric functions. We find that Algorithm 1 typically finds a reasonable peak within 100 steps. 3.2 LEARNING THE MARGIN-MAXIMIZING KERNEL FOR SVM Support vector machines (SVMs) are perhaps the most ubiquitous use case of kernels in practice. To this end, we propose a method that boosts Algorithm 1, building a kernel that maximizes the classification margin. Let k be a kernel with dual λk, and Y def = diag(y). Write the dual l1-SVM objective,5 parameterizing the kernel by λk: F(α, λk) def = 1T α 1 2γk(αP, αQ) = 1T α 1 Ω λk(ω) vα(ω) dω. Thus, for a fixed α, F is equivalent to kernel alignment, and can be minimized by Algorithm 1. However, the support vector weights α are of course not fixed; given a kernel k, α is chosen to maximize F, giving the (reciprocal) SVM margin. In all, to find a kernel k which maximizes the margin under an adversarial choice of α, one must consider a two-player zero-sum game: min λ L max α K F(α, λ), (6) 4Whenever i is an index, we will denote the imaginary unit by ι. 5Of course, our method applies to l2 SVMs, and has even stronger theoretical guarantees; see Section 4.1. Published as a conference paper at ICLR 2018 where L is the same constraint set as in Section 3.1, and K is the usual dual feasible set {0 α C, y T α = 0} with box parameter C. In this view, we make some key observations. First, Algorithm 1 allows the min-player to play a pure-strategy best response to the max-player. Furthermore, a mixed strategy λ for the min-player is simply a translationor rotation-invariant kernel, realized by the feature map corresponding to its support. Finally, since the objective is linear in λ and concave in α, there exists a Nash equilibrium (λ , α ) for this game, from which λ gives the margin-maximizing kernel. We can use no-regret learning dynamics to approximate this equilibrium. Algorithm 2 runs Algorithm 1 for the min-player, and online gradient ascent (Zinkevich, 2003) for the max-player. Intuitively (and as is visualized in our synthetic experiments), this process slowly morphs the landscape of vα to emphasize the margin, causing Algorithm 1 to find progressively more informative features. At the end, we simply concatenate these features; contingent on the success of the kernel alignment steps, we have approximated the Nash equilibrium. Algorithm 2 No-regret learning dynamics for SVM margin maximization 1: Input: training samples S = {(xi, yi)}n i=1. 2: Parameters: box constraint C, # steps T, step sizes {ηt}; parameters for Algorithm 1. 3: Set φx(ω) = ei ω,x (translation-invariant), or φx(ℓ, m) = Y d ℓ,m(x) (rotation-invariant). 4: Initialize α = Proj K[ C 2 1]. 5: for t = 1, . . . , T do 6: Use Algorithm 1 (or other vα maximizer) on S with weights αt, returning ωt. 7: Append two features {Re φx (ωt), Im φx ( ωt)} to each xi s representation Φ(xi). 8: Compute gradient gt := αF(αt, λt), where λt = δωt + δ ωt. 9: Update αt+1 Proj K[αt + ηtgt]. 10: end for 11: return features {Φ(xi) R2T }n i=1, or dual measure λ := 1 2T PT t=1 δωt + δ ωt. We provide a theoretical analysis in Section 4.1, and detailed discussion on heuristics, hyperparameters, and implementation details in depth in Appendix A. One important note is that the online gradient gt = 1 2Y Re( Φt, Yαt Φt) is computed very efficiently, where Φt(i) = φxi(ωt) Cm is the vector of the most recently appended features. Langevin dynamics, easily implemented on a GPU, comprise the primary time bottleneck. 4.1 CONVERGENCE OF NO-REGRET LEARNING DYNAMICS We first state the main theoretical result, which quantifies the convergence properties of Algorithm 2. Theorem 4.1 (Main). Assume that at each step t, Algorithm 1 returns an εt-approximate global maximizer ωt (i.e., vαt(ωt) supω Ωvαt(ω) εt). Then, with a certain choice of step sizes ηt, Algorithm 2 produces a dual measure λ L which satisfies max α K F(α, λ) min λ L max α K F(α, λ) + PT t=1 εt 2T + 3(C + 2C2)n (Alternate form.) Suppose instead that vαt(ωt) ρ at each time t. Then, λ satisfies max α K F(α, λ) ρ + 3(C + 2C2)n We prove Theorem 4.1 in Appendix C. For convenience, we state the first version as an explicit margin bound (in terms of competitive ratio M/M ): Corollary 4.2. Let M be the margin obtained by training an ℓ1 linear SVM with the same C as in Algorithm 2, on the transformed samples {(Φ(xi), yi)}. Then, M is (1 δ)-competitive with M , the maximally achievable margin by a kernel with the assumed invariance, with t εt T + 3(C + 2C2)n Published as a conference paper at ICLR 2018 This bound arises from the regret analysis of online gradient ascent (Zinkevich, 2003); our analysis is similar to the approach of Freund & Schapire (1996), where they present a boosting perspective. When using an ℓ2-SVM, the final term can be improved to O( log T T ) (Hazan et al., 2007). For a general overview of results in the field, refer to Hazan (2016). 4.2 GENERALIZATION BOUNDS Finally, we state two (rather distinct) generalization guarantees. Both depend mildly on a bandwidth assumption ω 2 Rω and the norm of the data Rx def = maxi xi . First, we state a margindependent SVM generalization bound, due to Koltchinskii & Panchenko (2002). Notice the appearance of λk 1 = Rλ, justifying our choice of normalization constant for Algorithm 1. Intriguingly, the end-to-end generalization error of our method decreases with an increasing number of random features, since the margin bound is being refined during Algorithm 2. Theorem 4.3 (Generalization via margin). For any SVM decision function f : X R with a kernel kλ constrained by λ 1 Rλ trained on samples S drawn i.i.d. from distribution D, the generalization error is bounded by Pr (x,y) D[yf(x) 0] min θ 1 n i=1 1yif(xi) θ + 6RωRx The proof can be found in Appendix D. Note that this improves on the generic result for MKL, from Theorem 2 in (Cortes et al., 2010), which has a log T dependence on the number of base kernels T. This improvement stems from the rank-one property of each component kernel. Next, we address another concern entirely: the sample size required for v(ω) to approximate the ideal Fourier potential videal(ω), the squared magnitude of the Fourier transform of the signed measure P Q arising from the true distribution. For the shift-invariant case: Theorem 4.4 (Generalization of the potential). Let S be a set of i.i.d. training samples on D, with n O d log 1 δ ε2 . We have that for all ω : ω Rω, with probability at least 1 δ, n2 videal(ω) ε. The O( ) suppresses factors polynomial in Rx and Rω. The full statement and proof are standard, and deferred to Appendix E. In particular, this result allows for a mild guarantee of polynomial hitting-time on the locus of approximate local maxima of videal (as opposed to the empirical v). Adapting the main result from Zhang et al. (2017): Theorem 4.5 (Langevin hitting time). Let ωτ be the output of Algorithm 1 on vα(ω), after τ steps. Algorithm 1 finds an approximate local maximum of videal in polynomial time. That is, with U being the set of ε-approximate local maxima of videal(ω), some ωt satisfies v(ωt) inf d(U,ω) v(ω) for some τ O(poly (Rx, Rω, d, ζ, ξ, ε 1, 1, log(1/δ)), with probability at least 1 δ. Of course, one should not expect polynomial hitting time on approximate global maxima; Roberts & Tweedie (1996) give asymptotic mixing guarantees. 5 EXPERIMENTS In this section, we highlight the most important and illustrative parts of our experimental results. For further details, we provide an extended addendum to the experimental section in Appendix A. The code can be found at github.com/yz-ignescent/Not-So-Random-Features. Published as a conference paper at ICLR 2018 RBF (baseline, 92.1% test) Data {(xi, yi)}; dual weights αt T = 100 (98.6% test) T = 300 (99.1% test) T = 1000 (99.3% test) Features { ωi}T i=1; potential vα(ω) SVM decision function (a) Evolution of Algorithm 2. Top: The windmill dataset, weighted by dual variables αt. Middle: 2t random features (magenta; last added ωt in cyan), overlaying the Fourier potential vαt(ω) in the background. Bottom: Decision boundary of a hinge-loss classifier trained on the kernel, showing refinement at the margin. Contours indicate the value of the margin. (b) Detailed evolution of the Fourier potential vα(ω). Left: Initial potential v(ω), with uniform weights. Right: Reweighted vα(ω) at t = 300. Note the larger support of peaks. (c) Toy dataset on the sphere, with top-3 selected features: spherical harmonics Y0,11, Y4,9, Y8,27. Figure 1: Visualizations and experiments on synthetic data. First, we exhibit two simple binary classification tasks, one in R2 and the other on S2, to demonstrate the power of our kernel selection method. As depicted in Figure 1, we create datasets with sharp boundaries, which are difficult for the standard RBF and arccosine (Cho & Saul, 2009) kernel. In both cases, ntrain = 2000 and ntest = 50000.6 Used on a ℓ1-SVM classifier, these baseline kernels saturate at 92.1% and 95.1% test accuracy, respectively; they are not expressive enough. On the R2 windmill task, Algorithm 2 chooses random features that progressively refine the decision boundary at the margin. By T = 1000, it exhibits almost perfect classification (99.7% training, 99.3% test). Similarly, on the S2 checkerboard task, Algorithm 2 (with some adaptations described 6ntest is chosen so large to measure the true generalization error. Published as a conference paper at ICLR 2018 in Appendix A.3) reaches almost perfect classification (99.7% training, 99.1% test) at T = 100, supported on only 29 spherical harmonics as features. We provide some illuminating visualizations. Figures 1a and 1b show the evolution of the dual weights, random features, and classifier. As the theory suggests, the objective evolves to assign higher weight to points near the margin, and successive features improve the classifier s decisiveness in challenging regions (1a, bottom). Figure 1c visualizes some features from the S2 experiment. Next, we evaluate our kernel on standard benchmark binary classification tasks. Challenging label pairs are chosen from the MNIST (Le Cun et al., 1998) and CIFAR-10 (Krizhevsky, 2009) datasets; each task consists of 10000 training and 2000 test examples; this is considered to be large-scale for kernel methods. Following the standard protocol from Yu et al. (2016), 512-dimensional Ho G features (Dalal & Triggs, 2005) are used for the CIFAR-10 tasks instead of raw images. Of course, our intent is not to show state-of-the-art results on these tasks, on which deep neural networks easily dominate. Instead, the aim is to demonstrate viability as a scalable, principled kernel method. We compare our results to baseline random features-based kernel machines: the standard RBF random features (RBF-RF for short), and the method of Sinha & Duchi (2016) (LKRF),7 using the same ℓ1-SVM throughout. As shown by Table 1, our method reliably outperforms these baselines, most significantly in the regime of few features. Intuitively, this lines up with the expectation that isotropic random sampling becomes exponentially unlikely to hit a good peak in high dimension; our method searches for these peaks. Training accuracy RBF-RF LKRF ours 0 1000 2000 3000 4000 5000 Number of features m Test accuracy RBF-RF LKRF ours Dataset Method Number of features m 100 500 1000 2000 5000 MNIST (1-7) RBF-RF 97.42 99.02 99.10 99.42 99.31 LKRF 97.60 98.95 99.05 99.21 99.37 ours 99.05 99.28 99.45 99.63 99.65 MNIST (4-9) RBF-RF 87.72 94.11 96.95 97.36 98.08 LKRF 88.42 93.99 95.88 97.18 97.76 ours 93.02 96.31 97.68 98.72 99.01 MNIST (5-6) RBF-RF 91.96 97.63 97.85 98.51 98.82 LKRF 92.78 97.60 97.79 98.36 98.54 ours 96.62 98.67 99.18 99.32 99.64 CIFAR-10 (plane-bird) RBF-RF 71.75 78.30 79.85 80.15 81.25 LKRF 73.30 78.80 79.85 80.80 81.55 ours 81.80 82.15 83.00 84.25 85.15 CIFAR-10 (auto-truck) RBF-RF 69.70 75.75 82.55 83.35 85.10 LKRF 71.89 77.33 83.62 84.60 84.95 ours 84.80 85.90 86.00 88.25 88.80 CIFAR-10 (dog-frog) RBF-RF 70.80 79.95 81.85 82.50 82.95 LKRF 70.75 78.95 81.25 81.35 83.00 ours 81.95 84.05 85.40 85.95 86.25 CIFAR-10 (auto-ship) RBF-RF 70.85 78.00 79.15 81.80 82.55 LKRF 70.55 77.90 79.55 80.95 82.85 ours 81.00 83.70 84.40 86.70 87.35 CIFAR-10 (auto-deer) RBF-RF 84.65 90.00 91.85 92.45 92.65 LKRF 84.85 89.75 91.05 92.65 92.90 ours 92.40 93.60 93.55 94.05 94.90 Table 1: Comparison on binary classification tasks. Our method is compared against standard RBF random features (RBF-RF), as well as the method from Sinha & Duchi (2016) (LKRF). Left: Performance is measured on the CIFAR-10 automobile vs. truck task, varying the number of features m. Standard deviations over 10 trials are shown, demonstrating high stability. Furthermore, our theory predicts that the margin keeps improving, regardless of the dimensionality of the feature maps. Indeed, as our classifier saturates on the training data, test accuracy continues increasing, without overfitting. This decoupling of generalization from model complexity is characteristic of boosting methods. 7Traditional MKL methods are not tested here, as they are noticeably (>100 times) slower. Published as a conference paper at ICLR 2018 In practice, our method is robust with respect to hyperparameter settings. As well, to outperform both RBF-RF and LKRF with 5000 features, our method only needs 100 features. Our GPU implementation reaches this point in 30 seconds. See Appendix A.1 for more tuning guidelines. 6 CONCLUSION We have presented an efficient kernel learning method that uses tools from Fourier analysis and online learning to optimize over two natural infinite families of kernels. With this method, we show meaningful improvements on benchmark tasks, compared to related random features-based methods. Many theoretical questions remain, such as accelerating the search for Fourier peaks (e.g. Hassanieh et al. (2012); Kapralov (2016)). These, in addition to applying our learned kernels to state-of-the-art methods (e.g. convolutional kernel networks (Mairal et al., 2014; Yang et al., 2015a; Mairal, 2016)), prove to be exciting directions for future work. ACKNOWLEDGMENTS We are grateful to Sanjeev Arora, Roi Livni, Pravesh Kothari, Holden Lee, Karan Singh, and Naman Agarwal for helpful discussions. This research was supported by the National Science Foundation (NSF), the Office of Naval Research (ONR), and the Simons Foundation. The first two authors are supported in part by Elad Hazan s NSF grant IIS-1523815. Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195 1199. ACM, 2017. Haim Avron, Vikas Sindhwani, Jiyan Yang, and Michael W Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. Journal of Machine Learning Research, 17(120):1 38, 2016. Eduard Gabriel B az avan, Fuxin Li, and Cristian Sminchisescu. Fourier kernel learning. In Computer Vision ECCV 2012, pp. 459 473. Springer, 2012. Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pp. 342 350, 2009. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for learning kernels. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 247 254. Citeseer, 2010. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research, 13(Mar):795 828, 2012. Koby Crammer, Joseph Keshet, and Yoram Singer. Kernel design using boosting. In Advances in Neural Information Processing Systems, pp. 553 560, 2003. Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz Kandola. On kernel-target alignment. In Advances in Neural Information Processing Systems, pp. 367 373, 2002. Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, pp. 3041 3049, 2014. Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pp. 886 893. IEEE, 2005. Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. Near-optimal no-regret algorithms for zero-sum games. Games and Economic Behavior, 92:327 348, 2015. Published as a conference paper at ICLR 2018 A Devinatz. Integral representations of positive definite functions. Transactions of the American Mathematical Society, 74(1):56 77, 1953. Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pp. 325 332. ACM, 1996. Kenji Fukumizu, Arthur Gretton, Gert R Lanckriet, Bernhard Sch olkopf, and Bharath K Sriperumbudur. Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems, pp. 1750 1758, 2009. Jean Gallier. Notes on spherical harmonics and linear representations of lie groups. 2009. Mehmet G onen and Ethem Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(Jul):2211 2268, 2011. Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Simple and practical algorithm for sparse Fourier transform. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1183 1194. Society for Industrial and Applied Mathematics, 2012. Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007. Atsushi Higuchi. Symmetric tensor spherical harmonics on the N-sphere and their application to the de Sitter group SO (N, 1). Journal of mathematical physics, 28(7):1553 1566, 1987. Sham M Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in Neural Information Processing Systems, pp. 793 800, 2009. Michael Kapralov. Sparse Fourier transform in any constant dimension with nearly-optimal sample complexity in sublinear time. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, pp. 264 277. ACM, 2016. Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, pp. 1 50, 2002. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5(Jan):27 72, 2004. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Julien Mairal. End-to-end kernel learning with supervised convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 1399 1407, 2016. Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional kernel networks. In Advances in Neural Information Processing Systems, pp. 2627 2635, 2014. Colin Mc Diarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148 188, 1989. Claus M uller. Analysis of spherical symmetries in Euclidean spaces, volume 129. Springer Science & Business Media, 2012. Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235 249, 2005. Published as a conference paper at ICLR 2018 Junier Oliva, Avinava Dubey, Barnabas Poczos, Jeff Schneider, and Eric P Xing. Bayesian nonparametric kernel-learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177 1184, 2007. Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996. Walter Rudin. Fourier analysis on groups. John Wiley & Sons, 2011. Robert E Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 1401 1406, 1999. IJ Schoenberg. Positive definite functions on spheres. Duke Mathematical Journal, 9(1):96 108, 1942. Aman Sinha and John C Duchi. Learning kernels with random features. In Advances in Neural Information Processing Systems, pp. 1298 1306, 2016. Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch olkopf, and Gert RG Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517 1561, 2010. Elias M Stein and Guido Weiss. Introduction to Fourier analysis on Euclidean spaces (PMS-32), volume 32. Princeton University Press, 2016. John von Neumann. On rings of operators. reduction theory. Annals of Mathematics, pp. 401 485, 1949. Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang. Deep fried convnets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1476 1483, 2015a. Zichao Yang, Andrew Wilson, Alex Smola, and Le Song. A la carte learning fast kernels. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, pp. 1098 1106, 2015b. Felix X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pp. 1975 1983, 2016. Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient Langevin dynamics. ar Xiv preprint ar Xiv:1702.05575, 2017. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928 936, 2003. Published as a conference paper at ICLR 2018 A APPENDIX FOR EXPERIMENTS Algorithm 2 gives a high-level outline of the essential components of our method. However, it conceals several hyperparameter choices and algorithmic heuristics, which are pertinent when applying our method in practice. We discuss a few more details in this section. A.1 HYPERPARAMETER TUNING GUIDE Throughout all experiments presented, we use hinge-loss SVM classifiers with C = 1. Note that the convergence of Algorithm 2 depends quadratically on C. With regard to Langevin diffusion (Algorithm 1), we observe that the best samples arise from using high temperatures and Gaussian parallel initialization. For the latter, a rule-of-thumb is to initialize 500 parallel copies of Langevin dynamics, drawing the initial position {ω0} from a centered isotropic Gaussian with 1.5 the variance of the optimal RBF random features. (In turn, a common rule-of-thumb for this bandwidth is the median of pairwise Euclidean distances between data points.) The step size in Algorithm 1 is tuned based on the magnitude of the gradient on v(ω), which can be significantly smaller than the upper bound derived in Section E. As is standard practice in Langevin Monte Carlo methods, the temperature is chosen so that the pertubation is roughly at the same magnitude as the gradient step. Empirically, running Langevin dynamics for 100 steps suffices to locate a reasonably good peak. To further improve efficiency, one can modify Algorithm 1 to pick the top k 10 samples, a k-fold speedup which does not degrade the features much. The step size of online gradient ascent is set to balance between being conservative and promoting diverse samples; these steps should not saturate (thereby solving the dual SVM problem), in order to have the strongest regret bound. In our experiments, we find that the step size achieving the standard regret guarantee (scaling as 1/ T) tends to be a little too conservative. On the other hand, it never hurts (and seems important in practice) to saturate the peak-finding routine (Algorithm 1), since this contributes an additive improvement to the margin bound. Noting that the objective is very smooth (the k-th derivative scales as Rk x), it may be beneficial to refine the samples using a few steps of gradient descent with a very small learning rate, or an accelerated algorithm for finding approximate local minima of smooth non-convex functions; see, e.g. Agarwal et al. (2017). A.2 PROJECTION ONTO THE SVM DUAL FEASIBLE SET A quick note on projection onto the feasible set K = {0 α C, y T α = 0} of the SVM dual convex program: it typically suffices in practice to use alternating projection. This feasible set is the intersection of a hyperplane and a hypercube; both of which admit a simple projection step. The alternation projection onto the intersection of two non-empty convex sets was originally proposed by von Neumann (1949). The convergence rate can be shown to be linear. To obtain the dual variables in our experiments, we use 10 such alternating projections. This results in a dual feasible solution up to hardware precision, and is a negligible component of the total running time (for which the parallel gradient computations are the bottleneck). Algorithm 3 Alternating Projection for SVM Dual Constraints Input: α Rn Parameters: box constraint C, label vector y. repeat Project onto the box: α = clip(α, 0, C) Project onto the hyperplane: α α y T α n y until convergence return α Published as a conference paper at ICLR 2018 A.3 SAMPLING SPHERICAL HARMONICS As we note in Section 3.1, it is unclear how to define gradient Langevin dynamics on v(l, m) in the inner-product case, since no topology is available on the indices (l, m) of the spherical harmonics. One option is to emulate Langevin dynamics, by constructing a discrete Markov chain which mixes to λ(l, m) eβv(l,m). However, we find in our experiments that it suffices to compute λ(l, m) by examining all values of v(l, m) with j no more than some threshold J. One should view this as approximating the kerneltarget alignment objective function via Fourier truncation. This is highly parallelizable: it involves approximately N(m, d) degree-J polynomial evaluations on the same sample data, which can be expressed using matrix multiplication. In our experiments, it sufficed to examine the first 1000 coefficients; we remark that it is unnatural in any real-world datasets to expect that v(ω) only has large values outside the threshold J. Under Fourier truncation, the domain of λ becomes a finite-dimensional simplex. In the gametheoretic view of 4.1, an approximate Nash equilibrium becomes concretely achievable via Nesterov s excessive gap technique (Nesterov, 2005; Daskalakis et al., 2015), given that the kernel player s actions are restricted to a mixed strategy over a finite set of basis kernels. Finally, we note a significant advantage to this setting, where we have a discrete set of Fourier coefficients: the same feature might be found multiple times. When a duplicate feature is found, it need not be concatenated to the representation; instead, the existing feature is scaled appropriately. This accounts for the drastically smaller support of features required to achieve near-perfect classification accuracy. B MORE ON SPHERICAL HARMONICS In this section, we go into more detail about the spherical harmonics in d dimensions. Although all of this material is standard in harmonic analysis, we provide this section for convenience, isolating only the relevant facts. B.1 SPHERICAL HARMONIC EXPANSION OF A ROTATION-INVARIANT KERNEL First, we provide a proof sketch for Theorem 2.2. We rely on the following theorem, an analogue of Bochner s theorem on the sphere, which characterizes rotation-invariant kernels: Theorem B.1 (Schoenberg (1942)). A continuous rotation-invariant function k(x, x ) = k( x, x ) on Sd 1 Sd 1 is positive semi-definite if and only if its expansion into Gegenbauer polynomials P d i has only non-negative coefficients, i.e. k( x, x ) = m=0 λm P d m( x, x ), (7) with λm 0, m N+. The Gegenbauer polynomials P d m : [ 1, 1] R are a generalization of the Legendre and Chebyshev polynomials; they form an orthogonal basis on [ 1, 1] under a certain weighting function (see Stein & Weiss (2016); Gallier (2009); M uller (2012) for details). Note that we adopt a different notation from Schoenberg (1942), which denotes our P d m by P (d 1)/2 m . Unlike Theorem 2.1, Theorem B.1 alone does not provide a clear path for constructing a feature map for inner-product kernel, because the inputs x, x of the kernel are still coupled after the expansion into P d m, which acts on the inner product x, x instead of on x and x separately. Fortunately, (see, e.g., Proposition 1.18 in Gallier (2009)), the Gegenbauer polynomials evaluated on an inner product admits a decomposition into spherical harmonics Y d ℓ,m: P d m( x, x ) = |Sd 1| l=1 Y d ℓ,m(x)Y d ℓ,m(x ) x, x Sd 1, (8) Published as a conference paper at ICLR 2018 where |Sd 1| denotes the surface area of Sd 1, and N(d, ℓ) = d 1+ℓ ℓ d 1+ℓ ℓ 2 . Here, we use the normalization convention that R Sd 1 |Y d ℓ,m(x)|2 dx = 1. From the existence of this expansion follows the claim from Theorem 2.2. For a detailed reference, see M uller (2012). B.2 PAIRING THE SPHERICAL HARMONICS We now specify the involution ω 7 ω on indices of spherical harmonics, which gives a pairing that takes the role of opposite Fourier coefficients in the X = Rd case. In particular, the Fourier transform λ of a real-valued function k : Rn R satisfies λ(ω) = λ( ω), so that the dual measure can be constrained to be symmetric. Now, consider the X = Sd 1 case, where the Fourier coefficients are on the set Ωof valid indices of spherical harmonics. We would like a permutation on the indices σ so that σ(σ(ω)) = ω, and λ(ω) = λ(σ(ω)) whenever λ is the spherical harmonic expansion of a real kernel. For a fixed dimension d and degree ℓ, the spherical harmonics form an N := N(d, ℓ)-dimensional orthogonal basis for a vector space V of complex polynomials from Sd 1 to C, namely the ℓhomogeneous polynomials (with domain extended to Rn) which are harmonic (eigenfunctions of the Laplace-Beltrami operator Sd 1). This basis is not unique; an arbitrary choice of basis may not contain pairs of conjugate functions. Fortunately, such a basis exists constructively, by separating the Laplace-Beltrami operator over spherical coordinates (θ1, . . . , θd 1). Concretely, for a fixed d and ℓ, the d-dimensional spherical harmonic, indexed by |ℓ1| ℓ2 ℓ3 . . . ℓd 1, is defined as: Yℓ1,...,ℓN def = 1 2π eiℓ1θ1 d 1 Y k P ℓd 1 ℓk (θk), where the functions in the product come from a certain family of associated Legendre functions in sin θ. For a detailed treatment, we adopt the construction and conventions from Higuchi (1987). In the scope of this paper, the relevant fact is that the desired involution is well-defined in all dimensions. Namely, consider the permutation σ that sends a spherical harmonic indexed by ω = (d, ℓ, ℓ1, ℓ2, . . . , ℓd 1) to σ(ω) = (d, ℓ, ℓ1, ℓ2, . . . , ℓd 1). Then, Y ω(x) def = Yσ(ω)(x) = Yω(x) for all ω, x. The symmetry condition in Theorem 2.2 follows straightforwardly. By orthonormality, we know that every square-integrable function f : Rn C has a unique decomposition into spherical harmonics, with coefficients λω = f, Yω , so that λ ω = f, Yω = λω. When f is real-valued, we conclude that λ ω = λω, as claimed. C PROOF OF THE MAIN THEOREM In this section, we prove the main theorem, which quantifies convergence of Algorithm 2 to the Nash equilibrium. We restate it here: Theorem 4.1. Assume that during each timestep t, the call to Algorithm 1 returns an εt-approximate global maximizer ωt (i.e. ˆvαt(ωt) maxω K ˆvαt(ω) εt). Then, Algorithm 2 returns a dual measure λ, which satisfies max α K F(α, λ) min λ L max α K F(α, λ) + PT t=1 εt 2T + O 1 Alternatively with the assumption that at each timestep t, ˆvαt(ωt) ρ, λ, satisfies max α K F(α, λ) ρ + O 1 If Algorithm 2 is used on a l2-SVM, the regret bound can be improved to be O( log T Proof. We will make use of the regret bound of online gradient ascent (see, e.g., (Hazan, 2016)). Here we only prove the theorem in the case for l1-SVM with box constraint C, under the assumption of εt-approximate optimality of Algorithm 1. Extending the proof to other cases is straightfoward. Published as a conference paper at ICLR 2018 Lemma C.1 (Regret bound for online gradient ascent). Let D be the diameter of the constraint set K, and G a Lipschitz constant for an arbitrary sequence of concave functions ft(α) def = F(α, λt) on K. Online gradient ascent on ft(α), with step size schedule ηt = G D t, guarantees the following for all T 1: t=1 ft(αt) max α K 1 T t=1 ft(α) 3GD Here, D C n by the box constraint, and we have G sup α K, ω Rd 1 2Y Re( Φt, Yαt Φt) 2 (1 + 2C) n. Thus, our regret bound is 3n(C+2C2) T ; that is, for all T 1, t=1 F(αt, δωt + δ ωt) max α K 1 T t=1 F(α, δωt + δ ωt)) 3n(C + 2C2) Since at each timestep t, ˆvαt(ωt) maxω K ˆvαt(ω) εt, we have by assumption F(αt, δωt + δ ωt) min ω K F(αt, δω + δ ω) + εt, and by concavity of F in α, we know that, for any fixed sequence α1, α2, . . . , αT , min λ L max α K F(α, λ) min λ L 1 T t=1 F(αt, λ), it follows that min λ L max α K F(α, λ) min λ L 1 T t=1 F(αt, λ) t=1 min ω K F αt, δω + δ ω t=1 F(αt, δωt + δ ωt) PT t=1 εt max α K 1 2T t=1 F(α, δωt + δ ωt) 3n(C + 2C2) T PT t=1 εt 2T = max α K 1 2T t=1 F(α, δωt + δ ωt) 3n(C + 2C2) T PT t=1 εt 2T . To complete the proof, note that for a given α K, 1 2T PT t=1 F(α, δωt + δ ωt) = F(α, λ) by linearity of F in the dual measure; here, λ = 1 2T PT t=1 δωt +δ ωt is the approximate Nash equilibrium found by the no-regret learning dynamics. D PROOF OF THEOREM 4.3 In this section, we compute the Rademacher complexity of the composition of the learned kernel and the classifier, proving Theorem 4.3. Published as a conference paper at ICLR 2018 Theorem 4.3. For any SVM decision function f : X R with a kernel kλ constrained by λ 1 Rλ trained on samples S drawn i.i.d. from distribution D, the generalization error is bounded by Pr (x,y) D[yf(x) 0] min θ 1 n i=1 1yif(xi) θ + 6RωRx Proof. For a fixed sample S = {xi}n i=1, we compute the empirical Rademacher complexity of the composite hypothesis class. Below, σ Rn is a vector of i.i.d. Rademacher random variables. sup λ 1 Rλ α Gkλα 1 j=1 σiαjkλ(xi, xj) sup λ 1 Rλ sup α Gkλα 1 σ Gkλα sup λ 1 Rλ σ Gkλσ v u u t sup λ 1 Rλ i=1 σieι ω,xi v u u t Rλ sup ω Ω i=1 σieι ω,xi i=1 σieι ω,xi i=1 σi cos( ω, xi ) i=1 σi sin( ω, xi ) i=1 σi cos( ω, xi ) i=1 σi sin( ω, xi ) i=1 σi cos( ω, xi ) i=1 σi sin( ω, xi ) Note that each term in the formula is the empirical Rademacher complexity of the composition of a 1-Lipschitz function (cosine or sine) with a linear function with ℓ2 norm bounded by Rω. By the composition property of Rademacher complexity, we have the following: i=1 σi cos( ω, xi ) i=1 σi cos( ω, xi ) i=1 σi cos( ω, xi ) Published as a conference paper at ICLR 2018 and because sin( ) is an odd function, i=1 σi sin( ω, xi ) Putting everything together, we finally obtain that from which we apply the result from Koltchinskii & Panchenko (2002) to conclude the theorem. E SAMPLE COMPLEXITY FOR THE FOURIER POTENTIAL In this section, we prove the following theorem about concentration of the Fourier potential. It suffices to disregard the reweighting vector α; to recover a guarantee in this case, simply replace ε with ε/C2. Note that we only argue this in the translation-invariant case. Theorem 4.4. Let S = {(xi, yi)}n i=1 be a set of i.i.d. training samples. Then, if n 8R2 x R4 ω(d log 2Rω we have that with probability at least 1 δ, 1 n2 v(n)(ω) videal(ω) ε for all ω Ωsuch that ω Rω. Let P(n), Q(n) denote the empirical measures from a sample S of size n, arising from i.i.d. samples from the true distribution, whose classes have measures P, Q, adopting the convention P(X) + Q(X) = 1. Then, in expectation over the sampling, and adopting the same normalization conventions as in the paper, we have E[P(n)/n] = P and E[Q(n)/n] = Q for every n. Let v(n)(ω) denote the empirical Fourier potential, computed from P(n), Q(n), so that we have E[v(n)(ω)/n2] = videal(ω). The result follows from a concentration bound on v(n)(ω)/n2. We first show that it is Lipschitz: Lemma E.1 (Lipschitzness of v(n) in ω). The function i yieι ω,xi is 2n2Rx-Lipschitz with respect to ω. Proof. We have i,j [n] yiyjeι ω,xi xj i,j [n] yiyji(xi xj)eι ω,xi xj n2 max i,j ||xi xj|| 2n2Rx. Thus, the Lipschitz constant of v(n)(ω)/n2 scales linearly with the norm of the data, a safe assumption. Next, we show Lipschitzness with respect to a single data point: Lemma E.2 (Lipschitzness of v(n) in xi). v(n)(ω) is 2n Rω-Lipschitz with respect to any xi. Published as a conference paper at ICLR 2018 Proof. Similarly as above: xiv(n)(ω) = i,j [n] yiyjeι ω,xi xj j [n] yiyj iωeι ω,xi xj Now, we complete the proof. By Lemma E.2, replacing one xi with another x i (whose norm is also bounded by Rx) changes v(n)(ω)/n2 by at most 4Rx Rω/n. Then, by Mc Diarmid s inequality (Mc Diarmid, 1989), we have the following fact: Lemma E.3 (Pointwise concentration of v(n)). For all ω Ωand ε > 0, Pr 1 n2 v(n)(ω) videal(ω) ε exp nε2 Now, let W be an (ε/Rω)-net of the set {ω Ω: ω Rω}. Applying the union bound on Lemma E.3 with ω ranging over each element of W, we have Pr 1 n2 v(n)(ω) videal(ω) ε, ω Rω Pr 1 n2 v(n)(ω) videal(ω) ε Rω , ω W |W| exp nε2 To make the LHS smaller than δ, it suffices to choose n 8R2 x R4 ω(d log 2Rω which is the claimed sample complexity.