# highdimensional_optimization_in_adaptive_random_subspaces__2cc9825c.pdf

High-Dimensional Optimization in Adaptive Random Subspaces

Jonathan Lacotte Department of Electrical Engineering Stanford University lacotte@stanford.edu

Mert Pilanci Department of Electrical Engineering Stanford University

Marco Pavone Department of Aeronautics &Astronautics Stanford University

We propose a new randomized optimization method for high-dimensional problems which can be seen as a generalization of coordinate descent to random subspaces. We show that an adaptive sampling strategy for the random subspace signiﬁcantly outperforms the oblivious sampling method, which is the common choice in the recent literature. The adaptive subspace can be efﬁciently generated by a correlated random matrix ensemble whose statistics mimic the input data. We prove that the improvement in the relative error of the solution can be tightly characterized in terms of the spectrum of the data matrix, and provide probabilistic upperbounds. We then illustrate the consequences of our theory with data matrices of different spectral decay. Extensive experimental results show that the proposed approach offers signiﬁcant speed ups in machine learning problems including logistic regression, kernel classiﬁcation with random convolution layers and shallow neural networks with rectiﬁed linear units. Our analysis is based on convex analysis and Fenchel duality, and establishes connections to sketching and randomized matrix decomposition.

1 Introduction

Random Fourier features, Nystrom method and sketching techniques have been successful in large scale machine learning problems. The common practice is to employ oblivious sampling or sketching matrices, which are typically randomized and ﬁxed ahead of the time. However, it is not clear whether one can do better by adapting the sketching matrices to data. In this paper, we show that adaptive sketching matrices can signiﬁcantly improve the approximation quality. We characterize the approximation error on the optimal solution in terms of the smoothness of the function, and spectral properties of the data matrix.

Many machine learning problems end up being high dimensional optimization problems, which typically follow from forming the kernel matrix of a large dataset or mapping the data trough a high dimensional feature map, such as random Fourier features [20] or convolutional neural networks [13]. Such high dimensional representations induce higher computational and memory complexities, and result in slower training of the models. Random projections are a classical way of performing dimensionality reduction, and are widely used in many algorithmic contexts [25]. Nevertheless, only recently these methods have captured great attention as an effective way of performing dimensionality reduction in convex optimization. In the context of solving a linear system Ax = b and least-squares optimization, the authors of [14] propose a randomized iterative method with linear convergence rate,

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

which, at each iteration, performs a proximal update x(k+1) =argminx T x x(k) 2 2, where the next iterate x(k+1) is restricted to lie within an afﬁne subspace T = x(k) + range(A S), and S is a n m dimension-reduction matrix with m min{n, d}. In the context of kernel ridge regression, the authors of [31] propose to approximate the n-dimensional kernel matrix by sketching its columns to a lower m-dimensional subspace, chosen uniformly at random. From the low dimensional kernel ridge solution α Rm, they show how to reconstruct an approximation ex Rn of the high dimensional solution x Rn. Provided that the sketching dimension m is large enough as measured by the spectral properties of the kernel matrix K , the estimate ex retains some statistical properties of x , e.g., minimaxity. Similarly, in the broader context of classiﬁcation through convex loss functions, the authors of [32, 33] propose to project the d-dimensional features of a given data matrix A to a lower m-dimensional subspace, chosen independently of the data. After computing the optimal low-dimensional classiﬁer α Rm, their algorithm returns an estimate ex Rd of the optimal classiﬁer x Rd. Even though they provide formal guarantees on the estimation error ex x 2, their results rely on several restrictive assumptions, that is, the data matrix A must be low rank, or, the classiﬁer x must lie in the span of the top few left singular vectors of A. Further, random subspace optimization has also been explored for large-scale trust region problems [27], also using a subspace chosen uniformly at random. Our proposed approach draws connections with the Gaussian Kaczmarz method proposed in [14] and the kernel sketching method in [31]. Differently, we are interested in smooth, convex optimization problems with ridge regularization. In contrast to [32, 33], we do not make any assumption on the optimal solution x .

Our work relates to the considerable amount of literature on randomized approximations of high dimensional kernel matrices K. The typical approach consists of building a low-rank factorization of the matrix K, using a random subset of its columns [28, 23, 9, 17]. The so-called Nystrom method has proven to be effective empirically [30], and many research efforts have been devoted to improving and analyzing the performance of its many variants (e.g., uniform column sub-sampling, leveragescore based sampling), especially in the context of kernel ridge regression [1, 2]. In a related vein, sketching methods have been proposed to reduce the space complexity of storing high-dimensional data matrices [19, 31], by projecting their rows to a randomly chosen lower dimensional subspace. Our theoretical ﬁndings build on known results for low-rank factorization of positive semi-deﬁnite (p.s.d.) matrices [15, 5, 12, 29], and show intimate connections with kernel matrices sketching [31]. Lastly, our problem setting also draws connections with compressed sensing [7] where the goal is to recover a high dimensional structured signal from a small number of randomized, and usually oblivious, measurements.

1.1 Contributions

In this work, we propose a novel randomized subspace optimization method with strong solution approximation guarantees which outperform oblivious sampling methods. We derive probabilistic bounds on the error of approximation for general convex functions. We show that our method provides a signiﬁcant improvement over the oblivious version, and theoretically quantify this observation as function of the spectral properties of the data matrix. We also introduce an iterative version of our method, which converges to the optimal solution by iterative reﬁnement.

1.2 An overview of our results

Let f : Rn R be a convex and µ-strongly smooth function, i.e., 2f(w) µIn for all w Rn, and A Rn d a high-dimensional matrix. We are interested in solving the primal problem

x = argmin x Rd f(Ax) + λ

2 x 2 2 , (1)

Given a random matrix S Rd m with m d, we consider instead the sketched primal problem

α argmin α Rm f(ASα) + λ

2 α S Sα , (2)

where we effectively restrict the optimization domain to a lower m-dimensional subspace. In this work, we explore the following questions: How can we estimate the original solution x given the sketched solution α ? Is a uniformly random subspace the optimal choice, e.g., S Gaussian i.i.d.?

Or, can we come up with an adaptive sampling distribution that is related to the matrix A, which yields stronger guarantees?

By Fenchel duality analysis, we exhibit a natural candidate for an approximate solution to x , given by ex = λ 1A f(ASα ). Our main result (Section 2) establishes that, for an adaptive sketching matrix of the form S = A e S where e S is typically Gaussian i.i.d., the relative error satisﬁes a highprobability guarantee of the form ex x 2/ x 2 ε, with ε < 1. Our error bound ε depends on the smoothness parameter µ, the regularization parameter λ, the shape of the domain of the Fenchel conjugate f and the spectral decay of the matrix A. Further, we show that this error can be explicitly controlled in terms of the singular values of A, and we derive concrete bounds for several standard spectral proﬁles, which arise in data analysis and machine learning. In particular, we show that using the adaptive matrix S = A e S provides much stronger guarantees than oblivious sketching, where S is independent of A. Then, we take advantage of the error contraction (i.e., ε < 1), and extend our adaptive sketching scheme to an iterative version (Section 3), which, after T iterations, returns a higher precision estimate ex(T ) that satisﬁes ex(T ) x 2/ x 2 εT . Throughout this work, we specialize our formal guarantees and empirical evaluations (Section 5) to Gaussian matrices e S, which is a standard choice and yields the tightest error bounds. However, our approach extends to a broader class of matrices e S, such as Rademacher matrices, sub-sampled randomized Fourier (SRFT) or Hadamard (SRHT) transforms, and column sub-sampling matrices. Thus, it provides a general framework for random subspace optimization with strong solution guarantees.

2 Convex optimization in adaptive random subspaces

We introduce the Fenchel conjugate of f, deﬁned as f (z) := supw Rn w z f(w) , which is convex and its domain domf := {z Rn | f (z) < + } is a closed, convex set. Our control of the relative error ex x 2/ x 2 is closely tied to controlling a distance between the respective solutions of the dual problems of (1) and (2). The proof of the next two Propositions follow from standard convex analysis arguments [22], and are deferred to Appendix C. Proposition 1 (Fenchel Duality). Under the previous assumptions on f, it holds that

min x f(Ax) + λ

2 x 2 2 = max z f (z) 1

2λ A z 2 2 .

There exist an unique primal solution x and an unique dual solution z . Further, we have Ax f (z ), z = f(Ax ) and x = 1

λA z . Proposition 2 (Fenchel Duality on Sketched Program). Strong duality holds for the sketched program

min α f(ASα) + λ

2 Sα 2 2 = max y f (y) 1

2λ PSA y 2 2 ,

where PS = S(S S) S is the orthogonal projector onto the range of S. There exist a sketched primal solution α and an unique sketched dual solution y . Further, for any solution α , it holds that ASα f (y ) and y = f(ASα ).

We deﬁne the following deterministic functional Zf which depends on f , the data matrix A and the sketching matrix S, and plays an important role in controlling the approximation error,

Zf Zf(A, S) = sup (domf z )

where P S = I PS is the orthogonal projector onto range(S) . The relationship x = λ 1A f(Ax ) suggests the point ex = λ 1A f(ASα ) as a candidate for approximating x . The Fenchel dual programs of (1) and (2) only differ in their quadratic regularization term, A z 2 2 and PSA y 2 2, which difference is tied to the quantity P S A (z y) 2. As it holds that ex x 2 = λ 1 A (z y ) 2, we show that the error ex x 2 can be controlled in terms of the spectral norm P S A 2, or more sharply, in terms of the quantity Zf, which satisﬁes Zf P S A 2. We formalize this statement in our next result, which proof is deferred to Appendix B.1.

Theorem 1 (Deterministic bound). Let α be any minimizer of the sketched program (2). Then, under the condition λ 2µZ2 f, we have

2λZf x 2 , (4)

which further implies

2λ P S A 2 x 2 . (5)

For an adaptive sketching matrix S = A e S, we rewrite P S A 2 2 = K K e S(e S K e S) e S K 2, where K = AA is p.s.d. Combining our deterministic bound (5) with known results [15, 12, 5] for randomized low-rank matrix factorization in the form K e S(e S K e S) e S K of p.s.d. matrices K, we can give guarantees with high probability (w.h.p.) on the relative error for various types of matrices e S. For conciseness, we specialize our next result to adaptive Gaussian sketching, i.e., e S Gaussian i.i.d. Given a target rank k 2, we introduce a measure of the spectral tail of A as

Rk(A) = σ2 k + 1

k Pρ j=k+1 σ2 j 1

2 , where ρ is the rank of the matrix A and σ1 σ2 . . . σρ its singular values. The proof of the next result follows from a combination of Theorem 1 and Corollary 10.9 in [15], and is deferred to Appendix B.2. Corollary 1 (High-probability bound). Given k min(n, d)/2 and a sketching dimension m = 2k, let S = A e S, with e S Rn m Gaussian i.i.d. Then, for some universal constant c0 36, provided λ 2µc2 0R2 k(A), it holds with probability at least 1 12e k that

2λRk(A) x 2 . (6)

Remark 1. The quantity Zf := sup (domf z ) AP S A 2 2

2 is the eigenvalue of the matrix

P S A , restricted to the spherical cap K := (domf z ) Sn 1, where Sn 1 is the unit sphere in dimension n. Thus, depending on the geometry of K, the deterministic bound (4) might be much tighter than (5), and yield a probabilistic bound better than (6). The investigation of such a result is left for future work.

2.1 Theoretical predictions as a function of spectral decay

We study the theoretical predictions given by (5) on the relative error, for different spectral decays of A and sketching methods, in particular, adaptive Gaussian sketching versus oblivious Gaussian sketching and leverage score column sub-sampling [12]. We denote νk = σ2 k the eigenvalues of AA . For conciseness, we absorb µ into the eigenvalues by setting νk µνk and µ 1. This re-scaling leaves the right-hand side of the bound (5) unchanged, and does not affect the analysis below. Then, we assume that ν1 = O(1), λ (νρ, ν1), and λ 0 as n + . These assumptions are standard in empirical risk minimization and kernel regression methods [11], which we focus on in Sections 4 and 5. We consider three decaying schemes of practical interest. The matrix A has either a ﬁnite-rank ρ, a κ-exponential decay where νj e κj and κ > 0, or, a β-polynomial decay where νk j 2β and β > 1/2. Among other examples, these decays are characteristic of various standard kernel functions, such as the polynomial, Gaussian and ﬁrst-order Sobolev kernels [3]. Given a precision ε > 0 and a conﬁdence level η (0, 1), we denote by m A (resp. m O, m S) a sufﬁcient dimension for which adaptive (resp. oblivious, leverage score) sketching yields the following (ε, η)-guarantee on the relative error. That is, with probability at least 1 η, it holds that ex x 2/ x 2 ε.

We determine m A from our probabilistic regret bound (6). For m S, using our deterministic regret bound (5), it then sufﬁces to bound the spectral norm P A e SA 2 in terms of the eigenvalues νk,

when e S is a leverage score column sub-sampling matrix. To the best of our knowledge, the tightest bound has been given by [12] (see Lemma 5). For m O, we leverage results from [32]. The authors provide an upper bound on the relative error ex x 2/ x 2, when S is Gaussian i.i.d. with variance 1 d. It should be noted that their sketched solution α is slightly different from ours. They solve α = argmin f(ASα) + (2λ) 1 α 2 2, whereas we do include the matrix S in the regularization

term. One might wonder which regularizer works best when S is Gaussian i.i.d. Through extensive numerical simulations, we observed a strongly similar performance. Further, standard Gaussian concentration results yields that Sα 2 2 α 2 2.

Our theoretical ﬁndings are summarized in Table 1, and we give the mathematical details of our derivations in Appendix D. For the sake of clarity, we provide in Table 1 lower bounds on the predicted values m O and m S, and, thus, lower bounds on the ratios m O/m A and m S/m A. Overall, adaptive Gaussian sketching provides stronger guarantees on the relative error ex x 2/ x 2.

Table 1: Sketching dimensions for a (ε, η)-guarantee on the relative error ex x 2/ x 2.

ρ-rank matrix κ-exponential decay β-polynomial decay (ρ n d) (κ > 0) (β > 1/2)

Adaptive Gaussian (m A) ρ + 1 + log 12

λε + log 12

λ 1/2βε 1/β + log 12

Oblivious Gaussian (m O) (ρ + 1)ε 2 log 2ρ

κ 1ε 2 log 1

2β ε 2 log 2d

Leverage score (m S) (ρ + 1) log 4ρ

λε log 1 η λ 1

β 2 β β 1 log 1 η

Lower bound on m O

m A ε 2 log ρ ε 2+h log 2d, h > 0 ε1/β 2 log(2d/η)

Lower bound on m S

m A log ρ min log 1 η , κ 1 log 1

β 1+2 β β 1

We illustrate numerically our predictions for adaptive Gaussian sketching versus oblivious Gaussian sketching. With n = 1000 and d = 2000, we generate matrices Aexp and Apoly, with spectral decay satisfying respectively νj ne 0.1j and νj nj 2. First, we perform binary logistic regression, with f(Ax) = n 1 Pn i=1 ℓyi(a i x) where ℓyi(z) = yi log(1+e z)+(1 yi) log(1+ez), y {0, 1}n and ai is the i-th row of A. For the polynomial (resp. exponential) decay, we expect the relative error ex x 2/ x 2 to follow w.h.p. a decay proportional to m 1 (resp. e 0.05m). Figure 1 conﬁrms those predictions. We repeat the same experiments with a second loss function, f(Ax) = (2n) 1 Pn i=1(a i x)2 + 2(a i x)yi. The latter is a convex relaxation of the penalty 1

2 (Ax)+ y 2 2 for ﬁtting a shallow neural network with a Re LU non-linearity. Again, Figure 1 conﬁrms our predictions, and we observe that the adaptive method performs much better than the oblivious sketch.

25 28 (a) Re LU polynomial

0 500 1000 (b) Re LU exponential

25 28 (c) Logistic polynomial

0 500 1000 (d) Logistic exponential

Figure 1: Relative error versus sketching dimension m {2k | 3 k 10} of adaptive Gaussian sketching (red) and oblivious Gaussian sketching (green), for the Re LU and logistic models, and the exponential and polynomial decays. We use λ = 10 4 for all simulations. Results are averaged over 10 trials. Bar plots show (twice) the empirical standard deviations.

3 Algorithms for adaptive subspace sketching

3.1 Numerical conditioning and generic algorithm

A standard quantity to characterize the capacity of a convex program to be solved efﬁciently is its condition number [6], which, for the primal (1) and (adaptive) sketched program (2), are given by

κ = λ + supx σ1 A 2f(Ax)A

λ + infx σd (A 2f(Ax)A) , κS = supα σ1 e S A(λI + A 2f(AA e Sα)A)A e S

infα σd e S A(λI + A 2f(AA e Sα)A)A e S .

The latter can be signiﬁcantly larger than κ, up to κS κ σ1(e S AA e S)

σm(e S AA e S) κ. A simple change of

variable overcomes this issue. With AS, = AS(S S) 1

2 , we solve instead the optimization problem

α = argmin α Rm f(AS, α ) + λ

2 α 2 2. (7)

It holds that ex = λ 1A f(AS, α ). The additional complexity induced by this change of variables comes from computing the (square-root) pseudo-inverse of S S, which requires O(m3) ﬂops via a singular value decomposition. When m is small, this additional computation is negligible and numerically stable, and the re-scaled sketched program (7) is actually better conditioned that the original primal program (1), as stated in the next result that we prove in Appendix C.3. Proposition 3. Under adaptive sketching, the condition number κ of the re-scaled sketched program (7) satisﬁes κ κ with probability 1.

Algorithm 1: Generic algorithm for adaptive sketching.

Input :Data matrix A Rn d, random matrix e S Rn m and parameter λ > 0.

1 Compute the sketching matrix S = A e S, and, the sketched matrix AS = AS.

2 Compute the re-scaling matrix R = S S 1

2 , and the re-scaled sketched matrix AS, = ASR.

3 Solve the convex optimization problem (7), and return ex = 1

λA f AS, α .

We observed a drastic practical performance improvement between solving the sketched program as formulated in (2) and its well-conditioned version (7).

If the chosen sketch dimension m is itself prohibitively large for computing the matrix Q = (S S) 1

2 , one might consider a pre-conditioning matrix Q, which is faster to compute, and such that the matrix SQ is well-conditioned. Typically, one might compute a matrix Q based on an approximate singular value decomposition of the matrix S S. Then, one solves the optimization problem α Q = argminα Rm f(ASQα) + λ

2 SQα 2 2. Provided that Q is invertible, it holds that ex satisﬁes ex = λ 1A f(ASQα Q).

3.2 Error contraction and almost exact recovery of the optimal solution

The estimate ex satisﬁes a guarantee of the form ex x 2 ε x 2 w.h.p., and, with ε < 1 provided that λ is large enough. Here, we extend Algorithm 1 to an iterative version which takes advantage of this error contraction, and which is relevant when a high-precision estimate ex is needed.

Algorithm 2: Iterative adaptive sketching

Input :Data matrix A Rn d, random matrix e S Rn m, iterations number T, parameter λ > 0.

1 Compute the sketched matrix AS, as in Algorithm 1. Set ex(0) = 0.

2 for t = 1, 2, . . . , T do

3 Compute a(t) = Aex(t 1), and, b(t) = S S 1

2 S ex(t 1).

4 Solve the following convex optimization problem

α(t) = argmin α Rm f(AS, α + a(t)) + λ

2 α + b(t) 2 2 . (8)

Update the solution by ex(t) = 1

λA f(AS, α(t) + a(t)).

6 Return the last iterate ex(T ).

A key advantage is that, at each iteration, the same sketching matrix S is used. Thus, the sketched matrix AS, has to be computed only once, at the beginning of the procedure. The output ex(T ) satisﬁes the following recovery property, which empirical beneﬁts are illustrated in Figure 2. Theorem 2. After T iterations of Algorithm 2, provided that λ 2µZ2 f, it holds that

2 x 2 . (9)

Further, if S = A e S where e S Rn m with i.i.d. Gaussian entries and m = 2k for some target rank k 2, then, for some universal constant c0 36, after T iterations of Algorithm 2, provided that λ 2c2 0µR2 k(A), the approximate solution ex(T ) satisﬁes with probability at least 1 12e k,

ex(T ) x 2 c2 0µR2 k(A) 2λ

2 x 2 . (10)

24 26 28 210

(b) Power method

q = 0 q = 1 q = 2

24 26 28 210

(a) Iterative method

T = 1 T = 2 T = 3

Figure 2: Relative error versus sketching dimension m {2k | 3 k 10} of adaptive Gaussian sketching for (a) the iterative method (Algorithm 2) and (b) the power method (see Remark 2). We use the MNIST dataset with images mapped through 10000-dimensional random Fourier features [20] for even-vs-odd classiﬁcation using binary logistic loss, and, λ = 10 5. Results are averaged over 20 trials. Bar plots show (twice) the empirical standard deviations.

Remark 2. An immediate extension of Algorithms 1 and 2 consists in using the power method [15]. Given q N, one uses the sketching matrix S = (A A) q A e S. The larger q, the smaller the approximation error AA AS(S S) S A 2 (see Corollary 10.10 in [15]). Of pratical interest are data matrices A with a spectral proﬁle starting with a fast decay, and then becoming ﬂat. This happens typically for A of the form A = A + W, where A has a fast decay and W is a noise matrix with, for instance, independent subgaussian rows [26]. Our results easily extend to this setting and we illustrate its empirical beneﬁts in Figure 2.

4 Application to empirical risk minimization and kernel methods

By the representer theorem, the primal program (1) can be re-formulated as

w argmin w Rn f(Kw) + λ

2 w Kw , (11)

where K = AA . Clearly, it holds that x = A w . Given a matrix e S with i.i.d. Gaussian entries, we consider the sketched version of the kernelized primal program (11),

α argmin α Rm f(K e Sα) + λ

2 α e S K e Sα . (12)

The sketched program (12) is exactly our adaptive Gaussian sketched program (2). Thus, setting ew = λ 1 f(K e Sα ), it holds that ex = A ew. Since the relative error ex x 2/ x 2 is controlled by the decay of the eigenvalues of K, so does the relative error A ( ew w ) 2/ A w 2. More generally, the latter statements are still true if K is any positive semi-deﬁnite matrix, and, if we replace A by any square-root matrix of K. Here, we denote Zf Zf K 1 2 , K 1 2 e S (see Eq. (3)).

Theorem 3. Let K Rn n be any positive semi-deﬁnite matrix. Let w be any minimizer of the kernel program (11) and α be any minimizer of its sketched version (12). Deﬁne the approximate solution ew = 1

λ f(K e Sα ). If λ 2µZ2 f, then it holds that

K 1 2 ( ew w ) 2 r µ

2λZf K 1 2 w 2 . (13)

For a positive deﬁnite kernel k : Rd Rd R and a data matrix A = [a1, . . . , an] Rn d, let K be the empirical kernel matrix, with Kij = k(ai, aj). Let ϕ( ) RD be a random feature map [20, 8], such as random Fourier features or a random convolutional neural net. We are interested in the computational complexities of forming the sketched versions of the primal (1), the kernel primal (11) and the primal (1) with ϕ(A) instead of A. We compare the complexities of adaptive and oblivious sketching and uniform column sub-sampling. Table 2 shows that all three methods have similar complexities for computing AS and ϕ(A)S. Adaptive sketching exhibits an additional factor 2 that comes from computing the correlated sketching matrices S = A e S and S = ϕ(A) e S. In practice, the latter is negligible compared to the cost of forming ϕ(A) which, for instance, corresponds to a forward pass over the whole dataset in the case of a convolutional neural network. On the other hand, uniform column sub-sampling is signiﬁcantly faster in order to form the sketched kernel matrix K e S, which relates to the well-known computational advantages of kernel Nystrom methods [30].

Table 2: Complexity of forming the sketched programs, given A Rn d. We denote dk the number of ﬂops to evaluate the kernel product k(a, a ), and, dϕ the number of ﬂops for a forward-pass ϕ(a). Note that these complexities could be reduced through parallelization.

AS ϕ(A)S K e S

Adaptive sketching O (2mdn) O (dϕn) + O (2m Dn) O dkn2 + O mn2

Oblivious sketching O (mdn) O (dϕn) + O (m Dn) - Uniform column sub-sampling O (mdn) O (dϕn) + O (m Dn) O (dknm)

5 Numerical evaluation of adaptive Gaussian sketching

We evaluate Algorithm 1 on MNIST and CIFAR10. First, we aim to show that the sketching dimension can be considerably smaller than the original dimension while retaining (almost) the same test classiﬁcation accuracy. Second, we aim to get signiﬁcant speed-ups in achieving a high-accuracy classiﬁer. To solve the primal program (1), we use two standard algorithms, stochastic gradient descent (SGD) with (best) ﬁxed step size and stochastic variance reduction gradient (SVRG) [16] with (best) ﬁxed step size and frequency update of the gradient correction. To solve the adaptive sketched program (2), we use SGD, SVRG and the sub-sampled Newton method [4, 10] which we refer to as Sketch-SGD, Sketch-SVRG and Sketch-Newton. The latter is well-suited to the sketched program, as the low-dimensional Hessian matrix can be quickly inverted at each iteration. For both datasets, we use 50000 training and 10000 testing images. We transform each image using a random Fourier feature map ϕ( ) RD, i.e., ϕ(a), ϕ(a ) exp γ a a 2 2 [20, 18]. For MNIST and CIFAR10, we choose respectively D = 10000 and γ = 0.02, and, D = 60000 and γ = 0.002, so that the primal is respectively 10000-dimensional and 60000-dimensional. Then, we train a classiﬁer via a sequence of binary logistic regressions which allow for efﬁcient computation of the Hessian and implementation of the Sketch-Newton algorithm , using a one-vs-all procedure.

First, we evaluate the test classiﬁcation error of ex. We solve to optimality the primal and sketched programs for values of λ {10 4, 5 10 5, 10 5, 5 10 6} and sketching dimensions m {64, 128, 256, 512, 1024}. In Table 3 are reported the results, which are averaged over 20 trials for MNIST and 10 trials for CIFAR10, and, empirical variances are reported in Appendix A. Overall, the adaptive sketched program yields a high-accuracy classiﬁer for most couples (λ, m). Further, we match the best primal classiﬁer with values of m as small as 256 for MNIST and 512 for CIFAR10, which respectively corresponds to a dimension reduction by a factor 40 and 120. These results additionally suggest that adaptive Gaussian sketching introduces an implicit regularization effect, which might be related to the beneﬁts of spectral cutoff estimators. For instance, on CIFAR10, using λ = 10 5 and m = 512, we obtain an improvement in test accuracy by more than 2% compared to x . Further, over some sketching dimension threshold under which the performance is bad, as the value of m increases, the test classiﬁcation error of ex increases to that of x , until matching it.

Further, we evaluate the test classiﬁcation error of two sketching baselines, that is, oblivious Gaussian sketching for which the matrix S has i.i.d. Gaussian entries, and, adaptive column sub-sampling (Nystrom method) for which S = A e S with e S a column sub-sampling matrix. As reported in Table 4, adaptive Gaussian sketching performs better for a wide range of values of sketching size m and regularization parameter λ.

Table 3: Test classiﬁcation error of adaptive Gaussian sketching on MNIST and CIFAR10 datasets.

λ x MNIST ex64 ex128 ex256 ex512 ex1024 x CIFAR ex64 ex128 ex256 ex512 ex1024

10 4 5.4 4.8 4.8 5.2 5.3 5.4 - - - - - - 5 10 5 4.6 4.1 3.8 4.0 4.3 4.5 51.6 52.1 50.5 50.6 50.8 51.0 10 5 2.8 8.1 3.4 2.4 2.5 2.8 48.2 60.1 54.5 47.7 45.9 46.2 5 10 6 2.5 11.8 4.9 2.8 2.6 2.4 47.6 63.6 59.8 51.9 47.7 45.8

Table 4: Test classiﬁcation error on MNIST and CIFAR10. "AG": Adaptive Gaussian sketch, "Ob": Oblivious Gaussian sketch, "N": Nystrom method.

λ x MNIST ex AG 256 ex AG 1024 ex Ob 256 ex Ob 1024 ex N 256 ex N 1024

5 10 5 4.6 % 4.0 % 4.5 % 25.2 % 8.5% 5.0 % 4.6 5 10 6 2.5% 2.8% 2.4% 30.1% 9.4% 3.0% 2.7%

λ x CIFAR ex AG 256 ex AG 1024 ex Ob 256 ex Ob 1024 ex N 256 ex N 1024

5 10 5 51.6 % 50.6% 51.0% 88.2% 70.5% 55.8% 53.1% 5 10 6 47.6% 51.9% 45.8% 88.9% 80.1% 57.2% 55.8%

Then, we compare the test classiﬁcation error versus wall-clock time of the optimization algorithms mentioned above. Figure 3 shows results for some values of m and λ. We observe some speed-ups on the 10000-dimensional MNIST problem, in particular for Sketch-SGD and for Sketch-SVRG, for which computing the gradient correction is relatively fast. Such speed-ups are even more signiﬁcant on the 60000-dimensional CIFAR10 problem, especially for Sketch-Newton. A few iterations of Sketch-Newton sufﬁce to almost reach the minimum ex, with a per-iteration time which is relatively small thanks to dimensionality reduction. Hence, it is more than 10 times faster to reach the best test accuracy using the sketched program. In addition to random Fourier features mapping, we carry out another set of experiments with the CIFAR10 dataset, in which we pre-process the images. That is, similarly to [24, 21], we map each image through a random convolutional layer. Then, we kernelize these processed images using a Gaussian kernel with γ = 2 10 5. Using our implementation, the best test accuracy of the kernel primal program (11) we obtained is 73.1%. Sketch-SGD, Sketch-SVRG and Sketch-Newton applied to the sketched kernel program (12) match this test accuracy, with signiﬁcant speed-ups, as reported in Figure 3.

0 50 100 (a) MNIST, m = 512, λ = 10 5.

SGD SVRG Sketch-SVRG Sketch-Newton Sketch-SGD

(b) CIFAR, m = 256, λ = 10 5.

SGD Sketch-SVRG Sketch-Newton Sketch-SGD

(c) CIFAR (random layer), m = 1024, λ = 5.10 6.

40 SGD Sketch-SVRG Sketch-Newton Sketch-SGD

Figure 3: Test classiﬁcation error (percentage) versus wall-clock time (seconds).

Acknowledgements

This work was partially supported by the National Science Foundation under grant IIS-1838179 and Ofﬁce of Naval Research, ONR YIP Program, under contract N00014-17-1-2433.

[1] A. Alaoui and M. W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems, pages 775 783, 2015.

[2] F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning Theory, pages 185 209, 2013.

[3] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Adademic, 2004.

[4] R. Bollapragada, R. H. Byrd, and J. Nocedal. Exact and inexact subsampled newton methods for optimization. IMA Journal of Numerical Analysis, 39(2):545 578, 2018.

[5] C. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled randomized hadamard transform. SIAM Journal on Matrix Analysis and Applications, 34(3):1301 1340, 2013.

[6] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.

[7] E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(8):1207 1223, 2006.

[8] A. Coates and A. Y. Ng. Learning feature representations with k-means. In Neural networks: Tricks of the trade, pages 561 580. Springer, 2012.

[9] P. Drineas and M. W. Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6(Dec):2153 2175, 2005.

[10] M. A. Erdogdu and A. Montanari. Convergence rates of sub-sampled newton methods. ar Xiv preprint ar Xiv:1508.02810, 2015.

[11] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1.

[12] A. Gittens and M. W. Mahoney. Revisiting the nyström method for improved large-scale machine learning. The Journal of Machine Learning Research, 17(1):3977 4041, 2016.

[13] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.

[14] R. M. Gower and P. Richtárik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660 1690, 2015.

[15] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, 2011.

[16] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315 323, 2013.

[17] S. Kumar, M. Mohri, and A. Talwalkar. Sampling methods for the nyström method. Journal of Machine Learning Research, 13(Apr):981 1006, 2012.

[18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011.

[19] M. Pilanci and M. J. Wainwright. Randomized sketches of convex programs with sharp guarantees. IEEE Transactions on Information Theory, 61(9):5096 5115, 2015.

[20] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177 1184, 2008.

[21] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do cifar-10 classiﬁers generalize to cifar-10? ar Xiv preprint ar Xiv:1806.00451, 2018.

[22] R. T. Rockafellar. Convex analysis. Princeton university press, 2015.

[23] A. J. Smola and B. Schölkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199 222, 2004.

[24] S. Tu, R. Roelofs, S. Venkataraman, and B. Recht. Large scale kernel learning using block coordinate descent. ar Xiv preprint ar Xiv:1602.05310, 2016.

[25] S. S. Vempala. The random projection method, volume 65. American Mathematical Soc., 2005.

[26] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.

[27] K. Vu, P.-L. Poirion, C. D Ambrosio, and L. Liberti. Random projections for trust region subproblems. ar Xiv preprint ar Xiv:1706.02730, 2017.

[28] C. K. Williams and M. Seeger. Using the nyström method to speed up kernel machines. In Advances in neural information processing systems, pages 682 688, 2001.

[29] R. Witten and E. Candes. Randomized algorithms for low-rank matrix factorizations: sharp performance bounds. Algorithmica, 72(1):264 281, 2015.

[30] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems, pages 476 484, 2012.

[31] Y. Yang, M. Pilanci, M. J. Wainwright, et al. Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3):991 1023, 2017.

[32] L. Zhang, M. Mahdavi, R. Jin, T. Yang, and S. Zhu. Recovering the optimal solution by dual random projection. In Conference on Learning Theory, pages 135 157, 2013.

[33] L. Zhang, M. Mahdavi, R. Jin, T. Yang, and S. Zhu. Random projections for classiﬁcation: A recovery approach. IEEE Transactions on Information Theory, 60(11):7300 7316, 2014.