# scalable_adaptive_stochastic_optimization_using_random_projections__9432a7c5.pdf

Scalable Adaptive Stochastic Optimization Using Random Projections

Gabriel Krummenacher gabriel.krummenacher@inf.ethz.ch Brian Mc Williams brian@disneyresearch.com

Yannic Kilcher yannic.kilcher@inf.ethz.ch Joachim M. Buhmann jbuhmann@inf.ethz.ch

Nicolai Meinshausen meinshausen@stat.math.ethz.ch

Institute for Machine Learning, Department of Computer Science, ETH Zürich, Switzerland Seminar for Statistics, Department of Mathematics, ETH Zürich, Switzerland Disney Research, Zürich, Switzerland

Adaptive stochastic gradient methods such as ADAGRAD have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by accumulating past gradients which are used to tune the step size adaptively. In certain situations the full-matrix variant of ADAGRAD is expected to attain better performance, however in high dimensions it is computationally impractical. We present ADA-LR and RADAGRAD two computationally efﬁcient approximations to full-matrix ADAGRAD based on randomized dimensionality reduction. They are able to capture dependencies between features and achieve similar performance to full-matrix ADAGRAD but at a much smaller computational cost. We show that the regret of ADA-LR is close to the regret of full-matrix ADAGRAD which can have an up-to exponentially smaller dependence on the dimension than the diagonal variant. Empirically, we show that ADA-LR and RADAGRAD perform similarly to full-matrix ADAGRAD. On the task of training convolutional neural networks as well as recurrent neural networks, RADAGRAD achieves faster convergence than diagonal ADAGRAD.

1 Introduction

Recently, so-called adaptive stochastic optimization algorithms have gained popularity for large-scale convex and non-convex optimization problems. Among these, ADAGRAD [9] and its variants [21] have received particular attention and have proven among the most successful algorithms for training deep networks. Although these problems are inherently highly non-convex, recent work has begun to explain the success of such algorithms [3].

ADAGRAD adaptively sets the learning rate for each dimension by means of a time-varying proximal regularizer. The most commonly studied and utilised version considers only a diagonal matrix proximal term. As such it incurs almost no additional computational cost over standard stochastic

Authors contributed equally.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

gradient descent (SGD). However, when the data has low effective rank the regret of ADAGRAD may have a much worse dependence on the dimensionality of the problem than its full-matrix variant (which we refer to as ADA-FULL). Such settings are common in high dimensional data where there are many correlations between features and can also be observed in the convolutional layers of neural networks. The computational cost of ADA-FULL is substantially higher than that of ADAGRAD it requires computing the inverse square root of the matrix of gradient outer products to evaluate the proximal term which grows with the cube of the dimension. As such it is rarely used in practise.

In this work we propose two methods that approximate the proximal term used in ADA-FULL drastically reducing computational and storage complexity with little adverse affect on optimization performance. First, in Section 3.1 we develop ADA-LR, a simple approximation using random projections. This procedure reduces the computational complexity of ADA-FULL by a factor of p but retains similar theoretical guarantees. In Section 3.2 we systematically proﬁle the most computationally expensive parts of ADA-LR and introduce further randomized approximations resulting in a truly scalable algorithm, RADAGRAD. In Section 3.3 we outline a simple modiﬁcation to RADAGRAD reducing the variance of the stochastic gradients which greatly improves practical performance. Finally we perform an extensive comparison between the performance of RADAGRAD with several widely used optimization algorithms on a variety of deep learning tasks. For image recognition with convolutional networks and language modeling with recurrent neural networks we ﬁnd that RADAGRAD and in particular its variance-reduced variant achieves faster convergence.

1.1 Related work

Motivated by the problem of training deep neural networks, very recently many new adaptive optimization methods have been proposed. Most computationally efﬁcient among these are ﬁrst order methods similar in spirit to ADAGRAD, which suggest alternative normalization factors [21, 28, 6]. Several authors propose efﬁcient stochastic variants of classical second order methods such as LBFGS [5, 20]. Efﬁcient algorithms exist to update the inverse of the Hessian approximation by applying the matrix-inversion lemma or directly updating the Hessian-vector product using the double-loop algorithm but these are not applicable to ADAGRAD style algorithms. In the convex setting these methods can show great theoretical and practical beneﬁt over ﬁrst order methods but have yet to be extensively applied to training deep networks.

On a different note, the growing zoo of variance reduced SGD algorithms [19, 7, 18] has shown vastly superior performance to ADAGRAD-style methods for standard empirical risk minimization and convex optimization. Recent work has aimed to move these methods into the non-convex setting [1]. Notably, [22] combine variance reduction with second order methods.

Most similar to RADAGRAD are those which propose factorized approximations of second order information. Several methods focus on the natural gradient method [2] which leverages second order information through the Fisher information matrix. [14] approximate the inverse Fisher matrix using a sparse graphical model. [8] use low-rank approximations whereas [26] propose an efﬁcient Kronecker product based factorization. Concurrently with this work, [12] propose a randomized preconditioner for SGD. However, their approach requires access to all of the data at once in order to compute the preconditioning matrix which is impractical for training deep networks. [23] propose a theoretically motivated algorithm similar to ADA-LR and a faster alternative based on Oja s rule to update the SVD.

Fast random projections. Random projections are low-dimensional embeddings Π : Rp Rτ which preserve up to a small distortion the geometry of a subspace of vectors. We concentrate on the class of structured random projections, among which the Subsampled Randomized Fourier Transform (SRFT) has particularly attractive properties [15]. The SRFT consists of a preconditioning step after which τ columns of the new matrix are subsampled uniformly at random as Π = p

p/τSΘD with the deﬁnitions: (i) S Rτ p is a subsampling matrix. (ii) D Rp p is a diagonal matrix whose entries are drawn independently from { 1, 1}. (iii) Θ Rp p is a unitary discrete Fourier tranansform (DFT) matrix. This formulations allows very fast implementations using the fast Fourier transform (FFT), for example using the popular FFTW package2. Applying the FFT to a p dimensional vector can be achieved in O (p log τ) time. Similar structured random projections

2http://www.fftw.org/

have gained popularity as a way to speed up [24] and robustify [27] large-scale linear regression and for distributed estimation [17, 16].

1.2 Problem setting

The problem considered by [9] is online stochastic optimization where the goal is, at each step, to predict a point βt Rp which achieves low regret with respect to a ﬁxed optimal predictor, βopt, for a sequence of (convex) functions Ft(β). After T rounds, the regret can be deﬁned as R(T) = PT t=1 Ft(βt) PT t=1 Ft(βopt).

Initially, we will consider functions Ft of the form Ft(β) := ft(β) + ϕ(β) where ft and ϕ are convex loss and regularization functions respectively. Throughout, the vector gt ft(βt) refers to a particular subgradient of the loss function. Standard ﬁrst order methods update βt at each step by moving in the opposite direction of gt according to a step-size parameter, η. The ADAGRAD family of algorithms [9] instead use an adaptive learning rate which can be different for each feature. This is controlled using a time-varying proximal term which we brieﬂy review. Deﬁning Gt = Pt i=1 gig i and Ht = δIp +(Gt 1 +gtg t )1/2, the ADA-FULL proximal term is given by ψt(β) = 1

Clearly when p is large, constructing G and ﬁnding its root and inverse at each iteration is impractical. In practice, rather than the full outer product matrix, ADAGRAD uses a proximal function consisting of the diagonal of Gt, ψt(β) = 1

2 β, δIp + diag(Gt)1/2 β . Although the diagonal proximal term is computationally cheaper, it is unable to capture dependencies between coordinates in the gradient terms. Despite this, ADAGRAD has been found to perform very well empirically. One reason for this is modern high-dimensional datasets are typically also very sparse. Under these conditions, coordinates in the gradient are approximately independent.

2 Stochastic optimization in high dimensions

ADAGRAD has attractive theoretical and empirical properties and adds essentially no overhead above a standard ﬁrst order method such as SGD. It begs the question, what we might hope to gain by introducing additional computational complexity. In order to motivate our contribution, we ﬁrst present an analogue of the discussion in [10] focussing on when data is high-dimensional and dense. We argue that if the data has low-rank (rather than sparse) structure ADA-FULL can effectively adapt to the intrinsic dimensionality. We also show in Section 3.1 that ADA-LR has the same property.

First, we review the theoretical properties of ADAGRAD algorithms, borrowing the g1:T,j notation[9].

Proposition 1. ADAGRAD and ADA-FULL achieve the following regret (Corollaries 6 & 11 from [9]) respectively:

RD(T) 2 βopt

j=1 g1:T,j + δ βopt 1 , RF (T) 2 βopt tr(G1/2 T ) + δ βopt . (1)

The major difference between RD(T) and RF (T) is the inclusion of the ﬁnal full-matrix and diagonal proximal term, respectively. Under a sparse data generating distribution ADAGRAD achieves an up-to exponential improvement over SGD which is optimal in a minimax sense [10]. While data sparsity is often observed in practise in high-dimensional datasets (particularly web/text data) many other problems are dense. Furthermore, in practise applying ADAGRAD to dense data results in a learning rate which tends to decay too rapidly. It is therefore natural to ask how dense data affects the performance of ADA-FULL.

For illustration, consider when the data points xi are sampled i.i.d. from a Gaussian distribution PX = N(0, Σ). The resulting variable will clearly be dense. A common feature of high dimensional data is low effective rank deﬁned for a matrix Σ as r(Σ) = tr(Σ)/ Σ rank(Σ) p. Low effective rank implies that r p and therefore the eigenvalues of the covariance matrix decay quickly. We will consider distributions parameterised by covariance matrices Σ with eigenvalues λj(Σ) = λ0j α for j = 1, . . . , p.

Functions of the form Ft(β) = Ft(β xt) have gradients gt M xt . For example, the least squares loss Ft(β xt) = 1

2(yt β xt)2 has gradient gt = xt(yt x t βt) = xtεt, such that

εt M. Let us consider the effect of distributions parametrised by Σ on the proximal terms of full, and diagonal ADAGRAD. Plugging X into the proximal terms of (1) and taking expectations with respect to PX we obtain for ADAGRAD and ADA-FULL respectively:

v u u t M 2E

t=1 x2 t,j p M

t=1 gtg t )1/2) M p

where the ﬁrst inequality is from Jensen and the second is from noticing the sum of T squared Gaussian random variables is a χ2 random variable. We can consider the effect of fast-decaying spectrum: for α 2, Pp j=1 j α/2 = O (log p) and for α (1, 2), Pp j=1 j α/2 = O p1 α/2 .

When the data (and thus the gradients) are dense, yet have low effective rank, ADA-FULL is able to adapt to this structure. On the contrary, although ADAGRAD is computationally practical, in the worst case it may have exponentially worse dependence on the data dimension (p compared with log p). In fact, the discrepancy between the regret of ADA-FULL and that of ADAGRAD is analogous to the discrepancy between ADAGRAD and SGD for sparse data.

Algorithm 1 ADA-LR

Input: η > 0, δ 0, τ

1: for t = 1 . . . T do 2: Receive gt = ft(βt). 3: Gt = Gt 1 + gtg t 4: Project: Gt = GtΠ 5: QR = Gt {QR-decomposition} 6: B = Q Gt 7: U, Σ, V = B {SVD} 8: 9: 10: βt+1 = βt ηV(Σ1/2 + δI) 1V gt 11: end for Output: βT

Algorithm 2 RADAGRAD

Input: η > 0, δ 0, τ

1: for t = 1 . . . T do 2: Receive gt = ft(βt). 3: Project: gt = Πgt 4: Gt = Gt 1 + gt g t 5: Qt, Rt qr_update(Qt 1, Rt 1, gt, gt) 6: B = G t Qt 7: U, Σ, W = B {SVD} 8: V = WQ

9: γt = η(gt VV gt) 10: βt+1 = βt ηV(Σ1/2 +δI) 1V gt γt 11: end for Output: βT

3 Approximating ADA-FULL using random projections

It is clear that in certain regimes, ADA-FULL provides stark optimization advantages over ADAGRAD in terms of the dependence on p. However, ADA-FULL requires maintaining a p p matrix, G and computing its square root and inverse. Therefore, computationally the dependence of ADA-FULL on p scales with the cube which is impractical in high dimensions.

A naïve approach would be to simply reduce the dimensionality of the gradient vector, gt Rτ = Πgt. ADA-FULL is now directly applicable in this low-dimensional space, returning a solution vector βt Rτ at each iteration. However, for many problems, the original coordinates may have some intrinsic meaning or in the case of deep networks, may be parameters in a model. In which case it is important to return a solution in the original space. Unfortunately in general it is not possible to recover such a solution from βt [30].

Instead, we consider a different approach to maintaining and updating an approximation of the ADAGRAD matrix while retaining the original dimensionality of the parameter updates β and gradients g.

3.1 Randomized low-rank approximation

As a ﬁrst approach we approximate the inverse square root of Gt using a fast randomized singular value decomposition (SVD) [15]. We proceed in two stages: First we compute an approximate basis

Q for the range of Gt. Then we use Q to compute an approximate SVD of Gt by forming the smaller dimensional matrix B = Q Gt and then compute the low-rank SVD UΣV = B. This is faster than computing the SVD of Gt directly if Q has few columns.

An approximate basis Q can be computed efﬁciently by forming the matrix Gt = GtΠ by means of a structured random projection and then constructing an orthonormal basis for the range of Gt by QR-decomposition. The randomized SVD allows us to quickly compute the square root and pseudo-inverse of the proximal term Ht by setting H 1 t = V(Σ1/2 + δI) 1V . We call this approximation ADA-LR and describe the steps in full in Algorithm 1.

In practice, using a structured random projection such as the SRFT leads to an approximation of the original matrix, Gt of the following form Gt QQ Gt ϵ, with high probability [15] where ϵ depends on τ, the number of columns of Q; p and the τ th singular value of Gt. Brieﬂy, if the singular values of Gt decay quickly and τ is chosen appropriately, ϵ will be small (this is stated more formally in Proposition 2). We leverage this result to derive the following regret bound for ADA-LR (see C.1 for proof).

Proposition 2. Let σk+1 be the kth largest singular value of Gt. Setting the projection dimension as

8 log(kn) 2 τ p and deﬁning ϵ = p

1 + 7p/τ σk+1. With failure probability at

most O k 1 ADA-LR achieves regret RLR(T) 2 βopt tr(G1/2 T ) + (2τ ϵ + δ) βopt .

Due to the randomized approximation we incur an additional 2τ ϵ βopt compared with the regret of ADA-FULL (eq. 1). So, under the earlier stated assumption of fast decaying eigenvalues we can use an identical argument as in eq. (2) to similarly obtain a dimension dependence of O (log p + τ).

Approximating the inverse square root decreases the complexity of each iteration from O p3

to O τp2 . We summarize the cost of each step in Algorithm 1 and contrast it with the cost of ADA-FULL in Table A.1 in Section A. Even though ADA-LR removes one factor of p form the runtime of ADA-FULL it still needs to store the large matrix Gt. This prevents ADA-LR from being a truly practical algorithm. In the following section we propose a second algorithm which directly stores a low dimensional approximation to Gt that can be updated cheaply. This allows for an improvement in runtime to O τ 2p .

3.2 RADAGRAD: A faster approximation

From Table A.1, the expensive steps in Algorithm 1 are the update of Gt (line 3), the random projection (line 4) and the projection onto the approximate range of Gt (line 6). In the following we propose RADAGRAD, an algorithm that reduces the complexity to O τ 2p by only approximately solving some of the expensive steps in ADA-LR while maintaining similar performance in practice.

To compute the approximate range Q, we do not need to store the full matrix Gt. Instead we only require the low dimensional matrix Gt = GtΠ. This matrix can be computed iteratively by setting Gt Rp τ = Gt 1 + gt(Πgt) . This directly reduces the cost of the random projection to O (p log τ) since we only project the vector gt instead of the matrix Gt, it also makes the update of Gt faster and saves storage.

We then project Gt on the approximate range of Gt and use the SVD to compute the inverse square root. Since Gt is symmetric its row and column space are identical so little information is lost by projecting Gt instead of Gt on the approximate range of Gt.3 The advantage is that we can now compute the SVD in O τ 3 and the matrix-matrix product on line 6 in O τ 2p . See Algorithm 2 for the full procedure.

The most expensive steps are now the QR decomposition and the matrix multiplications in steps 6 and 8 (see Algorithm 2 and Table A.1). Since at each iteration we only update the matrix Gt with the rank-one matrix gt g t we can use faster rank-1 QR-updates [11] instead of recomputing the full QR decomposition. To speed up the matrix-matrix product G t Q for very large problems (e.g. backpropagation in convolutional neural networks), a multithreaded BLAS implementation can be used.

3This idea is similar to bilinear random projections [13].

3.3 Practical algorithms

Here we outline several simple modiﬁcations to the RADAGRAD algorithm to improve practical performance.

Corrected update. The random projection step only retains at most τ eigenvalues of Gt. If the assumption of low effective rank does not hold, important information from the p τ smallest eigenvalues might be discarded. RADAGRAD therefore makes use of the corrected update

βt+1 = βt ηV(Σ1/2 + δI) 1V gt γt, where γt = η(I VV )gt.

γt is the projection of the current gradient onto the space orthogonal to the one captured by the random projection of Gt. This ensures that important variation in the gradient which is poorly approximated by the random projection is not completely lost. Consequently, if the data has rank less than τ, γ 0. This correction only requires quantities which have already been computed but greatly improves practical performance.

Variance reduction. Variance reduction methods based on SVRG [19] obtain lower-variance gradient estimates by means of computing a pivot point over larger batches of data. Recent work has shown improved theoretical and empirical convergence in non-convex problems [1] in particular in combination with ADAGRAD.

We modify RADAGRAD to use the variance reduction scheme of SVRG. The full procedure is given in Algorithm 3 in Section B. The majority of the algorithm is as RADAGRAD except for the outer loop which computes the pivot point, µ every epoch which is used to reduce the variance of the stochastic gradient (line 4). The important additional parameter is m, the update frequency for µ. As in [1] we set this to m = 5n. Practically, as is standard practise we initialise RADA-VR by running ADAGRAD for several epochs.

We study the empirical behaviour of ADA-LR, RADAGRAD and its variance reduced variant in the next section.

4 Experiments

4.1 Low effective rank data

0 500 1000 1500 2000 2500 3000 3500 4000 Iteration

ADA-FULL ADA-LR RADAGRAD ADAGRAD

(a) Logistic Loss

0 10 20 30 40 50 60 Principal component

Normalised eigenvalues

ADA-FULL ADA-LR RADAGRAD ADAGRAD

(b) Spectrum

Figure 1: Comparison of: (a) loss and (b) the largest eigenvalues (normalised by their sum) of the proximal term on simulated data.

We compare the performance of our proposed algorithms against both the diagonal and full-matrix ADAGRAD variants in the idealised setting where the data is dense but has low effective rank. We generate binary classiﬁcation data with n = 1000 and p = 125. The data is sampled i.i.d. from a Gaussian distribution N(µc, Σ) where Σ has with rapidly decaying eigenvalues λj(Σ) = λ0j α with α = 1.3, λ0 = 30. Each of the two classes has a different mean, µc.

For each algorithm learning rates are tuned using cross validation. The results for 5 epochs are averaged over 5 runs with different permutations of the data set and instantiations of the random projection for ADA-LR and RADAGRAD. For the random projection we use an oversampling factor so Π R(10+τ) p to ensure accurate recovery of the top τ singular values and then set the values of λ[τ:p] to zero [15].

Figure 1a shows the mean loss on the training set. The performance of ADA-LR and RADAGRAD match that of ADA-FULL. On the other hand, ADAGRAD converges to the optimum much more slowly. Figure 1b shows the largest eigenvalues (normalized by their sum) of the proximal matrix for each method at the end of training. The spectrum of Gt decays rapidly which is matched by

0 5000 10000 15000 20000 25000 30000 35000 40000 Iteration

Training Loss

RADAGRAD RADA-VR ADAGRAD ADAGRAD+SVRG

0 5000 10000 15000 20000 25000 30000 35000 40000 Iteration

Test Accuracy

RADAGRAD RADA-VR ADAGRAD ADAGRAD+SVRG

0 5000 10000 15000 20000 25000 30000 35000 Iteration

Training Loss

RADAGRAD RADA-VR ADAGRAD ADAGRAD+SVRG

0 5000 10000 15000 20000 25000 30000 35000 Iteration

Test Accuracy

RADAGRAD RADA-VR ADAGRAD ADAGRAD+SVRG

0 10000 20000 30000 40000 50000 Iteration

Training Loss

RADAGRAD RADA-VR ADAGRAD ADAGRAD+SVRG

0 10000 20000 30000 40000 50000 Iteration

Test Accuracy

RADAGRAD RADA-VR ADAGRAD ADAGRAD+SVRG

Figure 2: Comparison of training loss (top row) and test accuracy (bottom row) on (a) MNIST, (b) CIFAR and (c) SVHN.

the randomized approximation. This illustrates the dependencies between the coordinates in the gradients and suggests Gt can be well approximated by a low-dimensional matrix which considers these dependencies. On the other hand the spectrum of ADAGRAD (equivalent to the diagonal of G) decays much more slowly. The learning rate, η chosen by RADAGRAD and ADA-FULL are roughly one order of magnitude higher than for ADAGRAD.

4.2 Non-convex optimization in neural networks

Here we compare RADAGRAD and RADA-VR against ADAGRAD and the combination of ADAGRAD+SVRG on the task of optimizing several different neural network architectures.

Convolutional Neural Networks. We used modiﬁed variants of standard convolutional network architectures for image classiﬁcation on the MNIST, CIFAR-10 and SVHN datasets. These consist of three 5 5 convolutional layers generating 32 channels with Re LU non-linearities, each followed by 2 2 max-pooling. The ﬁnal layer was a dense softmax layer and the objevtive was to minimize the categorical cross entropy.

We used a batch size of 8 and trained the networks without momentum or weight decay, in order to eliminate confounding factors. Instead, we used dropout regularization (p = 0.5) in the dense layers during training. Step sizes were determined by coarsely searching a log scale of possible values and evaluating performance on a validation set. We found RADAGRAD to have a higher impact with convolutional layers than with dense layers, due to the higher correlations between weights. Therefore, for computational reasons, RADAGRAD was only applied on the convolutional layers. The last dense classiﬁcation layer was trained with ADAGRAD. In this setting ADA-FULL is computationally infeasible. The number of parameters in the convolutional layers is between 50-80k. Simply storing the full G matrix using double precision would require more memory than is available on top-of-the-line GPUs.

The results of our experiments can be seen in Figure 2, where we show the objective value during training and the test accuracy. We ﬁnd that both RADAGRAD variants consistently outperform both ADAGRAD and the combination of ADAGRAD+SVRG on these tasks. In particular combining RADAGRAD with variance reduction results in the largest improvement for training although both RADAGRAD variants quickly converge to very similar values for test accuracy.

For all models, the learning rate selected by RADAGRAD is approximately an order of magnitude larger than the one selected by ADAGRAD. This suggests that RADAGRAD can make more aggressive steps than ADAGRAD, which results in the relative success of RADAGRAD over ADAGRAD, especially at the beginning of the experiments.

We observed that RADAGRAD performed 5-10 slower than ADAGRAD per iteration. This can be attributed to the lack of GPU-optimized SVD and QR routines. These numbers are comparable with other similar recently proposed techniques [23]. However, due to the faster convergence we found that the overall optimization time of RADAGRAD was lower than for ADAGRAD.

0 20000 40000 60000 80000 100000 Iteration

Training Loss

RADAGRAD ADAGRAD

0 20000 40000 60000 80000 100000 Iteration

RADAGRAD ADAGRAD

Figure 3: Comparison of training loss (left) and and test loss (right) on language modelling task with the T-LSTM.

Recurrent Neural Networks. We trained the strongly-typed variant of the long short-term memory network (T-LSTM, [4]) for language modelling, which consists of the following task: Given a sequence of words from an original text, predict the next word. We used pre-trained GLOVE embedding vectors [29] as input to the T-LSTM layer and a softmax over the vocabulary (10k words) as output. The loss is the mean categorical crossentropy. The memory size of the T-LSTM units was set to 256. We trained and evaluated our network on the Penn Treebank dataset [25]. We subsampled strings of length 20 from the dataset and asked the network to predict each word in the string, given the words up to that point. Learning rates were selected by searching over a log scale of possible values and measuring performance on a validation set.

We compared RADAGRAD with ADAGRAD without variance reduction. The results of this experiment can be seen in Figure 3. During training, we found that RADAGRAD consistently outperforms ADAGRAD: RADAGRAD is able to both quicker reduce the training loss and also reaches a smaller value (5.62 10 4 vs. 1.52 10 3, a 2.7 reduction in loss). Again, we found that the selected learning rate is an order of magnitude higher for RADAGRAD than for ADAGRAD. RADAGRAD is able to exploit the fact that T-LSTMs perform type-preserving update steps which should preserve any low-rank structure present in the weight matrices. The relative improvement of RADAGRAD over ADAGRAD in training is also reﬂected in the test loss (1.15 10 2 vs. 3.23 10 2, a 2.8 reduction).

5 Discussion

We have presented ADA-LR and RADAGRAD which approximate the full proximal term of ADAGRAD using fast, structured random projections. ADA-LR enjoys similar regret to ADA-FULL and both methods achieve similar empirical performance at a fraction of the computational cost. Importantly, RADAGRAD can easily be modiﬁed to make use of standard improvements such as variance reduction. Using variance reduction in combination in particular has stark beneﬁts for non-convex optimization in convolutional and recurrent neural networks. We observe a marked improvement over widely-used techniques such as ADAGRAD and SVRG, the combination of which has recently been proven to be an excellent choice for non-convex optimization [1].

Furthermore, we tried to incorporate exponential forgetting schemes similar to RMSPROP and ADAM into the RADAGRAD framework but found that these methods degraded performance. A downside of such methods is that they require additional parameters to control the rate of forgetting.

Optimization for deep networks has understandably been a very active research area. Recent work has concentrated on either improving estimates of second order information or investigating the effect of variance reduction on the gradient estimates. It is clear from our experimental results that a thorough study of the combination provides an important avenue for further investigation, particularly where parts of the underlying model might have low effective rank.

Acknowledgements. We are grateful to David Balduzzi, Christina Heinze-Deml, Martin Jaggi, Aurelien Lucchi, Nishant Mehta and Cheng Soon Ong for valuable discussions and suggestions.

[1] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning, 2016. [2] S.-I. Amari. Natural gradient works efﬁciently in learning. Neural computation, 10(2):251 276, 1998. [3] D. Balduzzi. Deep online convex optimization with gated games. ar Xiv preprint ar Xiv:1604.01952, 2016. [4] D. Balduzzi and M. Ghifary. Strongly-typed recurrent neural networks. In Proceedings of the 33rd International Conference on Machine Learning, 2016. [5] R. H. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. ar Xiv preprint ar Xiv:1401.7020, 2014. [6] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for non-convex optimization. ar Xiv preprint ar Xiv:1502.04390, 2015. [7] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, 2014. [8] G. Desjardins, K. Simonyan, R. Pascanu, et al. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2062 2070, 2015. [9] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159, 2011. [10] J. C. Duchi, M. I. Jordan, and H. B. Mc Mahan. Estimation, optimization, and parallelism when data is sparse. In Advances in Neural Information Processing Systems, 2013. [11] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012. [12] A. Gonen and S. Shalev-Shwartz. Faster sgd using sketched conditioning. ar Xiv preprint ar Xiv:1506.02649, 2015. [13] Y. Gong, S. Kumar, H. Rowley, and S. Lazebnik. Learning binary codes for high-dimensional data using bilinear projections. In Proceedings of CVPR, pages 484 491, 2013. [14] R. Grosse and R. Salakhudinov. Scaling up natural gradient by sparsely factorizing the inverse ﬁsher matrix. In Proceedings of the 32nd International Conference on Machine Learning, pages 2304 2313, 2015. [15] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217 288, 2011. [16] C. Heinze, B. Mc Williams, and N. Meinshausen. Dual-loco: Distributing statistical estimation using random projections. In Proceedings of AISTATS, 2016. [17] C. Heinze, B. Mc Williams, N. Meinshausen, and G. Krummenacher. Loco: Distributing ridge regression with random projections. ar Xiv preprint ar Xiv:1406.3469, 2014. [18] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. Mc Williams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, 2015. [19] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013. [20] N. S. Keskar and A. S. Berahas. ada QN: An Adaptive Quasi-Newton Algorithm for Training RNNs. Nov. 2015. [21] D. Kingma and J. Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [22] A. Lucchi, B. Mc Williams, and T. Hofmann. A variance reduced stochastic newton method. ar Xiv preprint ar Xiv:1503.08316, 2015. [23] H. Luo, A. Agarwal, N. Cesa-Bianchi, and J. Langford. Efﬁcient second order online learning via sketching. ar Xiv preprint ar Xiv:1602.02202, 2016. [24] M. W. Mahoney. Randomized algorithms for matrices and data. Apr. 2011. ar Xiv:1104.5557v3 [cs.DS]. [25] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313 330, 1993. [26] J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, 2015. [27] B. Mc Williams, G. Krummenacher, M. Lucic, and J. M. Buhmann. Fast and robust least squares estimation in corrupted linear models. In Advances in Neural Information Processing Systems, volume 27, 2014. [28] B. Neyshabur, R. R. Salakhutdinov, and N. Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2413 2421, 2015. [29] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532 1543, 2014. [30] L. Zhang, M. Mahdavi, R. Jin, T. Yang, and S. Zhu. Recovering optimal solution by dual random projection. ar Xiv preprint ar Xiv:1211.3046, 2012.