# variance_reduction_with_sparse_gradients__779c9d5f.pdf

Published as a conference paper at ICLR 2020

VARIANCE REDUCTION WITH SPARSE GRADIENTS

Melih Elibol, Michael I. Jordan University of California, Berkeley {elibol,jordan}@cs.berkeley.edu

Lihua Lei Stanford University lihualei@stanford.edu

Variance reduction methods such as SVRG (Johnson & Zhang, 2013) and Spider Boost (Wang et al., 2018) use a mixture of large and small batch gradients to reduce the variance of stochastic gradients. Compared to SGD (Robbins & Monro, 1951), these methods require at least double the number of operations per update to model parameters. To reduce the computational cost of these methods, we introduce a new sparsity operator: The random-top-k operator. Our operator reduces computational complexity by estimating gradient sparsity exhibited in a variety of applications by combining the top-k operator (Stich et al., 2018; Aji & Heaﬁeld, 2017) and the randomized coordinate descent operator. With this operator, large batch gradients offer an extra beneﬁt beyond variance reduction: A reliable estimate of gradient sparsity. Theoretically, our algorithm is at least as good as the best algorithm (Spider Boost), and further excels in performance whenever the random-top-k operator captures gradient sparsity. Empirically, our algorithm consistently outperforms Spider Boost using various models on various tasks including image classiﬁcation, natural language processing, and sparse matrix factorization. We also provide empirical evidence to support the intuition behind our algorithm via a simple gradient entropy computation, which serves to quantify gradient sparsity at every iteration.

1 INTRODUCTION

Optimization tools for machine learning applications seek to minimize the ﬁnite sum objective

min x Rd f(x) 1

i=1 fi(x), (1)

where x is a vector of parameters, and fi : Rd R is the loss associated with sample i. Batch SGD serves as the prototype for modern stochastic gradient methods. It updates the iterate x with x η f I(x), where η is the learning rate and f I(x) is the batch stochastic gradient, i.e.

The batch size |I| in batch SGD directly impacts the stochastic variance and gradient query complexity of each iteration of the update rule.

In recent years, variance reduction techniques have been proposed by carefully blending large and small batch gradients (e.g. Roux et al., 2012; Johnson & Zhang, 2013; Defazio et al., 2014; Xiao & Zhang, 2014; Allen-Zhu & Yuan, 2016; Allen-Zhu & Hazan, 2016; Reddi et al., 2016a;b; Allen-Zhu, 2017; Lei & Jordan, 2017; Lei et al., 2017; Allen-Zhu, 2018b; Fang et al., 2018; Zhou et al., 2018; Wang et al., 2018; Pham et al., 2019; Nguyen et al., 2019; Lei & Jordan, 2019). They are alternatives to batch SGD and are provably better than SGD in various settings. While these methods allow for greater learning rates than batch SGD and have appealing theoretical guarantees, they require a per-iteration query complexity which is more than double than that of batch SGD. Defazio (2019) questions the utility of variance reduction techniques in modern machine learning problems, empirically identifying query complexity as one issue. In this paper, we show that gradient sparsity (Aji & Heaﬁeld, 2017) can be used to signiﬁcantly reduce the query complexity of variance reduction methods. Our work is motivated by the observation that gradients tend to be sparse, having only

Published as a conference paper at ICLR 2020

a small fraction of large coordinates. Speciﬁcally, if the indices of large gradient coordinates (measured in absolute value) are known before updating model parameters, we compute the derivative of only those coordinates while setting the remaining gradient coordinates to zero. In principle, if sparsity is exhibited, using large gradient coordinates will not effect performance and will signiﬁcantly reduce the number of operations required to update model parameters. Nevertheless, this heuristic alone has three issues: (1) bias is introduced by setting other entries to zero; (2) the locations of large coordinates are typically unknown; (3) accessing a subset of coordinates may not be easily implemented for some problems like deep neural networks.

We provide solutions for all three issues. First, we introduce a new sparse gradient operator: The random-top-k operator. The random-top-k operator is a composition of the randomized coordinate descent operator and the top-k operator. In prior work, the top-k operator has been used to reduce the communication complexity of distributed optimization (Stich et al., 2018; Aji & Heaﬁeld, 2017) applications. The random-top-k operator has two phases: Given a stochastic gradient and a pair of integers (k1, k2) that sum to k, the operator retains k1 coordinates which are most promising in terms of their likelihood to be large on average, then randomly selects k2 of the remaining coordinates with appropriate rescaling. The ﬁrst phase captures sparsity patterns while the second phase eliminates bias. Second, we make use of large batch gradients in variance reduction methods to estimate sparsity patterns. Inspired by the use of a memory vector in Aji & Heaﬁeld (2017), the algorithm maintains a memory vector initialized with the absolute value of the large batch gradient at the beginning of each outer loop and updated by taking an exponential moving average over subsequent stochastic gradients. Coordinates with large values in the memory vector are more promising, and the random-top-k operator will pick the top k1 coordinate indices based on the memory vector. Since larger batch gradients have lower variance, the initial estimate is quite accurate. Finally, for software that supports dynamic computation graphs, we provide a cost-effective way (sparse back-propagation) to implement the random-top-k operator.

In this work we apply the random-top-k operator to Spider Boost (Wang et al., 2018), a recent variance reduction method that achieves optimal query complexity, with a slight modiﬁcation based on the geometrization technique introduced by Lei & Jordan (2019). Theoretically, we show that our algorithm is never worse than Spider Boost and can strictly outperform it when the random-top-k operator captures gradient sparsity. Empirically, we demonstrate the improvements in computation for various tasks including image classiﬁcation, natural language processing, and sparse matrix factorization.

The rest of the paper is organized as follows. In Section 2, we deﬁne the random-top-k operator, our optimization algorithm, and a description of sparse backpropagation. The theoretical analyses are presented in Section 3, followed by experimental results in Section 4. All technical proofs are relegated to Appendix A, and additional experimental details can be found in Appendix B.

2 STOCHASTIC VARIANCE REDUCTION WITH SPARSE GRADIENTS

Generally, variance reduction methods reduce the variance of stochastic gradients by taking a snapshot f(y) of the gradient f(x) every m steps of optimization, and use the gradient information in this snapshot to reduce the variance of subsequent smaller batch gradients f I(x) (Johnson & Zhang, 2013; Wang et al., 2018). Methods such as SCSG (Lei & Jordan, 2017) utilize a large batch gradient, which is typically some multiple in size of the small batch gradient b, which is much more practical and is what we do in this paper. To reduce the cost of computing additional gradients, we use sparsity by only computing a subset k of the total gradients d, where y Rd.

For d, k, k1, k2 Z+, let k = k1 + k2, where 1 k d for a parametric model of dimension d. In what follows, we deﬁne an operator which takes vectors x, y and outputs y , where y retains only k of the entries in y, k1 of which are selected according to the coordinates in x which have the k1 largest absolute values, and the remaining k2 entries are randomly selected from y. The k1 coordinate indices and k2 coordinate indices are disjoint. Formally, the operator rtopk1,k2 : Rd Rd is deﬁned for x, y Rd as

rtopk1,k2(x, y)

yℓ if k1 > 0 and |x|ℓ |x|(k1) (d k1)

k2 yℓ if ℓ S 0 otherwise,

Published as a conference paper at ICLR 2020

where |x| denotes a vector of absolute values, |x|(1) |x|(2) . . . |x|(d) denotes the order statistics of coordinates of x in absolute values, and S denotes a random subset with size k2 that is uniformly drawn from the set {ℓ: |x|ℓ< |x|(k1)}. For instance, if x = (11, 12, 13, 14, 15), y = ( 25, 24, 13, 12, 11) and k1 = k2 = 1, then S is a singleton uniformly drawn from {1, 2, 3, 4}. Suppose S = {2}, then rtop1,1(x, y) = (0, 4y2, 0, 0, y5) = (0, 96, 0, 0, 11). If k1 + k2 = d, rtopk1,k2(x, y) = y. On the other hand, if k1 = 0, rtop0,k2(x, y) does not depend on x and returns a rescaled random subset of y. This is the operator used in coordinate descent methods. Finally, rtopk1,k2(x, y) is linear in y. The following Lemma shows that rtopk1,k2(x, y) is an unbiased estimator of y, which is a crucial property in our later analysis. Lemma 1. Given any x, y Rd,

E rtopk1,k2(x, y) = y, Var rtopk1,k2(x, y) = d k1 k2

k2 top k1(x, y) 2,

where E is taken over the random subset S involved in the rtopk1,k2 operator and

(top k1(x, y))ℓ= yℓ if k1 > 0 and |x|ℓ< |x|(k1) 0 otherwise.

Our algorithm is detailed as below.

Algorithm 1: Spider Boost with Sparse Gradients. Input: Learning rate η, inner loop size m, outer loop size T, large batch size B, small batch size b, initial iterate x0, memory decay factor α, sparsity parameters k1, k2.

1 I0 Unif({1, . . . , n}) with |I0| = B

2 M0 := | f I0(x0)|

3 for j = 1, ..., T do

4 x(j) 0 := xj 1, M (j) 0 := Mj 1

5 Ij Unif({1, . . . , n}) with |Ij| = B

6 ν(j) 0 := f Ij(x(j) 0 )

7 Nj := m (for implementation) or Nj geometric distribution with mean m (for theory)

8 for t = 0, . . . , Nj 1 do

9 x(j) t+1 := x(j) t ην(j) t

10 I(j) t Unif([n]) with |I(j) t | = b

11 ν(j) t+1 := ν(j) t + rtopk1,k2 M (j) t , f I(j) t (x(j) t+1) f I(j) t (x(j) t )

12 M (j) t+1 := α|ν(j) t+1| + (1 α)M (j) t

13 xj := x(j) Nj, Mj := M (j) Nj Output: xout = x T (for implementation) or xout = x T where T Unif([T]) (for theory)

The algorithm includes an outer-loop and an inner-loop. In the theoretical analysis, we generate Nj as Geometric random variables. This trick is called geometrization , proposed by Lei & Jordan (2017) and dubbed by Lei & Jordan (2019). It greatly simpliﬁes analysis (e.g. Lei et al., 2017; Allen-Zhu, 2018a). In practice, as observed by Lei et al. (2017), setting Nj to m does not impact performance in any signiﬁcant way. We only use geometrization in our theoretical analysis for clarity. Similarly, for our theoretical analysis, the output of our algorithm is selected uniformly at random from the set of outer loop iterations. Like the use of average iterates in convex optimization, this is a common technique for nonconvex optimization proposed by Nemirovski et al. (2009). In practice, we simply use the last iterate.

Similar to Aji & Heaﬁeld (2017), we maintain a memory vector M (j) t at each iteration of our algorithm. The memory vector is initialized to the large batch gradient computed before every pass through the inner loop, which provides a relatively accurate gradient sparsity estimate of x(j) 0 . The exponential moving average gradually incorporates information from subsequent small batch gradients to account for changes to gradient sparsity. We then use M (j) t as an approximation to the variance of each gradient coordinate in our rtopk1,k2 operator. With M (j) t as input, the rtopk1,k2

Published as a conference paper at ICLR 2020

operator targets k1 high variance gradient coordinates in addition to k2 randomly selected coordinates.

The cost of invoking rtopk1,k2 is dominated by the algorithm for selecting the top k coordinates, which has linear worst case complexity when using the introselect algorithm (Musser, 1997).

2.1 SPARSE BACK-PROPAGATION

A weakness of our method is the technical difﬁculty of implementing a sparse backpropagation algorithm in modern machine learning libraries, such as Tensorﬂow (Abadi et al., 2015) and Pytorch (Paszke et al., 2017). Models implemented in these libraries generally assume dense structured parameters. The optimal implementation of our algorithm makes use of a sparse forward pass and assumes a sparse computation graph upon which backpropagation is executed. Libraries that support dynamic computation graphs, such as Pytorch, will construct the sparse computation graph in the forward pass, which makes the required sparse backpropagation trivial. We therefore expect our algorithm to perform quite well on libraries which support dynamic computation graphs.

Consider the forward pass of a deep neural network, where φ is a deep composition of parametric functions, φ(x; θ) = φL(φL 1(. . . φ0(x; θ0) . . . ; θL 1); θL). (2)

The unconstrained problem of minimizing over the θℓcan be rewritten as a constrained optimization problem as follows:

i=1 loss(z(L+1) i , yi)

s.t. z(L+1) i = φL(z(L) i ; θL) ...

z(ℓ+1) i = φℓ(z(ℓ) i ; θℓ) ...

z(1) i = φ0(xi; θ0).

In this form, z L+1 i is the model estimate for data point i. Consider φℓ(x; θℓ) = σ(x T θℓ) for 1 ℓ< L, φL be the output layer, and σ be some subdifferentiable activation function. If we apply the rtopk1,k2 operator per-layer in the forward-pass, with appropriate scaling of k1 and k2 to account for depth, we see that the number of multiplications in the forward pass is reduced to k1 + k2: σ(rtopk1,k2(v, x)T rtopk1,k2(v, θℓ)). A sparse forward-pass yields a computation graph for a (k1 + k2)-parameter model, and back-propagation will compute the gradient of the objective with respect to model parameters in linear time (Chauvin & Rumelhart, 1995).

3 THEORETICAL COMPLEXITY ANALYSIS

3.1 NOTATION AND ASSUMPTIONS

Denote by the Euclidean norm and by a b the minimum of a and b. For a random vector Y Rd,

i=1 Var(Yi).

We say a random variable N has a geometric distribution, N Geom(m), if N is supported on the non-negative integers with

P(N = k) = γk(1 γ), k = 0, 1, . . . ,

for some γ such that EN = m. Here we allow N to be zero to facilitate the analysis.

Published as a conference paper at ICLR 2020

Assumption A1 on the smoothness of individual functions will be made throughout the paper.

A1 fi is differentiable with fi(x) fi(y) L x y ,

for some L < and for all i {1, . . . , n}.

As a direct consequence of assumption A1, it holds for any x, y Rd that

2 x y 2 fi(x) fi(y) fi(y), x y L

2 x y 2. (4)

To formulate our complexity bounds, we deﬁne

f = inf x f(x), f = f(x0) f .

Further we deﬁne σ2 as an upper bound on the expected norm of the stochastic gradients:

σ2 = sup x 1 n

i=1 fi(x) 2. (5)

By Cauchy-Schwarz inequality, it is easy to see that σ2 is also a uniform bound of f(x) 2.

Finally, we assume that sampling an index i and accessing the pair fi(x) incur a unit of cost and accessing the truncated version rtopk1,k2(m, fi(x)) incur (k1 + k2)/d units of cost. Note that calculating rtopk1,k2(m, f I(x)) incurs |I|(k1 + k2)/d units of computational cost. Given our framework, the theoretical complexity of the algorithm is

B + 2b Nj k1 + k2

3.2 WORST-CASE GUARANTEE

Theorem 1. Under the following setting of parameters

k2 6dm, B = 2σ2

For any T T(ϵ) 4 f/ηmϵ2, E f(xout) ϵ.

If we further set

m = Bd b(k1 + k2),

the complexity to achieve the above condition is

ECcomp(ϵ) = O

Recall that the complexity of Spider Boost (Wang et al., 2018) is

Thus as long as b = O(1), k1 = O(k2), our algorithm has the same complexity as Spider Boost under appropriate settings. The penalty term O( p

b(k1 + k2)/k2) is due to the information loss by sparsiﬁcation.

Published as a conference paper at ICLR 2020

3.3 DATA ADAPTIVE ANALYSIS

Let g(j) t = top k1(M (j) t , f(x(j) t+1) f(x(j) t )) 2, and

i=1 top k1(M (j) t , fi(x(j) t+1) fi(x(j) t )) 2.

By Cauchy-Schwarz inequality and the linearity of top k1, it is easy to see that g(j) t G(j) t . If our algorithm succeeds in capturing sparsity, both g(j) t and G(j) t will be small. In this subsection we will analyze the complexity under this case. Further deﬁne Rj as

Rj = Ejg(j) Nj + Ej G(j) Nj b , (7)

where Ej is taken over all randomness in j-th outer loop (line 4-13 of Algorithm 1). Theorem 2. Under the following setting of parameters

3m , B = 3σ2

For any T T(ϵ) 6 f/ηmϵ2,

E f(xout) 2 2ϵ2

3 + (d k1 k2)m

If E RT ϵ2 k2 3(d k1 k2)m, then E f(xout) ϵ. If we further set

m = Bd b(k1 + k2),

the complexity to achieve the above condition is

ECcomp(ϵ) = O

In practice, m is usually much larger than b. As a result, the complexity of our algorithm is O( p

(k1 + k2)/d) smaller than that of Spider Boost if our algorithm captures gradient sparsity. Although this type of data adaptive analyses is not as clean as the worst-case guarantee (Theorem 1), it reveals the potentially superior performance of our algorithm. Similar analyses have been done for various other algorithms, including Ada Grad (Duchi et al., 2011) and Adam (Kingma & Ba, 2014).

4 EXPERIMENTS

In this section, we present a variety of experiments to illustrate gradient sparsity and demonstrate the performance of Sparse Spider Boost. By computing the entropy of the empirical distribution of the absolute value of stochastic gradient coordinates, we show that certain models exhibit gradient sparsity during optimization. To evaluate the performance of variance reduction with sparse gradients, we compute the loss over gradient queries per epoch of Sparse Spiderboost and Spider Boost for a number of image classiﬁcation problems. We also compare Sparse Spider Boost, Spider Boost, and SGD on a natural language processing task and sparse matrix factorization.

For all experiments, unless otherwise speciﬁed, we run Spider Boost and Sparse Spider Boost with a learning rate η = 0.1, large-batch size B = 1000, small-batch size b = 100, inner loop length of m = 10, memory decay factor of α = 0.5, and k1 and k2 both set to 5% of the total number of model parameters. We call the sum k1 + k2 = k = 10% the sparsity of the optimization algorithm.

Published as a conference paper at ICLR 2020

4.1 GRADIENT SPARSITY AND IMAGE CLASSIFICATION

Our experiments in this section test a number of image classiﬁcation tasks for gradient sparsity, and plot the learning curves of some of these tasks. We test a 2-layer fully connected neural network with hidden layers of width 100, a simple convolutional neural net which we describe in detail in Appendix B, and Resnet-18 (He et al., 2015). All models use Re Lu activations. For datasets, we use CIFAR-10 (Krizhevsky et al.), SVHN (Netzer et al., 2011), and MNIST (Le Cun & Cortes, 2010). None of our experiments include Resnet-18 on MNIST as MNIST is an easier dataset; it is included primarily to provide variety.

Our method relies partially on the assumption that the magnitude of the derivative of some model parameters are greater than others. To measure this, we compute the entropy of the empirical distribution of the absolute value of stochastic gradient coordinates. In Algorithm 1, the following term updates our estimate of the variance of each coordinate s derivative:

M (j) t+1 := α|ν(j) t+1| + (1 α)M (j) t .

Consider the entropy of the following probability vector p(j) t = M (j) t / M (j) t 1. The entropy of p provides us with a measure of how much structure there is in our gradients. To see this, consider the hypothetical scenario where pi = 1/d. In this scenario we have no structure; the top k1 component of our sparsity operator is providing no value and entropy is maximized. On the other hand, if a single entry pi = 1 and all other entries pj = 0, then the top k1 component of our sparsity operator is effectively identifying the only relevant model parameter.

To measure the potential of our sparsity operator, we compute the entropy of p while running Spider Boost on a variety of datasets and model architectures. The results of running this experiment are summarized in the following table.

Table 1: Entropy of Memory Vectors

FC NN Conv NN Resnet-18

Max Before After Max Before After Max Before After

CIFAR-10 18.234 16.41 8.09 15.920 13.38 2.66 23.414 22.59 21.70 SVHN 18.234 15.36 8.05 15.920 13.00 2.97 23.414 22.62 21.31 MNIST 18.234 14.29 9.77 15.920 14.21 2.77 - - -

Table 1 provides the maximum entropy as well as the entropy of the memory vector before and after training for 150 epochs, for each dataset and each model. For each model, the entropy at the beginning of training is almost maximal. This is due to random initialization of model parameters. After 150 epochs, the entropy of Mt for the convolutional model drops to approximately 3, which suggests a substantial amount of gradient structure. Note that for the datasets that we tested, the gradient structure depends primarily on the model and not the dataset. In particular, for Resnet-18, the entropy appears to vary minimally after 150 epochs.

Figure 1: Spider Boost with 10% sparsity (k = 0.1d) compared to Spider Boost without sparsity. Left ﬁgure compares the two algorithms using Resnet-18 on Cifar-10. Right ﬁgure compares the two algorithms using a convolutional neural network trained on MNIST. The x-axis measures gradient queries over N, where N is the size of the respective datasets. Plots are in log-scale.

Published as a conference paper at ICLR 2020

Figure 1 compares Spider Boost alone to Spider Boost with 10% sparsity (10% of parameter derivatives). All experiments in this section are run for 50 epochs. In our comparison to Spider Boost, we measure the number of gradient queries over the size of the dataset N. A single gradient query is taken to be the cost of computing a gradient for a single data point. If i is the index of a single sample, then fi(x) is a single gradient query. Using the batch gradient to update model parameters for a dataset of size B has a gradient query cost of B. For a model with d parameters, using a single sample to update k model parameters has a gradient query cost of k/d, etc.

Our results of ﬁtting the convolutional neural network to MNIST show that sparsity provides a signiﬁcant advantage compared to using Spider Boost alone. We only show 2 epochs of this experiment since the MNIST dataset is fairly simple and convergence is rapidly achieved. The results of training Resnet-18 on CIFAR-10 suggests that our sparsity algorithm works well on large neural networks, and non-trivial datasets. We believe Resnet-18 on CIFAR-10 does not do as well due to the gradient density we observe for Resnet-18 in general. Sparsity here not only has the additional beneﬁt of reducing gradient query complexity, but also provides a dampening effect on variance due to the additional covariates in Spider Boost s update to model parameters. Results for the rest of these experiments can be found in Appendix B.

4.2 NATURAL LANGUAGE PROCESSING

We evaluate Sparse Spider Boost s performance on an LSTM-based (Hochreiter & Schmidhuber, 1997) generative language model. We compare Sparse Spider Boost, Spider Boost, and SGD. We train our LSTM model on the Penn Treebank (Marcus et al., 1994) corpus. The natural language processing model consists of a word embedding of dimension 128 of 1000 tokens, which is jointly learned with the task. The LSTM has a hidden and cell state dimension of 1024. All three optimization algorithms operate on this model. The variance reduction training algorithm for this type of model can be found in Appendix B. We run Spider Boost and Sparse Spider Boost with a learning rate η = 0.2, large-batch size B = 40, small-batch size b = 20, inner loop length of m = 2. We run SGD with learning rate 0.2 and batch size is 20. Figure 2 shows Spider Boost is slightly worse than SGD, and sparsity provides a noticeable improvement over SGD.

Figure 2: (a): SGD learning rate is 0.2 and batch size is 20. (b): SGD batch size is 103 and learning rate schedule is 0.1 for epochs 0 10, 0.01 for epochs 10 20, and 0.001 for epochs 20 40. The x-axis measures gradient queries over N, where N is the size of the respective datasets. Plots are in log-scale.

4.3 SPARSE MATRIX FACTORIZATION

For our experiments with sparse matrix factorization, we perform Bayesian Personalized Ranking (Rendle et al., 2009) on the Movie Lens database (Harper & Konstan, 2015) with a latent dimension of 20. To satisfy m = B/b, we run Spider Boost and Sparse Spider Boost with a large-batch size B = 1030, small-batch size b = 103, inner loop length of m = 10. For this experiment, we run Spider Boost with the following learning rate schedule:

η(a, b, t) = b + (a b)m t

where a = 1.0 and b = 0.1. The schedule interpolates from a to b as the algorithm progresses through the inner loop. For instance, within the inner loop, at iteration 0 the learning rate is 1.0, and at iteration m the learning rate is 0.1. We believe this is a natural way to utilize the low variance

Published as a conference paper at ICLR 2020

at the beginning of the inner loop, and is a fair comparison to an exponential decay learning rate schedule for SGD. Details of the SGD baselines are provided in Figure 2. We see Spider Boost is slightly worse than SGD, and sparsity provides a slight improvement over SGD, especially in the ﬁrst few epochs.

5 CONCLUSION

In this paper, we show how sparse gradients with memory can be used to improve the gradient query complexity of SVRG-type variance reduction algorithms. While we provide a concrete sparse variance reduction algorithm for Spider Boost, the techniques developed in this paper can be adapted to other variance reduction algorithms.

We show that our algorithm provides a way to explicitly control the gradient query complexity of variance reduction methods, a problem which has thus far not been addressed. Assuming our algorithm captures the sparsity structure of the optimization problem, we also prove that the complexity of our algorithm is an improvement over Spider Boost. The results of our comparison to Spider Boost validates this assumption, and entropy measures provided in Table 1 empirically support our hypothesis that gradient sparsity exists.

Table 1 also supports the results in Aji & Heaﬁeld (2017), which shows that the top-k operator generally outperforms the random-k operator. Our random-top-k operator takes advantage of the superior performance of the top-k operator while eliminating bias via a secondary random-k operator. Not every problem we tested exhibited sparsity structure. While this is true, our analysis proves that our algorithm performs no worse than Spider Boost in these settings. Even when there is no structure, our algorithm reduces to a random sampling of k1 + k2 coordinates, which is essentially a randomized coordinate descent analogue of Spider Boost. Empirically, we see that Sparse Spider Boost outperforms Spider Boost when no sparsity structure is present. We believe this is due to the variance introduced by additional covariates in the Spider Boost update, which is mitigated in Sparse Spider Boost by our random-top-k operator.

The results of our experiments on natural language processing and matrix factorization demonstrate that, with additional effort, variance reduction methods are competitive with SGD. While we view this as progress toward improving the practical viability of variance reduction algorithms, we believe further improvements can be made, such as better utilization of reduced variance during training, and better control over increased variance in very high dimensional models such as dense net (Defazio, 2019). We recognize these issues and hope to make progress on them in future work.

Mart ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorﬂow.org.

Alham Fikri Aji and Kenneth Heaﬁeld. Sparse communication for distributed gradient descent. Co RR, abs/1704.05021, 2017. URL http://arxiv.org/abs/1704.05021.

Zeyuan Allen-Zhu. Katyusha: The ﬁrst direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200 1205. ACM, 2017.

Zeyuan Allen-Zhu. Katyusha x: Practical momentum method for stochastic sum-of-nonconvex optimization. ar Xiv preprint ar Xiv:1802.03866, 2018a.

Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett

Published as a conference paper at ICLR 2020

(eds.), Advances in Neural Information Processing Systems 31, pp. 2675 2686. Curran Associates, Inc., 2018b. URL http://papers.nips.cc/paper/ 7533-natasha-2-faster-non-convex-optimization-than-sgd.pdf.

Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. Ar Xiv e-prints abs/1603.05643, 2016.

Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In International conference on machine learning, pp. 1080 1089, 2016.

Yves Chauvin and David E. Rumelhart (eds.). Backpropagation: Theory, Architectures, and Applications. L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1995. ISBN 0-8058-1259-8.

Aaron Defazio. On the ineffectiveness of variance reduced optimization for deep learning, 2019. URL https://openreview.net/forum?id=B1MIBs05F7.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689 699, 2018.

F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4):19:1 19:19, December 2015. ISSN 2160-6455. doi: 10.1145/ 2827872. URL http://doi.acm.org/10.1145/2827872.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Co RR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS 13, pp. 315 323, USA, 2013. Curran Associates Inc. URL http: //dl.acm.org/citation.cfm?id=2999611.2999647.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/ kriz/cifar.html.

Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann. lecun.com/exdb/mnist/.

Lihua Lei and Michael Jordan. Less than a Single Pass: Stochastically Controlled Stochastic Gradient. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54, pp. 148 156. PMLR, 2017.

Lihua Lei and Michael I Jordan. On the adaptivity of stochastic gradient-based optimization. ar Xiv preprint ar Xiv:1904.04480, 2019.

Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex ﬁnitesum optimization via scsg methods. In Advances in Neural Information Processing Systems 30, pp. 2348 2358. 2017. URL http://papers.nips.cc/paper/ 6829-non-convex-finite-sum-optimization-via-scsg-methods.pdf.

Published as a conference paper at ICLR 2020

Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert Mac Intyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. The penn treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, HLT 94, pp. 114 119, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics. ISBN 1-55860357-3. doi: 10.3115/1075812.1075835. URL https://doi.org/10.3115/1075812. 1075835.

David R Musser. Introspective sorting and selection algorithms. Software: Practice and Experience, 27(8):983 993, 1997.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574 1609, 2009.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http://ufldl.stanford.edu/ housenumbers/nips2011_housenumbers.pdf.

Lam M Nguyen, Marten van Dijk, Dzung T Phan, Phuong Ha Nguyen, Tsui-Wei Weng, and Jayant R Kalagnanam. Optimal ﬁnite-sum smooth non-convex optimization with sarah. ar Xiv preprint ar Xiv:1901.07648, 2019.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

Nhan H Pham, Lam M Nguyen, Dzung T Phan, and Quoc Tran-Dinh. Proxsarah: An efﬁcient algorithmic framework for stochastic composite nonconvex optimization. ar Xiv preprint ar Xiv:1902.05679, 2019.

Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. ar Xiv preprint ar Xiv:1603.06160, 2016a.

Sashank J Reddi, Suvrit Sra, Barnab as P oczos, and Alex Smola. Fast incremental method for nonconvex optimization. ar Xiv preprint ar Xiv:1603.06159, 2016b.

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-ﬁfth conference on uncertainty in artiﬁcial intelligence, pp. 452 461. AUAI Press, 2009.

H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951.

Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In Advances in Neural Information Processing Systems, pp. 2663 2671, 2012.

Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsiﬁed SGD with memory. Co RR, abs/1809.07599, 2018. URL http://arxiv.org/abs/1809.07599.

Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization. Co RR, abs/1810.10690, 2018. URL http://arxiv.org/abs/1810.10690.

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3925 3936. Curran Associates Inc., 2018.

Published as a conference paper at ICLR 2020

A TECHNICAL PROOFS

A.1 PREPARATORY RESULTS

Lemma 2 (Lemma 3.1 of Lei & Jordan (2019)). Let N Geom(m). Then for any sequence D0, D1, . . . with E|DN| < ,

E(DN DN+1) = 1

m (D0 EDN) .

Remark 1. The requirement E|DN| < is essential. A useful sufﬁcient condition if |Dt| = O(Poly(t)) because a geometric random variable has ﬁnite moments of any order.

Lemma 3 (Lemma B.2 of Lei & Jordan (2019)). Let z1, . . . , z M Rd be an arbitrary population and J be a uniform random subset of [M] with size m. Then

j=1 zj 2 2.

Proof of Lemma 1. WLOG, assume that |x1| |x2| . . . |xd|. Let S be a random subset of {k1 + 1, . . . , d} with size k2. Then

rtopk1,k2(x, y)

I(ℓ k1) + d k1

k2 I(ℓ S) .

As a result,

E h rtopk1,k2(x, y)

I(ℓ k1) + d k1

k2 I(ℓ> k1)P(ℓ S) = yℓ,

Var h rtopk1,k2(x, y)

2 y2 ℓI(ℓ> k1)P(ℓ S)(1 P(ℓ S))

k2 y2 ℓI(ℓ> k1).

Var rtopk1,k2(x, y) = d k1 k2

ℓ>k1 y2 ℓ= d k1 k2

k2 top k1(x, y) 2.

A.2 ANALYSIS OF A SINGLE INNER LOOP

Lemma 4. For any j, t,

Ej,t(ν(j) t+1 ν(j) t ) = f(x(j) t+1) f(x(j) t )

Varj,t(ν(j) t+1 ν(j) t ) η2L2

b ν(j) t 2 + d k1 k2

g(j) t + G(j) t b

where Ej,t and Varj,t are taken over the randomness of I(j) t and the random subset S involved in the rtopk1,k2 operator.

Proof. By deﬁnition,

ν(j) t+1 ν(j) t = rtopk1,k2 M (j) t , f I(j) t (x(j) t+1) f I(j) t (x(j) t ) .

Published as a conference paper at ICLR 2020

Let S be the random subset involved in rtopk1,k2. Then S is independent of (I(j) t , M (j) t , x(j) t+1, x(j) t ). By Lemma 1, ES ν(j) t+1 ν(j) t = f I(j) t (x(j) t+1) f I(j) t (x(j) t )

Var S ν(j) t+1 ν(j) t = d k1 k2

top k1 M (j) t , f I(j) t (x(j) t+1) f I(j) t (x(j) t ) 2 .

Since I(j) t is independent of (M (j) t , x(j) t+1, x(j) t ), the tower property of conditional expectation and variance implies that

Ej,t ν(j) t+1 ν(j) t = EI(j) t

f I(j) t (x(j) t+1) f I(j) t (x(j) t ) = f(x(j) t+1) f(x(j) t ),

Varj,t ν(j) t+1 ν(j) t = EI(j) t

Var S ν(j) t+1 ν(j) t + Var I(j) t

ES ν(j) t+1 ν(j) t . (8)

To bound the ﬁrst term, we note that top k1 is linear in y and thus

top k1 M (j) t , f I(j) t (x(j) t+1) f I(j) t (x(j) t ) 2

= EI(j) t top k1 M (j) t , f I(j) t (x(j) t+1) f I(j) t (x(j) t ) 2

+ Var I(j) t

h top k1 M (j) t , f I(j) t (x(j) t+1) f I(j) t (x(j) t ) i

= g(j) t + Var I(j) t

top k1(M (j) t , fi(x(j) t+1) fi(x(j) t ))

g(j) t + G(j) t b , (9)

where the last inequality uses Lemma 3. To bound the second term of (8), by Lemma 3,

ES ν(j) t+1 ν(j) t = Var I(j) t

f I(j) t (x(j) t+1) f I(j) t (x(j) t )

i=1 fi(x(j) t+1) fi(x(j) t ) 2 (i) L2

b x(j) t+1 x(j) t 2 (ii) = η2L2

b ν(j) t 2,

where (i) uses assumption A1 and (ii) uses the deﬁnition that x(j) t+1 = x(j) t ην(j) t .

Lemma 5. For any j, t,

Ej,t ν(j) t+1 f(x(j) t+1) 2 ν(j) t f(x(j) t ) 2 + η2L2

b ν(j) t 2 + d k1 k2

g(j) t + G(j) t b

where Ej,t and Varj,t are taken over the randomness of I(j) t and the random subset S involved in the rtopk1,k2 operator.

Proof. By Lemma 4, we have

ν(j) t+1 f(x(j) t+1) = ν(j) t f(x(j) t ) + ν(j) t+1 ν(j) t Ej,t(ν(j) t+1 ν(j) t ) .

Since I(j) t is independent of (ν(j) t , x(j) t ),

Covj,t ν(j) t f(x(j) t ), ν(j) t+1 ν(j) t = 0.

As a result,

Ej,t ν(j) t+1 f(x(j) t+1) 2 = ν(j) t f(x(j) t ) 2 + Varj,t(ν(j) t+1 ν(j) t ).

The proof is then completed by Lemma 4.

Published as a conference paper at ICLR 2020

Lemma 6. For any j,

Ej ν(j) Nj f(x(j) Nj) 2 mη2L2

b Ej ν(j) Nj 2 + σ2I(B < n)

B + (d k1 k2)m

where Ej is taken over all randomness in j-th outer loop (line 4-13 of Algorithm 1). 4.

Proof. By deﬁnition,

ν(j) t+1 ν(j) t + rtopk1,k2 M (j) t , f I(j) t (x(j) t+1) f I(j) t (x(j) t )

ν(j) t + f I(j) t (x(j) t+1) f I(j) t (x(j) t )

fi(x(j) t+1) fi(x(j) t )

fi(x(j) t+1) fi(x(j) t ) 2

v u u u u t

fi(x(j) t+1) 2 + X

fi(x(j) t ) 2

fi(x(j) t+1) 2 + 1

fi(x(j) t ) 2 !

As a result, ν(j) t ν(j) 0 + t

2nσ, (10) Thus, ν(j) t f(x(j) t ) 2 2 ν(j) t 2 + 2 f(x(j) t ) 2 = Poly(t).

This implies that we can apply Lemma 2 on the sequence Dt = ν(j) t f(x(j) t ) 2.

Letting j = Nj in Lemma 5 and taking expectation over all randomness in Ej, we have

Ej ν(j) Nj+1 f(x(j) Nj+1) 2

Ej ν(j) Nj f(x(j) Nj) 2 + η2L2

b Ej ν(j) Nj 2 + d k1 k2

g(j) Nj + G(j) Nj b

= Ej ν(j) Nj f(x(j) Nj) 2 + η2L2

b Ej ν(j) Nj 2 + d k1 k2

k2 Rj. (11)

By Lemma 2,

Ej ν(j) Nj f(x(j) Nj) 2 Ej ν(j) Nj+1 f(x(j) Nj+1) 2

ν(j) 0 f(x(j) 0 ) 2 Ej ν(j) Nj f(x(j) Nj) 2

Ej ν(j) 0 f(xj 1) 2 Ej ν(j) Nj f(xj) 2 , (12)

where the last line uses the deﬁnition that xj 1 = x(j) 0 , xj = x(j) Nj. By Lemma 3,

Ej ν(j) 0 f(xj 1) 2 σ2I(B < n)

The proof is completed by putting (11), (12) and (13) together.

Published as a conference paper at ICLR 2020

Lemma 7. For any j, t,

f(x(j) t+1) f(x(j) t ) + η

2 ν(j) t f(x(j) t ) 2 η

2 f(x(j) t ) 2 η

2(1 ηL) ν(j) t 2.

Proof. By (4),

f(x(j) t+1) f(x(j) t ) + D f(x(j) t ), x(j) t+1 x(j) t E + L

2 x(j) t x(j) t+1 2

= f(x(j) t ) η D f(x(j) t ), ν(j) t E + η2L

= f(x(j) t ) + η

2 ν(j) t f(x(j) t ) 2 η

2 f(x(j) t ) 2 η

2 ν(j) t 2 + η2L

2 ν(j) t 2.

The proof is then completed.

Lemma 8. For any j,

Ej f(xj) 2 2 ηm Ej(f(xj 1) f(xj)) + Ej ν(j) Nj f(xj) 2 (1 ηL)Ej ν(j) Nj 2,

where Ej is taken over all randomness in j-th outer loop (line 4-13 of Algorithm 1).

Proof. Since f(x) σ for any x,

|f(x(j) t+1) f(x(j) t )| σ ν(j) t .

This implies that

|f(x(j) t )| σ

k=0 ν(j) t + |f(x(j) 0 )|.

As shown in (10), ν(j) t = Poly(t) and thus |f(x(j) t )| = Poly(t). This implies that we can apply Lemma 2 on the sequence Dt = f(x(j) t ).

Letting j = Nj in Lemma 7 and taking expectation over all randomness in Ej, we have

Ejf(x(j) Nj+1) Ejf(x(j) Nj) + η

2 ν(j) Nj f(x(j) Nj) 2 η

2 f(x(j) Nj) 2 η

2(1 ηL) ν(j) Nj 2.

By Lemma 2,

Ejf(x(j) Nj) Ejf(x(j) Nj+1) = 1

m Ej(f(x(j) 0 ) f(x(j) Nj)) = 1

m Ej(f(xj 1) f(xj)).

The proof is then completed.

Combining Lemma 6 and Lemma 8, we arrive at the following key result on one inner loop.

Theorem 3. For any j,

E f(xj) 2 2 ηm Ej(f(xj 1) f(xj)) + σ2I(B < n)

B + (d k1 k2)m

Ej ν(j) Nj 2.

A.3 COMPLEXITY ANALYSIS

Proof of Theorem 1. By deﬁnition (7) of Rj and the smoothness assumption A1,

b L2E x(j) Nj+1 x(j) Nj 2 2η2L2E ν(j) Nj 2.

Published as a conference paper at ICLR 2020

By Theorem 3,

E f(xj) 2 2 ηm Ej(f(xj 1) f(xj)) + σ2I(B < n)

b 2(d k1 k2)mη2L2

Ej ν(j) Nj 2.

Since ηL = p

b + 2(d k1 k2)mη2L2

As a result,

E f(xj) 2 2 ηm Ej(f(xj 1) f(xj)) + σ2I(B < n)

Since xout = x T where T Unif([T]), we have

E f(xout) 2 2 ηm T E(f(x0) f(x T +1)) + σ2I(B < n)

ηm T + σ2I(B < n)

The setting of T and B guarantees that

2 f ηm T ϵ2

2 , σ2I(B < n)

Therefore, E f(xout) 2 ϵ2. By Cauchy-Schwarz inequality,

E f(xout) p

E f(xout) 2 ϵ.

In this case, the average computation cost is

ECcomp(ϵ) = T(ϵ) B + 2(k1 + k2)

d bm = 3BT(ϵ)

The proof is then proved by the setting of B.

Proof of Theorem 2. Under the setting of η,

By Theorem 3,

E f(xj) 2 2 ηm Ej(f(xj 1) f(xj)) + σ2I(B < n)

B + d k1 k2

By deﬁnition of xout,

E f(xout) 2 2 f

ηm T + σ2I(B < n)

B + (d k1 k2)m

Under the settings of T and B,

2 f ηm T ϵ2

3 , σ2I(B < n)

This proves the ﬁrst result. The second result follows directly. For the computation cost, similar to the proof of Theorem 1, we have

ECcomp(ϵ) = O(BT) = O

The proof is then completed by trivial algebra.

Published as a conference paper at ICLR 2020

Figure 3: Spider Boost with various values of sparsity, where (sparsity=k/d) corresponds to Spider Boost with sparsity k/d. Both ﬁgures use MNIST. The x-axis measures gradient queries over N, where N is the size of the respective datasets. Plots are in log-scale.

B EXPERIMENTS

B.1 DESCRIPTION OF SIMPLE CONVOLUTIONAL NEURAL NETWORK

The simple convolutional neural network used in the experiments consists of a convolutional layer with a kernel size of 5, followed by a max pool layer with kernel size 2, followed by another convolutional layer with kernel size 5, followed by a fully connected layer of input size 16 side2 120 (side is the size of the second dimension of the input), followed by a fully connected layer of size 120 84, followed by a ﬁnal fully connected layer of size 84 the output dimension.

B.2 NATURAL LANGUAGE PROCESSING

The natural language processing model consists of a word embedding of dimension 128 of 1000 tokens, which is jointly learned with the task. The LSTM has a hidden and cell state dimension of 1024.

Algorithm 2: Spider Boost for Natural Language Processing. Input: Learning rate η, inner loop size m, number of iterations T, large batch matrix Z2 with ℓ2 batches of size B, small batch matrix Z1 with ℓ1 batches of size b, initial iterate x0, initial states s0 and S0.

1 for t = 0, ..., T 1 do

2 i = mod(t, ℓ1)

3 j = mod(t, ℓ2)

4 if i = 0 then

6 if j = 0 then

8 if mod(t, m) = 0 then

9 νt, St+1 := f Z2j(xt, St)

10 st+1 = st

12 gp := f Z1i(xt 1, st 1)

13 gc, st+1 := f Z1i(xt, st)

14 νt := νt 1 + (gc gb)

15 St+1 = St

16 xt+1 := xt ηνt Output: x T

Before describing Algorithm 2, let us derive the full batch gradient of a generative language model. We encode the vocabulary of our dataset of length N so that D NN is a sequence of integers corresponding to one-hot encodings of each token. We model the transition p(Di+1|Di, si) using an RNN model M as M(Di, si) = Di+1, si+1, where si is the sequential model state at step i. The

Published as a conference paper at ICLR 2020

model M can be thought of as a classiﬁer with cross entropy loss L and additional dependence on si. The batch gradient objective can therefore be formulated by considering the full sequence of predictions from i = 0 to i = N 1, generating for each step i the output ˆDi+1, si+1. Each token is one-hot encoded as an integer (from 0 to the size of the vocabulary), so the empirical risk is given by

J(D; x) = 1

i=0 L( ˆDi, Di).

Thus, the full batch gradient is simply the gradient of J with respect to x.

In Algorithm 2, D is split into b contiguous sequences of length ℓ1 = N/b and stored in a matrix Z1 Nb ℓ1. Taking a pass over Z1 requires maintaining a state si Nb for each entry in a batch, which is reset before every pass over Z1. To deal with maintaining state for batches at different time scales, we deﬁne a different matrix Z2 Nb ℓ2 which maintains a different set of states Si NB for each entry of batch size B. We denote by g, st+1 = f Z1j(x, st) the gradient of our model with respect to x, where f Z1j denotes the gradient function corresponding to the jth batch of matrix Z1. The function f Z1j simply computes the loss of the jth batch of matrix Z1.