# safe_adaptive_importance_sampling__4047e365.pdf

Safe Adaptive Importance Sampling

Sebastian U. Stich EPFL sebastian.stich@epfl.ch

Anant Raj Max Planck Institute for Intelligent Systems anant.raj@tuebingen.mpg.de

Martin Jaggi EPFL martin.jaggi@epfl.ch

Importance sampling has become an indispensable strategy to speed up optimization algorithms for large-scale applications. Improved adaptive variants using importance values deﬁned by the complete gradient information which changes during optimization enjoy favorable theoretical properties, but are typically computationally infeasible. In this paper we propose an efﬁcient approximation of gradient-based sampling, which is based on safe bounds on the gradient. The proposed sampling distribution is (i) provably the best sampling with respect to the given bounds, (ii) always better than uniform sampling and ﬁxed importance sampling and (iii) can efﬁciently be computed in many applications at negligible extra cost. The proposed sampling scheme is generic and can easily be integrated into existing algorithms. In particular, we show that coordinate-descent (CD) and stochastic gradient descent (SGD) can enjoy signiﬁcant a speed-up under the novel scheme. The proven efﬁciency of the proposed sampling is veriﬁed by extensive numerical testing.

1 Introduction

Modern machine learning applications operate on massive datasets. The algorithms that are used for data analysis face the difﬁcult challenge to cope with the enormous amount of data or the vast dimensionality of the problems. A simple and well established strategy to reduce the computational costs is to split the data and to operate only on a small part of it, as for instance in coordinate descent (CD) methods and stochastic gradient (SGD) methods. These kind of methods are state of the art for a wide selection of machine learning, deep leaning and signal processing applications [9, 11, 35, 27]. The application of these schemes is not only motivated by their practical preformance, but also well justiﬁed by theory [18, 19, 2].

Deterministic strategies are seldom used for the data selection examples are steepest coordinate descent [4, 34, 20] or screening algorithms [14, 15]. Instead, randomized selection has become ubiquitous, most prominently uniform sampling [27, 29, 7, 8, 28] but also non-uniform sampling based on a ﬁxed distribution, commonly referred to as importance sampling [18, 19, 2, 33, 16, 6, 25, 24]. While these sampling strategies typically depend on the input data, they do not adapt to the information of the current parameters during optimization. In contrast, adaptive importance sampling strategies constantly re-evaluate the relative importance of each data point during training and thereby often surpass the performance of static algorithms [22, 5, 26, 10, 21, 23]. Common strategies are gradientbased sampling [22, 36, 37] (mostly for SGD) and duality gap-based sampling for CD [5, 23].

The drawbacks of adaptive strategies are twofold: often the provable theoretical guarantees can be worse than the complexity estimates for uniform sampling [23, 3] and often it is computationally

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

inadmissible to compute the optimal adaptive sampling distribution. For instance gradient based sampling requires the computation of the full gradient in each iteration [22, 36, 37]. Therefore one has to rely on approximations based on upper bounds [36, 37], or stale values [22, 1]. But in general these approximations can again be worse than uniform sampling.

This makes it necessary to develop adaptive strategies that can efﬁciently be computed in every iteration and that come with theoretical guarantees that show their advantage over ﬁxed sampling.

Our contributions. In this paper we propose an efﬁcient approximation of the gradient-based sampling in the sense that (i) it can efﬁciently be computed in every iteration, (ii) is provably better than uniform or ﬁxed importance sampling and (iii) recovers the gradient-based sampling in the fullinformation setting. The scheme is completely generic and can easily be added as an improvement to both CD and SGD type methods.

As our key contributions, we

(1) show that gradient-based sampling in CD methods is theoretically better than the classical ﬁxed sampling, the speed-up can reach a factor of the dimension n (Section 2); (2) propose a generic and efﬁcient adaptive importance sampling strategy that can be applied in CD and SGD methods and enjoys favorable properties such as mentioned above (Section 3); (3) demonstrate how the novel scheme can efﬁciently be integrated in CD and SGD on an important class of structured optimization problems (Section 4); (4) supply numerical evidence that the novel sampling performs well on real data (Section 5).

Notation. For x Rn deﬁne [x]i := x, ei with ei the standard unit vectors in Rn. We abbreviate if := [ f]i. A convex function f : Rn R with L-Lipschitz continuous gradient satisﬁes

f(x + ηu) f(x) + η u, f(x) + η2Lu

2 u 2 x Rn, η R , (1)

for every direction u Rn and Lu = L. A function with coordinate-wise Li-Lipschitz continuous gradients1 for constants Li > 0, i [n] := {1, . . . , n}, satisﬁes (1) just along coordinate directions, i.e. u = ei, Lei = Li for every i [n]. A function is coordinate-wise L-smooth if Li L for i = 1, . . . , n. For convenience we introduce vector l = (L1, . . . , n) and matrix L = diag(l). A probability vector p n := {x Rn 0 : x 1 = 1} deﬁnes a probability distribution P over [n] and we denote by i p a sample drawn from P.

2 Adaptive Importance Sampling with Full Information

In this section we argue that adaptive sampling strategies are theoretically well justiﬁed, as they can lead to signiﬁcant improvements over static strategies. In our exhibition we focus ﬁrst on CD methods, as we also propose a novel stepsize strategy for CD in this contribution. Then we revisit the results regarding stochastic gradient descent (SGD) already present in the literature.

2.1 Coordinate Descent with Adaptive Importance Sampling

We address general minimization problems minx f(x). Let the objective f : Rn R be convex with coordinate-wise Li-Lipschitz continuous gradients. Coordinate descent methods generate sequences {xk}k 0 of iterates that satisfy the relation

xk+1 = xk γk ikf(xk)eik . (2)

Here, the direction ik is either chosen deterministically (cyclic descent, steepest descent), or randomly picked according to a probability vector pk n. In the classical literature, the stepsize is often chosen such as to minimize the quadratic upper bound (1), i.e. γk = L 1 ik . In this work we propose to set γk = αk[pk] 1 ik where αk does not depend on the chosen direction ik. This leads to

1| if(x + ηei) if(x)| Li |η| , x Rn, η R.

directionally-unbiased updates, like it is common among SGD-type methods. It holds

Eik pk [f(xk+1) | xk] (1) Eik pk

f(xk) αk [pk]ik ( ikf(xk))2 + Likα2 k 2[pk]2 ik ( ikf(xk))2 | xk

= f(xk) αk f(xk) 2 2 +

Liα2 k 2[pk]i ( if(xk))2 . (3)

In adaptive strategies we have the freedom to chose both variables αk and pk as we like. We therefore propose to chose them in such a way that they minimize the upper bound (3) in order to maximize the expected progress. The optimal pk in (3) is independent of αk, but the optimal αk depends on pk. We can state the following useful observation. Lemma 2.1. If αk = αk(pk) is the minimizer of (3), then xk+1 := xk αk [pk]ik ikf(xk)eik satisﬁes

Eik pk [f(xk+1) | xk] f(xk) αk(pk)

2 f(xk) 2 2 . (4)

Consider two examples. In the ﬁrst one we pick a sub-optimal, but very common [18] distribution: Example 2.2 (Li-based sampling). Let p L n deﬁned as [p L]i = Li Tr[L] for i [n], where L = diag(L1, . . . , Ln). Then αk(p L) = 1 Tr[L].

The distribution p L is often referred to as (ﬁxed) importance sampling. In the special case when Li = L for all i [n], this boils down to uniform sampling.

Example 2.3 (Optimal sampling2). Equation (3) is minimized for probabilities [p k]i = Li| if(xk)|

L f(xk) 1 and αk(p k) = f(xk) 2 2

1 . Observe 1 Tr[L] αk(p k) 1 Lmin , where Lmin := mini [n] Li.

To prove this result, we rely on the following Lemma the proof of which, as well as for the claims above, is deferred to Section A.1 of the appendix. Here | | is applied entry-wise.

Lemma 2.4. Deﬁne V (p, x) := Pn i=1 Li[x]2 i [p]i . Then arg minp n V (p, x) = |

The ideal adaptive algorithm. We propose to chose the stepsize and the sampling distribution for CD as in Example 2.3. One iteration of the resulting CD method is illustrated in Algorithm 1. Our bounds on the expected one-step progress can be used to derive convergence rates of this algorithm with the standard techniques. This is exempliﬁed in Appendix A.1. In the next Section 3 we develop a practical variant of the ideal algorithm.

Efﬁciency gain. By comparing the estimates provided in the examples above, we see that the expected progress of the proposed method is always at least as good as for the ﬁxed sampling. For instance in the special case where L = Li for i [n], the Li-based sampling is just uniform sampling with αk(punif) = 1 Ln. On the other hand αk(p k) = f(xk) 2 2 L f(xk) 2 1 , which can be n times larger than αk(punif). The expected one-step progress in this extreme case coincides with the one-step progress of steepest coordinate descent [20].

2.2 SGD with Adaptive Sampling

SGD methods are applicable to objective functions which decompose as a sum

n Pn i=1 fi(x) (5)

with each fi : Rd R convex. In previous work [22, 36, 37] is has been argued that the following gradient-based sampling [ p k]i = fi(xk) 2 Pn i=1 fi(xk) 2 is optimal in the sense that it maximizes the expected progress (3). Zhao and Zhang [36] derive complexity estimates for composite functions. For non-composite functions it becomes easier to derive the complexity estimate. For completeness, we add this simpler proof in Appendix A.2.

2Here optimal refers to the fact that p k is optimal with respect to the given model (1) of the objective function. If the model is not accurate, there might exist a sampling that yields larger expected progress on f.

Algorithm 1 Optimal sampling

(compute full gradient) Compute f(xk)

(deﬁne optimal sampling) Deﬁne (p k, α k) as in Example 2.3 ik p k

xk+1 := xk α k [p k]ik ikf(xk)

Algorithm 2 Proposed safe sampling

(update l.- and u.-bounds) Update ℓ, u

(compute safe sampling) Deﬁne (ˆpk, ˆαk) as in (7) ik ˆpk Compute ikf(xk)

xk+1 := xk ˆαk [ˆpk]ik ikf(xk)

Algorithm 3 Fixed sampling

(deﬁne ﬁxed sampling) Deﬁne (p L, α) as in Example 2.2 ik p L Compute ikf(xk)

xk+1 := xk α [p L]ik ikf(xk)

Figure 1: CD with different sampling strategies. Whilst Alg. 1 requires to compute the full gradient, the compute operation in Alg. 2 is as cheap as for ﬁxed importance sampling, Alg. 3. Deﬁning the safe sampling ˆpk requires O(n log n) time.

3 Safe Adaptive Importance Sampling with Limited Information

In the previous section we have seen that gradient-based sampling (Example 2.3) can yield a massive speed-up compared to a static sampling distribution (Example 2.2). However, sampling according to p k in CD requires the knowledge of the full gradient f(xk) in each iteration. And likewise, sampling from p k in SGD requires the knowledge of the gradient norms of all components both these operations are in general inadmissible, i.e. the compute cost would void all computational beneﬁts of the iterative (stochastic) methods over full gradient methods.

However, it is often possible to efﬁciently compute approximations of p k or p k instead. In contrast to previous contributions, we here propose a safe way to compute such approximations. By this we mean that our approximate sampling is provably never worse than static sampling, and moreover, we show that our solution is the best possible with respect to the limited information at hand.

3.1 An Optimization Formulation for Sampling

Formally, we assume that we have in each iteration access to two vectors ℓk, uk Rn 0 that provide safe upper and lower bounds on either the absolute values of the gradient entries ([ℓk]i | if(xk)| [uk]i) for CD, or of the gradient norms in SGD: ([ℓk]i fi(xk) 2 [uk]i). We postpone the discussion of this assumption to Section 4, where we give concrete examples.

The minimization of the upper bound (3) amounts to the equivalent problem3

min αk min pk n

αk ck 2 2 + α2 k 2 V (pk, ck) min pk n V (pk, ck)

where ck Rn represents the unknown true gradient. That is, with respect to the bounds ℓk, uk, we can write ck Ck := {x Rn : [ℓk]i [x]i [uk]i, i [n]}. In Example 2.3 we derived the optimal solution for a ﬁxed ck Ck. However, this is not sufﬁcient to ﬁnd the optimal solution for an arbitrary ck Ck. Just computing the optimal solution for an arbitrary (but ﬁxed) ck Ck is unlikely to yield a good solution. For instance both extreme cases ck = ℓk and ck = uk (the latter choice is quite common, cf. [36, 23]) might be poor. This is demonstrated in the next example. Example 3.1. Let ℓ= (1, 2) , u = (2, 3) , c = (2, 2) and L1 = L2 = 1. Then V ℓ ℓ 1 , c =

9 4 c 2 2, V u u 1 , c = 25

12 c 2 2, whereas for uniform sampling V c c 1 , c = 2 c 2 2.

The proposed sampling. As a consequence of these observations, we propose to solve the following optimization problem to ﬁnd the best sampling distribution with respect to Ck:

vk := min p n max c Ck V (p, c)

c 2 2 , and to set (αk, pk) := 1

vk , ˆpk , (7)

where ˆpk denotes a solution of (7). The resulting algorithm for CD is summarized in Alg. 2.

In the remainder of this section we discuss the properties of the solution ˆpk (Theorem 3.2) and how such a solution can be efﬁciently be computed (Theorem 3.4, Algorithm 4).

3Although only shown here for CD, an equivalent optimization problem arises for SGD methods, cf. [36].

3.2 Proposed Sampling and its Properties

Theorem 3.2. Let (ˆp, ˆc) n Rn 0 denote a solution of (7). Then Lmin vk Tr [L] and

(i) max c Ck V (ˆp, c)

c 2 2 max c Ck V (p, c)

c 2 2 , p n; (ˆp has the best worst-case guarantee)

(ii) V (ˆp, c) Tr [L] c 2 2, c Ck. (ˆp is always better than Li-based sampling) Remark 3.3. In the special case Li = L for all i [n], the Li-based sampling boils down to uniform sampling (Example 2.2) and ˆp is better than uniform sampling: V (ˆp, c) Ln c 2 2, c Ck.

Proof. Property (i) is an immediate consequence of (7). Moreover, observe that the Li-based sampling p L is a feasible solution in (7) with value V (p L,c)

c 2 2 Tr [L] for all c Ck. Hence

Lc 2 1 c 2 2

2.4 = min p n V (p, c)

c 2 2 V (ˆp, c)

( ) V (ˆp, ˆc)

(7) max c Ck V (p L, c)

c 2 2 = Tr [L] , (8)

for all c Ck, thus vk [Lmin, Tr [L]] and (ii) follows. We prove inequality ( ) in the appendix, by showing that min and max can be interchanged in (7).

A geometric interpretation. We show in Appendix B that the optimization problem (7) can equivalently be written as vk = maxc Ck

Lc 1 c 2 = maxc Ck

l,c c 2 , where [l]i = Li for i [n]. The maximum is thus attained for vectors c Ck that minimize the angle with the vector l.

Theorem 3.4. Let c Ck, p =

Lc 1 and denote m = c 2 2

Lc 1 1 . If

[uk]i if [uk]i Lim , [ℓk]i if [ℓk]i Lim , Lim otherwise, i [n] , (9)

then (p, c) is a solution to (7). Moreover, such a solution can be computed in time O(n log n).

Proof. This can be proven by examining the optimality conditions of problem (7). This is deferred to Section B.1 of the appendix. A procedure that computes such a solution is depicted in Algorithm 4. The algorithm makes extensive use of (9). For simplicity, assume ﬁrst L = In for now. In each iteration t , a potential solution vector ct is proposed, and it is veriﬁed whether this vector satisﬁes all optimality conditions. In Algorithm 4, ct is just implicit, with [ct]i = [c]i for decided indices i D and [ct]i = [

Lm]i for undecided indices i / D. After at most n iterations a valid solution is found. By sorting the components of

L 1uk by their magnitude, at most a linear number of inequality checks in (9) have to be performed in total. Hence the running time is dominated by the O(n log n) complexity of the sorting algorithm. A formal proof is given in the appendix.

Algorithm 4 Computing the Safe Sampling for Gradient Information ℓ, u

1: Input: 0n ℓ u, L, Initialize: c = 0n, u = 1, ℓ= n, D = . 2: ℓsort := sort_asc(

L 1ℓ), usort := sort_asc(

L 1u), m = max(ℓsort) 3: while u ℓdo 4: if [ℓsort]ℓ> m then (largest undecided lower bound is violated) 5: Set corresponding [c]index := [

Lℓsort]ℓ; ℓ:= ℓ 1; D := D {index} 6: else if [usort]u < m then (smallest undecided upper bound is violated) 7: Set corresponding [c]index := [

Lusort]u; u := u + 1; D := D {index} 8: else 9: break (no constraints are violated) 10: end if 11: m := c 2 2

Lc 1 1 (update m as in (9)) 12: end while 13: Set [c]i :=

Lim for all i / D and Return c, p =

Lc 2 1 c 2 2

Competitive Ratio. We now compare the proposed sampling distribution ˆpk with the optimal sampling solution in hindsight. We know that if the true (gradient) vector c Ck would be given to us, then the corresponding optimal probability distribution would be p ( c) =

L c 1 (Example 2.3).

Thus, for this c we can now analyze the ratio V (ˆpk, c) V (p ( c), c). As we are interested in the worst case ratio among all possible candidates c Ck, we deﬁne

ρk := max c Ck V (ˆp, c) V (p (c), c) = max c Ck V (ˆp, c)

Lc 2 1 . (10)

Lemma 3.5. Let wk := minc Ck

Lc 2 1 c 2 2 . Then Lmin wk vk, and ρk vk

wk ( vk Lmin ).

Lemma 3.6. Let γ 1. If [Ck]i γ[Ck]i = and γ 1[Ck]i [Ck]i = for all i [n] (here [Ck]i denotes the projection on the i-th coordinate), then ρk γ4.

These two lemma provide bounds on the competitive ratio. Whilst Lemma 3.6 relies on a relative accuracy condition, Lemma 3.5 can always be applied. However, the corresponding minimization problem is non-convex. Note that knowledge of ρk is not needed to run the algorithm.

4 Example Safe Gradient Bounds

In this section, we argue that for a large class of objective functions of interest in machine learning, suitable safe upper and lower bounds ℓ, u on the gradient along every coordinate direction can be estimated and maintained efﬁciently during optimization. A similar argument can be given for the efﬁcient approximation of component wise gradient norms in ﬁnite sum objective based stochastic gradient optimization.

As the guiding example, we will here showcase the training of generalized linear models (GLMs) as e.g. in regression, classiﬁcation and feature selection. These models are formulated in terms of a given data matrix A Rd n with columns ai Rd for i [n].

Coordinate Descent - GLMs with Arbitrary Regularizers. Consider general objectives of the form f(x) := h(Ax) + Pn i=1 ψi([x]i) with an arbitrary convex separable regularizer term given by the ψi : R R for i [n]. A key example is when h: Rd R describes the least-squares regression objective h(Ax) = 1

2 Ax b 2 2 for a b Rd. Using that this h is twice differentiable with 2h(Ax) = In, it is easy to see that we can track the evolution of all gradient entries, when performing CD steps, as follows:

if(xk+1) if(xk) = γk ai, aik , i = ik . (11)

for ik being the coordinate changed in step k (here we also used the separability of the regularizer).

Therefore, all gradient changes can be tracked exactly if the inner products of all datapoints are available, or approximately if those inner products can be upper and lower bounded. For computational efﬁciency, we in our experiments simply use Cauchy-Schwarz | ai, aik | ai aik . This results in safe upper and lower bounds [ℓk+1]i if(xk+1) [uk+1]i for all inactive coordinates i = ik. (For the active coordinate ik itself one observes the true value without uncertainty). These bounds can be updated in linear time O(n) in every iteration.

For general smooth h (again with arbitrary separable regularizers ψi), (11) can readily be extended to hold [32, Lemma 4.1], the inner product change term becoming ai, 2f(A x)aik instead, when assuming h is twice-differentiable. Here x will be an element of the line segment [xk, xk+1].

Stochastic Gradient Descent - GLMs. We now present a similar result for ﬁnite sum problems (5) for the use in SGD based optimization, that is f(x) := 1

n Pn i=1 fi(x) = 1

n Pn i=1 hi(a i x).

Lemma 4.1. Consider f : Rd R as above, with twice differentiable hi : R R. Let xk, xk+1 Rd denote two successive iterates of SGD, i.e. xk+1 := xk ηk aik hik(a ikxk) = xk + γk aik. Then there exists x Rd on the line segment between xk and xk+1, x [xk, xk+1] with

fi(xk+1) fi(xk) = γk 2hi(a i x) ai, aik ai , i = ik . (12)

This leads to safe upper and lower bounds for the norms of the partial gradient, [ℓk]i fi(xk) 2 [uk]i, that can be updated in linear time O(n), analogous to the coordinate case discussed above.4

We note that there are many other ways to track safe gradient bounds for relevant machine learning problems, including possibly more tight ones. We here only illustrate the simplest variants, highlighting the fact that our new sampling procedure works for any safe bounds ℓ, u.

Computational Complexity. In this section, we have demonstrated how safe upper and lower bounds ℓ, u on the gradient information can be obtained for GLMs, and argued that these bounds can be updated in time O(n) per iteration of CD and SGD. The computation of the proposed sampling takes O(n log n) time (Theorem 3.4). Hence, the introduced overhead in Algorithm 2 compared to ﬁxed sampling (Algorithm 3) is of the order O(n log n) in every iteration. The computation of one coordinate of the gradient, ikf(xk), takes time Θ(d) for general data matrices. Hence, when d = Ω(n), the introduced overhead reduces to O(log n) per iteration.

5 Empirical Evaluation

In this section we evaluate the empirical performance of our proposed adaptive sampling scheme on relevant machine learning tasks. In particular, we illustrate performance on generalized linear models with L1 and L2 regularization, as of the form (5),

min x Rd 1 n

i=1 hi(a i x) + λ r(x) (13)

We use square loss, squared hinge loss as well as logistic loss for the data ﬁtting terms hi, and x 1 and x 2 2 for the regularizer r(x). The datasets used in the evaluation are rcv1, real-sim and news20.5 The rcv1 dataset consists of 20,242 samples with 47,236 features, real-sim contains 72,309 datapoints and 20,958 features and news20 contains 19,996 datapoints and 1,355,191 features. For all datasets we set unnormalized features with all the non-zero entries set to 1 (bag-of-words features). By real-sim and rcv1 we denote a subset of the data chosen by randomly selecting 10,000 features and 10,000 datapoints. By news20 we denote a subset of the data chose by randomly selecting 15% of the features and 15% of the datapoints. A regularization parameter λ = 0.1 is used for all experiments.

Our results show the evolution of the optimization objective over time or number of epochs (an epoch corresponding to n individual updates). To compute safe lower and upper bounds we use the methods presented in Section 4 with no special initialization, i.e. ℓ0 = 0n, u0 = n.

Coordinate Descent. In Figure 2 we compare the effect of the ﬁxed stepsize αk = 1 Ln (denoted as small ) vs. the time varying optimal stepsize (denoted as big ) as discussed in Section 2. Results are shown for optimal sampling p k (with optimal stepsize αk(p k), cf. Example 2.3), our proposed sampling ˆpk (with optimal stepsize αk(ˆpk) = v 1 k , cf. (7)) and uniform sampling (with optimal stepsize αk(p L) = 1 Ln, as here L = LIn, cf. Example 2.2). As the experiment aligns with theory conﬁrming the advantage of the varying big stepsizes we only show the results for Algorithms 1 3 in the remaining plots.

Performance for squared hinge loss, as well as logistic regression with L1 and L2 regularization is presented in Figure 3 and Figure 4 respectively. In Figures 5 and 6 we report the iteration complexity vs. accuracy as well as timing vs. accuracy results on the full dataset for coordinate descent with square loss and L1 (Lasso) and L2 regularization (Ridge).

Theoretical Sampling Quality. As part of the CD performance results in Figures 2 6 we include an additional evolution plot on the bottom of each ﬁgure to illustrate the values vk which determine the stepsize (ˆαk = v 1 k ) for the proposed Algorithm 2 (blue) and the optimal stepsizes of Algorithm 1 (black) which rely on the full gradient information. The plots show the normalized values vk Tr[L], i.e. the relative improvement over Li-based importance sampling. The results show that despite only relying on very loose safe gradient bounds, the proposed adaptive sampling is able to strongly beneﬁt from the additional information.

4Here we use the efﬁcient representation fi(x) = θ(x) ai for θ(x) R. 5All data are available at www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/

Uniform Proposed (big step) Proposed (small step)

0 -1 -2 -3 -4

(a) rcv1 , L1 reg.

Epochs 0 1 2 5 6

Optimal (big step) Optimal (small step)

0 -1 -2 -3 -4

(b) rcv1 , L2 reg. Figure 2: (CD, square loss) Fixed vs. adaptive sampling strategies, and dependence on stepsizes. With big αk = v 1 k and small αk = 1 Tr[L].

0.98 0.96 0.94 0.92 0.90 0.88 0.86

0 -1 -2 -3 -4 Epochs 0 1 2 5 6

Uniform Proposed Optimal

(a) rcv1 , L1 reg.

Uniform Proposed Optimal

Epochs 0 0.5 1 2.5 3

0 -1 -2 -3 -4

(b) real-sim , L2 reg. Figure 3: (CD, squared hinge loss) Function value vs. number of iterations for optimal stepsize αk = v 1 k .

0 -1 -2 -3 -4

Epochs 0 1 2 5 6

Uniform Proposed Optimal 6.90

(a) rcv1 , L1 reg.

Epochs 0 1 2 5 6

Uniform Proposed Optimal

0 -1 -2 -3 -4

(b) rcv1 , L2 reg.

Epochs 0 0.5 1 2.5 3

Uniform Proposed Optimal

0 -1 -2 -3 -4

(c) real-sim , L1 reg.

Uniform Proposed Optimal

Epochs 0 0.5 1 2.5 3

0 -1 -2 -3 -4

(d) real-sim , L2 reg. Figure 4: (CD, logistic loss) Function value vs. number of iterations for different sampling strategies. Bottom: Evolution of the value vk which determines the optimal stepsize (ˆαk = v 1 k ). The plots show the normalized values vk Tr[L], i.e. the relative improvement over Li-based importance sampling.

Uniform Proposed Optimal

Epochs 0 0.5 1 1.5 3 3.5

0 -1 -2 -3 -4

(a) rcv1, L1 reg.

Uniform Proposed Optimal

Epochs 0 0.5 1 2

0 -1 -2 -3 -4

(b) real-sim, L1 reg. Figure 5: (CD, square loss) Function value vs. number of iterations on the full datasets.

Time 0 2 4 14 6 16 12

Uniform Proposed

0 -1 -2 -3 -4

(a) real-sim, L1 reg.

Uniform Proposed

Time 0 2 4 14 6 16 12

0 -1 -2 -3 -4

(b) real-sim, L2 reg. Figure 6: (CD, square loss) Function value vs. clock time on the full datasets. (Data for the optimal sampling omitted, as this strategy is not competitive time-wise.)

Uniform Proposed Optimal

Epochs 0 0.5 1 2.5 2 (a) rcv1 , L1 reg.

Uniform Proposed Optimal

Epochs 0 0.5 1 2

(b) rcv1 , L2 reg.

Uniform Proposed Optimal

Epochs 0 0.5 1 2.5 2

(c) real-sim , L1 reg.

Uniform Proposed Optimal

Epochs 0 0.5 1 2.5 2

(d) real-sim , L2 reg.

Figure 7: (SGD, square loss) Function value vs. number of iterations.

Uniform Proposed Optimal

Epochs 0 1 2 4

(a) news20 , L1 reg. Figure 8: (SGD, square loss) Function value vs. number of iterations.

Uniform Proposed 40 35 30 25 20 15 10

Time 0 5 10 25 20

(a) news20 , L1 reg. Figure 9: (SGD square loss) Function value vs. clock time.

Stochastic Gradient Descent. Finally, we also evaluate the performance of our approach when used within SGD with L1 and L2 regularization and square loss. In Figures 7 8 we report the iteration complexity vs. accuracy results and in Figure 9 the timing vs. accuracy results. The time units in Figures 6 and 9 are not directly comparable, as the experiments were conducted on different machines.

We observe that on all three datasets SGD with the optimal sampling performs only slightly better than uniform sampling. This is in contrast with the observations for CD, where the optimal sampling yields a signiﬁcant improvement. Consequently, the effect of the proposed sampling is less pronounced in the three SGD experiments.

Summary. The main ﬁndings of our experimental study can be summarized as follows:

Adaptive importance sampling signiﬁcantly outperforms ﬁxed importance sampling in iterations and time. The results show that (i) convergence in terms of iterations is almost as good as for the optimal (but not efﬁciently computable) gradient-based sampling and (ii) the introduced computational overhead is small enough to outperform ﬁxed importance sampling in terms of total computation time. Adaptive sampling requires adaptive stepsizes. The adaptive stepsize strategies of Algorithms 1 and 2 allow for much faster convergence than conservative ﬁxed-stepsize strategies. In the experiments, the measured value vk was always signiﬁcantly below the worst case estimate, in alignment with the observed convergence. Very loose safe gradient bounds are sufﬁcient. Even the bounds derived from the the very naïve gradient information obtained by estimating scalar products resulted in signiﬁcantly better sampling than using no gradient information at all. Further, no initialization of the gradient estimates is needed (at the beginning of the optimization process the proposed adaptive method performs close to the ﬁxed sampling but accelerates after just one epoch).

6 Conclusion

In this paper we propose a safe adaptive importance sampling scheme for CD and SGD algorithms. We argue that optimal gradient-based sampling is theoretically well justiﬁed. To make the computation of the adaptive sampling distribution computationally tractable, we rely on safe lower and upper bounds on the gradient. However, in contrast to previous approaches, we use these bounds in a novel way: in each iteration, we formulate the problem of picking the optimal sampling distribution as a convex optimization problem and present an efﬁcient algorithm to compute the solution. The novel sampling provably performs better than any ﬁxed importance sampling a guarantee which could not be established for previous samplings that were also derived from safe lower and upper bounds.

The computational cost of the proposed scheme is of the order O(n log n) per iteration this is on many problems comparable with the cost to evaluate a single component (coordinate, sum-structure) of the gradient, and the scheme can thus be implemented at no extra computational cost. This is veriﬁed by timing experiments on real datasets.

We discussed one simple method to track the gradient information in GLMs during optimization. However, we feel that the machine learning community could proﬁt from further research in that direction, for instance by investigating how such safe bounds can efﬁciently be maintained on more complex models. Our approach can immediately be applied when the tracking of the gradient is delegated to other machines in a distributed setting, like for instance in [1].

[1] Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. Variance Reduction in SGD by Distributed Importance Sampling. ar Xiv.org, February 2015.

[2] Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling. In ICML 2017 - Proceedings of the 34th International Conference on Machine Learning, pages 1110 1119. June 2016.

[3] Ichiro Takeuchi Atsushi Shibagaki. Stochastic Primal Dual Coordinate Method with Non-Uniform Sampling Based on Optimality Violations. ar Xiv.org, October 2017.

[4] Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.

[5] Dominik Csiba, Zheng Qu, and Peter Richtárik. Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015 - Proceedings of the 32th International Conference on Machine Learning, February 2015.

[6] Dominik Csiba and Peter Richtárik. Importance Sampling for Minibatches. ar Xiv.org, February 2016.

[7] Jerome Friedman, Trevor Hastie, Holger Höﬂing, and Robert Tibshirani. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302 332, December 2007.

[8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1):1 22, 2010.

[9] Wenjiang J. Fu. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397 416, 1998.

[10] Xi He and Martin Takáˇc. Dual Free Adaptive Mini-batch SDCA for Empirical Risk Minimization. ar Xiv.org, October 2015.

[11] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and S Sundararajan. A Dual Coordinate Descent Method for Large-scale Linear SVM. In ICML 2008 - the 25th International Conference on Machine Learning, pages 408 415, New York, USA, 2008. ACM Press.

[12] Hidetoshi Komiya. Elementary proof for sion s minimax theorem. Kodai Math. J., 11(1):5 7, 1988.

[13] Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an O(1/t) convergence rate for projected stochastic subgradient descent. ar Xiv.org, December 2012.

[14] Jun Liu, Zheng Zhao, Jie Wang, and Jieping Ye. Safe Screening with Variational Inequalities and Its Application to Lasso. In ICML 2014 - Proceedings of the 31st International Conference on Machine Learning, pages 289 297, 2014.

[15] Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, and Joseph Salmon. Gap Safe screening rules for sparsity enforcing penalties. JMLR, 2017.

[16] Deanna Needell, Rachel Ward, and Nathan Srebro. Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm. In NIPS 2014 - Advances in Neural Information Processing Systems 27, pages 1017 1025, 2014.

[17] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

[18] Yurii Nesterov. Efﬁciency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2):341 362, 2012.

[19] Yurii Nesterov and Sebastian U. Stich. Efﬁciency of the accelerated coordinate descent method on structured optimization problems. SIAM Journal on Optimization, 27(1):110 123, 2017.

[20] Julie Nutini, Mark W Schmidt, Issam H Laradji, Michael P Friedlander, and Hoyt A Koepke. Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection. In ICML, pages 1632 1641, 2015.

[21] Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K. Dokania, and Simon Lacoste-Julien. Minding the gaps for block frank-wolfe optimization of structured svms. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pages 593 602. JMLR.org, 2016.

[22] Guillaume Papa, Pascal Bianchi, and Stéphan Clémençon. Adaptive Sampling for Incremental Optimization Using Stochastic Gradient Descent. ALT 2015 - 26th International Conference on Algorithmic Learning Theory, pages 317 331, 2015.

[23] Dmytro Perekrestenko, Volkan Cevher, and Martin Jaggi. Faster Coordinate Descent via Adaptive Importance Sampling. In AISTATS 2017 - Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54, pages 869 877. PMLR, 20 22 Apr 2017.

[24] Zheng Qu, Peter Richtárik, and Tong Zhang. Randomized Dual Coordinate Ascent with Arbitrary Sampling. ar Xiv.org, November 2014.

[25] Peter Richtárik and Martin Takáˇc. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 10(6):1233 1243, 2016.

[26] Mark Schmidt, Reza Babanezhad, Mohamed Ahmed, Aaron Defazio, Ann Clifton, and Anoop Sarkar. Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields. In AISTATS 2015 - Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 38, pages 819 828. PMLR, 09 12 May 2015.

[27] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming, 127(1):3 30, October 2010.

[28] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic Methods for l1-regularized Loss Minimization. JMLR, 12:1865 1892, June 2011.

[29] Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. JMLR, 14:567 599, February 2013.

[30] Maurice Sion. On general minimax theorems. Paciﬁc Journal of Mathematics, 8(1):171 176, 1958.

[31] S. U. Stich, C. L. Müller, and B. Gärtner. Variable metric random pursuit. Mathematical Programming, 156(1):549 579, Mar 2016.

[32] Sebastian U. Stich, Anant Raj, and Martin Jaggi. Approximate steepest coordinate descent. In Doina Precup and Yee Whye Teh, editors, ICML 2017 - Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 3251 3259. PMLR, 06 11 Aug 2017.

[33] Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262, 2008.

[34] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1):387 423, 2009.

[35] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3 34, 2015.

[36] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In ICML 2015 - Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1 9. PMLR, 07 09 Jul 2015.

[37] Rong Zhu. Gradient-based sampling: An adaptive importance sampling for least-squares. In NIPS - Advances in Neural Information Processing Systems 29, pages 406 414. 2016.