# adaptive_sampling_probabilities_for_nonsmooth_optimization__3ab39807.pdf

Adaptive Sampling Probabilities for Non-Smooth Optimization

Hongseok Namkoong 1 Aman Sinha 2 Steve Yadlowsky 2 John C. Duchi 2 3

Abstract Standard forms of coordinate and stochastic gradient methods do not adapt to structure in data; their good behavior under random sampling is predicated on uniformity in data. When gradients in certain blocks of features (for coordinate descent) or examples (for SGD) are larger than others, there is a natural structure that can be exploited for quicker convergence. Yet adaptive variants often suffer nontrivial computational overhead. We present a framework that discovers and leverages such structural properties at a low computational cost. We employ a bandit optimization procedure that learns probabilities for sampling coordinates or examples in (nonsmooth) optimization problems, allowing us to guarantee performance close to that of the optimal stationary sampling distribution. When such structures exist, our algorithms achieve tighter convergence guarantees than their non-adaptive counterparts, and we complement our analysis with experiments on several datasets.

1. Introduction

Identifying and adapting to structural aspects of problem data can often improve performance of optimization algorithms. In this paper, we study two forms of such structure: variance in the relative importance of different features and observations (as well as blocks thereof). As a motivating concrete example, consider the p regression problem

f(x) := k Ax bkp

where ai denote the rows of A 2 Rn d. When the columns (features) of A have highly varying norms say because

1Management Science & Engineering, Stanford University, USA 2Electrical Engineering, Stanford University, USA 3Statistics, Stanford University, USA. Correspondence to: Hongseok Namkoong <hnamk@stanford.edu>, Aman Sinha <amans@stanford.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

certain features are infrequent we wish to leverage this during optimization. Likewise, when rows ai have disparate norms, heavy rows of A inﬂuence the objective more than others. We develop optimization algorithms that automatically adapt to such irregularities for general nonsmooth convex optimization problems.

Standard (stochastic) subgradient methods (Nemirovski et al., 2009), as well as more recent accelerated variants for smooth, strongly convex incremental optimization problems (e.g. Johnson and Zhang, 2013; Defazio et al., 2014), follow deterministic or random procedures that choose data to use to compute updates in ways that are oblivious to conditioning and structure. As our experiments demonstrate, choosing blocks of features or observations for instance, all examples belonging to a particular class in classiﬁcation problems can be advantageous. Adapting to such structure can lead to substantial gains, and we propose a method that adaptively updates the sampling probabilities from which it draws blocks of features/observations (columns/rows in problem (1)) as it performs subgradient updates. Our method applies to both coordinate descent (feature/column sampling) and mirror descent (observation/row sampling). Heuristically, our algorithm learns to sample informative features/observations using their gradient values and requires overhead only logarithmic in the number of blocks over which it samples. We show that our method optimizes a particular bound on convergence, roughly sampling from the optimal stationary probability distribution in hindsight, and leading to substantial improvements when the data has pronounced irregularity.

When the objective f( ) is smooth and the desired solution accuracy is reasonably low, (block) coordinate descent methods are attractive because of their tractability (Nesterov, 2012; Necoara et al., 2011; Beck and Tetruashvili, 2013; Lee and Sidford, 2013; Richt arik and Tak aˇc, 2014; Lu and Xiao, 2015). In this paper, we consider potentially non-smooth functions and present an adaptive block coordinate descent method, which iterates over b blocks of coordinates, reminiscent of Ada Grad (Duchi et al., 2011). Choosing a good sampling distribution for coordinates in coordinate descent procedures is nontrivial (Lee and Sidford, 2013; Necoara et al., 2011; Shalev-Shwartz and Zhang, 2012; Richt arik and Tak aˇc, 2015; Csiba et al., 2015; Allen-Zhu and Yuan, 2015). Most work focuses on choos-

Adaptive Sampling Probabilities for Non-Smooth Optimization

ing a good stationary distribution using problem-speciﬁc knowledge, which may not be feasible; this motivates automatically adapting to individual problem instances. For example, Csiba et al. (2015) provide an updating scheme for the probabilities in stochastic dual ascent. However, the update requires O(b) time per iteration, making it impractical for large-scale problems. Similarly, Nutini et al. (2015) observe that the Gauss-Southwell rule (choosing the coordinate with maximum gradient value) achieves better performance, but this also requires O(b) time per iteration. Our method roughly emulates this behavior via careful adaptive sampling and bandit optimization, and we are able to provide a number of a posteriori optimality guarantees.

In addition to coordinate descent methods, we also consider the ﬁnite-sum minimization problem

where the fi are convex and may be non-smooth. Variancereduction techniques for ﬁnite-sum problems often yield substantial gains (Johnson and Zhang, 2013; Defazio et al., 2014), but they generally require smoothness. More broadly, importance sampling estimates (Strohmer and Vershynin, 2009; Needell et al., 2014; Zhao and Zhang, 2014; 2015; Csiba and Richt arik, 2016) can yield improved convergence, but the only work that allows online, problemspeciﬁc adaptation of sampling probabilities of which we are aware is Gopal (2016). However, these updates require O(b) computation and do not have optimality guarantees.

We develop these ideas in the coming sections, focusing ﬁrst in Section 2 on adaptive procedures for (non-smooth) coordinate descent methods and developing the necessary bandit optimization and adaptivity machinery. In Section 3, we translate our development into convergence results for ﬁnite-sum convex optimization problems. Complementing our theoretical results, we provide a number of experiments in Section 4 that show the importance of our algorithmic development and the advantages of exploiting block structures in problem data.

2. Adaptive sampling for coordinate descent

We begin with the convex optimization problem

x2X f(x) (2)

where X = X1 Xb Rd is a Cartesian product of closed convex sets Xj Rdj with P

j dj = d, and f is convex and Lipschitz. When there is a natural block structure in the problem, some blocks have larger gradient norms than others, and we wish to sample these blocks more often in the coordinate descent algorithm. To that

end, we develop an adaptive procedure that exploits variability in block importance online. In the coming sections, we show that we obtain certain near-optimal guarantees, and that the computational overhead over a simple random choice of block j 2 [b] is at most O(log b). In addition, under some natural structural assumptions on the blocks and problem data, we show how our adaptive sampling scheme provides convergence guarantees polynomially better in the dimension than those of naive uniform sampling or gradient descent.

Notation for coordinate descent Without loss of genrality we assume that the ﬁrst d1 coordinates of x 2 Rd

correspond to X1, the second d2 to X2, and so on. We let Uj 2 {0, 1}d dj be the matrix identifying the jth block, so that Id = [U1 Ud]. We deﬁne the projected subgradient vectors for each block j by

Gj(x) = Uj U >

j f 0(x) 2 Rd,

where f 0(x) 2 @f(x) is a ﬁxed element of the subdifferential @f(x). Deﬁne x[j] := U >

j x 2 Rdj and G[j](x) = U >

j Gj(x) = U >

j f 0(x) 2 Rdj. Let j denote a differentiable 1-strongly convex function on Xj with respect to the norm k kj, meaning for all 2 Rdj we have

+ r j(x[j])> + 1

and let k kj, be the dual norm of k kj. Let Bj(u, v) = j(u) j(v) r j(v)>(u v) be the Bregman divergence associated with j, and deﬁne the tensorized divergence B(x, y) := Pb

j=1 Bj(x[j], y[j]). Throughout the paper, we assume the following.

Assumption 1. For all x, y 2 X, we have B(x, y) R2

j, L2/b for j = 1, . . . , b.

2.1. Coordinate descent for non-smooth functions

The starting point of our analysis is the simple observation that if a coordinate J 2 [b] is chosen according to a probability vector p > 0, then the importance sampling estimator

GJ(x)/p J satisﬁes Ep[GJ(x)/p J] = f 0(x) 2 @f(x).

Thus the randomized coordinate subgradient method of Algorithm 1 is essentially a stochastic mirror descent method (Nemirovski and Yudin, 1983; Beck and Teboulle, 2003; Nemirovski et al., 2009), and as long as supx2X E[kp 1

] < 1 it converges at rate O(1/

T). With this insight, a variant of standard stochastic mirror descent analysis yields the following convergence guarantee for Algorithm 1 with non-stationary probabilities (cf. Dang and Lan (2015), who do not quite as carefully track dependence on the sampling distribution

Adaptive Sampling Probabilities for Non-Smooth Optimization

Algorithm 1 Non-smooth Coordinate Descent

Input: Stepsize x > 0, Probabilities p1, . . . , p T . Initialize: x1 = x for t 1, . . . , T

Sample Jt pt

[Jt] argminx2XJt

return x T 1

p). Throughout, we deﬁne the expected sub-optimality gap of an algorithm outputing an estimate bx by S(f, bx) := E[f(bx)] infx 2X f(x ). See Section A.1 for the proof.

Proposition 1. Under Assumption 1, Algorithm 1 achieves

S(f, x T ) R2

where S(f, x T ) = E[f( x T )] infx2X f(x).

As an immediate consequence, if pt pmin > 0 and x =

T , then S(f, x T ) RL

2 T pmin . To make this more concrete, we consider sampling from the uniform distribution pt 1

b1 so that pmin = 1/b, and assume homogeneous block sizes dj = d/b for simplicity. Algorithm 1 solves problem (2) to -accuracy within O(b R2L2/ 2) iterations, where each iteration approximately costs O(d/b) plus the cost of projecting into Xj. In contrast, mirror descent with the same constraints and divergence B achieves the same accuracy within O(R2L2/ 2) iterations, taking O(d) time plus the cost of projecting to X per iteration. As the projection costs are linear in the number b of blocks, the two algorithms are comparable.

In practice, coordinate descent procedures can signiﬁcantly outperform full gradient updates through efﬁcient memory usage. For huge problems, coordinate descent methods can leverage data locality by choosing appropriate block sizes so that each gradient block ﬁts in local memory.

2.2. Optimal stepsizes by doubling

In the the upper bound (3), we wish to choose the optimal stepsize x that minimizes this bound. However, the term PT

k G[j](xt)k2

is unknown a priori. We circumvent this issue by using the doubling trick (e.g. Shalev Shwartz, 2012, Section 2.3.1) to achieve the best possible rate in hindsight. To simplify our analysis, we assume that there is some pmin > 0 such that

+ : p>1 = 1, p pmin

Maintaining the running sum Pt

%%G[Jl](xl)

Algorithm 2 Stepsize Doubling Coordinate Descent

Initialize: x1 = x, p1 = p, k = 1 while t T do

Jl) 2 %%G[Jl](xl)

Jl, 4k, t T do Run inner loop of Algorithm 1 with

t t + 1 k k + 1 return x T 1

requires incremental time O(d Jt) at each iteration t, choosing the stepsizes adaptively via Algorithm 2 only requires a constant factor of extra computation over using a ﬁxed step size. The below result shows that the doubling trick in Algorithm 2 acheives (up to log factors) the performance of the optimal stepsize that minimizes the regret bound (3).

Proposition 2. Under Assumption 1, Algorithm 2 achieves

S(f, x T ) 6R

RL pmin T log 4 log

where S(f, x T ) = E[f( x T )] infx2X f(x).

2.3. Adaptive probabilities

We now present an adaptive updating scheme for pt, the sampling probabilities. From Proposition 2, the stationary distribution achieving the smallest regret upper bound minimizes the criterion

%%G[Jt](xt)

where the equality follows from the tower property. Since xt depends on the pt, we view this as an online convex optimization problem and choose p1, . . . , p T to minimize the regret

Note that due to the block coordinate nature of Algorithm 1, we only compute

j, for the sampled j = Jt at each iteration. Hence, we treat this as a multi-armed bandit problem where the arms are the blocks j = 1, . . . , b and we only observe the loss

Jt)2 associated with the arm Jt pulled at time t.

Adaptive Sampling Probabilities for Non-Smooth Optimization

Algorithm 3 Coordinate Descent with Adaptive Sampling

Input: Stepsize p > 0, Threshold pmin > 0 with

P = {p 2 Rb

+ : p>1 = 1, p pmin} Initialize: x1 = x, p1 = p for t 1, . . . , T

Sample Jt pt

Choose x,k according to Algorithm 2 Update x:

[Jt] argminx2XJt

Update p: for b t,j(x) deﬁned in (5),

wt+1 pt exp( ( pb t,Jt(xt)/pt

Jt)e Jt), pt+1 argminq2P Dkl

return x T 1

By using a bandit algorithm another coordinate descent method to update p, we show that our updates achieve performance comparable to the best stationary probability in b in hindsight. To this end, we ﬁrst bound the regret (4) by the regret of a linear bandit problem. By convexity of x 7! 1/x and d dxx 1 = x 2, we have

n%%G[j](xt)

j=1 | {z } ( )

Now, let us view the vector ( ) as the loss vector for a constrained linear bandit problem with feasibility region b. We wish to apply EXP3 (due to Auer et al. (2002)) or equivalently, a 1-sparse mirror descent to p with P (p) = p log p (see, for example, Section 5.3 of Bubeck and Cesa-Bianchi (2012) for the connections). However, EXP3 requires the loss values be positive in order to be in the region where P is strongly convex, so we scale our problem using the fact that p and pt s are probability vectors. Namely,

n))G[j](xt)

b t(xt), pt p

b t,j(x) :=

Using scaled loss values, we perform EXP3 (Algorithm 3). Intuitively, we penalize the probability of the sampled block by the strength of the signal on the block. The

scaling (5) ensures that we penalize blocks with low signal (as opposed to rewarding those with high signal) which enforces diversity in the sampled coordinates as well. In Section A.3, we will see how this scaling plays a key role in proving optimality of Algorithm 3. Here, the signal is measured by the relative size of the gradient in the block against the probability of sampling the block. This means that blocks with large surprises those with higher gradient norms relative to their sampling probability will get sampled more frequently in the subsequent iterations. Algorithm 3 guarantees low regret for the online convex optimization problem (4) which in turn yields the following guarantee for Algorithm 3.

Theorem 3. Under Assumption 1, the adaptive updates in

Algorithm 3 with p = p2

S(f, x T ) 6R

v u u u t min

k G[j](xt)k2

| {z } (a):best in hindsight

| {z } (b):regret for bandit problem

where S(f, x T ) = E[f( x T )] infx2X f(x).

See Section A.3 for the proof. Note that there is a trade-off in the regret bound (6) in terms of pmin: for small pmin, the ﬁrst term is small, as the the set b is large, but second (regret) term is large, and vice versa. To interpret the bound (6), take pmin = δ/b for some δ 2 (0, 1). The ﬁrst term dominates the remainder as long as T = (b log b); we require T (b R2L2/ 2) to guarantee convergence of coordinate descent in Proposition 1, so that we roughly expect the ﬁrst term in the bound (6) to dominate. Thus, Algorithm 3 attains the best convergence guarantee for the optimal stationary sampling distribution in hindsight.

2.4. Efﬁcient updates for p

The updates for p in Algorithm 3 can be done in O(log b) time by using a balanced binary tree. Let Dkl (p||q) := Pd

i=1 pi log pi

qi denote the Kullback-Leibler divergence between p and q. Ignoring the subscript on t so that w = wt+1, p = pt and J = Jt, the new probability vector q is given by the minimizer of

Dkl (q||w) s.t. q>1 = 1, q pmin, (7)

where w is the previous probability vector p modiﬁed only at the index J. We store w in a binary tree, keeping values up to their normalization factor. At each node, we also store the sum of elements in the left/right subtree for

Adaptive Sampling Probabilities for Non-Smooth Optimization

Algorithm 4 KL Projection

1: Input: J, p J, w J, mass = P

i wi 2: wcand p J mass. 3: if wcand/(mass w J + wcand) pmin then 4: wcand pmin 1 pmin (mass w J) 5: Update(wcand, J)

efﬁcient sampling (for completeness, the pseudo-code for sampling from the binary tree in O(log b) time is given in Section B.3). The total mass of the tree can be accessed by inspecting the root of the tree alone.

The following proposition shows that it sufﬁces to touch at most one element in the tree to do the update. See Section B for the proof.

Proposition 4. The solution to (7) is given by

1 1 p J+w J wj if w J pmin(1 p J)

1 pmin 1 pmin

1 p J wj otherwise ,

1 1 p J+w J w if w J pmin(1 p J)

1 pmin pmin otherwise.

As seen in Algorithm 4, we need to modify at most one element in the binary tree. Here, the update function modiﬁes the value at index J and propagates the value up the tree so that the sum of left/right subtrees are appropriately updated. We provide the full pseudocode in Section B.2.

2.5. Example

The optimality guarantee given in Theorem 3 is not directly interpretable since the term (a) in the upper bound (6) is only optimal given the iterates x1, . . . , x T despite the fact that xt s themselves depend on the sampling probabilities. Hence, we now study a setting where we can further bound (6) to get a explicit regret bound for Algorithm 3 that is provably better than non-adaptive counterparts. Indeed, under certain structural assumptions on the problem similar to those of Mc Mahan and Streeter (2010) and Duchi et al. (2011), our adaptive sampling algorithm provably achieves regret polynomially better in the dimension than either using a uniform sampling distribution or gradient descent.

Consider the SVM objective

where n is small and d is large. Here, @f(x) = 1 n

zi. Assume that for some ﬁxed 2 (1, 1) and Lj := βj , we have |@jf(x)|2

i=1 |zi,j|2 L2

j. In particular, this is the case if we have sparse features z U 2 { 1, 0, +1}d with power law

Algorithm 2 [2, 1) 2 (1, 2)

Table 1. Runtime comparison (computations needed to guarantee -optimality gap) under heavy-tailed block structures. ACD=adaptive coordinate descent, UCD=uniform coordinate descent, GD=gradient descent

tails P(|z U,j| = 1) = βj where U is a uniform random variable over {1, . . . , n}.

Take Cj = {j} for j = 1, . . . , d (and b = d). First, we show that although for the uniform distribution p = 1/d

E[k Gj(xt)k2

j = O(d log d),

the term (a) in (6) can be orders of magnitude smaller. Proposition 5. Let b = d, pmin = δ/d for some δ 2 (0, 1), and Cj = {j}. If k Gj(x)k2

j := βj for some 2 (1, 1), then

min p2 b,p pmin

O(log d), if 2 [2, 1) O(d2 ), if 2 (1, 2).

We defer the proof of the proposition to Section A.5. Using this bound, we can show explicit regret bounds for Algorithm 3. From Theorem 3 and Proposition 5, we have that Algorithm 3 attains

O( R log d p

T ), if 2 [2, 1)

2 ), if 2 (1, 2)

Rd3/4T 3/4 log5/4 d

Setting above to be less than and inverting with respect to T, we obtain the iteration complexity in Table 1.

To see the runtime bounds for uniformly sampled coordinate descent and gradient descent, recall the regret bound (3) given in Proposition 1. Plugging pt

j = 1/d in the bound, we obtain

S(f, x T ) O(R

2R2/(L2Td) where L2 := Pd

j. Similarly, gradient descent with x =

2R2/(L2T) attains

S(f, x T ) O(R

Since each gradient descent update takes O(d), we obtain the same runtime bound.

Adaptive Sampling Probabilities for Non-Smooth Optimization

While non-adaptive algorithms such as uniformly-sampled coordinate descent or gradient descent have the same runtime for all , our adaptive sampling method automatically tunes to the value of . Note that for 2 (1, 1), the ﬁrst term in the runtime bound for our adaptive method given in Table 1 is strictly better than that of uniform coordinate descent or gradient descent. In particular, for 2 [2, 1) the best stationary sampling distribution in hindsight yields an improvement that is at most O(d) better in the dimension. However, due to the remainder terms for the bandit problem, this improvement only matters (i.e.ﬁrst term is larger than second) when

if 2 [2, 1)

3 2 (1 ) log 5/2 d

if 2 (1, 2).

In Section 4, we show that these remainder terms can be made smaller than what their upper bounds indicate. Empirically, our adaptive method outperforms the uniformlysampled counterpart for larger values of than above.

3. Adaptive probabilities for stochastic

gradient descent

Consider the empirical risk minimization problem

fi(x) =: f(x)

where X 2 Rd is a closed convex set and fi( ) are convex functions. Let C1, . . . , Cb be a partition of the n samples so that each example belongs to some Cj, a set of size nj := |Cj| (note that the index j now refers to blocks of examples instead of coordinates). These block structures naturally arise, for example, when Cj s are the examples with the same label in a multi-class classiﬁcation problem. In this stochastic optimization setting, we now sample a block Jt pt at each iteration t, and perform gradient updates using a gradient estimate on the block CJt. We show how a similar adaptive updating scheme for pt s again achieves the optimality guarantees given in Section 2.

3.1. Mirror descent with non-stationary probabilities

Following the approach of (Nemirovski et al., 2009), we run mirror descent for the updates on x. At iteration t, a block Jt is drawn from a b-dimensional probability vector pt. We assume that we have access to unbiased stochastic gradients Gj(x) for each block. That is, E[Gj(x)] = 1 nj

i2Cj @fi(x). In particular, the estimate GJt(xt) := @f It(x) where It is drawn uniformly in CJt gives the usual unbiased stochastic gradient of minibatch size 1. The other extreme is obtained by using a minibatch size of nj where GJt(xt) := 1 n Jt

i2CJt @fi(x). Then,

the importance sampling estimator n Jt npt

Jt GJt(xt) is an un-

biased estimate for the subgradient of the objective.

Let be a differentiable 1-strongly convex function on X with respect to the norm k k as before and denote by k k the dual norm of k k. Let B(x, y) = (x) (y) r (y)>(x y) be the Bregman divergence associated with . In this section, we assume the below (standard) bound.

Assumption 2. For all x, y 2 X, we have B(x, y) R2

and k Gj(x)k2

L for j = 1, . . . , b.

We use these stochastic gradients to perform mirror updates, replacing the update in Algorithm 1 with the update

xt+1 argmin

From a standard argument (e.g., (Nemirovski et al., 2009)), we obtain the following convergence guarantee. The proof follows an argument similar to that of Proposition 1.

Proposition 6. Under Assumption 2, the updates (8) attain

S(f, x T ) R2

j k Gj(xt)k2

where S(f, x T ) = E[f( x T )] infx2X f(x).

Again, we wish to choose the optimal step size x that minimizes the regret bound (9). To this end, modify the doubling trick given in Algorithm 2 as follows: use Pt

for the second while condition,

and stepsizes x,k =

2 . Then, similar to Proposition 2, we have

S(f, x T ) 6R

2RL pmin T log 4

3.2. Adaptive probabilities

Now, we consider an adaptive updating scheme for pt s similar to Section 2.3. Using the scaled gradient estimate

b t,j(x) :=

+ L2 maxj n2

to run EXP3, we obtain Algorithm 5. Again, the additive scaling L2(maxj nj/npmin)2 is to ensure that b 0. As in Section 2.4, the updates for p in Algorithm 5 can be done in O(log b) time. We can also show similar optimality guarantees for Algorithm 5 as before. The proof is essentially the same to that given in Section A.3.

Adaptive Sampling Probabilities for Non-Smooth Optimization

Algorithm 5 Mirror Descent with Adaptive Sampling

Input: Stepsize p > 0 Initialize: x1 = x, p1 = p for t 1, . . . , T

Sample Jt pt

Choose x,k according to (modiﬁed) Algorithm 2. Update x:

Jt argminx2X

wt+1 pt exp( ( pb t,Jt(xt)/pt

Jt)e Jt) pt+1 argminq2P Dkl

return x T 1

Theorem 7. Let W := L maxj nj

pminn . Under Assumption 2,

Algorithm 5 with p = 1 W 2

b T achieves

S(f, x T ) 6R

+ W(2Tb log b)

2RW T log 4 log

where S(f, x T ) = E[f( x T )] infx2X f(x).

With equal block sizes nj = n/b and pmin = δ/b for some δ 2 (0, 1), the ﬁrst term in the boudn of Theorem 7 is O(TL2) which dominates the second term if T = (b log b). Since we usually have T = (n) for SGD, as long as n = (b log b) we have

S(f, x T ) O

v u u u t min

That is, Algorithm 5 attains the best regret bound achieved by the optimal stationary distribution in hindsight had the xt s had remained the same. Further, under similar structural assumptions k Gj(x)k2

/ j as in Section 2.5, we can prove that the regret bound for our algorithm is better than that of the uniform distribution.

4. Experiments

We compare performance of our adaptive approach with stationary sampling distributions on real and synthetic datasets. To minimize parameter tuning, we ﬁx p at the value suggested by theory in Theorems 3 and 7. However, we make a heuristic modiﬁcation to our adaptive algorithm since rescaling the bandit gradient (5) and (10) dwarfs the signals in gradient values if L is too large. We present performance of our algorithm with respect to multiple estimates of the Lipschitz constant ˆL = L/c for c > 1, where

L is the actual upper bound.1 We tune the stepsize x for both methods, using the form β/

t and tuning β.

For all our experiments, we compare our method against the uniform distribution and blockwise Lipschitz sampling distribution pj / Lj where Lj is the Lipschitz constant of the j-th block (Zhao and Zhang, 2015). We observe that the latter method often performs very well with respect to iteration count. However, since computing the blockwise Lipschitz sampling distribution takes O(nd), the method is not competitive in large-scale settings. Our algorithm, on the other hand, adaptively learns the latent structure and often outperforms stationary counterparts with respect to runtime. While all of our plots are for a single run with a random seed, we can reject the null hypothesis f(x T

uniform) < f(x T

adaptive) at 99% conﬁdence for all instances where our theory guarantees it. We take k k = k k2 throughout this section.

4.1. Adaptive sampling for coordinate descent

Synthetic Data We begin with coordinate descent, ﬁrst verifying the intuition of Section 2.5 on a synthetic dataset. We consider the problem minimizekxk1 1

1 n k Ax bk1, and we endow A 2 Rn d with the following block structure: the columns are drawn as aj j /2N(0, I). Thus, the gradients of the columns decay in a heavy-tailed manner as in Section 2.5 so that L2

j = j . We set n = d = b = 256; the effects of changing ratios n/d and b/d manifest themselves via relative norms of the gradients in the columns, which we control via instead. We run all experiments with pmin = 0.1/b and multiple values of c.

Results are shown in Figure 1, where we show the optimality gap vs. runtime in (a) and the learned sampling distribution in (b). Increasing (stronger block structure) improves our relative performance with respect to uniform sampling and our ability to accurately learn the underlying block structure. Experiments over more and c in Section C further elucidate the phase transition from uniform-like behavior to regimes learning/exploiting structure.

We also compare our method with (non-preconditioned) SGD using leverage scores pj / kajk1 given by (Yang et al., 2016). The leverage scores (i.e., sampling distribution proportional to blockwise Lipschitz constants) roughly correpond to using pj / j /2, which is the stationary distribution that minimizes the bound (3); in this synthetic setting, this sampling probability coincides with the actual block structure. Although this is expensive to compute, taking O(nd) time, it exploits the latent block structure very well as expected. Our method quickly learns the structure and performs comparably with this optimal distribution.

1We guarantee a positive loss by taking max(b t,j(x), 0).

Adaptive Sampling Probabilities for Non-Smooth Optimization

0 0.5 1 1.5 2 2.5 3 3.5 4

0 0.5 1 1.5 2

(a) Optimality gap

0 50 100 150 200 250 0

0 50 100 150 200 250 0

(b) Learned sampling distribution

Figure 1. Adaptive coordinate descent (left to right: = 0.4, 2.2)

0 50 100 150 200 250 300 350

(a) Optimality gap

100 101 102 103 0

(b) Learned distribution

Figure 2. Model selection for nucleotide sequences

Model selection Our algorithm s ability to learn underlying block structure can be useful in its own right as an online feature selection mechanism. We present one example of this task, studying an aptamer selection problem (Cho et al., 2013), which consists of n = 2900 nucleotide sequences (aptamers) that are one-hot-encoded with all kgrams of the sequence, where 1 k 5 so that d = 105, 476. We train an l1-regularized SVM on the binary labels, which denote (thresholded) binding afﬁnity of the aptamer. We set the blocksize as 50 features (b = 2110) and pmin = 0.01/b. Results are shown in Figure 2, where we see that adaptive feature selection certainly improves training time in (a). The learned sampling distribution depicted in (b) for the best case (c = 107) places larger weight on features known as G-complexes; these features are wellknown to affect binding afﬁnities (Cho et al., 2013).

4.2. Adaptive sampling for SGD

Synthetic data We use the same setup as in Section 4.1 but now endow block structure on the rows of A rather than the columns. In Figure 3, we see that when there is little block structure ( = 0.4) all sampling schemes perform similarly. When the block structure is apparent ( = 6), our adaptive method again learns the underlying structure

0 0.5 1 1.5 2 2.5

0 0.5 1 1.5 2 2.5

(a) Optimality gap

0 50 100 150 200 250 0

0 50 100 150 200 250 0

(b) Learned sampling distribution

Figure 3. Adaptive SGD (left to right: = 0.4, 6)

0 50 100 150 200 250

(a) CUB-200

0 2 4 6 8 10

Figure 4. Optimality gap for CUB-200-2011 and ALOI

and outperforms uniform sampling. We provide more experiments in Section C to illustrate behaviors over more c and . We note that our method is able to handle online data streams unlike stationary methods such as leverage scores.

CUB-200-2011/ALOI We apply our method to two multi-class object detection datasets: Caltech-UCSD Birds-200-2011 (Wah et al., 2011) and ALOI (Geusebroek et al., 2005). Labels are used to form blocks so that b = 200 for CUB and b = 1000 for ALOI. We use softmax loss for CUB-200-2011 and a binary SVM loss for ALOI, where in the latter we do binary classiﬁcation between shells and non-shell objects. We set pmin = 0.5/b to enforce enough exploration. For the features, outputs of the last fullyconnected layer of Res Net-50 (He et al., 2016) are used for CUB so that we have 2049-dimensional features. Since our classiﬁer x is (b d)-dimensional, this is a fairly large scale problem. For ALOI, we use default histogram features (d = 128). In each case, we have n = 5994 and n = 108, 000 respectively. We use X := {x 2 Rm : kxk2 r} where r = 100 for CUB and r = 10 for ALOI. We observe in Figure 4 that our adaptive sampling method outperforms stationary counterparts.

Adaptive Sampling Probabilities for Non-Smooth Optimization

Acknowledgements

HN was supported by the Samsung Scholarship. AS and SY were supported by Stanford Graduate Fellowships and AS was also supported by a Fannie & John Hertz Foundation Fellowship. JCD was supported by NSF-CAREER1553086.

Z. Allen-Zhu and Y. Yuan. Even faster accelerated coordi-

nate descent using non-uniform sampling. ar Xiv preprint ar Xiv:1512.09103, 2015.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time anal-

ysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235 256, 2002.

A. Beck and M. Teboulle. Mirror descent and nonlinear

projected subgradient methods for convex optimization. Operations Research Letters, 31:167 175, 2003.

A. Beck and L. Tetruashvili. On the convergence of block

coordinate descent type methods. SIAM Journal on Optimization, 23(4):2037 2060, 2013.

S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 (1):1 122, 2012.

N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and

games. Cambridge University Press, 2006.

M. Cho, S. S. Oh, J. Nie, R. Stewart, M. Eisenstein,

J. Chambers, J. D. Marth, F. Walker, J. A. Thomson, and H. T. Soh. Quantitative selection and parallel characterization of aptamers. Proceedings of the National Academy of Sciences, 110(46), 2013.

D. Csiba and P. Richt arik. Importance sampling for mini-

batches. ar Xiv preprint ar Xiv:1602.02283, 2016.

D. Csiba, Z. Qu, and P. Richt arik. Stochastic dual coordi-

nate ascent with adaptive probabilities. ar Xiv preprint ar Xiv:1502.08053, 2015.

C. D. Dang and G. Lan. Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM Journal on Optimization, 25(2):856 881, 2015.

A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems 27, 2014.

J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient

methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 2159, 2011.

J.-M. Geusebroek, G. J. Burghouts, and A. W. Smeulders.

The amsterdam library of object images. International Journal of Computer Vision, 61(1):103 112, 2005.

S. Gopal. Adaptive sampling for sgd by exploiting side

information. In Proceedings of The 33rd International Conference on Machine Learning, pages 364 372, 2016.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-

ing for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

R. Johnson and T. Zhang. Accelerating stochastic gradient

descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, 2013.

Y. T. Lee and A. Sidford. Efﬁcient accelerated coordinate

descent methods and faster algorithms for solving linear systems. In 54th Annual Symposium on Foundations of Computer Science, pages 147 156. IEEE, 2013.

Z. Lu and L. Xiao. On the complexity analysis of random-

ized block-coordinate descent methods. Mathematical Programming, 152(1-2):615 642, 2015.

B. Mc Mahan and M. Streeter. Adaptive bound optimiza-

tion for online convex optimization. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010.

I. Necoara, Y. Nesterov, and F. Glineur. A random coordinate descent method on large optimization problems with linear constraints. University Politehnica Bucharest, Tech. Rep, 2011.

D. Needell, R. Ward, and N. Srebro. Stochastic gradient

descent, weighted sampling, and the randomized Kaczmarz algorithm. In Advances in Neural Information Processing Systems 27, pages 1017 1025, 2014.

A. Nemirovski and D. Yudin. Problem Complexity and Method Efﬁciency in Optimization. Wiley, 1983.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Ro-

bust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4): 1574 1609, 2009.

Y. Nesterov. Efﬁciency of coordinate descent methods on

huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012.

Adaptive Sampling Probabilities for Non-Smooth Optimization

J. Nutini, M. Schmidt, I. H. Laradji, M. Friedlander, and

H. Koepke. Coordinate descent converges faster with the gauss-southwell rule than random selection. ar Xiv preprint ar Xiv:1506.00552, 2015.

P. Richt arik and M. Tak aˇc. Iteration complexity of random-

ized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (1-2):1 38, 2014.

P. Richt arik and M. Tak aˇc. Parallel coordinate descent methods for big data optimization. Mathematical Programming, page Online ﬁrst, 2015. URL http://link.springer.com/article/ 10.1007/s10107-015-0901-6.

S. Shalev-Shwartz. Online learning and online convex opti-

mization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012.

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. ar Xiv preprint ar Xiv:1211.2717, 2012.

T. Strohmer and R. Vershynin. A randomized Kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262 278, 2009.

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be-

longie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.

J. Yang, Y.-L. Chow, C. R e, and M. W. Mahoney. Weighted

sgd for p regression with randomized preconditioning. In Proceedings of the Twenty-Seventh Annual ACMSIAM Symposium on Discrete Algorithms, pages 558 569. Society for Industrial and Applied Mathematics, 2016.

P. Zhao and T. Zhang. Accelerating minibatch stochastic gradient descent using stratiﬁed sampling. ar Xiv:1405.3080 [stat.ML], 2014.

P. Zhao and T. Zhang. Stochastic optimization with impor-

tance sampling. In Proceedings of the 32nd International Conference on Machine Learning, 2015.