# online_variance_reduction_with_mixtures__0dda0625.pdf Online Variance Reduction with Mixtures Zalán Borsos 1 Sebastian Curi 1 Kfir Y. Levy 1 Andreas Krause 1 Adaptive importance sampling for stochastic optimization is a promising approach that offers improved convergence through variance reduction. In this work, we propose a new framework for variance reduction that enables the use of mixtures over predefined sampling distributions, which can naturally encode prior knowledge about the data. While these sampling distributions are fixed, the mixture weights are adapted during the optimization process. We propose VRM, a novel and efficient adaptive scheme that asymptotically recovers the best mixture weights in hindsight and can also accommodate sampling distributions over sets of points. We empirically demonstrate the versatility of VRM in a range of applications. 1. Introduction In the framework of Empirical Risk Minimization (ERM), we are provided with a set of samples D = {x1, . . . , xn} X drawn from the underlying data distribution, and our goal is to minimize the average loss over D, known as the empirical risk. Sequential ERM solvers (SGD, SVRG, etc.) proceed in multiple passes over the dataset and usually require an unbiased estimate of the loss in each round of the optimization. Typically, the estimate is generated by sampling uniformly from D, which is oblivious to the fact that different points can affect the optimization differently. This ignorance can hinder the performance of the optimizer due to the high variance of the obtained estimates. A promising direction that has recently received increased interest is represented by (adaptive) importance sampling techniques. Clever sampling distributions can account for characteristics of datapoints relevant to the optimization in order to improve the performance (Zhao & Zhang, 2015; Namkoong et al., 2017; Katharopoulos & Fleuret, 2018). 1Department of Computer Science, ETH Zurich. Correspondence to: Zalán Borsos . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Why mixtures? The majority of existing works on adaptive sampling distributions are unable to exploit similarities between points. Thus, only after several passes over the dataset do these methods become effective. This can become a major bottleneck for large datasets. Fortunately, in many situations, it is possible to exploit the structure present in data in some form of a prior. One way of capturing prior knowledge in the framework of importance sampling for variance reduction is to propose plausible fixed sampling distributions before performing the optimization. This is very natural for problems where similar objects can be grouped together, e.g., based on the class label or clustering in feature space. In such cases it is sensible to employ standard sampling distributions that draw similar members with the same probability, e.g., as considered by Zhao & Zhang (2014). Another option is to employ sampling distributions that encourage diverse sets of samples, e.g., Determinantal Point Processes (Kulesza et al., 2012). Suppose that several such proposals for sampling distributions are available. A natural idea is to combine them into a mixture and adapt the mixture weights during the optimization process, in order to achieve variance reduction. This setup has the potential of being much more efficient than learning an arbitrary sampling distribution over individual points, provided that one can propose plausible sampling distributions prior to the optimization. Another advantage of this setting is that it enables efficient handling of distributions not only on individual points, but also on sets of points. While in principle this can still be treated using existing approaches, their computational complexity in this case will increase proportionally to the number of sets, possibly exponentially with the number of points. In this work, we develop an online learning approach to variance reduction and ask the following question: given k fixed sampling distributions, how can we choose the mixture weights in order to achieve the largest reduction in the variance of the estimates? We provide a simple yet efficient algorithm for doing so. As for our main contributions, we: formulate the task of adaptive importance sampling for variance reduction with mixtures as an online learning problem, propose a novel algorithm for this setting with sublinear regret of O(T 4/5), Online Variance Reduction with Mixtures substantiate our findings experimentally with accompanying efficient implementation1. Related Work: There is a large body of work on employing importance sampling distributions for variance reduction. Prior knowledge of gradient norm bounds on each datapoint has been utilized for fixed importance sampling by Needell et al. (2014) and Zhao & Zhang (2015). Adaptive strategies were presented by Bouchard et al. (2015), who propose parametric importance sampling distributions where the parameters of the distributions are updated during the course of the optimization. Stich et al. (2017) derive a safe adaptive sampling scheme that is guaranteed to outperform any fixed sampling strategy. A significant body of work is concerned with non-uniform sampling of coordinates in coordinate descent (Allen-Zhu et al., 2016; Perekrestenko et al., 2017; Salehi et al., 2018). All the works presented above provide importance sampling schemes over points or coordinates. Sampling over sets of points and exploiting similarities between points in these works remains an open question. Importance sampling has found applications in optimizing deep neural networks. Johnson & Guestrin (2018) and Katharopoulos & Fleuret (2018) develop methods for choosing the importance sampling distributions over points proportional to their corresponding approximate gradient norm bounds. Johnson & Guestrin (2018) also propose to adapt the learning rate based on the gains in gradient norm reductions. Loshchilov & Hutter (2015) perform sampling based on the latest known loss value with exponentially decaying selection probability on the rank. In the context of reinforcement learning, Schaul et al. (2016) suggest a smoothed importance sampling scheme of experiences present in the replay buffer, based on the last observed TD-error. Most closely related to our setting, importance sampling for variance reduction has been considered through the lens of online learning in the recent works of Namkoong et al. (2017), Salehi et al. (2017) and Borsos et al. (2018). These works pose the ERM solver as an adversary responsible for generating the losses. The goal of the learner (player) is to minimize the cumulative variance by choosing the sampling distributions adaptively based on partial feedback. However, similarly to existing work on adaptive sampling, these methods are not designed for exploiting similarities between points. 2. Problem Setup In the framework of ERM, the goal is to minimize the empirical risk, min θ Θ L(θ) = min θ Θ 1 n i=1 ℓ(xi, θ), 1The code is available at https://github.com/ zalanborsos/variance-reduction-mixtures where ℓ: X Θ 7 R 0 is the loss function, and Θ Rd is a compact domain. A typical sequential ERM solver will run over T rounds and update the parameters based on an unbiased estimate Lt of the empirical loss in each round t [T]. A common approach for producing Lt is to sample a point it {1, ..., n} uniformly at random, thus ignoring the underlying structure of the data. However, using importance sampling, we can produce these estimates by sampling with any distribution q n, where n is the n-dimensional probability simplex, provided that we compensate for the bias through importance weights. Suppose we are provided with k sampling distributions p1, ..., pk n. We combine these distributions into a mixture, in which the probability of sampling xi is given by w p(i), where w k is the mixture weight vector and p(i) := [p1(i), ..., pk(i)]. Using the mixture, we obtain the loss estimate Lt(θ) = ri ℓ(xi, θ), where ri = 1 n w p(i) is the importance weight of point i. The performance of solvers such as SGD, SAGA (Defazio et al., 2014) and SVRG (Johnson & Zhang, 2013) is known to improve when the variance of Lt is smaller. Thus, a natural performance measure for our mixture sampling distribution is the cumulative variance of Lt through the T rounds of optimization, t=1 Var( L(θt)) = 1 t=1 L2(θt). Since only the cumulative second moments depend on w, we define our cost function at time t as 1 n2 ft(w), where ℓ2 t(i) w p(i), and where we have introduced the shorthand ℓt(i) := ℓ(xi, θt). Through the lens of online learning, it is natural to regard the sequential solver as an adversary responsible for generating the losses {ℓt}t [T ] and to measure the performance using the notion of the cumulative regret, Regret T = 1 t=1 ft(wt) min w k By devising a no-regret algorithm, we are guaranteed to compete asymptotically with the best mixture weights in hindsight. The protocol for online variance reduction with mixtures (OVRM) is presented in Figure 1. Motivated by empirical insights, we impose a natural mild restriction on our setting, which is easily verified in practice: Online Variance Reduction with Mixtures Assumption 1 Throughout the work, we assume that the losses are bounded, ℓ2 t(i) L for all t [T], i [n] and that all mixture components place a probability mass at most pmax = c n on any specific point, where c [1, n]. That is, pj(i) c n, for all i [n], j [k]. The choice of c is in the hands of the designer who determines the fixed sampling distributions in the mixture. Although c can be as large as n, in our experiments, we show how we can obtain a large speedup in the optimization due to the reduced variance by using mixtures with small values of c (the maximal value of c is less than 50 in the experiments). We note that our setup also allows choosing k = n, as many mixture components as points, where a mixture puts all its probability mass on a specific point, i.e., pj(i) = δij for i [n], j [k] and consequently c = n. Under this perspective, our setting is a strict generalization of adaptively choosing sampling distributions over the points, which is the main objective of Salehi et al. (2017), Namkoong et al. (2017) and Borsos et al. (2018). However, in practical scenarios, k is usually small due to the limited number of available proposal distributions. OVRM Protocol Input: Dataset D = {x1, ..., xn}, sampling distributions p = [p1, ..., pk]. for t = 1, . . . , T do player chooses wt k adversary chooses ℓt Rn player draws It w t p player incurs cost ft(wt)/n2 and receives ℓt(It) as partial feedback end for Figure 1. Online variance reduction protocol with mixtures and partial feedback. We have formulated our objective as minimizing the cumulative second moment of the loss estimates. If we choose to substitute ℓt(i) with ℓ(xi, θt) , the norm of the loss gradients, the corresponding cumulative second moment has a stark relationship to the quality of optimization for example, this quantity directly appears in the regret bounds of Ada Grad (Duchi et al., 2011). For a more detailed discussion see Borsos et al. (2018). Let us discuss some properties of our setting. Since f1, ..., f T are convex functions on k, the problem is an instance of online convex optimization (OCO). While the OCO framework offers a wide range of well-understood tools, our biggest challenge here is posed by the fact that the cost functions are unbounded, together with the fact that we have partial feedback. The majority of existing regret analyses assume boundedness of the cost functions. For simplicity, we focus on choosing datapoints, nevertheless, our method applies to choosing coordinates or blocks of coordinates in coordinate descent and can work on top of any sequential solver that builds on unbiased loss estimates. As we will see in Section 4, the complexity and the performance guarantee of our algorithm is independent of n, which broadens its applicability significantly. For example, instead of learning mixtures of distributions over points, we can learn a mixture for variance-reduced sampling of minibatches, where each mixture component is a fixed k-Determinantal Point Process (Kulesza et al., 2012). 3. Full Information Setting Let us assume for the moment that in each round of Protocol 1, the player receives full information feedback, i.e., sees the losses associated to all points [ℓt(1), ..., ℓt(n)] instead of observing only the loss ℓt(It) associated with the chosen point. This setup, referred to as full information setting, is unrealistic, yet it serves as the main tool for the analysis of the partial information setting, which we discuss in Section 4. Here, we first show an efficient algorithm for the full information setting (Alg. 1), ensuring a regret bound of O(k1/2T 2/3). Unfortunately, even under Assumption 1, our cost function is unbounded. In order to tackle this, we consider that the last mixture component (the k-th one) is always the uniform distribution. If this is not the case in practice, we can simply attach the uniform distribution to the given distributions. This is w.l.o.g., since the optimal w in hindsight is allowed to assign 0 weight on any component. Thus, we have that p(i) = [p1(i), ..., pk 1(i), 1/n] for all i [n]. For the analysis, we consider the restricted simplex k = {w k| w(k) γ}, where the last weight corresponding to the uniform component is larger than some γ (0, 1]. This allows for decomposing the regret as follows: Regret T = 1 t=1 ft(wt) min w k t=1 ft(w) min w k This decomposition introduces a trade-off. By choosing larger γ, we pay more in term (B) for potentially missing the optimal w. Nevertheless, larger γ makes the cost function nicer : not only does it reduce the upper bounds on the costs, but it also turns ft into an exp-concave function, as we will later show. First, let us focus on bounding (B), which captures the excess regret of working in k instead of k. Online Variance Reduction with Mixtures Lemma 1. The reduction to the restricted simplex k incurs the excess regret of Proof. Let w = arg minw k PT t=1 ft(w). Let w = (1 γ)w + γek, where ek = [0, ..., 0, 1]. Now clearly w k. We can observe that for all i [n], 1 w p(i) 1 w p(i) = γ(w ek) p(i) w p(i) w p(i) = γ w p(i) 1 w p(i) w p(i). If for some i we have w p(i) 1/n < 0, or, equivalently w p(i) < 1/n, we can ignore this specific term. Otherwise, if w p(i) 1/n, then also evidently w p(i) 1/n. Denote I+ the set of i s for which w p(i) 1/n. Using the previous observations, we can now bound (B): γℓ2 t(i) w p(i) 1 w p(i) w p(i) w p(i) w p(i) w p(i) n2γLT, where the last inequality uses the fact that w p(i) 1/n for all i I+ and that |I+| n. This proves the claim. By constraining our convex set to the restricted simplex k, we achieve desirable properties of our cost function: ft and its gradient norm are bounded. The first natural option for solving the problem is Online Gradient Descent (OGD). However, OGD can only guarantee a O( T) bound on (A); we elaborate on this in the supplementary material. We can obtain better regret bounds by noticing that restricting the domain to k has another advantage: it allows for exploiting curvature information as ft is exp-concave on this domain. A convex function g : K 7 R, where K is a convex set, is called α-exp-concave, if e αg(x) is concave. Exp-concavity is a weaker property than strong convexity, but it can still be exploited to achieve logarithmic regret bounds (Hazan et al., 2006). In the following result, we establish the expconcavity of our cost function on the restricted simplex k. Lemma 2. ft is 2γ n2L-exp-concave on k for all t [T]. Proof sketch. In order to prove exp-concavity, we rely on the following result (Hazan et al., 2016): a twice differentiable function g is α-exp-concave at x, iff 2g(x) α g(x) g(x). (2) In our case, K = k and ft(w) = Pn i=1 ℓ2 t (i)p(i) (w p(i))2 and 2ft(w) = 2 Pn i=1 ℓ2 t (i)p(i)p(i) (w p(i))3 . We can prove the property of exp-concavity using the following observation: for x1, ..., xn Rd we have n X i=1 xix i , (3) which is a result of the definition of positive semidefiniteness and Jensen s inequality. If we instantiate xi := ℓ2 t (i)p(i) (w p(i))2 and plug ft into Equation 2, we can identify α = 2γ/(n2L) after an additional step of lower bounding w p(i) by γ/n. From the last step, we can see that working in the restricted simplex k is crucial for achieving exp-concavity. Algorithm 1 ONS input Dataset D = {x1, ..., xn}, sampling distributions p = [p1, ..., pk 1, 1/n], parameters γ, β, ε 0. 1: w1 = [1/k, ..., 1/k] 2: H0 = εI 3: for t in 1 to T do 4: play wt, observe ft(wt) 5: update: Ht = Ht 1 + ft(wt) ft(wt) 6: Newton step: w = wt 1 β H 1 t ft(wt) 7: project: wt+1 = arg minw k(w w ) Ht(w w ) 8: end for Since the ft s are α-exp-concave functions in the restricted simplex, we can bound (A) by employing Algorithm 1, known as Online Newton Step (ONS), which provides the following guarantee for appropriately chosen β and ε (Hazan et al., 2006): t=1 ft(wt) min w k t=1 ft(w) 5 1 α + GD k log T, (4) where D = 2 is the diameter of k and G supw k,t [T ] ft(w) 2 is an upper bound on the gradient norm: ℓ2 t(i)p(i) (w p(i))2 γ2 (1, ..., 1) 2 = n2L where the inequality uses that in k we have w p(i) γ/n. Using these bounds together with Lemma 2 in Equation 4, we can finally bound (A): Lemma 3. Algorithm 1 ensures (A) 10Lk3/2 log T Online Variance Reduction with Mixtures Finally, we can combine the results from Lemma 3 and 1 and optimize the parameter γ that controls the trade-off, to arrive at the following regret bound with respect to the full simplex: Theorem 4. The regret of Algorithm 1 is Regret T 5Lk1/2T 2/3 log1/3 T. 4. The Partial Information Setting In a practice, the player only receives partial feedback from the environment corresponding to the loss of the chosen point, as presented in Figure 1. Even under partial feedback, the unbiasedness of the loss estimates must be ensured. For this, we propose our main algorithm, Variance Reduction with Mixtures (VRM), presented in Algorithm 2. VRM is inspired by the seminal work of Auer et al. (2002), in its approach to obtaining unbiased estimates under partial information. The algorithm in line 4 samples It w t p and receives only ℓt(It) as feedback in round t. We obtain an estimate by ℓ2 t(i) = ℓ2 t(i) w t p(i) 1It=i, (5) which is clearly unbiased due to E h ℓ2 t(i)|ℓt, wt i = ℓ2 t(i). We can analogously define ft(w) = Pn i=1 ℓ2 t (i) w p(i). With this choice, the estimates can be readily used, similar to the full information setting, in Algorithm 2. Algorithm 2 VRM input Dataset D = {x1, ..., xn}, sampling distributions p = [p1, ..., pk 1, 1/n], parameters γ, β, ε 0. 1: w1 = [1/k, ..., 1/k] 2: H0 = εI 3: for t in 1 to T do 4: sample It w t p, receive ℓt(It), set ft(w) = ℓ2 t (It) w p(It) 5: update: Ht = Ht 1 + ft(wt) ft(wt) 6: Newton step: w = wt 1 β H 1 t ft(wt) 7: project: wt+1 = arg minw k(w w ) Ht(w w ) 8: end for In the partial information setting, the natural performance measure of the player is the expected regret E [Regret T ], where the expectation is taken with respect to the randomized choices of the player and actions of the adversary. Crucially, we allow the adversary to adapt to the player s past behavior. This non-oblivious setting naturally arises in stochastic optimization, where ℓt depends on wt 1. For analyzing the expected regret incurred by the VRM under partial information, we can reuse the full information analysis. However, the exp-concavity constant and the gradient norm bounds change, and the non-oblivious behavior requires further analysis, resulting in the no-regret guarantee of Theorem 5, which is independent of n. Theorem 5. VRM achieves the expected regret E [Regret T ] = O k3/8c1/5LT 4/5 . Proof sketch. We first start by bounding the pseudo-regret, which compares the cost incurred by VRM to the cost incurred by the optimal mixture weights in expectation. It can be shown that ft(w) is 2γ2 n2L-exp concave on k and has the gradient bound ft(w) 2 = ℓ2 t(It) p(It) 2 (w p(It))2 Ln2c where the inequality uses the fact that w p(It) γ/n and Assumption 1, which implies p(i) 2 c k/n for all i [n]. Combined with the guarantee in Equation 4, this gives the bound on the expectation of (A) from the regret decomposition. The upper bound on (B) from Lemma 1 does not change under expectation and the modified losses. For bounding the expected regret, we rely on Freedman s lemma (Freedman, 1975) for the martingale difference sequence {Zt := Pn i=1 ℓ2 t(i) Pn i=1 ℓ2 t(i)}t [T ] in order to account for the non-oblivious nature of the adversary. 5. Efficient Implementation We now address practical aspects of VRM. Naively implemented, each iteration of the algorithm has a complexity of O(k3). One might argue that this can become a bottleneck when performed in each round of stochastic optimization. In practice, however, one usually has a limited number of available proposal distributions, limiting k to the small regime. Moreover, in the following, we present several tricks that improve on the complexity of the iteration. The online Newton update and step in lines 5 and 6 of Algorithm 2 can be implemented in O(k2) due to the Sherman Morrison formula (Hazan et al., 2006): H 1 t = H 1 t 1 H 1 t 1 ft(wt) ft(wt) H 1 t 1 1 + ft(wt) H 1 t 1 ft(wt) . Thus, the most costly operation of the algorithm is the projection step that requires solving a quadratic program, having a complexity of O(k3). In practice, we can trade off accuracy for efficiency in solving the quadratic program approximately by employing only a few steps of a projectionbased iterative solver (e.g., projected gradient descent, etc.). The key to the success of such a proposal is an efficient projection step onto the restricted simplex k, which captures the constraints of the quadratic program. Our proposed method, Algorithm 3, is a two-stage projection procedure Online Variance Reduction with Mixtures that is inspired by the efficient projection onto the simplex (Gafni & Bertsekas, 1984; Duchi et al., 2008) and has O(k log k) time complexity due to the sorting. Algorithm 3 Projection 1: function proj_simplex (w Rd, z (0, 1]) 2: sort w decreasingly into u 3: ρ = max n j [d] : uj Pj τ=1 uτ z /j > 0 o 4: λ = (Pρ τ=1 uτ z) /ρ 5: return max{w λ, 0} 6: end function 7: 8: w = proj_simplex (w, 1) 9: if w(k) < γ then 10: w(k) = γ 11: w(1 : k 1) = proj_simplex (w(1 : k 1), 1 γ) 12: end if 13: return w The idea behind projecting to k is the following: if the projection step with respect to the full simplex results in a point in the restricted simplex, we are done. Otherwise, we set the last coordinate of w to γ, and project the first k 1 coordinates to have mass 1 γ. Lemma 6. Algorithm 3 returns x = arg min x x w 2 2 s.t. x k. Proof. As shown by Duchi et al. (2008), the proj_simplex function solves the following minimization problem: min x x w 2 2 s.t. i=1 xi = z, xi 0. Denoting x = arg minx k x w 2 and x = arg minx k x w 2, we only need to inspect the case when x = x . In this case, we have x (k) = γ. To see this by proof of contradiction, assume x (k) > γ. Now we have2 x w < x w , and there also exists a small ϵ such that y := (1 ϵ)x + ϵx k and y(k) = γ. The contradicts with the optimality of x since, y w 2 (1 ϵ) x w 2+ϵ x w 2 < x w 2 . As a consequence, if x = x we can set w(k) = γ and call the proj_simplex function for the first k 1 coordinates and with the 1 γ leftover mass. Thus we have reduced the cost of one iteration in VRM to O(k2), and we further investigate its efficiency in the experiments. 2This is since the projection objective x w 2 is stronglyconvex, and hence the optimum must be unique. 6. Experiments In this section, we evaluate our method experimentally. The experiments are designed to illustrate the underlying principles of the algorithm as well as the beneficial effects of variance reduction in various real-world domains. We emphasize that it is crucial to design good sampling distributions for the mixture, and that this is an application-specific task. The following experiments provide guidance to this design process, but deriving better sampling distributions is an open question for future work. 6.1. SVM on blobs Consider the toy dataset consisting of n = 10 000 datapoints arranged in 6 well-separated, balanced, Gaussian blobs illustrated in the left of Figure 2. Points belonging to the leftmost three blobs are assigned negative class labels, and points in the rightmost three are labelled as positive. In this setting it is natural to propose k = 6 sampling distributions, one corresponding to each blob. A specific component assigns uniformly large probability to its associated points and uniformly small probability everywhere else. Notice that in this case c = k. We run 5 epochs of online gradient descent for SVM with step size 0.01/ t at iteration t. At each iteration, the sampler gets as feedback the norm of the gradient of the hinge loss. This way, VRM is expected to propose critical points (producing high norm loss gradients) more frequently, i.e., to sample the two middle blobs often, since they contain the support vectors. This intuition is confirmed in the middle plot of Figure 2, where the points color intensities represent their corresponding blob s mixture weights obtained by VRM at the end of the training. This also results in the fact that VRM achieves a certain level of accuracy faster than uniform sampling, due to discovering the support vectors earlier. 6.2. k-DPPs The following experiment illustrates that our method can handle distributions over sets of points. k-Determinantal point processes (k-DPP) (Kulesza et al., 2012) over a discrete set is a distribution over all subsets of size k. Being a member of the family of repulsive point processes, their diversity-inducing effect has recently been used in Zhang et al. (2017) for sampling minibatches in stochastic optimization. In this experiment, we take a similar path and investigate variance reduction in linear regression with sampling batches from a mixture of k-DPP kernels. This is rendered possible by our theoretical results, which show that the regret is independent of the number of points (which is n k in this case). We solve linear regression on a synthetic dataset of size n = 1 000 and dimension d = 10 generated as follows: the Online Variance Reduction with Mixtures Figure 2. Left: toy dataset consisting of 6 blobs, green indicates positive labels. Middle: illustration of mixture weights after 10 000 iterations, where high transparency corresponds to low weight. Due to large mixture weights, points from the two middle blobs are sampled more often, leading to faster discovery of support vectors. Right: Mean squared error achieved by the samplers on the regression task. VRM with k-DPPs provides 1.4 speedup over uniform sampling in terms of iterations. features are drawn from a multivariate normal distribution with random means and variances for each dimension. In order to change the relative importance of the samples, the features of 10 randomly selected points are scaled up by a factor of 10. The dependent variables Y are generated by Y = Xw0 +ϵ, where X is the feature matrix, w0 is a vector drawn from a normal distribution with mean 0 and variance 25 and ϵ is the standard normal noise. The optimization is performed with minibatch SGD with step size 10 4/ t in round t over 100 epochs and batch size of 5. The feedback to the samplers is norm of the gradient of the mean squared error. Our mixture consists of three k-DPPs with regularized linear kernel L = XX + λI, where λ {1, 10, 100}. We introduce a small bias by applying soft truncation to the importance weights: r = 0.8r + 0.2. The result of the 10 runs of the optimization process with different random seeds shown in right of Figure 2, where VRM significantly outperforms the uniform sampling in terms of number of iterations needed for a certain error level. However, since we use exact k-DPP sampling, the computational overhead outweighs the practical benefits of our method in this setting3. 6.3. Prioritized Experience Replay In this experiment, our goal is to identify good hyperparameters for prioritized experience replay (Schaul et al., 2016) with Deep Q-Learning (DQN) (Mnih et al., 2015) on the Cartpole environment of the Gym (Brockman et al., 2016). Prioritized experience replay is an importance sampling scheme that samples observations from the replay buffer approximately proportional to their last associated temporal difference (TD) error. The sampling distribution over a point j in the buffer is p(j) (|δj| + ϵ)α, where δj is the last observed TD-error associated to experience j, whereas ϵ and α are hyperparameters for smoothing the probabilities. With the appropriately chosen hyperparameters, prioritized 3Efficient k-DPP samplers are available, e.g. (Li et al., 2016); we leave the investigation of time-performance trade-offs with these samplers for future work. experience replay can significantly improve the performance of DQN learning. Figure 4. Evolution of rewards over 200 episodes of the different experience replay samplers on Cartpole. 50 runs with different random seeds. VRM identifies the mixture component corresponding to the best hyperparameter setting in early stages and assigns a large mixture weight to it. In this experiment, we show how VRM allows for automatic hyperparameter selection in a single run without performance loss. We generate 9 mixture components of prioritized experience replays with all possible parameter combinations of ϵ = {0.01, 0.1, 1} and α = {0.1, 0.5, 0.9}. The feedback to VRM is the TD-error incurred by the sampled experiences. During the optimization process, the prioritized replay buffers are also updated as new observations are inserted and the TD-errors are refreshed. This is a deviation from our presentation, where we relied on fixed sampling distributions. However, our framework easily extends to sampling distributions that change over time, i.e., sampling point i in round t is i w t pt(i) and we allow pt to depend on t. The result of 50 runs with different random seeds over 200 episodes is presented in Figure 4. The shaded areas represent 68% confidence intervals. VRM successfully identifies the mixture component corresponding to the best hyperparameter setting in early stages and assigns the largest mixture weight to it. As a consequence, VRM performs hyperparameter selection in a single run without loss of performance compared to the best setting. Online Variance Reduction with Mixtures Figure 3. k-means loss evolution on the test set. VRM suffers from a larger setup time due the cost of initializing the mixture components, but eventually outperforms the other methods in terms of relative error, where the reference is the batch k-means. 6.4. k-means Next we investigate the gains of our sampler for minibatch k-means (Sculley, 2010). We reproduce the experimental setup of Borsos et al. (2018) as well as their proposed algorithm, Variance Reducer Bandit (VRB). We compare VRM to uniform sampling and to VRB. The parameters of VRB where chosen as indicated by the authors. For both VRM and VRB, the points in the batch are sampled independently according to the samplers and the feedback is given in a delayed fashion, once per batch. The feedback corresponds to the norm of the minibatch k-means loss gradient. It remains to specify how to construct our mixture sampler. We use a mixture with 10 components. Inspired by VRB, we choose each mixture s sampling distribution proportional to the square root of the distances to a randomly chosen center with small uniform smoothing. More formally, for each component j, we define its sampling distribution as pj(i) = 0.9 p d2(xi, µj) p Pn k=1 d2(xk, µj) + 0.1 where µj is the randomly chosen center for component j. We note that this design of sampling distributions leads to low values of c, as presented in Table 1. We use batch size b = 100 and number of clusters k = 100, and initialize the centers via k-means++ (Arthur & Vassilvitskii, 2007), with shared initialization across all methods. We generate 10 different set of initial centers and run each method 10 times on each set of initial centers. We train the algorithms on 80% of the data. For the mixture sampler, we perform an additional 80%-20% split the training data, in order to choose the hyperparameters β and γ. We report the loss on the test sets of the datasets presented in Table 1 (KDD Cup 2004; Faulkner et al., 2011; Le Cun et al., 1998) with more details in the supplementary material. We are ultimately interested in the performance versus computational time trade-off. Thus, for the samplers, we include in the time measurement the setup and the sampling time. The results are shown in Figure 3, where we measure the Table 1. Dataset details KDD CSN MNIST nr. of points 145 751 80 000 70 000 nr. of features 74 17 10 c 48.18 42.42 3.09 relative error of minibatch k-means combined with different samplers compared to batch k-means. The shaded areas represent 95% confidence intervals. VRM suffers initially from a high setup time due to calculation of the proposed sampling distributions of the mixture, but eventually outperforms the other methods. Similarly to Borsos et al. (2018), we observe no advantage on MNIST, where the best-inhindsight mixture weights are uniform. 7. Conclusion We proposed a novel framework for online variance reduction with mixtures, in which structures in the data can be easily captured by formulating fixed sampling distributions as mixture components. We devised VRM, a novel importance sampling method for this setting that relies on the Online Newton Step algorithm and we showed that it asymptotically recovers the optimal mixture weight in hindsight. After several considerations for improving efficiency, including a novel projection step on the restricted simplex, we empirically demonstrate the versatility of VRM in a range of applications. Acknowledgements This research was supported by the SNSF grant 407540_167212 through the NRP 75 Big Data program and by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme grant agreement No 815943. K.Y.L. is supported by the ETH Zurich Postdoctoral Fellowship and Marie Curie Actions for People COFUND program. Online Variance Reduction with Mixtures Allen-Zhu, Z., Qu, Z., Richtárik, P., and Yuan, Y. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning, pp. 1110 1119, 2016. Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027 1035. Society for Industrial and Applied Mathematics, 2007. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. Borsos, Z., Krause, A., and Levy, K. Y. Online variance reduction for stochastic optimization. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 324 357. PMLR, 06 09 Jul 2018. Bouchard, G., Trouillon, T., Perez, J., and Gaidon, A. Online learning to sample. ar Xiv preprint ar Xiv:1506.09016, 2015. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016. Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014. Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T. Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pp. 272 279. ACM, 2008. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. Faulkner, M., Olson, M., Chandy, R., Krause, J., Chandy, K. M., and Krause, A. The next big one: Detecting earthquakes and other rare events from community-based sensors. In Information Processing in Sensor Networks (IPSN), 2011 10th International Conference on, pp. 13 24. IEEE, 2011. Freedman, D. A. On tail probabilities for martingales. the Annals of Probability, pp. 100 118, 1975. Gafni, E. M. and Bertsekas, D. P. Two-metric projection methods for constrained optimization. SIAM Journal on Control and Optimization, 22(6):936 964, 1984. Hazan, E., Kalai, A., Kale, S., and Agarwal, A. Logarithmic regret algorithms for online convex optimization. In Lecture Notes in Computer Science, volume 4005, pp. 499 513. Springer-Verlag Berlin Heidelberg, June 2006. Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016. Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315 323, 2013. Johnson, T. B. and Guestrin, C. Training deep models faster with robust, approximate importance sampling. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 7276 7286. Curran Associates, Inc., 2018. Katharopoulos, A. and Fleuret, F. Not all samples are created equal: Deep learning with importance sampling. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2525 2534, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. KDD Cup 2004. KDD Cup 2004. Protein Homology Dataset. http://osmot.cs.cornell.edu/ kddcup/, 2004. Accessed: 10.11.2016. Kulesza, A., Taskar, B., et al. Determinantal point processes for machine learning. Foundations and Trends R in Machine Learning, 5(2 3):123 286, 2012. Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Li, C., Jegelka, S., and Sra, S. Efficient sampling for kdeterminantal point processes. In Artificial Intelligence and Statistics, pp. 1328 1337, 2016. Loshchilov, I. and Hutter, F. Online batch selection for faster training of neural networks. ar Xiv preprint ar Xiv:1511.06343, 2015. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529, 2015. Online Variance Reduction with Mixtures Namkoong, H., Sinha, A., Yadlowsky, S., and Duchi, J. C. Adaptive sampling probabilities for non-smooth optimization. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2574 2583, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. Needell, D., Ward, R., and Srebro, N. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pp. 1017 1025, 2014. Perekrestenko, D., Cevher, V., and Jaggi, M. Faster coordinate descent via adaptive importance sampling. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54. PMLR, 2017. Salehi, F., Celis, L. E., and Thiran, P. Stochastic Optimization with Bandit Sampling. Ar Xiv e-prints, August 2017. Salehi, F., Thiran, P., and Celis, E. Coordinate descent with bandit sampling. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 9267 9277. Curran Associates, Inc., 2018. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. In International Conference on Learning Representations, Puerto Rico, 2016. Sculley, D. Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web, pp. 1177 1178. ACM, 2010. Stich, S. U., Raj, A., and Jaggi, M. Safe adaptive importance sampling. In Advances in Neural Information Processing Systems 30, pp. 4384 4394. Curran Associates, Inc., 2017. Zhang, C., Kjellstrom, H., and Mandt, S. Determinantal point processes for mini-batch diversification. Conference on Uncertainty in Artificial Intelligence (UAI), 2017. Zhao, P. and Zhang, T. Accelerating minibatch stochastic gradient descent using stratified sampling. ar Xiv preprint ar Xiv:1405.3080, 2014. Zhao, P. and Zhang, T. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 1 9, 2015.