# accelerating_stochastic_composition_optimization__93e7140f.pdf Accelerating Stochastic Composition Optimization Mengdi Wang , Ji Liu , and Ethan X. Fang Princeton University, University of Rochester, Pennsylvania State University mengdiw@princeton.edu, ji.liu.uwisc@gmail.com, xxf13@psu.edu Consider the stochastic composition optimization problem where the objective is a composition of two expected-value functions. We propose a new stochastic firstorder method, namely the accelerated stochastic compositional proximal gradient (ASC-PG) method, which updates based on queries to the sampling oracle using two different timescales. The ASC-PG is the first proximal gradient method for the stochastic composition problem that can deal with nonsmooth regularization penalty. We show that the ASC-PG exhibits faster convergence than the best known algorithms, and that it achieves the optimal sample-error complexity in several important special cases. We further demonstrate the application of ASC-PG to reinforcement learning and conduct numerical experiments. 1 Introduction The popular stochastic gradient methods are well suited for minimizing expected-value objective functions or the sum of a large number of loss functions. Stochastic gradient methods find wide applications in estimation, online learning, and training of deep neural networks. Despite their popularity, they do not apply to the minimization of a nonlinear function involving expected values or a composition between two expected-value functions. In this paper, we consider the stochastic composition problem, given by min x2 0 is a regularization parameter. Its batch version takes the form h(x; ai, bi) + λ h(x; ai, bi) 1 h(x; ai, bi) Here the variance term is the composition of the mean square function and an expected loss function. Although the stochastic composition problem (1) was barely studied, it actually finds a broad spectrum of emerging applications in estimation and machine learning (see Wang et al. [2016] for a list of applications). Fast optimization algorithms with theoretical guarantees will lead to new computation tools and online learning methods for a broader problem class, no longer limited to the expectation minimization problem. 1.2 Related Works and Contributions Contrary to the expectation minimization problem, unbiased" gradient samples are no longer available for the stochastic composition problem (1). The objective is nonlinear in the joint probability distribution of (w, v), which substantially complicates the problem. In a recent work by Dentcheva et al. [2015], a special case of the stochastic composition problem, i.e., risk-averse optimization, has been studied. A central limit theorem has been established, showing that the K-sample batch problem converges to the true problem at the rate of O(1/ K) in a proper sense. For the case where R(x) = 0, Wang et al. [2016] has proposed and analyzed a class of stochastic compositional gradient/subgradient methods (SCGD). The SCGD involves two iterations of different time scales, one for estimating x by a stochastic quasi-gradient iteration, the other for maintaining a running estimate of g(x ). Wang and Liu [2016] studies the SCGD in the setting where samples are corrupted with Markov noises (instead of i.i.d. zero-mean noises). Both works establish almost sure convergence of the algorithm and several convergence rate results, which are the best-known convergence rate prior to the current paper. The idea of using two-timescale quasi-gradient traced back to the earlier work Ermoliev [1976]. The incremental treatment of proximal gradient iteration has been studied extensively for the expectation minimization problem, see for examples Beck and Teboulle [2009], Bertsekas [2011], Ghadimi and Lan [2015], Gurbuzbalaban et al. [2015], Nedi c [2011], Nedi c and Bertsekas [2001], Nemirovski et al. [2009], Rakhlin et al. [2012], Shamir and Zhang [2013], Wang and Bertsekas [2016], Wang et al. [2015]. However, except for Wang et al. [2016] and Wang and Liu [2016], all of these works focus on variants of the expectation minimization problem and do not apply to the stochastic composition problem (1). Another work partially related to this paper is by Dai et al. [2016]. They consider a special case of problem (1) arising in kernel estimation, where they assume that all functions fv s are convex and their conjugate functions f ? v s can be easily obtained/sampled. Under these additional assumptions, they essentially rewrite the problem into a saddle point optimization involving functional variables. In this paper, we propose a new accelerated stochastic compositional proximal gradient (ASC-PG) method that applies to the penalized problem (1), which is a more general problem than the one considered in Wang et al. [2016]. We use a coupled martingale stochastic analysis to show that ASC-PG achieves significantly better sample-error complexity in various cases. We also show that ASC-PG exhibits optimal sample-error complexity in two important special cases: the case where the outer function is linear and the case where the inner function is linear. Our contributions are summarized as follows: 1. We propose the first stochastic proximal-gradient method for the stochastic composition problem. This is the first algorithm that is able to address the nonsmooth regularization penalty R( ) without deteriorating the convergence rate. 2. We obtain a convergence rate O(K 4/9) for smooth optimization problems that are not necessarily convex, where K is the number of queries to the stochastic first-order oracle. This improves the best known convergence rate and provides a new benchmark for the stochastic composition problem. 3. We provide a comprehensive analysis and results that apply to various special cases. In particular, our results contain as special cases the known optimal rate results for the expectation minimization problem, i.e., O(1/ K) for general objectives and O(1/K) for strongly convex objectives. 4. In the special case where the inner function g( ) is a linear mapping, we show that it is sufficient to use one timescale to guarantee convergence. Our result achieves the non-improvable rate of convergence O(1/K) for optimal strongly convex optimization and O(1/ K) for nonconvex smooth optimization. It implies that the inner linearity does not bring fundamental difficulty to the stochastic composition problem. 5. We show that the proposed method leads to a new on-policy reinforcement learning algorithm. The new learning algorithm achieves the optimal convergence rate O(1/ K) for solving Bellman equations (or O(1/K) for solving the least square problem) based on K observations of state-tostate transitions. In comparison with Wang et al. [2016], our analysis is more succinct and leads to stronger results. To the best of our knowledge, Theorems 1 and 2 in this paper provide the best-known rates for the stochastic composition problem. Paper Organization. Section 2 states the sampling oracle and the accelerated stochastic compositional proximal gradient algorithm (ASC-PG). Section 3 states the convergence rate results in the case of general nonconvex objective and in the case of strongly convex objective, respectively. Section 4 describes an application of ASC-PG to reinforcement learning and gives numerical experiments. Notations and Definitions. For x 2 0 such that kykk ckzkk for each k. We denote by Ivalue condition the indicator function, which returns value if the condition is satisfied; otherwise 0. We denote by H the optimal objective function value of problem (1), denote by X the set of optimal solutions, and denote by PS(x) the Euclidean projection of x onto S for any convex set S. We also denote by short that f(y) = Ev[fv(y)] and g(x) = Ew[gw(x)]. 2 Algorithm We focus on the black-box sampling environment. Suppose that we have access to a stochastic first-order oracle, which returns random realizations of first-order information upon queries. This is a typical simulation oracle that is available in both online and batch learning. More specifically, assume that we are given a Sampling Oracle (SO) such that Given some x 2 t 0 βk, if k = t 0. (3) Then we can verify the following important relations: t zt+1, yk+1 = t gwt+1(zt+1), which essentially say that xk+1 is a damped weighted average of {zt+1}k+1 0 and yk+1 is a damped weighted average of {gwt+1(zt+1)}k+1 An Analytical Example of the Extrapolation-Smooth Scheme Now consider the special case where gw( ) is always a linear mapping gw(z) = Awz + bz and βk = 1/(k + 1). We can verify that (k) t = 1/(k + 1) for all t. Then we have g(xk+1) = 1 k + 1 E[Aw]zt+1 +E[bw], yk+1 = 1 k + 1 Awt+1zt+1 + 1 k + 1 In this way, we can see that the scaled error k(yk+1 g(xk+1)) = (Awt+1 E[Aw])zt+1 + (bwt+1 E[bw]) is a zero-mean and zero-drift martingale. Under additional technical assumptions, we have E[kyk+1 g(xk+1)k2] O (1/k) . Note that the zero-drift property of the error martingale is the key to the fast convergence rate. The zero-drift property comes from the near-unbiasedness of yk, which is due to the special construction of the extrapolation-smoothing scheme. In the more general case where gw is not necessarily linear, we can use a similar argument to show that yk is a nearly unbiased estimate of g(xk). As a result, the extrapolation-smoothing (y, z)-step ensures that yk tracks the unknown quantity g(xk) efficiently. 3 Main Results We present our main theoretical results in this section. Let us begin by stating our assumptions. Note that all assumptions involving random realizations of v, w hold with probability 1. Assumption 1. The samples generated by the SO are unbiased in the following sense: 1. E{wk,vk}(rg> wk(x)rfvk(y)) = rg>(x)rf(y) 8k = 1, 2, , K, 8x, 8y. 2. Ewk(gwk(x)) = g(x) 8x. Note that wk and vk are not necessarily independent. Assumption 2. The sample gradients and values generated by the SO satisfy Ew(kgw(x) g(x)k2) σ2 8x. Assumption 3. The sample gradients generated by the SO are uniformly bounded, and the penalty function R has bounded gradients. krfv(x)k (1), krgw(x)k (1), k@R(x)k (1) 8x, 8w, 8v Assumption 4. There exist LF , Lf, Lg > 0 such that 1. F(z) F(x) hr F(x), z xi + LF 2 kz xk2 8x 8z. 2. krfv(y) rfv(w)k Lfky wk 8y 8w 8v. 3. kg(x) g(z) rg(z)>(x z)k Lg 2 kx zk2 8x 8z. Our first main result concerns with general optimization problems which are not necessarily convex. Theorem 1 (Smooth (Nonconvex) Optimization). Let Assumptions 1, 2, 3, and 4 hold. Denote by F(x) := (Ev(fv) Ew(gw))(x) for short and suppose that R(x) = 0 in (1) and E(F(xk)) is bounded from above. Choose k = k a and βk = 2k b where a 2 (0, 1) and b 2 (0, 1) in Algorithm 1. Then we have k=1 E(kr F(xk)k2) K O(Ka 1 + L2 f Lg K4b 4a Ilog K 4a 4b=1 + L2 f K b + K a). (4) If Lg 6= 0 and Lf 6= 0, choose a = 5/9 and b = 4/9, yielding E(kr F(xk)k2) O(K 4/9). (5) If Lg = 0 or Lf = 0, choose a = b = 1/2, yielding E(kr F(xk)k2) O(K 1/2). (6) The result of Theorem 1 strictly improves the best-known results given by Wang et al. [2016]. First the result of (5) improves the finite-sample error bound from O(k 2/7) to O(k 4/9) for general convex and nonconvex optimization. This improves the best known convergence rate and provides a new benchmark for the stochastic composition problem. Note that it is possible to relax the condition E(F(xk)) is bounded from above" in Theorem 1. However, it would make the analysis more cumbersome and yield an additional term log K in the error bound. Our second main result concerns strongly convex objective functions. We say that the objective function H is optimally strongly convex with parameter λ > 0 if H(x) H(PX (x)) λkx PX (x)k2 8x. (7) (see Liu and Wright [2015]). Note that any strongly convex function is optimally strongly convex, but the reverse does not hold. For example, the objective function (2) in on-policy reinforcement learning is always optimally strongly convex (even if E(A) is a rank deficient matrix), but not necessarily strongly convex. Theorem 2. (Strongly Convex Optimization) Suppose that the objective function H(x) in (1) is optimally strongly convex with parameter λ > 0 defined in (7). Set k = Cak a and βk = Cbk b where Ca > 4λ, Cb > 2, a 2 (0, 1], and b 2 (0, 1] in Algorithm 1. Under Assumptions 1, 2, 3, and 4, we have E(kx K PX (x K)k2) O f Lg K 4a+4b + L2 . (8) If Lg 6= 0 and Lf 6= 0, choose a = 1 and b = 4/5, yielding E(kx K PX (x K)k2) O(K 4/5). (9) If Lg = 0 or Lf = 0, choose a = 1 and b = 1, yielding E(kx K PX (x K)k2) O(K 1). (10) Let us discuss the results of Theorem 2. In the general case where Lf 6= 0 and Lg 6= 0, the convergence rate in (9) is consistent with the result of Wang et al. [2016]. Now consider the special case where Lg = 0, i.e., the inner mapping is linear. This result finds an immediate application to Bellman error minimization problem (2) which arises from reinforcement learning problem in (and with 1 norm regularization). The proposed ASC-PG algorithm is able to achieve the optimal rate O(1/K) without any extra assumption on the outer function fv. To the best of our knowledge, this is the best (also optimal) sample-error complexity for on-policy reinforcement learning. Remarks Theorems 1 and 2 give important implications about the special cases where Lf = 0 or Lg = 0. In these cases, we argue that our convergence rate (10) is optimal" with respect to the sample size K. To see this, it is worth pointing out the O(1/K) rate of convergence is optimal for strongly convex expectation minimization problem. Because the expectation minimization problem is a special case of the problem (1), the O(1/K) convergence rate must be optimal for the stochastic composition problem too. Consider the case where Lf = 0, which means that the outer function fv( ) is linear with probability 1. Then the stochastic composition problem (1) reduces to an expectation minimization problem since (Evfv Ewgw)(x) = Ev(fv(Ewgw(x))) = Ev Ew(fv gw)(x). Therefore, it makes a perfect sense to obtain the optimal convergence rate. Consider the case where Lg = 0, which means that the inner function g( ) is a linear mapping. The result is quite surprising. Note that even g( ) is a linear mapping, it does not reduce problem (1) to an expectation minimization problem. However, the ASC-PG still achieves the optimal convergence rate. This suggests that, when inner linearity holds, the stochastic composition problem (1) is not fundamentally more difficult than the expectation minimization problem. The convergence rate results unveiled in Theorems 1 and 2 are the best known results for the composition problem. We believe that they provide important new result which provides insights into the complexity of the stochastic composition problem. 4 Application to Reinforcement Learning In this section, we apply the proposed ASC-PG algorithm to conduct policy value evaluation in reinforcement learning through attacking Bellman equations. Suppose that there are in total S states. Let the policy of interest be . Denote the value function of states by V 2