# a_composite_randomized_incremental_gradient_method__14766c4e.pdf A Composite Randomized Incremental Gradient Method Junyu Zhang 1 Lin Xiao 2 We consider the problem of minimizing the composition of a smooth function (which can be nonconvex) and a smooth vector mapping, where both of them can be express as the average of a large number of components. We propose a composite randomized incremental gradient method based on SAGA type of construction. The gradient sample complexity of our method matches that of several recently developed methods based on SVRG in the general case. However, for structured problems where linear convergence rates can be obtained, our method can be much better for ill-conditioned problems. In addition, when the finite-sum structure only appear for the inner mapping, the sample complexity of our method is the same as that of SAGA for minimizing finite sum of smooth nonconvex functions, despite the additional outer composition and the stochastic composite gradients being biased in our case. 1. Introduction We consider composite optimization problems of the form minimize x Rd f 1 i=1 gi(x) + r(x), (1) where f : Rp R is a smooth and possibly nonconvex function, each gi : Rd Rp is a smooth vector mapping for i = 1, . . ., n, and r : Rd R { } is a convex but possibly nonsmooth function. Problem (1) is a special case of the following more general problem with composite finite sum (average) structure: minimize x Rd 1 m i=1 gi(x) + r(x), (2) 1Department of Industrial and Systems Engineering, University of Minnesota, Minneapolis, Minnesota, USA. 2Microsoft Research, Redmond, Washington, USA. Correspondence to: Junyu Zhang , Lin Xiao . Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). where each fj : Rp R is smooth and can be nonconvex. Such problems often arise as finite sample approximations of the stochastic composite optimization problem minimize x Rd Eν fν Eξ[gξ(x)] + r(x), (3) where fν and gξ are parametrized by the random variables ν and ξ of some unknown probability distributions. Many interesting tasks in reinforcement learning (e.g., Sutton & Barto, 1998) and risk-averse learning can be formulated as optimization problems with such composite structure. Algorithms for solving such problems have received lots of attention in recent years, for both the stochastic version (e.g., Wang et al., 2017a;b; Wang & Liu, 2016) and the finite-sum version (e.g., Huo et al., 2018; Lian et al., 2017; Lin et al., 2018; Liu et al., 2018). In this paper, we focus on randomized algorithms for solving problem (1). While our algorithms and complexity results all extend to the general case (2), focusing on (1) greatly simplifies the presentation and also make the main ideas and analysis much more clear. Nevertheless, our results for the general case are given in the supplementary materials. Another reason for us to focus on (1) is that most of the interesting applications that we know can be put into this form. A well-known example is the policy evaluation problem in reinforcement learning (RL). With linear value function approximation, it can be formulated as minimize x Rd E[A]x E[b] 2, (4) where A and b are random matrix and vector generated by the transition probability matrix and rewards of a Markov decision process (MDP) (e.g., Dann et al., 2014). A finite sample version of (4) is in the form of (1) with f ( ) = 2. A more interesting example is risk-averse optimization, which has many applications in RL and financial mathematics. We consider a general formulation of mean-variance trade-off: maximize x Rd 1 n j=1 hj(x) λ i=1 hi(x) 2 , (5) where each hj(x) : Rd R is a reward function. The goal of Problem (5) is to maximize the average reward with a A Composite Randomized Incremental Gradient Method penalty on the variance. If we use the following mappings: gi(x) : Rd Rd+1 = x hi(x) fj(y, z) : Rd+1 R = hj(y) + λ hj(y) z 2, (7) then problem (5) can be put into the form of (2) with m = n and r(x) 0. However, note that i=1 hi(x) 2 = 1 i=1 h2 i (x) 1 i=1 hi(x) 2 . This allows us to reformulate problem (5) into the form of (1) by using the mappings gi(x) : Rd R2 = hi(x) h2 i (x) f (y, z) : R2 R = y + λy2 λz. (9) This leads to a simpler structure and much lower intermediate dimension, i.e., p = 2 instead of p = d + 1 as in (6) and (7). Besides these applications, structured problems such as (1) and (2) are of independent interest for research on randomized algorithms. Let s denote the Jacobian matrix of gi(x) by g i(x), which is a matrix in Rp d. Due to the composition of an expectation inside a nonlinear function, it is very hard to get an unbiased estimation of the gradient i=1 gi(x) = 1 i=1 g i(x) T f 1 i=1 gi(x) . Specifically, let S {1, . . ., n} be a random subset and i S gi(x), g (x) = 1 i S g i(x). Then g (x) T f ( g(x)) is always a biased estimate of the gradient, unless one is willing to calculate the full average (1/n) Pn i=1 gi(x). Such difficulties often arise in optimizing more sophisticated objective functions other than the empirical risk (e.g., Chaudhari et al., 2016; Hazan et al., 2016; Gulcehre et al., 2016; Mobahi & Fisher, 2015). As a simplest model, the analysis of randomized algorithms for (1) may provide insights into more general problems. 1.1. Related Work Wang et al. (2017a) considered the problem of minimizing F(x) Eν fν Eξ[gξ(x)] , i.e., problem (3) with r 0. They derived algorithms to find ϵ-optimal solutions1 with sample complexities O(ϵ 4), O(ϵ 3.5) and O(ϵ 1.25) for the 1Here by ϵ-optimal solution, we mean some x Rd that satisfies E[F(x) F ] < ϵ in the convex case, where F = infx F(x), and E[ F(x) 2] < ϵ in the smooth nonconvex case. smooth nonconvex case, smooth convex case and smooth strongly convex case respectively. Wang et al. (2017b) considered problem (3) with nontrivial r(x) and obtained improved sample complexity of O(ϵ 2.25) for smooth nonconvex case and O(ϵ 2) and O(ϵ 1) for the general convex and strongly convex cases respectively. For the finite-sum problem (2), several authors applied the variance reduction technique of SVRG (Johnson & Zhang, 2013; Xiao & Zhang, 2014) to obtain improved sample complexities. Lian et al. (2017) considered problem (2) when the objective function is strongly convex and r(x) = 0. They derived two algorithms with sample complexities O((m+n+κ3) log( 1 ϵ )) and O((m+n+κ4) log( 1 ϵ )) respectively, where κ is some suitably defined condition number. Huo et al. (2018) developed algorithms based on the SVRG scheme to obtain an O(m+n+(m+n)2/3ϵ 1) sample complexity for the smooth nonconvex case and an O((m+n+κ3) log( 1 ϵ )) sample complexity for strongly convex problems with nonsmooth r. This problem was also studied by Yu & Huang (2017) and they proposed an ADMM style algorithm with complexity O((m + n + κ4) log( 1 ϵ )) in the strongly convex case. More recently, Lin et al. (2018) and Liu et al. (2018) made further improvements over the complexity under various conditions. 1.2. Contributions and Outline Most previous work on solving the problems (1) and (2) are based on the variance reduction scheme of SVRG. In this paper, we propose a composite randomized increment gradient method using the scheme of SAGA (Defazio et al., 2014). We show that it attains the same sample complexity as the methods based on SVRG in general, but for structured problems that allow linear convergence, we obtain improved complexity bounds than previous methods based on SVRG. In Section 2, we present the C-SAGA method for solving problem (1). In Section 3, we show that it has O(n+n2/3ϵ 1) sample complexity if the composite part is smooth and r is convex. In Section 4, we show that with additional conditions, i.e., gradient dominant or optimally strongly convex condition, an O((n + κn2/3) log( 1 ϵ )) sample complexity can be obtained. We provide numerical experiments in Section 5. We emphasize that these complexities are the same as those obtained by SVRG-type of methods for solving the problem minimize x Rd 1 n i=1 gi(x) + r(x). (10) where each gi : Rd R is a scalar mapping (Allen-Zhu & Hazan, 2016; Reddi et al., 2016a;b; Lei et al., 2017). These results indicate that the additional smooth composition over finite-sum problems does not incur higher complexity. Our algorithm and results extend to the general case (2), with sample complexity O(m+n+(m+n)2/3ϵ 1) for smooth A Composite Randomized Incremental Gradient Method nonconvex problems and O((m + n + κ(m + n)2/3) log( 1 ϵ )) under gradient dominant or strongly convex conditions. Compared with the O((m + n + κ3) log( 1 ϵ )) complexity of Lian et al. (2017) and Huo et al. (2018), our result provides (much) better bound as long as κ > n1/3. The details for the general case are given in the supplementary materials. 2. The Composite SAGA Method For the ease of presentation, we define the following functions related to the objective in (1): i=1 gi(x), F(x) f (g(x)), (11) Φ(x) F(x) + r(x) = f (g(x)) + r(x) . (12) First we review the proximal gradient method for solving problems of the form minimize x Rd Φ(x) = F(x) + r(x) , (13) where F is smooth and r is convex and may be nonsmooth. The proximal operator of r with parameter η is defined as proxη r (x) := arg min y 2η y x 2o . (14) We assume that the function r is relatively simple, meaning that its proximal operator has a close-form solution or can be computed efficiently. The proximal gradient method for solving problem (13) is xt+1 = proxη r xt ηF (xt) , (15) where F (xt) denotes the gradient of F at xt and η is an appropriate step size (e.g., Nesterov, 2013). For convenience, we define the proximal gradient mapping of Φ as x proxη r x ηF (x) . (16) As a result, the proximal gradient method (15) can be written as xt+1 = xt η G(xt). Notice that when r(x) 0, we have G(x) F (x) for any η > 0. For problem (1) and any x generated by some randomized algorithm, we call x an ϵ-stationary point in expectation if E G(x ) 2 ϵ. (17) The aim of an efficient (randomized) algorithm is to find such a solution with low sample complexities of the individual functions gi and their Jacobian g i (the total number of times they need to be evaluated). For the batch proximal gradient method (15), its iteration complexity is O(LF/ϵ) (e.g., Nesterov, 2004), where LF is the Lipschitz constant of F . This translates into a sample complexity of O(LFn/ϵ) Algorithm 1 Composite SAGA (C-SAGA) 1: input: initial point x0 Rd, initial reference points α0 i for i = 1, . . ., n, and step size η > 0 2: Initialize average mapping and Jacobian: i=1 gi(α0 i ), Z0 = 1 i=1 g i(α0 i ) 3: for t = 0, ...,T 1 do 4: Sample with replacement a subset St {1, ..., n} uniformly at random, with |St| = s. 5: Compute gj(xt) and g j(xt) for all j St and let yt = Yt + 1 j St (gj(xt) gj(αt j)) (18) zt = Zt + 1 j St (g j(xt) g j(αt j)) (19) 6: Let F(xt) = z T t f (yt) and xt+1 = proxη r xt η F(xt) (20) 7: Update reference points: αt+1 j = xt if j St αt j otherwise 8: Update average mapping and Jacobian: Yt+1 = Yt + 1 j St (gj(xt) gj(αt j)) (21) Zt+1 = Zt + 1 j St (g j(xt) g j(αt j)) (22) 9: end for 10: output: Randomly choose t {1, ...,T} and output xt of the components gi and g i. Our goal is to develop a randomized algorithm that has lower sample complexity. We propose to use a randomly selected subset of the functions gi during each iteration to approximate the full gradient F (x) = (g (x))T f (g(x)). As we pointed out in the introduction, one can easily get an unbiased estimation of g (x) by 1 |S| P j S g j(x) with S being a uniformly sampled subset of {1, . . ., n}. However, an unbiased estimate of F (x) = (g (x))T f (g(x)) cannot be constructed without knowing g(x). Therefore, we have to construct sufficiently accurate estimates of g(x) and g (x) at the same time and also deal with the bias in estimating F (x). In Algorithm 1, we propose C-SAGA, a composite randomized incremental gradient method that employs the estimation scheme of SAGA (Defazio et al., 2014). At each A Composite Randomized Incremental Gradient Method iteration t, we maintain estimates of g(xt) and g (xt) by i=1 gi(αt i ) and Zt = 1 i=1 g i(αt i ) (23) respectively, where αt i are some reference points associated with gi at iteration t. However, we do not use them directly to estimate F (xt), i.e., instead of using ZT t f (Yt), we use z T t f (yt) to estimate F (xt), where zt and yt are constructed as in (18) and (19) respectively. We define the notation F(xt) z T t f (yt), (24) and use it to replace F (xt) in the proximal gradient method (15), which results in the update formula (20). Then we use the standard SAGA scheme to update the reference points αt+1 j as well as Yt and Zt as in (21) and (22). If each gi is a scalar mapping, i.e., gi : Rd1 R, and f ( ) = id( ) is the scalar identity mapping, then Problem (1) becomes to the standard finite sum optimization problem in (10). In this case, F(xt) = z T t f (yt) = z T t , hence the update of yt and Yt will no longer be necessary. Then Algorithm 1 becomes the standard (minibatch) SAGA algorithm. We note that Reddi et al. (2016a;b) have developed extensions of SAGA for solving the finite-sum problem (10) with smooth nonconvex functions gi. However, their method requires two independent sets of samples of size s during each iteration: one for updating the counter part of F(xt) = z T t , and the other set is used for updating the reference points αt j and Zt. In contrast, our method avoids such "double sampling" scheme, and uses only one set of random samples to update both quantities. It reduces to the original SAGA scheme when applied to solve problems (10) with even nonconvex gi s. Nevertheless, it attains the same sample complexity as the methods by Reddi et al. (2016a;b). 3. Convergence Analysis: Sublinear Rates We make the following assumption concerning the functions gi, f , r and g defined in (11). Assumption 1. (smoothness assumption) Each gi : Rd Rp, for i = 1, . . ., n, is a C1-smooth vector mapping. In particular, each gi is ℓg-Lipschitz and its Jacobian matrix g i is Lg-Lipschitz. Consequently, g is ℓg-Lipschitz and g is Lg-Lipschitz. The function f : Rp R is C1-smooth with f being ℓf -Lipschitz and its gradient f being Lf -Lipschitz. The function r : Rd R { } is convex and can be nonsmooth. As a result of Assumption 1, the gradient of F defined in (11), denoted as F , is LF-Lipschitz continuous with LF = ℓ2 g Lf + ℓf Lg. We also define the constant G0 = 18ℓ4 g L2 f + 2ℓ2 f L2 g = O(L2 F) . Theorem 1. Suppose Assumption 1 holds and we choose α0 i = x0 for all i = 1, . . ., n. Let the sequence {xt}T t=0 be generated by Algorithm 1 and t is chosen uniformly at random from {1, . . .,T}. 1. If the batch size s = 1, then we can choose the step size η = 1 4n 3G0 = O 1 E[ G(xt ) 2] 32n 3G0 T E Φ(x0) Φ(x T) . Therefore, finding xt with E[ G(xt ) 2] ϵ requires O(LFn/ϵ) function and Jacobian evaluations. 2. If we choose the batch size s = n2/3, and the step size L2 F + 48G0 = O 1 E[ G(xt ) 2] 8 TηE Φ(x0) Φ(x T) = O LF As a result, finding xt with E[ G(xt ) 2] ϵ requires O(n + LFn2/3/ϵ) function and Jacobian evaluations. Note that for the case s = 1, the sample complexity is O(LFn/ϵ), which matches the sample complexity of fullgradient descent method. This is the first result in stochastic composite optimization that can match the performance of full-gradient descent by taking just one sample per iteration. By using s = n2/3 and large step size η = O(1/LF), we obtain the improved sample complexity O(n + LFn2/3/ϵ). 3.1. Outline of Analysis Here we give an outline of the proof for Theorem 1, emphasizing the key ideas and steps. Detailed proofs of individual lemmas are given in the supplementary materials. First, as we explained before, yt and zt are unbiased estimates of g(xt) and g (xt) respectively, but F(xt) = z T t f (yt) is a biased estimate of F (xt) = (g (xt))T f (xt). The following lemma bounds the variances of the unbiased estimates and also the squared bias for estimating F (xt). Lemma 1. Suppose {xt} and {αt j} for j = 1, . . ., n are generated by Algorithm 1. Then we have E[yt|xt] = g(xt), E[ yt g(xt) 2|xt] ℓ2 g s 1 n j=1 xt αt j 2, A Composite Randomized Incremental Gradient Method where E[ |xt] denotes conditional expectation given xt, and E[zt|xt] = g (xt), E[ zt g (xt) 2|xt] L2 g s 1 n j=1 xt αt j 2. In addition, the bias in estimating F (xt) can be bounded as E F(xt) F (xt) 2|xt G0 j=1 xt αt j 2. In order to quantify the optimality of xt, a proper metric is the norm of the proximal gradient mapping η(xt ˆxt+1), where ˆxt+1 = proxη r xt ηF (xt) . (25) Eventually, we expect our algorithm to return a point x with E[ G(x) 2] ϵ. However, in the algorithm, only the approximate proximal gradient mapping η(xt xt+1), where xt+1 is given in (20), can be computed. Thus we will have to ensure the closeness between G(xt) and G(xt). This result is given in the next lemma. Lemma 2. Let ˆxt+1 be defined according to (25), then 1 η xt+1 ˆxt+1 2 η F(xt) F (xt) 2 E G(xt) 2 2E G(xt) 2 + 2G0 j=1 xt αt j 2 . Using the two lemmas above, we can show the next result, which quantifies the expected descent over the iterations. Lemma 3. If the sequence {xt} is generated by Algorithm 1, then the following descent result holds: E[Φ(xt+1)] E[Φ(xt)] η j=1 xt αt j 2 . (26) To study the convergence of Algorithm 1, we also need to construct the following Lyapunov function: Rt = E Φ(xt) + ct 1 j=1 xt αt j 2 (27) where the coefficients ct satisfy c T = 0 and the recursion ct = ct+1(1 p)(1 + β) + 3G0 4s η . (28) The coefficient p = 1 (1 1 n)s s 2n is the probability that an index j is chosen to get into the set St, and β > 0 is an arbitrary coefficient that will be determined later. The following lemma bounds the evolution of the distances between xt and the reference points αt i . Lemma 4. Suppose {xt} and {αt j} for i = 1, . . ., n are generated by Algorithm 1, then the following result holds: j=1 xt+1 αt+1 j 2 1 + 1 E xt+1 xt 2 + (1 p)(1 + β)E 1 j=1 xt αt j 2 , where β > 0 is an arbitrary constant appearing in (28). Reddi et al. (2016a;b) derived a sharper bound, replacing the coefficient (1 + 1/β) above by (1 + (1 p)/β), but required two independent set of samples of size s for updating zt and αt respectively. Our bound in Lemma 4 is slightly looser, but it only requires one set of samples and does not deteriorate the final complexity. Combining Lemma 3 and Lemma 4, we have the following result. Lemma 5. Let the Lyapunov function Rt and the coefficients ct be defined according to (27) and (28) respectively, then E γ G(xt) 2 + G0η j=1 xt αt j 2 Rt Rt+1, (29) γ = min 0 t T 1 ct+1η2 . (30) One last lemma is needed to guarantee that γ is sufficiently large, namely, γ = O(η). Lemma 6. Let ct be defined according to (28) and let γ be defined in (30). If we choose β = s/(4n), then for any 0 t T, we have ct 3n G0η/s2. As a result, we can set 1. either s = 1 and η = 1 4n 3G0 = O 1 2. or s = n2/3 and η = 1 L2 F + 48G0 = O 1 Under both set of parameters, we have γ η Finally, Theorem 1 can be proved using Lemma 5 and Lemma 6; see details in the supplementary materials. 4. Linear Convergence of C-SAGA In this section, we shall present performance of Algorithm 1 under the gradient-dominant condition and the strong convexity condition, where the algorithm exhibits fast linear convergence. A Composite Randomized Incremental Gradient Method 4.1. Gradient-Dominant Function First, we analyze the case where r(x) 0 and F(x) is ν-gradient dominant. Formally speaking, we make the following assumption. Assumption 2. The nonsmooth partr(x) 0 and the smooth part F is ν-gradient dominant, which means there exist some ν > 0 such that F(x) inf y F(y) ν 2 F (x) 2, x Rd. (31) Note that strong convexity is a special case of gradient domination, which in turn is a special case of the PolyakŁojasiewicz condition (e.g., Karimi et al., 2016). More specifically, a strongly convex function with convexity parameter µ is 1 µ-gradient dominant (e.g., Nesterov, 2004). We denote κ = νLF as the effective condition number. Reddi et al. (2016a) proposed a restarted SAGA algorithm for solving the finite-sum problem (10) and achieved the sample complexity of O((n + κn2/3) log(1/ϵ)). However, each of their restarts requires synchronization of all the reference points, and the frequent restarts make the algorithm more like a SVRG type of algorithm. Here we show that our C-SAGA method in Algorithm 1, without any modification, achieves the same sample complexity for the more general composite finite-sum problem (1). We denote the Algorithm 1 (C-SAGA) as a mapping (xt , αt 1 , ..., αt n ) = C-SAGA(x0, α0 1, ..., α0 n, τ, η), where τ is the number of iterations and η is step size. In addition to xt , the outputs include the reference points αt i for i = 1, . . ., n at time t , which is chosen uniformly random from {1, . . ., τ}. For the ease of understanding, we imagine running the algorithm in K stages, each with τ iterations and initialized with the previous output and reference points. Since the outputs are snapshots randomly chosen from the previous τ iterations, we don t really need to finish all the τ iterations for each stage. We can just randomly generate an lk from {1, ..., τ}, and then go to the next stage as soon as we reach the lk-th iteration in the current stage. This is equivalent to partitioning the iterations of Algorithm 1 into some virtually assigned stages. More precisely, let {xt} and {αk 1, . . ., αk n} be generated by Algorithm 1, and each lk, for k = 1, . . ., K 1, is an independent uniform sample from {1, ..., τ}. Then we assign subsequences { xk} and { αk j }, for j = 1, . . ., n, as follows: xk+1 = x Tk+lk, and αk+1 j = αTk+lk j , j = 1, ..., n, (32) with x0 = x0, α0 j = α0 j , and Tk+1 = Tk + lk for k = 0, ..., K 1, with T0 = 0 . (33) Theorem 2. Suppose Assumption 1 and Assumption 2 holds. Let the sequence {xt} be generated by Algorithm 1 and { xk} is defined by (32) and (33), where each lk is uniformly sampled from {1, ..., τ}. Then if we set the virtual epoch length parameter τ = 16ν/η + 24n/s, then E F( xk) F(x ) + 3G0η j=1 xk αk j 2 2k E F( x0) F(x ) + 3G0η j=1 x0 α0 j 2 . Consequently, in order to guarantee E[F( xk) F(x )] ϵ: 1. If we choose s = 1 and η = O(1/(n LF)), then τ = O(n(νLF + 1)) and the total sample complexity is O(n(κ + 1) log(1/ϵ)); 2. If we choose s = n2/3 and η = O(1/LF), then τ = O(νLF + n1/3) and the total sample complexity is O(τs log(1/ϵ)) = O((n + κn2/3) log(1/ϵ)) . 4.2. Optimally Strongly Convex Function In this part, we work with general nonsmooth convex proper function r(x) and a µ-optimally strongly convex assumption on Φ(x) = F(x) + r(x). Formally, we assume Φ(x) satisfies the following assumption. Assumption 3. We assume F is convex, and the overall objective function Φ(x) is µ-optimally strongly convex, i.e., Φ(x) Φ(x ) µ 2 x x 2, x Rd (34) for some constant µ > 0, where x = arg miny Φ(y). In addition, we assume r is convex and possibly nonsmooth. Theorem 3. Suppose Assumption 1 and Assumption 3 hold and α0 i = x0 for all i = 1, . . ., n. Let the sequence {xt} be generated by Algorithm 1. And suppose that { xk} are assigned according to (32) and (33) with lk sampled uniformly from {1, ..., τ}. If we set the virtual epoch length to be τ = 28/3ηµ + 96n/s, then E Φ( xk) Φ(x ) 1 2k E Φ(x0) Φ(x ) . Consequently, in order to ensure E[F( xk) F(x )] ϵ: 1. If s = 1 and η = O(1/(n LF)), then τ = O(n(κ + 1)) and the total sample complexity is O(n(κ + 1) log(1/ϵ)); 2. If s = n2/3 and η = O(1/LF), then τ = O(κ +n1/3) and the total sample complexity is O(τs log(1/ϵ)) = O((n + κn2/3) log(1/ϵ)) . A Composite Randomized Incremental Gradient Method 0 0.5 1 1.5 2 2.5 3 # of samples 105 || F(xk) + r(xk)|| n = 5000, d = 500 C-SAGA ASC-PG VRSC-PG 0 0.5 1 1.5 2 2.5 3 # of samples 105 (F(xk)+r(xk)) - (F(x*)+r(x*)) n = 5000, d = 500 C-SAGA ASC-PG VRSC-PG Figure 1. Experiments on the risk-averse portfolio optimization problem 5. Numerical Experiments In this part, we present numerical experiments for two applications: risk-averse portfolio optimization and policy evaluation for Markov decision processes. 5.1. Risk-Averse Portfolio Optimization Suppose there are d different assets that we can invest during a time interval of {1, ..., n}. Let Ri,j be the return of asset j at time i, and Ri be the vector consists of Ri,1, . . ., Ri,d. Then the risk-averse portfolio optimization problem can be formulated as problem (5) by defining hi(x) = Ri, x , i = 1, . . ., n, where x Rd is the vector of asset allocations. In addition, we also add an ℓ1 regularization term r(x) = β x 1 for some β > 0, in order to obtain sparse asset allocation. As we discussed in the introduction, using the mappings defined in (6) and (7), we can transform this problem into the form of (2). We solve this problem via two algorithms that handle the finite-sum structure both inside and outside of the composition: ASC-PG (Wang et al., 2017b) and VRSC-PG (Huo et al., 2018). Alternatively, if we use the mappings defined in (8) and (9), then the problem is transformed into the form of (1). We can solve this formulation with our C-SAGA algorithm. In our experiments, the reward vectors Ri are first generated as n i.i.d Gaussian random vectors with a random correlation matrix C = LLT, where L Rd d satisfies N(0, 1) distribution elementwise. We set the parameter λ = 0.1 in (5) and the ℓ1 regularization parameter β = 1. We test the algorithms on a randomly generated case with d = 500, n = 5000, and plot the experiment results in Figure 1. The curves are averaged over 20 runs and are plotted against the number of samples of the component functions. Both VRSC-PG and C-SAGA use the same step size η = 0.001 and batch size s = n2/3 . They are chosen from by experimenting with {1, 0.1, 0.01, 0.001, 0.0001}, and η = 0.001 works best for VRSC-PG and for C-SAGA. For ASC-PG, we set its parameters αk = 0.001/k, βk = 1/k (the βk is different from the sparsity penalty parameter β, see Wang et al. (2017b)). They are hand-tuned to ensure ASCPG converges fast among a range of other tested parameters. As shown in Figure 1, ASC-PG is the slowest one due to its lack of variance reduction schemes. Both VRSC-PG and C-SAGA have linear convergence in this case, and C-SAGA is faster than VRSC-PG in terms sample efficiency. This supports our theory that C-SAGA can be more efficient (weaker dependence on the condition number) when linear convergence occurs. 5.2. Policy Evaluation for MDP Consider a Markov decision process (MDP) with state space S = {1, ..., S}. Let the reward associated with transition from state i to state j be Ri,j. Let Pπ RS S be the transition probability matrix under some fixed policy π. We would like to evaluate the value function V π : S R under the policy, which satisfies following Bellman equation: V π(i) = S X j=1 Pπ i, j(ri,j + γV π(j)) = Ej |i[Ri,j + γV π(j)]. We apply the linear function approximation V π(i) Φi, w for a given set feature vectors Φi (e.g., Dann et al., 2014; Wang et al., 2017b), and would like to compute the optimal vector w . This can be formulated as the following problem minimize w F(w) S X j=1 Pπ i,j(ri,j +γ Φj, w ) 2 . A Composite Randomized Incremental Gradient Method 0 500 1000 1500 2000 # of samples SCGD ASCGD ASC-PG C-SAGA VRSC-PG 0 500 1000 1500 2000 # of samples F(wk)- F(w*) SCGD ASCGD ASC-PG C-SAGA VRSC-PG 0 1000 2000 3000 4000 5000 # of samples SCGD ASCGD ASC-PG C-SAGA VRSC-PG 0 1000 2000 3000 4000 5000 # of samples F(wk)- F(w*) SCGD ASCGD ASC-PG C-SAGA VRSC-PG Figure 2. Experiments on MDP policy evaluation problem. Row 1: case with S = 10. Row 2: case with S = 100. Let s denote qπ i (w) S X j=1 Pπ i, j(ri,j + γ Φj, w ) = Ej |i[Ri,j + γ Φj, w ]. Then by defining g(w) = Φ1, w , ..., ΦS, w , qπ i (w), ..., qπ S(w) T and f (y1, ..., y S, z1, ..., z S) = y z 2 = S X i=1 (yi zi)2, this problem is transformed into the form of (3). Note that the first S components of g are deterministic, then for SCGD (Wang et al., 2017a), ASCGD (Wang et al., 2017a) and ASCPG (Wang et al., 2017b), the running average estimation of g is not applied for these components. For the last S components, since they are S independent expectations, the variance reduction technique of both VRSC-PG (Huo et al., 2018) and C-SAGA are applied to each of these components. In the experiments, Pπ, Φ and Rπ are generated randomly. Figure 2 shows two experiments with sizes S = 10 and S = 100 respectively. We plot the objective values and gradient sizes against the samples drawn by the algorithm, and they are shown as averages over 20 runs. For the case where S = 10, both VRSC-PG and C-SAGA use the same batch size s = 1. C-SAGA takes a step size η = 0.1, while VRSC-PG takes a stepsize of η = 0.03, because it diverges under η = 0.1 and η = 0.03 seems to work best VRSC-PG. For S = 100, we set η = 0.005 and batch size s = 10 for C-SAGA and VRSC-PG. The step size is chosen as the best among {0.1, 0.05, 0.01, 0.005, 0.001}. Under η = 0.005 both VRSC-PG and C-SAGA gain their best performance. For SCGD, we choose αk = 0.01k 3/4 and βk = 0.1k 1/2. For ASCGD, αk = 0.01k 5/7 and βk = 0.1k 4/7. For ASCPG, αk = 0.01k 1/2 and βk = 0.1k 1. The meaning of these step size parameters can be found in Wang et al. (2017a;b). They are hand-tuned to yield fast convergence. Figure 2 show that C-SAGA has the best performance compared with other methods. In summary, C-SAGA is very effective for solving composite finite sum problems due to its simple construction and fast convergence. More experiments are presented in the supplementary materials. A Composite Randomized Incremental Gradient Method Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 699 707, 2016. Chaudhari, P., Choromanska, A., Soatto, S., Le Cun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. Entropy-sgd: Biasing gradient descent into wide valleys. ar Xiv preprint, ar Xiv:1611.01838, 2016. Dann, C., Neumann, G., and Peters, J. Policy evaluation with temporal differences: a survey and comparison. Journal of Machine Learning Research, 15(1):809 883, 2014. Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014. Gulcehre, C., Moczulski, M., Visin, F., and Bengio, Y. Mollifying networks. ar Xiv preprint, ar Xiv:1608.04980, 2016. Hazan, E., Levy, K. Y., and Shalev-Shwartz, S. On graduated optimization for stochastic non-convex problems. In International conference on machine learning, pp. 1833 1841, 2016. Huo, Z., Gu, B., Liu, J., and Huang, H. Accelerated method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 3287 3294, 2018. Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315 323, 2013. Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient method and proximal-gradient methods under the Polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Database - European Conference, Proceedings, pp. 795 811, 2016. Lei, L., Ju, C., Chen, J., and Jordan, M. I. Non-convex finite-sum optimization via scsg methods. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2348 2358. Curran Associates, Inc., 2017. Lian, X., Wang, M., and Liu, J. Finite-sum composition optimization via variance reduced gradient descent. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1159 1167, 2017. Lin, T., Fan, C., Wang, M., and Jordan, M. I. Improved oracle complexity for stochastic compositional variance reduced gradient. ar Xiv preprint ar Xiv:1806.00458, 2018. Liu, L., Liu, J., Hsieh, C.-J., and Tao, D. Stochastically controlled stochastic gradient for the convex and non-convex composition problem. ar Xiv preprint, ar Xiv:1809.02505, 2018. Mobahi, H. and Fisher, J. W. On the link between gaussian homotopy continuation and convex envelopes. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 43 56. Springer, 2015. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston, 2004. Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013. Reddi, S. J., Sra, S., Póczos, B., and Smola, A. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 1971 1977. IEEE, 2016a. Reddi, S. J., Sra, S., Póczos, B., and Smola, A. J. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems, pp. 1145 1153, 2016b. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. Wang, M. and Liu, J. A stochastic compositional gradient method using Markov samples. In 2016 Winter Simulation Conference (WSC), pp. 702 713, Dec 2016. doi: 10.1109/ WSC.2016.7822134. Wang, M., Fang, E. X., and Liu, H. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419 449, 2017a. Wang, M., Liu, J., and Fang, E. Accelerating stochastic composition optimization. Journal of Machine Learning Research, 18(105):1 23, 2017b. Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014. Yu, Y. and Huang, L. Fast stochastic variance reduced ADMM for stochastic composition optimization. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), pp. 3364 3370, 2017.