# debiasing_conditional_stochastic_optimization__f42b453b.pdf

Debiasing Conditional Stochastic Optimization

Lie He EPFL lie.he@epfl.ch

Shiva Prasad Kasiviswanathan Amazon kasivisw@gmail.com

In this paper, we study the conditional stochastic optimization (CSO) problem which covers a variety of applications including portfolio selection, reinforcement learning, robust learning, causal inference, etc. The sample-averaged gradient of the CSO objective is biased due to its nested structure, and therefore requires a high sample complexity for convergence. We introduce a general stochastic extrapolation technique that effectively reduces the bias. We show that for nonconvex smooth objectives, combining this extrapolation with variance reduction techniques can achieve a significantly better sample complexity than the existing bounds. Additionally, we develop new algorithms for the finite-sum variant of the CSO problem that also significantly improve upon existing results. Finally, we believe that our debiasing technique has the potential to be a useful tool for addressing similar challenges in other stochastic optimization problems.

1 Introduction

In this paper, we investigate the conditional stochastic optimization (CSO) problem as presented by Hu et al. [16], which is formulated as follows: min x Rd F(x) = Eξ[fξ(Eη|ξ[gη(x; ξ)])], (CSO)

where ξ and η represent two random variables, with η conditioned on ξ. The fξ : Rp R and gη : Rd Rp denote a stochastic function and a mapping respectively. The inner expectation is calculated with respect to the conditional distribution of η|ξ. In line with the established CSO framework [16, 15], throughout this paper, we assume access to samples from the distribution P(ξ) and the conditional distribution P(η|ξ).

Many machine learning tasks can be formulated as a CSO problem, such as policy evaluation and control in reinforcement learning [6, 24], and linearly-solvable Markov decision process [5]. Other examples of the CSO problem include instrumental variable regression [23] and invariant learning [16]. Moreover, the widely-used Model-Agnostic Meta-Learning (MAML) framework, which seeks to determine a meta-initialization parameter using metadata for related learning tasks that are trained through gradient-based algorithms, is another example of a CSO problem. In this context, tasks ξ are drawn randomly, followed by the drawing of samples η|ξ from the specified task [11]. It is noteworthy that the standard stochastic optimization problem minx Eξ[fξ(x)] represents a degenerate case of the CSO problem, achieved by setting gη as an identity function.

In numerous prevalent CSO problems, such as first-order MAML (FO-MAML) [11], the outer random variable ξ only takes value in a finite set (say in {1, . . . , n}). These problems can be reformulated to have a finite-sum structure in the outer loop and referred to as Finite-sum Coupled Compositional Optimization (FCCO) problem in [33, 19]. In this paper, we also study this problem, formulated as: min x Rd Fn(x) = 1

n Pn i=1 fi(Eη|i[gη(x; i)]). (FCCO)

The FCCO problem also has broad applications in machine learning for optimizing average precision, listwise ranking losses, neighborhood component analysis, deep survival analysis, deep latent variable models [33, 19].

Work initiated during internship at Amazon.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

Although the CSO and FCCO problems are widespread, they present challenges for optimization algorithms. Based on the special composition structure of CSO, using chain rule, under mild conditions, the full gradient of CSO is given by

F(x) = Eξ Eη|ξ[ gη(x; ξ)] fξ(Eη|ξ[gη(x; ξ)]) .

Constructing an unbiased stochastic estimator for the gradient is generally computationally expensive (and even impossible). A straightforward estimation of F(x) is to estimate Eξ with 1 sample of ξ, estimate Eη|ξ[gη( )] with a set Hξ of m independent and identically distributed (i.i.d.) samples drawn from the conditional distribution P(η|ξ), and Eη|ξ[ gη( )] with a different set Hξ of m i.i.d. samples drawn from the same conditional distribution, i.e.,

ˆFm(x) := 1

η Hξ g η(x; ξ) fξ( 1

η Hξ gη(x; ξ)). (1)

Note that ˆFm(x) consists of two terms. The first term, (1/m) P

η Hξ g η(x; ξ), is an unbiased estimate of Eη|ξ[ gη(x; ξ)]. However, the second term is generally biased, i.e.,

Eη|ξ[ fξ( 1

η Hξ gη(x; ξ))] = fξ(Eη|ξ[gη(x; ξ)]).

Consequently, ˆFm(x) is a biased estimator of F(x). To reach the ε-stationary point of F(x) (Definition 1), the bias has to be sufficiently small.

Optimization with biased gradients converges only to a neighborhood of the stationary point. While the bias diminishes with increasing batch size, it also introduces additional sample complexity. For nonconvex objectives, Biased Stochastic Gradient Descent (BSGD) requires a total sample complexity of O(ε 6) to reach an ε-stationary point [16]. This contrasts with standard stochastic optimization, where sample-averaged gradients are unbiased with a sample complexity of O(ε 4) [12, 3]. This discrepancy has spurred a multitude of proposals aimed at reducing the sample complexities of both CSO and FCCO problems. Hu et al. [16] introduced Biased Spider Boost (BSpider Boost), which, based on the variance reduction technique Spider Boost from Wang et al. [38], reduces the variance of ξ to achieve a sample complexity of O(ε 5) for the CSO problem. Hu et al. [17] proposed multi-level Monte Carlo (MLMC) gradient methods V-MLMC and RT-MLMC to further enhance the sample complexity to O(ε 4). The SOX [33] and MSVR-V2 [19] algorithms concentrated on the FCCO problem and improved the sample complexity to O(nε 4) and O(nε 3), respectively.

Our Contributions. In this paper, we improve the sample complexities for both the CSO and FCCO problems (see Table 1). To facilitate a clear and concise presentation, we will suppress the dependence on specific problem parameters throughout the ensuing discussion.

(a) Our main technical tool in this paper is an extrapolation-based scheme that mitigates. bias in gradient estimations. Considering a suitably differentiable function q( ) and a random variable δ D, we show that we can approximate the value of q(E[δ]) via extrapolation from a limited number of evaluations of q(δ), while maintaining a minimal bias. In the context of CSO and FCCO problems, this scheme is used in gradient estimation, where the function q corresponds to fξ and the random variable δ corresponds to gη. (b) For the CSO problem, we present novel algorithms that integrate the above extrapolation-based scheme with BSGD and BSpider Boost algorithms of Hu et al. [16]. Our algorithms, referred to as E-BSGD and E-BSpider Boost, achieve a sample complexity of O(ε 4.5) and O(ε 3.5) respectively, in order to attain an ε-stationary point for nonconvex smooth objectives. Notably, the sample complexity of E-BSpider Boost improves the best-known sample complexity of O(nε 4) for the CSO problem from Hu et al. [17]. (c) For the FCCO problem2 we propose a new algorithm that again combines the extrapolationbased scheme with a multi-level variance reduction applied to both inner and outer parts of the problem. Our algorithm, referred to as E-Nested VR, achieves a sample complexity of O(nε 3) if n ε 2/3 and O(max{ nε 2.5, ε 4/ n}) if n > ε 2/3 for nonconvex smooth objectives and second-order extrapolation scheme. Our bound is never worse than the O(nε 3) bound of MSVR-V2 algorithm of Jiang et al. [19] and is in fact better if n = Ω(ε 2/3). As an illustration, when n = Θ(ε 1.5), our bound of O(ε 3.25) is significantly better than the MSVR-V2 bound of O(ε 4.5).

2 For the FCCO problem we focus on n = O(ε 2) case, for n = Ω(ε 2) we can just treat the FCCO problem as a CSO problem and get an O(ε 3.5) sample complexity bound via our E-BSpider Boost algorithm.

Problem Old Bounds Our Bounds

Algorithm Bound Algorithm Bound

CSO BSGD [16] O(ε 6) E-BSGD O(ε 4.5) CSO BSpider Boost [16] O(ε 5) E-BSpider Boost O(ε 3.5) CSO RT-MLMC [17] O(ε 4)

FCCO MSVR-V2 [19] O(nε 3) E-Nested VR

( O(nε 3) if n ε 2/3

O(max{ n ε2.5 , 1 nε4 }), if n > ε 2/3

Table 1: Sample complexities needed to reach ε-stationary point for FCCO and CSO problems with nonconvex smooth objectives. Assumptions are comparable, but our results require an additional mild regularity on fξ and gη. For FCCO also see Footnote 2. Note that Ω(ε 3) is a sample complexity lower bound for standard stochastic nonconvex optimization [3], and hence, also for the problems considered in this paper.

In terms of proof techniques, our approach diverges from conventional analyses for the CSO and FCCO problems in that we focus on explicitly bounding the bias and variance terms of the gradient estimator to establish the convergence guarantee. Compared to previous results, our improvements do require an additional mild regularity assumption on fξ and gη mainly that fξ is 4th order differentiable. Firstly, as we discuss in Remark 2 most common instantiations of CSO/FCCO framework such as: 1) invariant logistic regression [16], 2) instrumental variable regression [23], 3) first-order MAML for sine-wave few shot regression [11] and other problems, 4) deep average precision maximization [26, 34], tend to satisfy this assumption. Secondly, we highlight that the bounds derived from previous studies do not improve when incorporating this additional regularity assumption. Thirdly, Ω(ε 3) remains the lower bound for stochastic optimization even under the arbitrary smoothness constraint [2], demonstrating that our improvement is non-trivial. Our results show that, this regularity assumption, which seems to practically valid, can be exploited through a novel extrapolation-based bias reduction technique to provide substantial improvements in sample complexity.3

We defer some additional related work to Appendix B and conclude with some preliminaries.

Notation. Vectors are denoted by boldface letters. For a vector x, x 2 denotes its ℓ2-norm. A function with k continuous derivatives is called a Ck function. We use a b to denote that a Cb for some constant C > 0. We consider expectation over various randomness: Eξ[ ] denotes expectation over the random variable ξ, Eη|ξ[ ] denotes expectation over the conditional distribution of η|ξ. Unless otherwise specified, for a random variable X, E[X] denotes expectation over the randomness in X. We focus on nonconvex objectives in this paper and use the following standard convergence criterion for nonconvex optimization [18].

Definition 1 (ε-stationary point) For a differentiable function F( ), we say that x is a first-order ε-stationary point if F(x) 2 ε2.

For notational convenience, in the rest of this paper, we omit the dependence on ξ (or i in the FCCO context) in the function g and use gη(x) to represent gη(x; ξ).

2 Stochastic Extrapolation as a Tool for Bias Correction

In this section, we present an approach for tackling the bias problem as appears in optimization procedures such as BSGD, BSpider Boost, etc. Importantly, our approach addresses a general problem appearing in optimization settings and could be of independent interest. All missing details from this section are presented in Appendix C.

For ease of presentation, we start by considering the 1-dimensional case and assume a function q : R R, a constant s R. Let δ be a random variable drawn from an arbitrary distribution D over R. In Sections 3 and 4, we apply these ideas to the CSO and FCCO problems where the random variable δ is played by gη( ) and function q is played by fξ. Informally stated, our goal in this section will be to

Efficiently approximate q(s + E[δ]) with few evaluations of {q(s + δ)}δ D.

3Higher-order smoothness conditions have also been exploited in standard stochastic optimization for performance gains [4].

An interesting case is when s = 0, where we are approximating q(E[δ]) with evaluations of {q(δ)}δ D. Now, if q is an affine function, then q(s + E[δ]) = E[q(s + δ)]. However, the equality does not hold true for general q, and there exists a bias, i.e., |q(s + E[δ]) E[q(s + δ)]| > 0. In this section, we introduce a stochastic extrapolation-based method, where we use an affine combination of biased stochastic estimates, to achieve better approximation.

Suppose q C2k is a continuous differentiable up to 2k-th derivative and let h = E[δ]. We expand q(s + δ), the most straightforward approximation of q(s + E[δ]), using Taylor series at s + h, and take expectation,

E[q(s + δ)] =q(s + h) + q (s + h) E[δ h] + q (s+h)

2 E[(δ h)2] + q(3)(s+h)

6 E[(δ h)3]

+ . . . + q(2k 1)(s+h)

(2k 1)! E[(δ h)(2k 1)] + 1 (2k)! E[q(2k)(ϕδ)(δ h)2k], (2)

where ϕδ between s + δ and s + h. While E[q(s + δ)] matches q(s + h) in the first 2 terms, the third term is no longer zero. The approximation error (bias) is

| E[q(s + δ)] q(s + h)| = | q (s+h)

2 E[(δ h)2] + . . . + 1 (2k)! E[q(2k)(ϕδ)(δ h)2k]|.

In order to analyze the upper bound, we make the following assumption on D and q.

Assumption 1 (Bounded moments) For all δ D has bounded higher-order moments: σl := | E[(δ E[δ])l]| < for l = 2, 3, . . . 2k.

Assumption 2 (Bounded derivatives) The q C2k and has bounded derivatives, i.e., al := sups dom(q) |q(l)(s)| < for l = 1, 2, . . . , 2k.

In addition, we consider a sample averaged distribution Dm derived from D as follows.

Definition 2 Given a distribution D satisfying Assumption 1 and m N+, we define the distribution Dm that outputs δ where δ = 1

m Pm i=1 δi with δi i.i.d. D.

The moments of such distribution Dm decrease with batch size m as k 2, | E[(δ E[δ])k]| = O(m k/2 ) (see Lemma 1). Our desiderata would be to construct a scheme that uses some samples from the distribution Dm to construct an approximation of q(s + E[δ]) that satisfies the following requirement.

Definition 3 (kth-order Extrapolation Operator) Given a function q : R R satisfying Assumption 2 and distribution Dm satisfying Assumption 1, we define a kth-order extrapolation operator T (k) Dm as an operator from C2k C2k that given N = N(k) i.i.d. samples δ1, . . . , δN from Dm satisfies s R: | E[T (k) m q(s)] q(s + E[δ])| = O(m k).

We now propose a sequence of operators L(1) Dm, L(2) Dm, L(3) Dm, . . . that satisfy the above definition. The

L(k) Dmq(s) is designed to ensure its Taylor expansion at s+h has a form of q(s+h)+O(E[(δ h)2k]). The remainder O(E[(δ h)2k]) is bounded by O(m k) due to Lemma 1.

A First-order Extrapolation Operator. We define the simplest operator

L(1) Dmq : s 7 [q(s + δ)] where δ i.i.d. Dm.

In Proposition 2 (Appendix C), we show that L(1) Dm is a first-order extrapolation operator.4

A Second-order Extrapolation Operator. We define the following linear operator L(2) Dm which

transforms q C4 into L(2) Dmq which has lesser bias (but similar variance, as shown later).

Definition 4 (L(2) Dm Operator) Given Dm and q, define the following operator,

L(2) Dmq : s 7 h 2 q(s + δ1+δ2

2 ) q(s+δ1)+q(s+δ2)

2 i where δ1, δ2 i.i.d. Dm.

4Note that if the function q is only Lq-Lipschitz continuous, then |E [q(s + δ)] q(s + E[δ])| p

L2q E[|δ E[δ]|]2 Lq σ2

m1/2 . Therefore, in this case, q(s + δ) does not satisfy the first-order guarantee.

Number of estimates

|q(s+ [ ]) Avg( ( )q(s))|

|q(s + [ ]) [ (1)

(a) q(s) = s2/2, δ N(10, 100), m = 1.

|q(s + [ ]) [ ( )

(b) q(s) = s4, p(δ) = δ/2 where δ [0, 2]. Figure 1: The Fig. 1a investigates the estimation errors of L( )q(s) with their number of observations. The Fig. 1b compares the biases of E[L( )q(s)] with increasing inner batch size m.

Note that δ1+δ2

2 is same as sampling from D2m. The absolute difference in the Taylor expansion of L(2) Dmq at s + h differs from q(s + h) as,

O E 2( δ1+δ2

2((δ1 h)3 + (δ2 h)3) = O(|(E[(δ h)3]|) for δ i.i.d. Dm. (3)

The bias error of this scheme can be bounded through the following proposition.

Proposition 1 (Second-order Guarantee) Assume that distribution Dm and q( ) satisfies Assumption 1 and 2 respectively with k = 2. Then, for all s R, E h L(2) Dmq(s) i q(s + E[δ])

4a3σ3+9a4σ2 2 48m2 + 5a4

96 σ4 3σ2 2 m3 .

Remark 1 While extrapolation is motivated by Taylor expansion which requires smoothness, higher order derivatives are not explicitly computed. Appendix F.3 empirically shows that applying extrapolation to non-smooth functions achieves similar bias correction. Relaxing the smoothness conditions is a direction for future work.

The above proposition shows that L(2) Dm is in fact a second-order extrapolation operator with k = 2 under Definition 3. We will use this operator when we consider the CSO and FCCO problems later. Now, focusing on variance, we can relate the variance of L(2) Dmq(s) in terms of the variance of q(s+δ). In particular, a consequence of Lemma 2 is that

E h L(2) Dmq(s) E[L(2) Dmq(s)] 2i = O(E[(q(s + δ) E[q(s + δ)])2]).

Extension of L(2) Dm to Higher-dimensional Case. If q : Rp Rℓis a vector-valued function, then there is a straightforward extension of Definition 4. Now, for distribution D over Rp and corresponding sampled averaged distribution Dm, and s Rp

L(2) Dmq : s 7 h 2 q(s + δ1+δ2

2 ) q(s+δ1)+q(s+δ2)

2 i where δ1, δ2 i.i.d. Dm. (4)

Higher-order Extrapolation Operators. The idea behind the construction of L(2) Dm can be generalized to higher k s. For example, in Proposition 3, we construct a third-order extrapolation operator L(3) Dm through higher degree Taylor series approximation

L(3) Dmq : s 7 ( 1

36L(2) Dm + 5

9L(2) D2m 3

4L(2) D3m 16

9 L(2) D4m + 3L(2) D6m)q(s).

While this idea of expressing the k-th order operator as an affine combination of lower-order operators works for every k, explicit constructions soon become tedious.

In Fig. 1, we empirically demonstrate the effectiveness of extrapolation in stochastic estimation. 5 In Fig. 1a, we choose q(s) = s2/2, δ N(10, 100). For both L(2) D6mq(s) and L(3) Dmq(s), their estimation errors converge to 0 with increasing number of estimates. This coincides with Proposition 1 as a3 = 0

5We use L(1) D12m, L(2) D6m, L(3) Dm to ensure that each estimate uses same amount of samples (12m).

and a4 = 0 for quadratic q. In contrast, biased first order method only converges to a neighborhood. In Fig. 1b, we consider q(s) = s4 and p(δ) = δ/2 where δ [0, 2]. All three methods are biased and their biases decrease with m, i.e. O(m k) for kth order method. Depending on the constants (e.g. ai, σi), a higher-order extrapolation method may need decently large m (burn-in phase) to outperform lower-order methods.

3 Applying Stochastic Extrapolation in the CSO Problem

In this section, we apply the extrapolation-based scheme from the previous section to reduce the bias in the CSO problem. We focus on variants of BSGD and their accelerated version BSpider Boost based on our second-order approximation operator (Definition 4). Let Hξ, Hξ, and H ξ indicate different sets, each of which contains m i.i.d. random variables/samples drawn from the conditional distribution P(η|ξ). Remember that, as mentioned earlier, we use gη(x) to represent gη(x; ξ).

Extrapolated BSGD. At time t, BSGD constructs a biased estimator of F(xt) using one sample ξ and 2m i.i.d. samples from the conditional distribution as in (1)

Gt+1 BSGD = 1

η Hξ g η(xt) fξ 1

η Hξ gη(xt) . (5)

To reduce this bias, we apply the second-order extrapolation operator from (4). At time t, we define Dt+1 g,ξ to be the distribution of the random variable 1 m P

η Hξ gη(xt). Then we apply L(2) Dt+1 g,ξ by

setting q to fξ and s = 0, i.e.

L(2) Dt+1 g,ξ fξ(0) := 2 fξ 1 2m P

η Hξ gη(xt) + 1 2m P

η H ξ gη (xt))

η Hξ gη(xt)) + fξ( 1

η H ξ gη (xt)) , (6)

where 1 m P

η Hξ gη(xt) and 1 m P

η H ξ gη (xt) are i.i.d. drawn from Dt+1 g,ξ . In Algorithm 2 (Appendix A), we present our extrapolated BSGD (E-BSGD) scheme, where we replace fξ( 1

η Hξ gη(xt)) in (5) by L(2) Dt+1 g,ξ fξ(0) resulting in this following gradient estimate:

Gt+1 E-BSGD = 1 m P

η Hξ g η(xt) L(2) Dt+1 g,ξ fξ(0). (7)

Extrapolated BSpider Boost. BSpider Boost, proposed by Hu et al. [16], uses the variance reduction methods for nonconvex smooth stochastic optimization developed by Fang et al. [10], Wang et al. [38]. BSpider Boost builds upon BSGD and has two kinds of updates: a large batch and a small batch update. In each step, it decides which update to apply based on a random coin. With probability pout, it selects a large batch update with B1 outer samples of ξ. With remaining probability 1 pout, it selects a small batch update where the gradient estimator will be updated with gradient information in the current iteration generated with B2 outer samples of ξ and the information from the last iteration. Formally, it constructs a gradient estimate as follows,

( Gt BSB + 1 B2 P ξ B2,|B2|=B2(Gt+1 BSGD Gt BSGD) with prob. 1 pout 1 B1 P

ξ B1,|B1|=B1 Gt+1 BSGD with prob. pout. (8)

We propose our extrapolated BSpider Boost scheme (formally defined in Algorithm 3, Appendix A) by replacing the BSGD gradient estimates in (8) with E-BSGD.

Gt+1 E-BSB =

( Gt E-BSB + 1 B2 P

ξ B2,|B2|=B2(Gt+1 E-BSGD Gt E-BSGD) with prob. 1 pout 1 B1 P

ξ B1,|B1|=B1 Gt+1 E-BSGD with prob. pout. (9)

Sample Complexity Analyses of E-BSGD and E-BSpider Boost. We adopt the standard assumptions used in the literature [27, 35, 33, 41]. All proofs are deferred to Appendix D.

Assumption 3 (Lower bound) F is lower bounded by F .

Assumption 4 (Bounded variance) Assume that gη and gη have bounded variances, i.e., for all ξ in the support of P(ξ) and x Rp, σ2 g := Eη|ξ[ gη(x; ξ) Eη|ξ[gη(x; ξ)] 2 2] < and

ζ2 g := Eη|ξ[ gη(x; ξ) Eη|ξ[ gη(x; ξ)] 2 2] < .

Assumption 5 (Lipschitz continuity/smoothness of fξ and gη) For all ξ in the support of P(ξ) , fξ( ) is Cf-Lipschitz continuous (i.e., fξ(x) fξ(x ) 2 Cf x x 2 x, x Rp) and Lf Lipschitz smooth (i.e., fξ(x) fξ(x ) 2 Lf x x 2, x, x Rp) for any ξ. Similarly, for all ξ in the support of P(ξ) and η in the support of P(η|ξ), gη( ; ξ) is Cg-Lipschitz continuous and Lg-Lipschitz smooth.

The smoothness of fξ and gη naturally implies the smoothness of F. Zhang and Xiao [41, Lemma 4.2] show that Assumption 5 ensures F is: 1) CF -Lipschitz continuous with CF = Cf Cg; and 2) LF -Lipschitz smooth with LF = Lg Cf + C2 g Lf. We denote LF = ζg Cf + σg Cg Lf. Moreover, Assumption 5 also guarantees that fξ and gη have bounded gradients. In addition, fξ and gη are assumed to satisfy the following regularity condition in order to apply our extrapolation-based scheme from Section 2.

Assumption 6 (Regularity) For all ξ in the support of P(ξ), fξ is 4th-order differentiable with bounded derivatives (i.e., al := supg Rp (l)fξ(g) 2 < for l = 1, 2, 3, 4, x Rp) and gη has

bounded moments upto 4th-order (i.e., σk = supx Rd supξ Eη|ξ h Pp i=1 gη(x) Eη|ξ[gη(x)] k i

i < , k = 1, 2, 3, 4).

Remark 2 The core piece of Assumption 6 is the 4th order differentiability of fξ as other parts can be easily satisfied through appropriate boundedness assumptions. This condition though is satisfied by common instantiations of CSO/FCCO. We discuss some examples including invariant logistic regression, instrumental variable regression, first-order MAML for sine-wave few-shot regression task, deep average precision maximization in Section 5. Therefore, our improvements in sample complexity apply to all these problems.

Consider some time t > 0. Let Gt+1 be a stochastic estimate of F(xt) where xt is the current iterate. The next iterate xt+1 := xt γGt. Let E[ ] denote the conditional expectation, where we condition on all the randomness until time t. We consider the bias and variance terms coming from our gradient estimate. Formally, we define the following two quantities

Et+1 bias = F(xt) E[Gt+1] 2 2 , Et+1 var = E[ Gt+1 E[Gt+1] 2 2].

Our idea of getting to an ε-stationary point (Definition 1) will be to ensure that Et+1 bias and Et+1 var are bounded. The main technical component of our analyses is in fact analyzing these bias and variance terms for the various gradient estimates considered. For this purpose, we first analyze the bias and variance terms for the (original) BSGD (Lemma 5) and BSpider Boost (Lemma 7) algorithms, which are then used to get the corresponding bounds for our E-BSGD (Lemma 6) and E-BSpider Boost (Lemma 8) algorithms. Through these bias and variance bounds, we establish the following main results of this section.

Theorem 3 [E-BSGD Convergence] Consider the (CSO) problem. Suppose Assumptions 3, 4, 5, 6 hold true and LF , CF , LF , Cg, F are constants and Ce(f; g) := 8a3σ3+18a4σ2 2+5a4σ4 96 defined in Corollary 1 are associated with second order extrapolation in the CSO problem. Let step size γ 1/(2LF ). Then the output xs of E-BSGD (Algorithm 2) satisfies: E[ F(xs) 2 2] ε2, for nonconvex F, if the inner batch size m = Ω(Ce Cgε 1/2) , and the number of iterations

T = Ω(LF (F(x0) F )( L2 F/m + C2 F )ε 4).

The E-BSGD takes O(ε 4) iterations to converge and compute O(ε 0.5) gradients per iteration. Therefore, its resulting sample complexity is O(ε 4.5) which is more efficient than O(ε 6) of BSGD. Similar improvements can be observed for E-BSpider Boost in Theorem 4.

Theorem 4 [E-BSpider Boost Convergence] Consider the (CSO) problem under the same assumptions as Theorem 3. Let step size γ 1/(13LF ). Then the output xs of E-BSpider Boost (Algorithm 3)

satisfies: E[ F(xs) 2 2] ε2, for nonconvex F, if the inner batch size m = O(Ce Cgε 0.5), the hyperparameters of the outer loop of E-BSpider Boost B1 = ( L2 F /m + C2 F )ε 2, B2 = B1, pout = 1/B2, and the number of iterations T = Ω(LF (F(x0) F )ε 2).

The resulting sample complexity of E-BSpider Boost is O(ε 3.5), which improves O(ε 5) bound of BSpider Boost [16] and O(ε 4) bound of V-MLMC/RT-MLMC [17].

4 Applying Stochastic Extrapolation in the FCCO Problem

In this section, we apply the extrapolation-based scheme from Section 2 to the FCCO problem. We focus on case where n = O(ε 2). For larger n, we can treat the FCCO problem as a CSO problem and get an O(ε 3.5) bound from Theorem 4. All missing details are presented in Appendix E.

Now, a straightforward algorithm for FCCO is to use the finite-sum variant of Spider Boost (or SPIDER) [10, 38] in Algorithm 3. In this case, if we choose the outer batch sizes to be B1 = n, B2 = n and the inner batch size to be m = max{ε 2/n, ε 1/2}. The resulting sample complexity of E-BSpider Boost now becomes, O(max{ n/ε2.5, 1/ nε4}), which recovers O(ε 3.5) bound as in Theorem 4 for n = Θ(ε 2). However, when n is small, such as n = O(1), the sample complexity degenerates to O(ε 4) which is worse than the Ω(ε 3) lower bound of stochastic optimization [3]. We leave the details to Theorem 8. We still use Assumptions 3, 4, 5, 6 for the analysis of FCCO problem, replacing the role of ξ with i. Algorithm 1 E-Nested VR

1: Input: x0 Rd, step-size γ, batch sizes S1, S2, B1, B2, Probability pin, pout 2: for t = 0, 1, . . . , T 1 do 3: if (t = 0) or (with prob. pout) then Large outer batch 4: for i B1 [n] with |B1| = B1 do 5: draw yt+1 i from distribution Dt+1 y,i defined in (10)

6: compute zt+1 i using (11) and define ϕt i = xt

7: end for 8: Gt+1 E-NVR = 1 B1 P

i B1(zt+1 i ) L(2) Dt+1 y,i fi(0)

9: else Small outer batch 10: for i B2 with |B2| = B2 do 11: draw yt+1 i and yt i from distribution Dt+1 y,i and Dt y,i defined in (10)

12: compute zt+1 i using (11) and define ϕt i = xt

13: end for 14: Gt+1 E-NVR = Gt E-NVR + 1 B2 P

i B2(zt+1 i ) (L(2) Dt+1 y,i fi(0) L(2) Dt y,i fi(0))

15: end if 16: xt+1 = xt γGt+1 E-NVR 17: end for 18: Output: xs picked uniformly at random from {xt}T 1 t=0

Extrapolated Nested VR. We now introduce a nested variance reduction algorithm E-Nested VR which reaches low sample complexity for all choices of n. Missing proofs from this section are presented in Appendix E. For the stochasticities in the FCCO problem, our idea is to use two nested Spider Boost variance reduction components: one for the outer random variable i and the other for the inner random variable η|i. In each outer (resp. inner) Spider Boost step, we choose large batch B1 (resp. S1) with probability pout (resp. pin); otherwise we choose small batch. Let Hi denote a set of m i.i.d. samples drawn from the conditional distribution P(η|i). Similarly, let Hi denote another set of m i.i.d. samples drawn from the same conditional distribution. For each given i, we approximate Eη|i[gη(xt)] with yt+1 i from distribution Dt+1 y,i where,

η Hi gη(xt) with prob. pin or t = 0 yt i + 1 S2 P

η Hi(gη(xt) gη(ϕt i)) with prob. 1 pin. (10)

Similarly, we approximate E η|i[ g η(xt)] with zt+1 i defined as follows

η Hi g η(xt) with prob. pin or t = 0 zt i + 1 S2 P

η Hi( g η(xt) g η(ϕt i)) with prob. 1 pin, (11)

0 2e5 4e5 10

Sub-optimality

BSGD E-BSGD

0 2e5 4e5 # Samples

BSpider Boost E-BSpider Boost

Nested VR E-Nested VR

(a) CSO: Large n = 50000

0 1e6 2e6 # Samples

Sub-optimality

E-BSGD E-BSpider Boost E-Nested VR V-MLMC

(b) FCCO: small n = 50 Figure 2: Performances of algorithms and their extrapolated versions on the invariant logistic regression task. Algorithms in each subplot use the same amount of inner batch size in each iteration. The shaded region represents the 95%-confidence interval computed over 10 runs.

where ϕt i is the last time i is visited before time t. If i is not selected at time t, then yt+1 i = yt i and zt+1 i = zt i. Note that we use independent samples for yt+1 i and zt+1 i .

Finally, we present E-Nested VR in Algorithm 1 where second-order extrapolation operator L(2) is applied to each occurrence of fi. We now analyze its convergence guarantee. Our analysis works by first looking at the effect of multi-level variance reduction without the extrapolation (that we refer to as Nested VR, Theorem 10, Appendix E.2), and then showing how extrapolation could further help to drive down the sample complexity.

Theorem 5 [E-Nested VR Convergence] Consider the (FCCO) problem. Under the same assumptions as Theorem 3.

If n = O(ε 2/3), then we choose the hyperaparameters of E-Nested VR (Algorithm 1) as B1 = B2 = n, pout = 1, S1 = L2 F ε 2, S2 = LF ε 1, pin = L 1 F ε, γ = O( 1

If n = Ω(ε 2/3), then we choose the hyperaparameters of E-Nested VR as B1 = n, B2 = n, pout = 1/ n, S1 = S2 = max n Ce Cgε 1/2, L2 F /(nε2) o , pin = 1, γ = O( 1

Then the output xs of E-Nested VR satisfies: E[ F(xs) 2 2] ε2, for nonconvex F with iterations

T = Ω LF (F(x0) F )ε 2 .

From Theorem 5, E-Nested VR has a sample complexity of O(nε 3) in the small n regime (n = O(ε 2/3)) and O(max{ n/ε2.5, 1/ nε4}) in the large n regime (n = Ω(ε 2/3)). Therefore, in the large n regime, this improves the O(nε 3) sample complexity of MSVR-V2 [19].

5 Applications

In this section, we demonstrate the numerical performance of our proposed algorithms. We focus on the application of invariant logistic regression here. In Appendix F, we discuss performance of our proposed algorithms on other common CSO/FCCO applications.

5.1 Application of Invariant Risk Minimization Invariant learning has wide applications in machine learning and related areas [22, 1]. Invariant logistic regression [16] is formulated as follows:

min x Eξ=(a,b)[log(1 + exp( b Eη|ξ[η] x)],

where a and b represent a sample and its corresponding label, and η is a noisy observation of the sample a. This first part can be considered as a CSO objective, with fξ(y) := log(1+exp( by)) and gη(x; ξ) := η x. As the loss fξ C is smooth, our results from Sections 3 and 4 are applicable.

An ℓ2-regularizer is added to ensure the existence of an unique minimizer. Since the gradient of the penalization term is unbiased, we only have to consider the bias of the data-dependent term. We generate a synthetic dataset with d = 10 dimensions. The minimizer is drawn from Gaussian distribution x N(0, 1) Rd. We draw invariant samples {(ai, bi)}i where ai N(0, 1) Rd

and compute bi = sgn(a i x ) and its perturbed observation η N(ai, 100) Rd.

We consider drawing ξ from a large set (n = 50000) and a small set (n = 50) as CSO and FCCO problems respectively. As baselines, we implemented the BSGD and BSpider Boost methods

0.0 0.5 1.0 1e6

Sub-optimality

BSGD E-BSGD

0.0 0.5 1.0 # Samples 1e6

BSpider Boost E-BSpider Boost

0.0 0.5 1.0 1e6

Nested VR E-Nested VR

(a) CSO: Large n = 10000

0.0 0.5 1.0 # Samples 1e6

Sub-optimality

E-BSGD E-BSpider Boost E-Nested VR V-MLMC

(b) FCCO: small n = 50 Figure 3: Performances of algorithms and their extrapolated versions on the instrumental variable regression task. The shaded region represents the 95%-confidence interval.

from [16], V-MLMC approach from [17], and Nested VR approach from Appendix E.2 which achieves the same complexity as MSVR-V2 [19] for the FCCO problem. RT-MLMC shares the same sample complexity as V-MLMC, and is thus omitted from the experiment [17]. The results are shown in Fig. 2. In the CSO setting, we compare biased gradient methods with their extrapolated variants (BSGD vs. E-BSGD, BSpider Boost vs. E-BSpider Boost, and Nested VR vs. E-Nested VR). The extrapolated versions of BSGD, BSpider Boost, and Nested VR consistently reach lower error than their non-extrapolated counterparts, as is evident in Figure 2a. In this case, the performance of BSpider Boost is similar to BSGD as also noted by the authors of these techniques [16], and a drawback of BSpider Boost seems to be that it is much harder to tune in practice. However, it is clear that E-BSGD outperforms BSGD, and E-BSpider Boost outperforms BSpider Boost, respectively. In the FCCO setting, we compare extrapolation based methods and MLMC based methods. Figure 2a, shows that E-Nested VR outperforms all other extrapolated algorithms, including the V-MLMC approach of [17], matching our theoretical findings.

5.2 Application of Instrumental Variable Regression Instrumental variable regression is a popular technique in econometrics that aims to estimate the causal effect of input X on a confounded outcome Y with an instrument variable Z, which, for a given X, is conditionally independent of Y . As noted by [23], the instrumental variable regression is a special case of the CSO problem. The instrumental variable regression problem is phrased as:

min w Eξ=(Y,Z) ℓ(Y, EX|Z[g X(w)])

where ξ = (Y, Z), η = X. This can be viewed in the CSO framework with fξ( ) = ℓ(Y, ). We choose ℓ(y, ˆy) = log cosh(y ˆy) as regression loss function and g X(w) = w X to be a linear regressor. In this case, fξ C with fξ(ˆy) = tanh(ˆy Y ), and our results from Sections 3 and 4 apply. We generate the data similar to [31]

Z Uniform([ 3, 3]2), e N(0, 1), δ N(0, 0.1), γ Exponential(10)

2e + γ, Y = X + e + δ where z1 is the first dimension of Z, e is the confounding variable and δ, γ are noises. In this experiment, we solve the instrumental variable regression using BSGD, BSpider Boost, Nested VR and their extrapolated variants described in Sections 3 and 4. In each pair of experiments, the samples used per iteration are fixed same, i.e.: 1) BSGD uses m = 2 and E-BSGD uses m = 1; 2) For BSpider Boost and E-BSpider Boost, we use cycle length of 10, small batch and large batch in Spider to be 10 and 100 respectively, and we choose inner batch sizes m = 2 for BSpider Boost and m = 1 for E-BSpider Boost; 3) For Nested VR and E-Nested VR, we fix the outer batch size to 10 and 5 respectively, and choose fix the inner Spider Cycle to be 10 with large batch 100 and small batch 10. The results are presented in Figure 3. As is quite evident, the extrapolation variants achieve faster convergence in all 3 cases, confirming our theoretical findings.

6 Concluding Remarks

In this paper, we consider the conditional stochastic optimization CSO problem and its finite-sum variant FCCO. Due to the interplay between nested structure and stochasticity, most of the existing gradient estimates suffer from large biases and have large sample complexity of O(ε 5). We propose stochastic extrapolation-based algorithms that tackle this bias problem and improve the sample complexities for both these problems. While we focus on nonconvex objectives, our proposed algorithms can also be beneficial when used with strongly convex, convex objectives. We also believe that similar ideas could also prove helpful for multi-level stochastic optimization problems [41] with nested dependency.

Acknowledgements

We would like to thank Caner Turkmen, Sai Praneeth Karimireddy, and Martin Jaggi for helpful initial discussions surrounding this project.

[1] Fabio Anselmi, Joel Z Leibo, Lorenzo Rosasco, Jim Mutch, Andrea Tacchetti, and Tomaso Poggio. Unsupervised learning of invariant representations. Theoretical Computer Science, 633:112 121, 2016.

[2] Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Ayush Sekhari, and Karthik Sridharan. Second-order information in non-convex stochastic optimization: Power and limitations. In Jacob D. Abernethy and Shivani Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 242 299. PMLR, 2020. URL http: //proceedings.mlr.press/v125/arjevani20a.html.

[3] Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, pages 1 50, 2022.

[4] Sébastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, and Aaron Sidford. Near-optimal method for highly smooth convex optimization. In Alina Beygelzimer and Daniel Hsu, editors, Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pages 492 507. PMLR, 2019. URL http: //proceedings.mlr.press/v99/bubeck19a.html.

[5] Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning Research, pages 1458 1467. PMLR, 2017. URL http://proceedings.mlr.press/v54/ dai17a.html.

[6] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED: convergent reinforcement learning with nonlinear function approximation. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1133 1142. PMLR, 2018. URL http://proceedings.mlr.press/v80/dai18c.html.

[7] Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1646 1654, 2014. URL https://proceedings.neurips.cc/paper/2014/hash/ ede7e2b6d13a41ddf9f4bdef84fdc737-Abstract.html.

[8] Bradley Efron. Bootstrap methods: another look at the jackknife. Springer, 1992.

[9] Yuri M Ermoliev and Vladimir I Norkin. Sample average approximation method for compound stochastic optimization problems. SIAM Journal on Optimization, 23(4):2231 2263, 2013.

[10] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. SPIDER: near-optimal nonconvex optimization via stochastic path-integrated differential estimator. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 687 697, 2018. URL https://proceedings.neurips.cc/paper/2018/ hash/1543843a4723ed2ab08e18053ae6dc5b-Abstract.html.

[11] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1126 1135. PMLR, 2017. URL http://proceedings.mlr.press/v70/finn17a.html.

[12] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

[13] Takashi Goda and Wataru Kitade. Constructing unbiased gradient estimators with finite variance for conditional stochastic optimization. Ar Xiv preprint, abs/2206.01991, 2022. URL https: //arxiv.org/abs/2206.01991.

[14] Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Minimax estimation of divergences between discrete distributions. IEEE Journal on Selected Areas in Information Theory, 1(3):814 823, 2020.

[15] Yifan Hu, Xin Chen, and Niao He. Sample complexity of sample average approximation for conditional stochastic optimization. SIAM Journal on Optimization, 30(3):2103 2133, 2020.

[16] Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 612, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 1cdf14d1e3699d61d237cf76ce1c2dca-Abstract.html.

[17] Yifan Hu, Xin Chen, and Niao He. On the bias-variance-cost tradeoff of stochastic optimization. Advances in Neural Information Processing Systems, 34:22119 22131, 2021.

[18] Prateek Jain, Purushottam Kar, et al. Non-convex optimization for machine learning. Foundations and Trends in Machine Learning, 10(3-4):142 363, 2017.

[19] Wei Jiang, Gang Li, Yibo Wang, Lijun Zhang, and Tianbao Yang. Multi-block-single-probe variance reduced estimator for coupled compositional optimization. Ar Xiv preprint, abs/2207.08540, 2022. URL https://arxiv.org/abs/2207.08540.

[20] Jiantao Jiao and Yanjun Han. Bias correction with jackknife, bootstrap, and taylor series. IEEE Transactions on Information Theory, 66(7):4392 4418, 2020.

[21] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 315 323, 2013. URL https://proceedings.neurips.cc/paper/2013/hash/ ac1dd209cbcc5e5d1c6e28598e8cbbe8-Abstract.html.

[22] Youssef Mroueh, Stephen Voinea, and Tomaso A. Poggio. Learning with group invariant features: A kernel perspective. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1558 1566, 2015. URL https://proceedings.neurips. cc/paper/2015/hash/6602294be910b1e3c4571bd98c4d5484-Abstract.html.

[23] Krikamol Muandet, Arash Mehrjou, Si Kai Lee, and Anant Raj. Dual instrumental variable regression. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 612, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 1c383cd30b7c298ab50293adfecb7b18-Abstract.html.

[24] Ofir Nachum and Bo Dai. Reinforcement learning via fenchel-rockafellar duality. Ar Xiv preprint, abs/2001.01866, 2020. URL https://arxiv.org/abs/2001.01866.

[25] Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takác. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2613 2621. PMLR, 2017. URL http://proceedings.mlr.press/ v70/nguyen17b.html.

[26] Qi Qi, Youzhi Luo, Zhao Xu, Shuiwang Ji, and Tianbao Yang. Stochastic optimization of areas under precision-recall curves with provable convergence. In Marc Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 1752 1765, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 0dd1bc593a91620daecf7723d2235624-Abstract.html.

[27] Qi Qi, Youzhi Luo, Zhao Xu, Shuiwang Ji, and Tianbao Yang. Stochastic optimization of areas under precision-recall curves with provable convergence. Advances in Neural Information Processing Systems, 34:1752 1765, 2021.

[28] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alexander J. Smola. Stochastic variance reduction for nonconvex optimization. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 314 323. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/reddi16.html.

[29] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for nonconvex optimization. Ar Xiv preprint, abs/1603.06159, 2016. URL https://arxiv.org/ abs/1603.06159.

[30] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83 112, 2017.

[31] Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 4595 4607, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 17b3c7061788dbe82de5abe9f6fe22b3-Abstract.html.

[32] John Tukey. Bias and confidence in not quite large samples. Ann. Math. Statist., 29:614, 1958.

[33] Bokun Wang and Tianbao Yang. Finite-sum coupled compositional stochastic optimization: Theory and applications. In International Conference on Machine Learning, pages 23292 23317. PMLR, 2022.

[34] Guanghui Wang, Ming Yang, Lijun Zhang, and Tianbao Yang. Momentum accelerates the convergence of stochastic AUPRC maximization. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 3753 3771. PMLR, 2022. URL https://proceedings.mlr. press/v151/wang22b.html.

[35] Guanghui Wang, Ming Yang, Lijun Zhang, and Tianbao Yang. Momentum accelerates the convergence of stochastic auprc maximization. In International Conference on Artificial Intelligence and Statistics, pages 3753 3771. PMLR, 2022.

[36] Mengdi Wang, Ji Liu, and Ethan X. Fang. Accelerating stochastic composition optimization. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1714 1722, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/ 92262bf907af914b95a0fc33c3f33bf6-Abstract.html.

[37] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161:419 449, 2017.

[38] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 2403 2413, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 512c5cad6c37edb98ae91c8a76c3a291-Abstract.html.

[39] Christopher Stroude Withers. Bias reduction by taylor series. Communications in Statistics Theory and Methods, 16(8):2369 2383, 1987.

[40] Yu M Yermol yev. A general stochastic programming problem. 1971.

[41] Junyu Zhang and Lin Xiao. Multilevel composite stochastic optimization via nested variance reduction. SIAM Journal on Optimization, 31(2):1131 1157, 2021.

A Missing Pseudocodes 15

B Missing Details from Section 1 15

B.1 Other Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

C Missing Details from Section 2 17

D Stationary Point Convergence Proofs from Section 3 (CSO) 21

D.1 Helpful Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

D.2 Convergence of BSGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

D.3 Convergence of E-BSGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

D.4 Convergence of BSpider Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D.5 Convergence of E-BSpider Boost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

E Stationary Point Convergence Proofs from Section 4 (FCCO) 31

E.1 E-BSpider Boost for FCCO problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

E.2 Convergence of Nested VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

E.3 Convergence of E-Nested VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

F Missing Details from Section 5 46

F.1 Application of First-order MAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

F.2 Application of Deep Average Precision Maximization . . . . . . . . . . . . . . . . . . 47

F.3 Necessity of Additional Smoothness Conditions . . . . . . . . . . . . . . . . . . . . . 47

A Missing Pseudocodes

We present pseudocodes of E-BSGD and E-BSpider Boost scheme in Algorithms 2 and 3 respectively.

Algorithm 2 E-BSGD

1: Input: x0 Rd, step-size γ, batch sizes m 2: for t = 0, 1, . . . , T 1 do 3: Draw one sample ξ and compute extrapolated gradient Gt+1 E-BSGD from (7) 4: xt+1 xt γGt+1 E-BSGD 5: end for 6: Output: xs picked uniformly at random from {xt}T 1 t=0

B Missing Details from Section 1

B.1 Other Related Work

CSO. Dai et al. [5] proposed a primal-dual stochastic approximation algorithm to solve a min-max reformulation of CSO, employing the kernel embedding techniques. However, this method requires convexity of fξ and linearity of gη, which are not satisfied by general applications when neural networks are involved. Goda and Kitade [13] showed that a special class of CSO problems can be

Algorithm 3 E-BSpider Boost

1: Input: x0 Rd, step-size γ, batch sizes B1, B2, Probability pout 2: for t = 0, 1, . . . , T 1 do 3: Draw χout from Bernoulli(pout) 4: if (t = 0) or (χout = 1) then Large batch 5: Draw B1 outer samples {ξ1, . . . , ξB1} 6: Compute extrapolated gradient Gt+1 E-BSGD with (7)

Gt+1 E-BSB = 1 B1 P ξ B1 Gt+1 E-BSGD

7: else Small batch 8: Draw B2 outer samples {ξ1, . . . , ξB2} 9: Compute extrapolated gradient Gt+1 E-BSGD with (7)

Gt+1 E-BSB = Gt E-BSB + 1 B2 P

ξ B2(Gt+1 E-BSGD Gt E-BSGD)

10: end if 11: xt+1 = xt γGt+1 E-BSB 12: end for 13: Output: xs picked uniformly at random from {xt}T 1 t=0

unbiased, e.g., when fξ measures the squared error between some u(ξ) and Eη|ξ[gη(x; ξ], giving rise to this objective function Eξ[(u(ξ) Eη|ξ[gη(x; ξ])2]. However, they did not show any improvement over the sample complexity of BSGD (i.e., O(ε 6)). Hu et al. [16] also analyzed lower bounds on the minimax error for the CSO problem and showed that for a specific class of biased gradients with O(ε) bias (same bias as BSGD) and variance O(1) the bound achieved by BSpider Boost is tight. However, these lower bounds are not applicable in settings such as ours (and also to [17]) where the bias is smaller than the BSGD bias.

Variance Reduction. The reduction of variance in stochastic optimization is a crucial approach to decrease sample complexity, particularly when dealing with finite-sum formulations of the form minx 1

n Pn i=1 fi(x). Pioneering works such as Stochastic Average Gradient (SAG) [30], Stochastic Variance Reduced Gradient (SVRG) [21, 28], and SAGA [7, 29] improved the iteration complexity from O(ε 4) in Stochastic Gradient Descent (SGD) to O(ε 2). Subsequent research, including Stochastic Path-Integrated Differential Estimator (SPIDER) [10] and Stochastic Recursive Gradient Algorithm (SARAH) [25], expanded the application of these techniques to both finite-sum and online scenarios, where n is large or possibly infinite. These methods boast an improved sample complexity of min( nε 2, ε 3). Spider Boost [38], achieves the same near-optimal complexity performance as SPIDER, but allows a much larger step size and hence runs faster in practice than SPIDER. In this paper, we use a probabilistic variant of Spider Boost as the variance reduction module for CSO and FCCO problems. We highlight that alternative techniques, such as SARAH, can also be applied and offer similar guarantees.

Bias Correction. One of the classic problems in statistics is to design procedures to reduce the bias of estimators. Well-established general bias correction techniques, such as the jackknife [32], bootstrap [8], Taylor series [39, 14], have been extensively studied and applied in various contexts [20]. However, these methods are predominantly examined in relation to standard statistical distributions, with limited emphasis on their adaptability to optimization problems. Our proposed extrapolationbased approach is derived from sample-splitting methods [14], specifically tailored and analyzed for optimization problems involving unknown distributions.

Stochastic Composition Optimization. Finally, a closely related class of problems, called stochastic composition optimization, has been extensively studied (e.g., [40, 9, 36, 37]) in the literature where the goal is:

min x Rd Eξ[fξ(Eη[gη(x)])]. (12)

Despite having nested expectations in their formulations (CSO) and (12) are fundamentally different: a) in stochastic composite optimization the inner randomness η is conditionally dependent on the outer randomness ξ and b) in CSO the inner random function gη(x, ξ) depends on both ξ and η. These differences lead to quite different sample complexity bounds for these problems, as explored in Hu et al. [15]. In fact, Zhang and Xiao [41] presented a near optimal complexity of O(min(ε 3, nε 2))

for stochastic composite optimization problems using nested variance reduction. While Wang et al. [36] also use the "extrapolation" technique, their motivation and formula are significantly different from ours and cannot reduce the bias in the CSO problem.

C Missing Details from Section 2

Lemma 1 (Moments of Dm) The moments of δ Dm are bounded as follows

E[(δ E[δ])2] = σ2

m , | E[(δ E[δ])3]| = σ3

m2 , E[(δ E[δ])4] = σ4

m3 + 3(m 1)σ2 2 m3 .

More generally, for k 2, | E[(δ E[δ])k]| = O(m k/2 ).

Proof: Define ˆδ = δ E[δ] as the centered random variable. Now

E[(δ E[δ])k] = E[ˆδk].

So we focus on E[ˆδk] in the remainder of the proof. For k = 2,

| E[ˆδ2]| = 1 m2 | E[Pm i=1 ˆδi]2| = 1 m2 E h P i ˆδ2 i + 2 P i<j ˆδiˆδj i = σ2

| E[ˆδ3]| = 1 m3 | E[Pm i=1 ˆδi]3|

= 1 m3 E h P i ˆδ3 i + 3 P i =j ˆδ2 i ˆδj + 6 P i<j<k ˆδiˆδjˆδk i

m2 . For k = 4,

| E[ˆδ4]| = 1 m4 | E[Pm i=1 ˆδi]4|

= 1 m4 E h P

i ˆδ4 i + 4 P

i =j ˆδ3 i ˆδj + 6 P

i<j ˆδ2 i ˆδ2 j + 24 P

i<j<k<l ˆδiˆδjˆδkˆδl i

= 1 m4 m E[ˆδ4 i ] + 6 m(m 1)

2 E[ˆδ2 i ] E[ˆδ2 j ]

m3 + 3(m 1)σ2 2 m3 . For k = 5,

| E[ˆδ5]| = 1 m5 | E[Pm i=1 ˆδi]5|

= 1 m5 E h P

i ˆδ5 i + 10 P

i =j ˆδ3 i ˆδ2 j i

= 1 m5 m E[ˆδ5 i ] + 10m(m 1) E[ˆδ3 i ] E[ˆδ2 j ]

m4 + 10(m 1)σ3σ2

m4 . For general k > 0, we expand the following term as a function of m

| E[ˆδk]| = 1 mk | E[Pm i=1 ˆδi]k|.

As E[ˆδi] = 0 and ˆδi and ˆδj are independent for different i and j, the outcome has the following form

| E[ˆδk]| = 1 mk O

2a2+3a3+ +kak=k ai 0 i

m Pk i=2 aiσa2 2 σa3 3 σak k

where Pk i=2 ai is the count of independent {ˆδi} used in σa2 2 σa3 3 σa4 4 . Among the terms in (13), the dominating one in terms of m is one with largest Pk i=2 ai, i.e.

| E[ˆδk]| =

( 1 mk O(mk/2)σk/2 2 if k even, 1 mk O(m k/2 )σ k/2 1 2 σ3 if k odd.

Then, we can simplify the upper right-hand side with

| E[ˆδk]| = O(m k+ k/2 ), which gives all the desired results.

Proposition 2 (First-order Guarantee) Assume that Dm and q( ) satisfy Assumption 1 and 2 re-

spectively with k = 1. Then, s R, E h L(1) Dmq(s) i q(s + E[δ]) a2σ2/(2m).

Proof: Let h = E[δ]. If the function q C2, then the Taylor expansion at s + h with remainders leads to

E[q(s + δ)] = q(s + h) + q (s + h) E[δ h] + 1

2 E[q (ϕ1)(δ h)2]

where ϕ1 between s + h and s + δ. Then the error of extrapolation becomes

| E[q(s + δ)] q(s + h)| = 1

2 E[q (ϕ1)(δ h)2] a2

2 E[(δ h)2].

By Assumption 2 and Lemma 1, we have that

| E[q(s + δ)] q(s + h)| a2

2 E[(δ h)2] = a2

2 E[(δ h)2] = a2σ2

This completes the proof.

Proposition 1 (Second-order Guarantee) Assume that distribution Dm and q( ) satisfies Assumption 1 and 2 respectively with k = 2. Then, for all s R, E h L(2) Dmq(s) i q(s + E[δ])

4a3σ3+9a4σ2 2 48m2 + 5a4

96 σ4 3σ2 2 m3 .

Proof: Let h = E[δ]. If the function q C4, then the Taylor expansion at s + h with remainders leads to

E[q(s + δ1)] =q(s + h) + q (s + h) E[δ1 h] + q (s+h)

2 E[(δ1 h)2] + q(3)(s+h)

6 E[(δ1 h)3]

24 E[q(4)(ϕ1)(δ1 h)4]

E[q(s + δ2)] =q(s + h) + q (s + h) E[δ2 h] + q (s+h)

2 E[(δ2 h)2] + q(3)(s+h)

6 E[(δ2 h)3]

24 E[q(4)(ϕ2)(δ2 h)4]

E[q(s + δ1+δ2

2 )] =q(s + h) + q (s + h) E[ δ1+δ2

2 h] + q (s+h)

2 E[( δ1+δ2

+ q(3)(s+h)

24 E[q(4)(ϕ3) δ1+δ2

where ϕ1, ϕ2, ϕ3 between s + h and s + δ1, s + δ2, s + δ3 respectively.

As E[δ h] = 0, the error of extrapolation becomes

| E[L2 Dmq(s)] q(s + h)|

2 E h q(3)(s+h)

2 E[ q(3)(s+h)

6 (δ1 h)3] + E[ q(3)(s+h)

+ 2 E h q(4)(ϕ3)

2 E[ q(4)(ϕ1)

24 (δ1 h)4] + E[ q(4)(ϕ2)

24 (δ2 h)4]

6 2 E h δ1+δ2

2 E[(δ1 h)3] + E[(δ2 h)3]

+ 2 E h q(4)(ϕ3)

2 E[ q(4)(ϕ1)

24 (δ1 h)4] + E[ q(4)(ϕ2)

24 (δ2 h)4]

6 2 E h δ1+δ2

2 E[(δ1 h)3] + E[(δ2 h)3]

+ 2 E h |q(4)(ϕ3)|

2 E[ |q(4)(ϕ1)|

24 (δ1 h)4] + E[ |q(4)(ϕ2)|

24 (δ2 h)4]

6 2 E h δ1+δ2

2 E[(δ1 h)3] + E[(δ2 h)3]

24 2 E h δ1+δ2

2 E[(δ1 h)4] + E[(δ2 h)4] .

where the second inequality uses the upper bound on q(3)( ) (Assumption 2) and the third inequality uses (δ h)4 is non-negative and the last inequality uses the uniform bound on q(4)( ) (Assumption 2).

| E[L2 Dmq(s)] q(s + h)|

12| E(δ1 h)3| + a4

24 2 E δ1+δ2

2 h 4 + E(δ1 h)4

24 σ4 4m3 + 3(2m 1)σ2 2 4m3 + σ4

m3 + 3(m 1)σ2 2 m3

24 9σ2 2 2m2 + 5(σ4 3σ2 2) 4m3

4a3σ3+9a4σ2 2 48m2 + 5a4

96 σ4 3σ2 2 m3 .

we first use that E[(δ1 h)3] = E[(δ2 h)3] = 4 E[( δ1+δ2

2 h)3] and the uses the bound on moments

in Lemma 1. Note that E δ1+δ2

2 h 4 can be seen as the 4th order moments of a batch size of 2m.

Proposition 3 Assume q C6. Then L(3) Dm as defined below is a third-order extrapolation operator.

L(3) Dmq : s 7 ( 1

36L(2) Dm + 5

9L(2) D2m 3

4L(2) D3m 16

9 L(2) D4m + 3L(2) D6m)q(s).

Proof: Let h = E[δ]. If q C2k, then q has the following Taylor expansion

E[q(s + δ)] = q(s + h) | {z } zero order term

+q (s + h) E[δ h] + q (s+h)

2 E[(δ h)2] | {z } second order term

+ q(2k 1)(s+h)

(2k 1)! E[(δ h)2k 1] + 1 2k! E[q(2k)(ϕ)(δ h)2k].

Eliminate the third order term in the Taylor expansion. Consider the following affine combination which

F(3) Dmq : s 7 α1L(2) Dmq(s) + α2L(2) D2mq (s) .

We determine α1 and α2 by expanding L(2) Dmq(s) and L(2) D2mq(s) and analyze the coefficients of terms:

(Affine). Taylor expansion of F(3) Dmq(s) at s + h should have zero order term q(s + h), i.e.

α1q(s + h) + α2q(s + h) = q(s + h).

(Eliminate third term). Taylor expansion of F(3) Dmq(s) at s + h should have third order term E[(δ h)3]. That is,

α1 E[(δ1 h)3] + α2 E h δ1+δ2

2 h 3i = 0.

This is equivalent to

α1 E[(δ1 h)3] + α2

4 E h (δ1 h)3i = 0.

Therefore, α1 and α2 can be determined through the following linear system

α1 + α2 = 1

The solution is α1 = 1

3 and α2 = 4

For k = 3 order extrapolation, consider the following

L(3) Dmq : s 7 α 1F(3) Dmq(s) + α 2F(3) D2mq (s) + α 3F(3) D3mq (s) .

We determine α 1, α 2 and α 3 by satisfying the following two conditions

(Affine). Taylor expansion of L(3) Dmq(s) at s + h should have zero order term q(x + h), i.e.

(α 1 + α 2 + α 3)q(x + h) = q(x + h).

Taylor expansion of L(3) Dmq(s) at s + h should have 4th order term E[(δ h)4]. That is

α 1 E[(δ1 h)4] + α 2 E h δ1+δ2

2 h 4i + α 3 E h δ1+δ2+δ3

3 h 4i = 0.

This is equivalent to α 1 + α 2 8 + α 3 27 E[(δ1 h)4] = 0 3

9α 2 E[(δ1 h)2] 2 = 0.

Therefore, α 1, α 2 and α 3 can be determined through the following linear system

α 1 + α 2 + α 3 = 1

The solution is α 1 = 1 12, α 2 = 4

3 and α 3 = 9

4. Then consider the Taylor expansion of L(3) Dmq(s) at s + h with (2), we can

| E[L(3) Dmq(s)] q(s + h)| q(5)(s + h) E[(δ h)5] + E[q(6)(ϕδ)(δ h)6] O((a5 + a6)m 3)

where the first inequality uses the fact that L(3) Dm is an affine mapping and the last inequality uses

Lemma 1. Therefore, L(3) Dm is a 3rd-order extrapolation operator. We can expand it into

L(3) Dmq : s 7 1 12 1

3L(2) Dmq(s) + 4

3L(2) D2mq (s) 4

3L(2) D2mq(s) + 4

3L(2) D4mq (s)

3L(2) D3mq(s) + 4

3L(2) D6mq (s)

36L(2) Dm + 5

9L(2) D2m 3

4L(2) D3m 16

9 L(2) D4m + 3L(2) D6m)q(s).

Lemma 2 (Variance Bound) Assume that q : Rp Rℓis in C4 and Dm is the distribution in Assumption 1. Suppose that the variance of q(s + δ) is bounded as

E[ q(s + δ) E[q(s + δ)] 2 2] V 2

Then the variance of extrapolation L(2) Dmq(s) is upper bounded by

E L(2) Dmq(s) E[L(2) Dmq(s)] 2

Proof: Let us use the definition of L(2) Dmq(s):

E L(2) Dmq(s) E[L(2) Dmq(s)] 2

E 2q(s + δ1+δ2

2 ) q(s+δ1)+q(s+δ2)

2 E h 2q(s + δ1+δ2

2 ) q(s+δ1)+q(s+δ2)

3 E h 2q(s + δ1+δ2

2 ) E 2q(s + δ1+δ2

i + 3 E q(s+δ1)

2 E h q(s+δ1)

+ 3 E[ q(s+δ2)

2 E h q(s+δ2)

2m + C) + 3

This completes the proof.

D Stationary Point Convergence Proofs from Section 3 (CSO)

In this section, we provide the convergence proofs for the CSO problem. We start by establishing some helpful lemmas in Appendix D.1. In Appendix D.2, we reanalyze the BSGD algorithm to obtain explicit bias and variance bounds, which are then useful when we analyze E-BSGD in Appendix D.3. Similarly, we reanalyze BSpider Boost in Appendix D.4 and use the resulting bias and variance bounds for the analysis of E-BSpider Boost in Appendix D.5.

Note that throughout our analyses, we define Et+1[ |t] as the expectation of randomness at time t + 1 conditioning on the randomness until time t. When there is no ambiguity, we use E[ ] instead of Et+1[ |t].

D.1 Helpful Lemmas

Lemma 3 (Sufficient Decrease) Suppose Assumption 5 holds true and γ 1 2LF then

F(xt) 2 2 2(E[F (xt+1)] F (xt))

γ + LF γEt+1 var + Et+1 bias ,

where E[ ] denote conditional expectation over the randomness at time t conditioned on all of the past randomness until time t.

Proof: In this proof, we use E[ ] to denote conditional expectation over the randomness at time t conditioned on all the past randomness until time t.

Let us expand F(xt+1) and apply the LF -smoothness of F

E[F(xt+1)] F(xt) γ E[ F(xt), Gt+1 ] + LF γ2

2 E[ Gt+1 2 2].

Since E[ Gt+1 2 2] = E[ Gt+1 E[Gt+1] 2 2] + E[Gt+1] 2 2 = Et+1 var + E[Gt+1] 2 2, then

E[F(xt+1)] F(xt) γ E[ F(xt), Gt+1 ] + LF γ2

2 (Et+1 var + E[Gt+1] 2 2).

Expand the middle term with

γ E[ F(xt), Gt+1 ] = γ

2 F(xt) 2 2 γ

2 E[Gt+1] 2 2 + γ

2 F(xt) E[Gt+1] 2 2 = γ

2 F(xt) 2 2 γ

2 E[Gt+1] 2 2 + γ

2 Et+1 bias .

Combine with the inequality

E[F(xt+1)] F(xt) γ

2 F(xt) 2 2 γ

2 (1 LF γ) E[Gt+1] 2 2 + γ

2 Et+1 bias + LF γ2

2 Et+1 var .

By taking γ 1 2LF , we have that

E[F(xt+1)] F(xt) γ

2 F(xt) 2 2 γ

4 E[Gt+1] 2 2 + γ

2 Et+1 bias + LF γ2

2 Et+1 var .

Re-arranging the terms we get the desired inequality.

A consequence of Lemma 3 is the following result.

Lemma 4 (Descent Lemma) Suppose Assumption 5 holds true. By taking γ 1 2LF , we have,

1 T PT 1 t=0 E h F(xt) 2 2 i + 1

2 1 T PT 1 t=0 E h Et[Gt+1|t] 2 2

2(F (x0) F )

T PT 1 t=0 E[Et+1 bias ] + LF γ

T PT 1 t=0 E[Et+1 var ]

where the expectation is taken over all randomness from t = 0 to T.

Proof: We denote the conditional expectation at time t in the descent lemma (Lemma 3) as Et+1[ |t] which conditions on all past randomness until time t. Then the descent lemma can be written as

Et+1[F(xt+1)|t] F(xt) γ

2 F(xt) 2 2 γ

4 Et+1[Gt+1|t] 2 2 + γ

2 Et+1[Et+1 bias |t] + LF γ2

2 Et+1[Et+1 var |t].

If we additionally consider the randomness at time t 1, and apply Et[ |t 1] to both sides

Et Et+1[F(xt+1)|t]|t 1 Et[F(xt)|t 1] γ

2 E[ F(xt) 2 2 |t 1]

Et h γ 4 Et+1[Gt+1|t] 2 2 |t 1 i + γ

2 Et Et+1[Et+1 bias |t]|t 1

2 Et 1] Et+1[Et+1 var |t]|t 1 .

By the law of iterative expectations, we have Et Et+1[ |t]|t 1 = Et Et+1 [ |t 1]

Et[Et+1 F(xt+1)|t 1 ] Et[F(xt)|t 1] γ

2 E[ F(xt) 2 2 |t 1]

Et h γ 4 Et+1[Gt+1|t] 2 2 |t 1 i + γ

2 Et Et+1 Et+1 bias |t 1

2 Et[Et Et+1 var |t 1 ].

Similarly, we can apply Et 1[ |t 2], Et 2[ |t 3], . . ., E2[ |1] and finally E1[ ]

E1 . . . [Et+1 F(xt+1) ] E1 . . . [Et[F(xt)]] γ

2 E1 . . . [Et[ F(xt) 2 2]]

E1 . . . [Et[ γ

4 Et+1[Gt+1|t] 2 2]] + γ

2 E1 . . . [Et+1 Et+1 bias ]

2 E1 . . . [Et Et+1 var ].

Now that both sides of the inequality have no randomness, we can simplify the notation by applying Et+1 . . . [Et[ ]] to both sides and by denoting

E[ ] = E1 . . . [E+1[ ]].

Then the descent lemma becomes

E[F(xt+1)] E[F(xt)] γ

2 E[ F(xt) 2 2] γ

4 E[ Et+1[Gt+1|t] 2 2] + γ

2 E[Et+1 bias ] + LF γ2

2 E[Et+1 var ].

Now we can sum the descent lemmas from t = 0 to T 1 PT 1 t=0 E[F(xt+1)] PT 1 t=0 E[F(xt)] γ

2 PT 1 t=0 E h F(xt) 2 2 i

4 PT 1 t=0 E h Et+1[Gt+1|t] 2 2

2 PT 1 t=0 E[Et+1 bias ] + LF γ2

2 PT 1 t=0 E[Et+1 var ].

After simplification and division by T, we get

1 T PT 1 t=0 E h F(xt) 2 2 i + 1

2 1 T PT 1 t=0 E h Et+1[Gt+1|t] 2 2

2(E[F (x T )] E[F (x0)])

2 1 T PT 1 t=0 E[Et+1 bias ] + LF γ2

2 1 T PT 1 t=0 E[Et+1 var ]

2(E[F (x T )] F )

T PT 1 t=0 E[Et+1 bias ] + LF γ

T PT 1 t=0 E[Et+1 var ].

The following corollary is a consequence of Proposition 1.

Corollary 1 Assume fξ in CSO satisfies

al := sup x sup ξ

l+1fξ(x) 2 < , l = 1, 2, 3, 4.

Let s further assume that the higher order moments of gη( ) are bounded,

σk = sup x sup ξ Eη|ξ h Pp i=1 gη(x) Eη|ξ[gη(x)] k i

i < , k = 1, 2, 3, 4

where [ ]i refers to the i-th coordinate of a vector. Consider the L(2) Dt+1 g,ξ fξ(0) defined in (6), then

E L(2) Dt+1 g,ξ fξ(0) fξ(Eη|ξ[gη(xt)])

2 C2 e m4 ξ,

where C2 e(f; g) := 8a3σ3+18a4σ2 2+5a4σ4 96 2 .

Proof: The Proposition 1 gives the following upper bound

E L(2) Dt+1 g,ξ fξ(0) fξ(Eη|ξ[gη(xt)])

2 4a3σ3+9a4σ2 2 48m2 + 5a4

96 σ4 3σ2 2 m3 2 .

For simplicity, we can relax the upper bound to

E L(2) Dt+1 g,ξ fξ(0) fξ(Eη|ξ[gη(xt)])

2 1 m4 8a3σ3+18a4σ2 2+5a4σ4 96 2 .

D.2 Convergence of BSGD

In this section, we reanalyze the BSGD algorithm of [16] to obtain bounds on bias and variance of its gradient estimates. Theorem 6 shows that BSGD achieves an O(ε 6) sample complexity.

Lemma 5 (Bias and Variance of BSGD) The bias and variance of BSGD are

Et+1 bias σ2 bias m , Et+1 var σ2 in m + σ2 out

where σ2 in = ζ2 g C2 f + σ2 g C2 g L2 f, σ2 out = C2 F , and σ2 bias = σ2 g C2 g L2 f.

Proof: Denote Gt+1 = Gt+1 BSGD (5) and denote E[ ] as the conditional expectation Et+1[ |t] which conditions on all past randomness until time t. Note that the g η can be estimated without bias, i.e.

E η|ξ h 1 m P

η Hξ g η(x) i = E η|ξ [ g η(x)] ,

Then let s first look at the bias of BSGD

Et+1 bias = F(xt+1) E[Gt+1] 2 2

= Eξ h (E η|ξ[ g η(xt)]) fξ(Eη|ξ[gη(xt)]) Eη|ξ[ fξ( 1

η Hξ gη(xt))] i 2

fξ(Eη|ξ[gη(xt)]) Eη|ξ[ fξ( 1

η Hξ gη(xt))] 2

C2 g L2 f Eξ

Eη|ξ[gη(xt)] 1

η Hξ gη(xt) 2

C2 g L2 f m Eξ h Eη|ξ h Eη|ξ[gη(xt)] gη(xt) 2 2

= σ2 g C2 g L2 f m = σ2 bias m .

For the first inequality, we take the expectation outside the norm and bound g η with Cg.

On the other hand, the variance of BSGD can be decomposed into inner variance and outer variance

Et+1 var = Eξ[Eη|ξ, η|ξ[ Gt+1 Eξ[Eη|ξ, η|ξ[Gt+1]] 2 2]]

= Eξ[Eη|ξ, η|ξ[ (Gt+1 Eη|ξ, η|ξ[Gt+1]) + (Eη|ξ, η|ξ[Gt+1] Eξ[Eη|ξ, η|ξ[Gt+1]]) 2 2]]

= Eξ[Eη|ξ, η|ξ[ Gt+1 Eη|ξ, η|ξ[Gt+1] 2 2]] | {z } Inner variance

+ Eξ[ Eη|ξ, η|ξ[Gt+1] Eξ[Eη|ξ, η|ξ[Gt+1]] 2 2] | {z } Outer variance

The inner variance is bounded as follows

Eξ[Eη|ξ, η|ξ[ Gt+1 Eξ[Eη|ξ, η|ξ[Gt+1] 2 2]]

η Hξ g η(xt) E η|ξ[ g η(xt)]) fξ( 1

m P η Hξ gη(xt)) 2

(E η|ξ[ g η(xt)]) ( fξ( 1

η Hξ gη(xt)) Eη|ξ[ fξ( 1

η Hξ gη(xt))]) 2

η Hξ g η(xt) E η|ξ[ g η(xt)] 2

η Hξ gη(xt)) Eη|ξ[ fξ( 1

η Hξ gη(xt))] 2

η Hξ g η(xt) E η|ξ[ g η(xt)] 2

η Hξ gη(xt)) fξ(Eη[gη(xt)]) 2

fξ(Eη[gη(xt)]) Eη|ξ[ fξ( 1

η Hξ gη(xt))] 2

ζ2 g C2 f m + C2 g Lf Eξ

η Hξ gη(xt) Eη|ξ[gη(xt)] 2

C2 f ζ2 g m + C2 g Lf

m Eξ h Eη|ξ h gη(xt) Eη|ξ[gη(xt)] 2 2

ζ2 g C2 f +σ2 g C2 g L2 f m = σ2 in m .

The outer variance is independent of the inner batch size and can be bounded by

Eξ[ Eη|ξ, η|ξ[Gt+1] Eξ[Eη|ξ, η|ξ[Gt+1]] 2 2] Eξ[ Eη|ξ, η|ξ[Gt+1] 2 2] C2 f C2 g = C2 F = σ2 out

Therefore, the variance is bounded as follows

Et+1 var σ2 in m + σ2 out.

This completes the proof.

Theorem 6 (BSGD Convergence) Consider the (CSO) problem. Suppose Assumptions 3, 4, 5 holds true. Let step size γ 1/(2LF ). Then for BSGD, xs picked uniformly at random among {xt}T 1 t=0 satisfies: E[ F(xs) 2 2] ε2, for nonconvex F, if the inner batch size m = Ω σ2 biasε 2 and the number of iterations T = Ω (F(x0) F )LF (σ2 in/m + σ2 out)ε 4 , where σ2 in = ζ2 g C2 f + σ2 g C2 g L2 f, σ2 out = C2 F , and σ2 bias = σ2 g C2 g L2 f.

Proof: Denote Gt+1 = Gt+1 BSGD (1). Using descent lemma (Lemma 4) and bias-variance bounds of BSGD (Lemma 5)

1 T PT 1 t=0 E h F(xt) 2 2 i + 1

2 1 T PT 1 t=0 E h Et+1[Gt+1|t] 2 2

2(E[F (x T )] F )

γT + LF γ( σ2 in m + σ2 out) + σ2 bias m

Then we can minimize the right-hand size by optimizing γ to

2(F (x0) F ) LF (σ2 in/m+σ2 out)T

which is smaller than the bound of step size γ 1 2LF if T is greater than the following constant which does not rely on the target precision ε

T 8LF (F (x0) F )

σ2 in/m+σ2 out .

Then the upper bound of gradient becomes

1 T PT 1 t=0 E h F(xt) 2 2 i q

2(F (x0) F )LF (σ2 in/m+σ2 out) T + σ2 bias m .

By taking inner batch size of at least

m σ2 bias ε2 ,

and iteration T greater than

T 2(F (x0) F )LF (σ2 in/m+σ2 out) ε4 ,

we have that

1 T PT 1 t=0 E h F(xt) 2 2 i 2ε2.

By picking xs uniformly at random among {xt}T 1 t=0 , we get the desired guarantee.

The resulting sample complexity of BSGD to get to an ε-stationary point is O(ε 6).

D.3 Convergence of E-BSGD

In this section, we analyze the sample complexity of Algorithm 2 (E-BSGD) for the CSO problem.

Lemma 6 (Bias and Variance of E-BSGD) The bias and variance of E-BSGD are

Et+1 bias σ2 bias m4 , Et+1 var 14( σ2 in m + σ2 out)

where σ2 in = ζ2 g C2 f + σ2 g C2 g L2 f, σ2 out = C2 F , and σ2 bias = C2 g C2 e with C2 e defined in Corollary 1.

Proof: Denote Gt+1 = Gt+1 E-BSGD (7). Like previously (Lemma 5), let E[ ] denote the conditional expectation Et+1[ |t] which conditions on all past randomness until time t. In E-BSGD, we apply extrapolation to fξ( ). The bias can be estimated with the help of Corollary 1 as

Et+1 bias = F(xt+1) E[Gt+1] 2 2

" fξ(Eη|ξ[gη(xt)]) E L(2) Dt+1 g,ξ fξ(0)

C2 g C2 e m4 .

Since the variance of BSGD in Lemma 5 is upper bounded by σ2 in m + σ2 out, then Lemma 2 gives

Et+1 var 14(σ2 in/m + σ2 out).

This proves the claimed bounds.

Theorem 3 [E-BSGD Convergence] Consider the (CSO) problem. Suppose Assumptions 3, 4, 5, 6 hold true and LF , CF , LF , Cg, F are constants and Ce(f; g) := 8a3σ3+18a4σ2 2+5a4σ4 96 defined in Corollary 1 are associated with second order extrapolation in the CSO problem. Let step size γ 1/(2LF ). Then the output xs of E-BSGD (Algorithm 2) satisfies: E[ F(xs) 2 2] ε2, for nonconvex F, if the inner batch size m = Ω(Ce Cgε 1/2) , and the number of iterations

T = Ω(LF (F(x0) F )( L2 F/m + C2 F )ε 4).

Proof: The proof is very similar to Theorem 6. Denote Gt+1 = Gt+1 E-BSGD (7). Using descent lemma (Lemma 4) and bias-variance bounds of E-BSGD (Lemma 6)

1 T PT 1 t=0 E h F(xt) 2 2 i + 1

2 1 T PT 1 t=0 E h Et+1[Gt+1|t] 2 2

2(E[F (x T )] F )

γT + 14LF γ( σ2 in m + σ2 out) + C2 g C2 e m4 .

Then we optimize γ to

(F (x0) F ) 7LF (σ2 in/m+σ2 out)T

which is smaller than the bound of step size γ 1 2LF if T is greater than the following constant which does not rely on the target precision ε

T 4LF (F (x0) F )

7(σ2 in/m+σ2 out) .

Then the gradient norm has the following upper bound.

1 T PT 1 t=0 F(xt) 2 2 4 q

7(F (x0) F )LF (σ2 in+σ2 out) T + σ2 bias m4 .

In order to reach ε-stationary point, i.e.

1 T PT 1 t=0 F(xt) 2 2 ε2,

we can enforce

7(F (x0) F )LF (σ2 in/m+σ2 out) T ε2, C2 g C2 e m4 ε2.

By taking inner batch size of at least

m = Ω( σ1/2 bias ε 1/2),

and iteration T greater than

T 112(F (x0) F )LF ( σ2 in m +σ2 out) ε4 ,

we have that

1 T PT 1 t=0 F(xt) 2 2 3ε2.

By picking xs uniformly at random among {xt}T 1 t=0 , we get the desired guarantee.

D.4 Convergence of BSpider Boost

In this section, we reanalyze the BSpider Boost algorithm of [16] to obtain bounds on bias and variance of its gradient estimates. Theorem 6 shows that BSpider Boost achieves an O(ε 5) sample complexity.

Let Gt+1 BSB as the BSpider Boost gradient estimate

Gt+1 BSB = Gt BSB + 1 B2 P

ξ B2(Gt+1 BSGD Gt BSGD) with prob. 1 pout 1 B1 P ξ B1 Gt+1 BSGD with prob. pout.

Lemma 7 (Bias and Variance of BSpider Boost) If γ min{ 1 2LF , B2

6LF }, then the bias and variance of BSpider Boost are

1 T PT 1 t=0 E[Et+1 bias ] 2σ2 bias m + (1 pout)3

pout B2 56L2 F γ2

T PT 1 t=0 E[ Et+1[Gt+1|t] 2 2] + ( 1 T pout + 1) 4(1 pout)2

B1 ( σ2 in m + σ2 out)

1 T PT 1 t=0 E[Et+1 var ] 28(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Et+1[Gt+1|t] 2 2] + ( 1

T + pout) 2

B1 ( σ2 in m + σ2 out),

where σ2 in = ζ2 g C2 f + σ2 g C2 g L2 f, σout = C2 F , and σ2 bias = σ2 g C2 g L2 f.

Proof: Denote Gt+1 = Gt+1 BSB (8). Like previously (Lemma 5), let E[ ] denote the conditional expectation Et+1[ |t] which conditions on all past randomness until time t. Denote Gt+1 L and Gt+1 S as the large batch and small batch in BSpider Boost separately, i.e., Gt+1 L = 1 B1 P

ξ B1 Gt+1 BSGD with prob. pout Gt+1 S = Gt + 1 B2 P

ξ B2(Gt+1 BSGD Gt BSGD) with prob. 1 pout.

The bias of BSpider Boost can be decomposed to its distance to BSGD and the distance from BSGD to the full gradient, i.e.,

Et+1 bias = F(xt+1) E[Gt+1] 2 2 2 F(xt+1) E[Gt+1 BSGD] 2 2 + 2 E[Gt+1 BSGD] E[Gt+1] 2 2

2σ2 bias m + 2 E[Gt+1 BSGD] E[Gt+1] 2 2 .

where the last inequality uses the bias of BSGD from Lemma 5. Then the second term can be bounded as follows E[Gt+1 BSGD] E[Gt+1] 2 2 = (1 pout)2 E[Gt+1 BSGD] E[Gt+1 S ] 2 2 = (1 pout)2 E[Gt BSGD] Gt 2 2 .

By taking the expectation of randomness of Gt

E[Gt+1 BSGD] E[Gt+1] 2 2 = (1 pout)2 E[Gt BSGD] E[Gt] 2 2 + E Gt E[Gt] 2 2

= (1 pout)2 E[Gt BSGD] E[Gt] 2 2 + Et var

Note that E[G1 BSGD] E[G1] 2 2 = 0 as the first iteration always chooses the large batch. Then as we always use large batch at t = 0 we know that

1 T PT 1 t=0 E[Gt+1 BSGD] E[Gt+1] 2 2 (1 pout)2

pout 1 T PT 1 t=0 Et+1 var . (15)

Therefore combine (14) and (15) we can upper bound the bias

1 T PT 1 t=0 Et+1 bias 2σ2 bias m + 2(1 pout)2

pout 1 T PT 1 t=0 Et+1 var . (16)

Variance. Now we consider the variance,

Et+1 var = E h Gt+1 E[Gt+1] 2 2

(1 pout) E h Gt+1 S E[Gt+1 S ] 2 2

i + pout E h Gt+1 L E[Gt+1 L ] 2 2

B2 E h Gt+1 BSGD Gt BSGD E[Gt+1 BSGD Gt BSGD] 2 2

B1 E h Gt+1 BSGD E[Gt+1 BSGD] 2 2

B2 E h Gt+1 BSGD Gt BSGD E[Gt+1 BSGD Gt BSGD] 2 2

B1 ( σ2 in m + σ2 out) (17)

where the last equality is because the large batch in BSpider Boost is similar to BSGD.

E1 var = E h G1 E[G1] 2 2

i = E h G1 L E[G1 L] 2 2

i = 1 B1 E h G1 BSGD E[G1 BSGD] 2 2

i 1 B1 ( σ2 in m +σ2 out). (18) Finally, we expand the variance at small batch size epoch

E h Gt+1 BSGD Gt BSGD E[Gt+1 BSGD Gt BSGD] 2 2

= E h Gt+1 BSGD Gt BSGD Eη|ξ, η|ξ[Gt+1 BSGD Gt BSGD] 2 2

| {z } Inner variance Tin

+ Eξ h Eη|ξ, η|ξ[Gt+1 BSGD Gt BSGD] E[Gt+1 BSGD Gt BSGD] 2 2

| {z } Outer variance Tout

The outer variance Tout can be upper bounded as

Tout Eξ h Eη|ξ, η|ξ[Gt+1 BSGD Gt BSGD] 2 2

(E η|ξ[ g η(xt)]) Eη|ξ[ fξ( 1

η Hξ gη(xt))] (E η|ξ[ g η(xt 1)]) Eη|ξ[ fξ( 1

η Hξ gη(xt 1))] 2

(E η|ξ[ g η(xt)] E η|ξ[ g η(xt 1)]) Eη|ξ[ fξ( 1

η Hξ gη(xt))] 2

(E η|ξ[ g η(xt 1)]) Eη|ξ[ fξ( 1

η Hξ gη(xt)) fξ( 1

η Hξ gη(xt 1))] 2

2L2 g C2 f xt xt 1 2 2 + 2C4 g L2 f xt xt 1 2 2 = 2L2 F xt xt 1 2 2 = 2L2 F γ2 Gt 2 2 .

The inner variance can be bounded by

Tin 4 E ( 1

η Hξ( g η(xt) g η(xt 1)) E η|ξ[ g η(xt) g η(xt 1)]) fξ( 1

η Hξ gη(xt)) 2

+ 4 E (E η|ξ[ g η(xt) g η(xt 1)]) ( fξ( 1

η Hξ gη(xt)) Eη|ξ[ fξ( 1

η Hξ gη(xt))]) 2

η Hξ g η(xt 1)) fξ( 1

η Hξ gη(xt)) fξ( 1

η Hξ gη(xt 1))

Eη|ξ h fξ( 1

η Hξ gη(xt)) fξ( 1

η Hξ gη(xt 1)) i

η Hξ g η(xt 1) E η|ξ[ g η(xt 1)]) ( fξ( 1

η Hξ gη(xt)) fξ( 1

η Hξ gη(xt 1))) 2

4C2 f m E h g η(xt) g η(xt 1) E η|ξ[ g η(xt) g η(xt 1)] 2 2

i + 4L2 g C2 f m xt xt 1 2 2

+ 4C2 g E fξ( 1

m P η Hξ gη(xt)) fξ( 1

m P η Hξ gη(xt 1))

Eη|ξ[ fξ( 1

m P η Hξ gη(xt)) fξ( 1

m P η Hξ gη(xt 1))]

+ 4C4 g L2 f m xt xt 1 2 2 .

Then we have that

Tin 4L2 g C2 f m xt xt 1 2 2 + 4L2 g C2 f m xt xt 1 2 2

+ 4C2 g E fξ( 1

m P η Hξ gη(xt)) fξ( 1

m P η Hξ gη(xt 1)) ( fξ(Eη|ξ[gη(xt)]) fξ(Eη|ξ[gη(xt 1)])) 2

+ 4C2 g E ( fξ(Eη|ξ[gη(xt)]) fξ(Eη|ξ[gη(xt 1)]))

Eη|ξ[ fξ( 1

η Hξ gη(xt)) fξ( 1

η Hξ gη(xt 1))]

+ 4C4 g L2 f m xt xt 1 2 2

8L2 g C2 f m xt xt 1 2 2 + 8C4 g L2 f m xt xt 1 2 2 + 4C4 g L2 f m xt xt 1 2 2

12L2 F m xt xt 1 2 2

= 12L2 F γ2

To sum up, the variance is bounded by

Et+1 var 2(1 pout)L2 F γ2

m) Gt 2 2 + pout

B1 ( σ2 in m + σ2 out)

14(1 pout)L2 F γ2

B2 Gt 2 2 + pout

B1 ( σ2 in m + σ2 out)

= 14(1 pout)L2 F γ2

B2 Et[ Gt 2 2 |t 1] + pout

B1 ( σ2 in m + σ2 out).

Then averaging over time and now we redefine E[ ] = ET . . . [E0[ ]]

1 T PT 1 t=0 E[Et+1 var ] 14(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Gt+1 2 2] + E1 var T + pout

B1 ( σ2 in m + σ2 out)

= 14(1 pout)L2 F γ2

B2 1 T PT 1 t=0 (E[Et+1 var ] + E[ Et[Gt+1|t] 2 2]) + E1 var T + pout

B1 ( σ2 in m + σ2 out).

If we take γ B2

6LF , then 14(1 pout)L2 F γ2

2, therefore

1 T PT 1 t=0 E[Et+1 var ] 28(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Et[Gt+1|t] 2 2] + E1 var T + 2pout

B1 ( σ2 in m + σ2 out)

(18) 28(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Et[Gt+1|t] 2 2] + ( 1

T + pout) 2

B1 ( σ2 in m + σ2 out) Then with (16), we can bound the bias by

1 T PT 1 t=0 E[Et+1 bias ] 2σ2 bias m + (1 pout)3

pout B2 56L2 F γ2

T PT 1 t=0 E[ Et[Gt+1|t] 2 2] + ( 1 T pout + 1) 4(1 pout)2

B1 ( σ2 in m + σ2 out)

Theorem 7 (BSpider Boost Convergence) Consider the (CSO) problem. Suppose Assumptions 3, 4, 5 holds true. Let step size γ 1/(13LF ). Then for BSpider Boost, xs picked uniformly at random among {xt}T 1 t=0 satisfies: E[ F(xs) 2 2] ε2, for nonconvex F, if the inner batch size m = Ω(σ2 biasε 2), the hyperparameters of the outer loop of BSpider Boost are B1 = (σ2 in/m + σ2 out)ε 2, B2 = O(ε 1), pout = 1/B2, and the number of iterations T = Ω LF (F(x0) F )ε 2 , where σ2 in = ζ2 g C2 f + σ2 g C2 g L2 f, σ2 out = C2 F , and σ2 bias = σ2 g C2 g L2 f.

Proof: Denote Gt+1 = Gt+1 BSB (8). Using descent lemma (Lemma 4) and bias-variance bounds of BSpider Boost (Lemma 7)

1 T PT 1 t=0 E[ F(xt) 2 2] + 1 2T PT 1 t=0 E[ Et[Gt+1|t] 2 2]

2(F (x0) F )

γT + LF γ 1

T PT 1 t=0 E[Et+1 var ] + 1

T PT 1 t=0 E[Et+1 bias ]

2(F (x0) F )

γT + LF γ 1

T PT 1 t=0 E[Et+1 var ] + 2σ2 bias m + 2 pout 1 T PT 1 t=0 E[Et+1 var ]

2(F (x0) F )

γT + 2σ2 bias m + 3 pout 1 T PT 1 t=0 E[Et+1 var ]

where the last inequality use γ 1 2LF . Use the variance estimation of Gt+1 and choose B2pout = 1

1 T PT 1 t=0 E[ F(xt) 2 2] + 1 2T PT 1 t=0 E[ Et[Gt+1|t] 2 2]

2(F (x0) F )

γT + 2σ2 bias m + 84L2 F γ2 1

T PT 1 t=0 E[ Et[Gt+1|t] 2 2] + ( 1 T pout + 1) 6

B1 ( σ2 in m + σ2 out).

Now we can let γ 1 13LF such that 84L2 F γ2 1

1 T PT 1 t=0 E[ F(xt) 2 2] 2(F (x0) F )

γT + 2σ2 bias m + ( 1 T pout + 1) 6

B1 ( σ2 in m + σ2 out).

In order for the right-hand side to be ε2, the inner batch size

m 2σ2 bias ε2 , and the outer batch size

B1 = σ2 in/m+σ2 out ε2 , B2 = p

B1, pout = 1 B2 .

The step size γ is upper bounded by min{ 1 2LF , B2

6LF , 1 13LF }. As B2 1, we can take γ = 1 13LF . So we need iteration T greater than

T 26LF (F (x0) F )

By picking xs uniformly at random among {xt}T 1 t=0 , we get the desired guarantee.

The resulting sample complexity of BSpider Boost to get to an ε-stationary point is O(ε 5).

D.5 Convergence of E-BSpider Boost

In this section, we analyze the sample complexity of Algorithm 3 (E-BSpider Boost) for the CSO problem.

Lemma 8 (Bias and Variance of E-BSpider Boost) The bias and variance of E-BSpider Boost are

1 T PT 1 t=0 E[Et+1 var ] 28(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Et[Gt+1|t] 2 2] + ( 1 T pout + 1) 28pout

B1 ( σ2 in m + σ2 out)

1 T PT 1 t=0 E[Et+1 bias ] 2 σ2 bias m4 + 2 pout (1 pout)2

T PT 1 t=0 E[Et+1 var ],

where σ2 in := ζ2 g C2 f + σ2 g C2 g L2 f, σout = C2 F , and σ2 bias = C2 g C2 e with C2 e defined in Corollary 1.

Proof: Denote Gt+1 = Gt+1 E-BSB (9). Like previously (Lemma 5), let E[ ] denote the conditional expectation Et[ |t] which conditions on all past randomness until time t. Let Gt+1 = Gt+1 E-BSB be the E-BSpider Boost update. We expand the bias as follows

Et+1 bias = F(xt+1) E[Gt+1] 2 2 2 F(xt+1) E[Gt+1 E-BSGD 2 2] + 2 E[Gt+1 E-BSGD] E[Gt+1] 2 2 .

From Lemma 6, we know that F(xt+1) E[Gt+1 E-BSGD] 2 2 σ2 bias m4 .

The distance between E[Gt+1 E-BSGD] and E[Gt+1] can be bounded as follows. E[Gt+1 E-BSGD] E[Gt+1] 2 2 = (1 pout)2 E[Gt+1 E-BSGD] (Gt + E[Gt+1 E-BSGD Gt E-BSGD]) 2 2 = (1 pout)2 E[Gt E-BSGD] Gt 2 2

Taking expectation with respect to Gt

E[Gt+1 E-BSGD] E[Gt+1] 2 2 (1 pout)2( E[Gt E-BSGD] E[Gt] 2 2 + Gt E[Gt] 2 2).

where E[G1 E-BSGD] E[G1] 2 2 = 0. By averaging over time we have

1 T PT 1 t=0 E[Gt+1 E-BSGD] E[Gt+1] 2 2 1 pout (1 pout)2

T PT 1 t=0 E[Et+1 var ].

Then the bias is bounded by

1 T PT 1 t=0 E[Et+1 bias ] 2 σ2 bias m4 + 2 pout (1 pout)2

T PT 1 t=0 E[Et+1 var ].

Variance. Since the extrapolation only gives a constant overhead given Lemma 2

1 T PT 1 t=0 E h Gt+1 BSB Et[Gt+1 BSB|t] 2 2

i 28(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Et[Gt+1 BSB|t] 2 2] + ( 1

T + pout) 28

B1 ( σ2 in m + σ2 out).

Then the variance is bounded by

1 T PT 1 t=0 E[Et+1 var ] 28(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Et[Gt+1|t] 2 2] + ( 1 T pout + 1) 28pout

B1 ( σ2 in m + σ2 out).

Theorem 4 [E-BSpider Boost Convergence] Consider the (CSO) problem under the same assumptions as Theorem 3. Let step size γ 1/(13LF ). Then the output xs of E-BSpider Boost (Algorithm 3) satisfies: E[ F(xs) 2 2] ε2, for nonconvex F, if the inner batch size m = O(Ce Cgε 0.5), the hyperparameters of the outer loop of E-BSpider Boost B1 = ( L2 F /m + C2 F )ε 2, B2 = B1, pout = 1/B2, and the number of iterations

T = Ω(LF (F(x0) F )ε 2).

Proof: Denote Gt+1 = Gt+1 E-BSB (9). Using descent lemma (Lemma 4) and bias-variance bounds of E-BSpider Boost (Lemma 8)

1 T PT 1 t=0 E[ F(xt) 2 2] + 1 2T PT 1 t=0 E[ Et[Gt+1|t] 2 2]

2(F (x0) F )

γT + LF γ 1

T PT 1 t=0 E[Et+1 var ] + 1

T PT 1 t=0 E[Et+1 bias ]

2(F (x0) F )

γT + LF γ 1

T PT 1 t=0 E[Et+1 var ] + 2 σ2 bias m4 + 2 pout 1 T PT 1 t=0 E[Et+1 var ]

2(F (x0) F )

γT + 2 σ2 bias m4 + 3 pout 1 T PT 1 t=0 E[Et+1 var ]

where the last inequality use γ 1 2LF . Use the variance estimation of Gt+1 and choose B2pout = 1

1 T PT 1 t=0 E[ F(xt) 2 2] + 1 2T PT 1 t=0 E[ Et[Gt+1|t] 2 2]

2(F (x0) F )

γT + 2 σ2 bias m4 + 84L2 F γ2 1

T PT 1 t=0 E[ Et[Gt+1|t] 2 2] + ( 1 T pout + 1) 84

B1 ( σ2 in m + σ2 out).

Now we can let γ 1 13LF such that 84L2 F γ2 1

1 T PT 1 t=0 E[ F(xt) 2 2] 2(F (x0) F )

γT + 2 σ2 bias m4 + ( 1 T pout + 1) 84

B1 ( σ2 in m + σ2 out). (19)

In order to make the right-hand side ε2, the inner batch size

m = Ω( σ2 biasε 0.5),

and the outer batch size

B1 = (σ2 in/m+σ2 out) ε2 , B2 = p

B1, pout = 1 B2 .

The step size γ is upper bounded by min{ 1 2LF , B2

6LF , 1 13LF }. As B2 1, we can take γ = 1 13LF . So we need iteration T greater than

T 26LF (F (x0) F )

By picking xs uniformly at random among {xt}T 1 t=0 , we get the desired guarantee.

E Stationary Point Convergence Proofs from Section 4 (FCCO)

In this section, we provide the convergence proofs for the FCCO problem. We start by analyzing a variant of BSpider Boost (Algorithm 3) for this case in Appendix E.1. In Appendix E.2, we present a multi-level variance reduction approach (called Nested VR) that applies variance reduction in both outer (over the random variable i) and inner (over the random variable η|i) loops. In Appendix E.3, we analyze E-Nested VR. As in the case of CSO analyses, our proofs go via bounds on bias and variance terms of these algorithms.

E.1 E-BSpider Boost for FCCO problem

Theorem 8 Consider the (FCCO) problem. Suppose Assumptions 3, 4, 5, 6 holds true. Let step size γ = O(1/LF ). Then the output of E-BSpider Boost (Algorithm 3) satisfies: E[ F(xs) 2 2] ε2, for nonconvex F, if the inner batch size m = Ω(max{Ce Cgε 1/2, σ2 inn 1ε 2}), the hyperparameters of the outer loop of E-BSpider Boost B1 = n, B2 = n, pout = 1/B2, and the number of iterations T = Ω LF (F(x0) F )ε 2 . The resulting sample complexity is

O LF (F(x0) F ) max n n Ce Cg

ε2.5 , σ2 in nε4 o .

Remark 9 The sample complexity depends on the relation between n and ε

When n = O(1), we have a complexity of O(ε 4). This happens because we did not apply variance reduction for the inner loop.

When n = Θ(ε 2/3), E-BSpider Boost has same performance as MSVR-V2 [19] of O(nε 3) = O(ε 11/3).

When n = Θ(ε 1.5), E-BSpider Boost achieves a better sample complexity of O(ε 3.25) than O(ε 4.5) from MSVR-V2 [19].

When n = Θ(ε 2), we recover O(ε 3.5) sample complexity as in Theorem 4.

Proof: Denote Gt+1 = Gt+1 E-BSB (9). As we are using the finite-sum variant of Spider Boost for the outer loop of the CSO problem, we only need to change the (17) and (18) to reflect that the outer variance is 0 now instead of σ2 out B1 in the general CSO case. More concretely, we update (17) to

Et+1 var = E[ Gt+1 E[Gt+1] 2 2]

(1 pout) E[ Gt+1 S E[Gt+1 S ] 2 2] + pout E[ Gt+1 L E[Gt+1 L ] 2 2]

B2 E[ Gt+1 E-BSGD Gt E-BSGD E[Gt+1 E-BSGD Gt E-BSGD] 2 2] + pout

B1 E[ Gt+1 E-BSGD E[Gt+1 E-BSGD] 2 2]

B2 E[ Gt+1 E-BSGD Gt E-BSGD E[Gt+1 E-BSGD Gt E-BSGD] 2 2] + pout

B1 σ2 in m . (20)

and change (18) to

E1 var = E[ G1 E[G1] 2 2] = E[ G1 L E[G1 L] 2 2] = 1 B1 E[ G1 E-BSGD E[G1 E-BSGD] 2 2] 1 B1 σ2 in m . (21) Then our analysis only has to start from the updated version of (19)

1 T PT 1 t=0 E[ F(xt) 2 2] 2(F (x0) F )

γT + 2 σ2 bias m4 + ( 1 T pout + 1) 84

B1 σ2 in m .

We would like all terms on the right-hand side to be bounded by ε2. From 2 σ2 bias m4 ε2 we know that

m = Ω( σ1/2 bias ε1/2 ).

From ( 1 T pout + 1) 84

B1 σ2 in m ε2, we know that

m = Ω( σ2 in nε2 ).

From 2(F (x0) F )

γT ε2, we can choose that

LF ), T = Ω LF (F (x0) F )

Now the total sample complexity for E-BSpider Boost for the FCCO problem becomes

B2m T = O L2 F (F(x0) F ) max n σ1/2 bias ε2.5 , σ2 in nε4

By picking xs uniformly at random among {xt}T 1 t=0 , we get the desired guarantee.

E.2 Convergence of Nested VR

Nested VR Algorithm. We start by describing the Nested VR construction. We maintain states yt+1 i and zt+1 i to approximate

yt+1 i Eη|i[gη(xt)], zt+1 i E η|i[ g η(xt)].

In iteration t + 1, if i is selected, then the state yt+1 i is updated as follows

η Hi gη(xt) with prob. pin yt i + 1 S2 P

η Hi(gη(xt) gη(ϕt i)) with prob. 1 pin,

where ϕt i is the last time node i is visited. If i is not selected, then

yt+1 i = yt i.

In this case, yt+1 i was never used to compute fi(yt+1 i ) because i is not selected at the time t + 1. We use the following quantities

ˆzt+1 i = E η|i[ g η(xt)], zt+1 i = 1

η Hi g η(xt). (22)

We use Gt+1 NVR as the actual updates,

Gt+1 NVR = 1

i B1(zt+1 i ) fi(yt+1 i ) with prob. pout Gt NVR + 1 B2 P

i B2((zt+1 i ) fi(yt+1 i ) (zt i) fi( yt i)) with prob. 1 pout.

We can also use the following quantity ˆGt+1 NVR as an auxiliary

ˆGt+1 NVR =

i B1(ˆzt+1 i ) fi(yt+1 i ) with prob. pout ˆGt NVR + 1 B2 P

i B2((ˆzt+1 i ) fi(yt+1 i ) (ˆzt i) fi( yt i)) with prob. 1 pout.

Here we use yt i to represent an i.i.d. copy of yt i where i is selected at time t.

The iterate xt+1 is therefore updated

xt+1 = xt γGt+1 NVR.

Lemma 9 The error between Gt+1 NVR and ˆGt+1 NVR can be upper bounded as follows

1 T PT 1 t=0 E Gt+1 NVR ˆGt+1 NVR 2

C2 f σ2 g m + 4(1 pout)

B2mpout 1 T PT 1 t=0 E[ Gt+1 i Gt i 2 2] .

Proof: In this proof, we ignore the subscript in Gt+1 NVR and ˆGt+1 NVR, we bound the error between Gt+1

and associated ˆGt+1 where

Gt+1 i = ( 1

η Hi g η(x)) fi(yt+1 i ),

ˆGt+1 i = (E η|i[ g η(x)]) fi(yt+1 i ).

Let s only consider the expectation over the randomness of g η,

Gt+1 i ˆGt+1 i 2

η Hi g η(x) E η|i[ g η(x))] 2

E[ fi(yt+1 i ) 2 2]

C2 f m E η|i h g η(x) E η|i[ g η(x))] 2 2

C2 f σ2 g m .

Then we can bound the error as follows

Gt+1 ˆGt+1 2

Gt+1 i ˆGt+1 i 2

+ (1 pout) Gt ˆGt 2

2 + 1 B2 E EHi

Gt+1 i Gt i ˆGt+1 i ˆGt i 2

C2 f σ2 g m + (1 pout) Gt ˆGt 2

B2m E h Gt+1 i Gt i 2 2

C2 f σ2 g m + (1 pout) Gt ˆGt 2

2 + (1 pout)

B2m E[ Gt+1 i Gt i 2 2] .

Unroll the recursion gives

1 T PT 1 t=0 E Gt+1 ˆGt+1 2

C2 f σ2 g m + 4(1 pout)

B2mpout 1 T PT 1 t=0 E[ Gt+1 i Gt i 2 2].

Lemma 10 (Staleness) Define the staleness of iterates at time t as Ξt := 1

n Pn j=1 xt ϕt j 2 2 and let Gt+1 be the gradient estimate, then

1 T PT 1 t=0 E[Ξt] 6n2

T PT 1 t=0 E[ Gt+1 2 2]. (23)

Proof: Like previously (Lemma 5), let E[ ] denote the expectation conditioned on all previous randomness until t 1. It is clear that Ξ0 = 0, so we only consider t > 0. We upper bound E[Ξt] as follows,

E[Ξt] = (1 pout) 1

j=1 E[ xt ϕt j 2 2]

| {z } if time t takes B2

j=1 E[ xt ϕt j 2 2]

| {z } if time t takes B1(ϕt j=xt 1)

Then we can expand E[Ξt] as follows

E[Ξt] = 1 pout

j=1 E[ xt ϕt j 2 2] + pout

j=1 E[ xt xt 1 2 2]

β ) Ei[ xt 1 ϕt j 2

2] + (1 + β) xt 1 xt 2 2

+ poutγ2 E[ Gt 2 2]

β ) Ei[ xt 1 ϕt j 2

2]+(1+β)γ2 E[ Gt 2 2]

where we use Cauchy-Schwarz inequality with coefficient β > 0. By the definition of ϕt j,

xt 1 ϕt 1 j 2

xt 1 xt 1 2 2

+(1+β)γ2 E[ Gt 2 2]

n )Ξt 1 + (1 + β)γ2 E[ Gt 2 2].

By taking β = 2n/B2, we have that (1 + 1

2n and thus

E[Ξt] (1 B2

2n )Ξt 1 + (1 + 2n

B2 )γ2 E[ Gt 2 2].

Note that E[Ξ0] = 0.

1 T PT 1 t=0 E[Ξt] 2n

T PT 1 t=0 E[ Gt+1 2 2]

T PT 1 t=0 E[ Gt+1 2 2].

The following lemma describes how the inner variable changes inside the variance.

Lemma 11 Denote Et+1 y := E h yt+1 i Eη|i[gη(xt)] 2 2

i to be the error from inner variance and pout T 1. Then

1 T PT 1 t=0 Et+1 y (1 pin)C2 g pin S2 1 T PT 1 t=0 Ξt + 2σ2 g S1 .

Meanwhile, E1 y = E[ y1 i Eη|i[gη(x0)] 2 2] = σ2 g S1 .

Et+1 y pin σ2 g S1 + (1 pin) Ei[Eη|i[ yt i Eη|i[gη(ϕt i)] 2 2]]

S2 Ei[Eη|i[ gη(xt) gη(ϕt i) 2 2]]

(1 pin)Et y + (1 pin)C2 g S2 Ei[Eη|i[ xt ϕt i 2 2]] + pin σ2 g S1 .

As t = 0 always uses the large batch, E1 y = E[ y1 i Eη|i[gη(x0)] 2 2] = σ2 g S1 . Then

1 T PT 1 t=0 Et+1 y (1 pin)C2 g pin S2 1 T PT 1 t=0 Ei[Eη|i[ xt ϕt i 2 2]] + σ2 g S1 + E1 y pin T

(1 pin)C2 g pin S2 1 T PT 1 t=0 Ξt + 2σ2 g S1 .

Lemma 12 The error Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]] satisfies

1 T PT 1 t=1 Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]] 4C2 gγ2

T PT 1 t=0 E[ Gt+1 2 2] + 4(1 pin)C2 g S2 1 T PT 1 t=0 Ξt+1

T PT 1 t=0 Et+1 y .

Note that when pin = 1 and S1 = S2 = m, we recover the following

1 T PT 1 t=1 Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]] = 1

T PT 1 t=1 Ei[Epin[Eη|i[ 1

m P η Hξ gη(xt) gη(xt 1) 2

T PT 1 t=0 E[ Gt+1 2 2].

Proof: For t 2, Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]] can be upper bounded as follows

Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]] = pin Ei

η Hi(gη(xt) gη(xt 1)) 2

+ (1 pin) Ei

yt i yt 1 i + 1 S2 P

η Hi(gη(xt) gη(ϕt i)) (gη(xt 1) gη(ϕt 1 i )) 2

pin C2 g xt xt 1 2 2 + 1 pin

S2 Ei h Eη|i h (gη(xt) gη(ϕt i)) (gη(xt 1) gη(ϕt 1 i )) 2 2

+ 3(1 pin) Ei h yt i Eη|i[gη(ϕt i)] 2 2

i + Ei h yt 1 i Eη|i[gη(ϕt 1 i )] 2 2

i + C2 g xt xt 1 2 2

pin C2 g xt xt 1 2 2 + 2(1 pin)C2 g S2 Ξt + Ξt 1 + 3(1 pin) Et y + Et 1 y + C2 g xt xt 1 2 2

(pin + 3(1 pin))C2 g xt xt 1 2 2 + 2(1 pin)C2 g S2 Ξt + Ξt 1 + 3(1 pin) Et y + Et 1 y .

For t = 1, we choose y1 i = y1 i

Ei[Epin[Eη|i[ y2 i y1 i 2 2]]] = pin Ei

η Hi(gη(x1) gη(x0)) 2

+ (1 pin) Ei

y1 i 1 S2 P η Hi(gη(x1) gη(x0)) y1 i 2

C2 g x1 x0 2 2 .

Then for summing up t = 1 to T 1 PT 1 t=2 Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]] + Ei[Epin[Eη|i[ y2 i y1 i 2 2]]]

(pin + 3(1 pin))C2 g PT 1 t=2 xt xt 1 2 2 + 2(1 pin)C2 g S2

PT 1 t=2 Ξt + PT 1 t=2 Ξt 1

+ 3(1 pin) PT 1 t=2 Et y + PT 1 t=2 Et 1 y + C2 g x1 x0 2 2

4C2 g PT 1 t=1 xt xt 1 2 2 + 4(1 pin)C2 g S2 PT 1 t=0 Ξt+1 + 6(1 pin) PT 1 t=0 Et+1 y .

Finally, the error has the following upper bound

1 T PT 1 t=1 Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]]

T PT 1 t=0 E[ Gt+1 2 2] + 4(1 pin)C2 g S2 1 T PT 1 t=0 Ξt+1 + 6(1 pin)

T PT 1 t=0 Et+1 y .

Lemma 13 (Bias and Variance of Nested VR) If the step size γ satisfies,

γ2L2 F max n (1 pin)

pin S2 18 B2 , 1 pout

pin S2 18n2

B2 2 , (1 pout)

then the variance and bias of Nested VR are

1 T PT 1 t=0 Et+1 var 32 pout

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F T PT 1 t=0 E[Gt+1] 2 2

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)

T 8 L2 F B1S1

1 T PT 1 t=0 Et+1 bias 12(1 pin)

T PT 1 t=0 E[Gt+1] 2 2 + 4 L2 F S1

+ 12(1 pin)

B2 2 L2 F γ2 + 2(1 pout)2

1 T PT 1 t=0 Et+1 var .

Proof: Notations. Let us define the following terms,

Gt+1 i := (zt+1 i ) fi(yt+1 i ), Gt i := (zt i) fi( yt i).

Note that the Gt computed at time t + 1 has same expectation as Gt

Et+1[ Gt|t] = Et[Gt|t 1]. (24)

Computing the bias. First consider the two cases in the outer loop

Et+1 bias = F(xt) Et+1[Gt+1|t] 2 2 2 F(xt) Et+1[Gt+1 i |t] 2 2 | {z } At+1 1

+2 Et+1[Gt+1 i |t] Et+1[Gt+1|t] 2 2 | {z } At+1 2

We expand At+1 2 as follows

At+1 2 = Et+1[Gt+1 i |t] Et+1[Gt+1|t] 2 2

= Et+1[Gt+1 i |t] pout Et+1[Gt+1 i |t] (1 pout)(Gt + Et+1[Gt+1 i Gt i|t]) 2

= (1 pout)2 Gt Et+1[ Gt i|t] 2

2 = (1 pout)2 Gt Et[Gt i|t 1] 2 2 where we use (24) in the last equality. Now we take expectation with respect to randomness at t such that Gt is a random variable, then

At+1 2 = (1 pout)2 Et h Gt Et[Gt i|t 1] 2 2 |t 1 i

= (1 pout)2 Et[Gt|t 1] Et[Gt i|t 1] 2 2 + Et var

= (1 pout)2 At 2 + Et var

while at initialization we always use large batch

A1 2 = E1[G1 i ] E1[G1] 2 2 = E1[G1 i ] E1[G1 i ] 2 2 = 0.

Therefore, when we average over time t

1 T PT 1 t=0 At+1 2 (1 pout)2

pout 1 T PT 1 t=0 Et+1 var . (25)

On the other hand, let us consider the upper bound on At+1 1

At+1 1 C2 g L2 f E[ yt+1 i Eη|i[gη(xt)] 2 2] = C2 g L2 f Et+1 y .

From Lemma 11 we know that

1 T PT 1 t=0 At+1 1 C2 g L2 f (1 pin)C2 g pin S2 1 T PT 1 t=0 Ξt + 2σ2 g S1

(1 pin)L2 F pin S2 1 T PT 1 t=0 Ξt + 2 L2 F S1 .

From Lemma 10 we know that

1 T PT 1 t=0 At+1 1 (1 pin)L2 F pin S2

T PT 1 t=0 E[ Gt+1 2 2] + 2 L2 F S1

T PT 1 t=0 E[Gt+1] 2 2 + 2 L2 F S1 + 6(1 pin)

T PT 1 t=0 Et+1 var .

Therefore, the bias has the following bound

1 T PT 1 t=0 Et+1 bias 12(1 pin)

T PT 1 t=0 E[Gt+1] 2 2 + 4 L2 F S1

+ 12(1 pin)

B2 2 L2 F γ2 + 2(1 pout)2

1 T PT 1 t=0 Et+1 var . (26)

Note that when pin = 1 and S1 = S2 = m, then this bias recovers BSpider Boost in (16)

1 T PT 1 t=0 Et+1 bias 4 L2 F m + 2(1 pout)2

pout 1 T PT 1 t=0 Et+1 var .

Computing the variance. Let us decompose the variance into 3 parts:

Et+1 var = E h Gt+1 E[Gt+1] 2 2

= E Gt+1 ˆGt+1 Eη|i[ ˆGt+1] Ei[Eη|i[ ˆGt+1]] 2

= E Gt+1 ˆGt+1 2

| {z } Et+1 g

+ Ei[ Eη|i[Gt+1] Ei[Eη|i[Gt+1]] 2 2] | {z } Et+1 var,out

+ E[ Gt+1 Eη|i[Gt+1] 2 2] | {z } Et+1 var,in

where Et+1 var,out and Et+1 var,in are the variance of outer loop and inner loop.

Inner Variance. For t 1, we expand the inner variance

Et+1 var,in = pout E 1

i(E η|i[ g η(xt)]) ( fi(yt+1 i ) Eη|i[ fi(yt+1 i )]) 2

+ (1 pout) E 1

i(Gt+1 i Gt i) Eη|i[Gt+1 i Gt i] 2

B1 C2 g E h fi(yt+1 i ) Eη|i[ fi(yt+1 i )] 2 2

Gt+1 i Gt i 2

B1 4C2 g L2 f Et+1 y + 1 pout

Gt+1 i Gt i 2

We bound the outer variance as

Gt+1 i Gt i 2

Gt+1 i (E η|i[ g η(xt 1)]) fi(yt+1 i ) Gt i 2

2 Ei h Epin h Eη|i h Gt+1 i (E η|i[ g η(xt 1)]) fi(yt+1 i ) 2

(E η|i[ g η(xt 1)]) fi(yt+1 i ) Gt i 2

2C2 f L2 g xt xt 1 2 2 + 2C2 g L2 f Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]]. (28)

For t = 0, as we only use large and small batch in the

E1 var,in = E 1

i(E η|i[ g η(x0)]) ( fi(y1 i ) Eη|i[ fi(y1 i )]) 2

1 B1 C2 g E[ fi(y1 i ) Eη|i[ fi(y1 i )] 2 2]

1 B1 4C2 g L2 f E1 y

1 B1 4C2 g L2 f σ2 g S1

4 L2 F B1S1 .

Therefore, average over time t = 0, . . . T 1 gives

1 T PT 1 t=0 Et+1 var,in pout

B1 4C2 g L2 f 1 T PT 1 t=1 Et+1 y + 1 pout

B2 1 T PT 1 t=1 Ei[Epin[Eη|i[ Gt+1 i Gt i 2

2]]] + E1 var,in

B1 4C2 g L2 f 1 T PT 1 t=0 Et+1 y + 1 pout

B2 1 T PT 1 t=1 Ei

Gt+1 i Gt i 2

+ (1 pout)E1 var,in T

B1 4C2 g L2 f 1 T PT 1 t=0 Et+1 y + (1 pout)E1 var,in T

+ 2(1 pout)

C2 f L2 g T PT 1 t=1 xt xt 1 2 2 + C2 g L2 f 1 T PT 1 t=1 Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]]

B1 4C2 g L2 f 1 T PT 1 t=0 Et+1 y + (1 pout)E1 var,in T

+ 2(1 pout)

C2 f L2 gγ2

T PT 1 t=0 E[ Gt+1 2 2]

+ 2(1 pout)

B2 C2 g L2 f 1 T PT 1 t=1 Ei[Epin[Eη|i[ yt+1 i yt i 2 2]]]. Let us first apply Lemma 12

1 T PT 1 t=0 Et+1 var,in pout

B1 4C2 g L2 f 1 T PT 1 t=0 Et+1 y + (1 pout)E1 var,in T + 2(1 pout)

C2 f L2 gγ2

T PT 1 t=0 E[ Gt+1 2 2]

+ 2(1 pout)C2 g L2 f B2

T PT 1 t=0 E[ Gt+1 2 2] + 4(1 pin)C2 g S2 1 T PT 1 t=0 Ξt + 6(1 pin)

T PT 1 t=0 Et+1 y

B1 + (1 pin)(1 pout)

12C2 g L2 f T PT 1 t=0 Et+1 y + 8(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Gt+1 2 2]

+ 8(1 pout)(1 pin)C4 g L2 f B2S2 1 T PT 1 t=0 Ξt + (1 pout)E1 var,in T Then we apply Lemma 11 on the bound of 1

T PT 1 t=0 Et+1 y

1 T PT 1 t=0 Et+1 var,in 24 pout

B1 + (1 pin)(1 pout)

(1 pin)L2 F pin S2 1 T PT 1 t=0 Ξt + L2 F S1

+ 8(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Gt+1 2 2]

+ 8(1 pout)(1 pin)C4 g L2 f B2S2 1 T PT 1 t=0 Ξt + (1 pout)E1 var,in T

B1 + (1 pin)(1 pout)

B2 + pin(1 pout)

(1 pin)L2 F pin S2 1 T PT 1 t=0 Ξt

+ 8(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Gt+1 2 2]

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)E1 var,in T

B1 + 1 pout

(1 pin)L2 F pin S2 1 T PT 1 t=0 Ξt

+ 8(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Gt+1 2 2]

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)

T E1 var,in.

From Lemma 10, we plug in the upper bound of 1

T PT 1 t=0 Ξt

1 T PT 1 t=0 Et+1 var,in 24 pout

B1 + 1 pout

(1 pin)L2 F pin S2

T PT 1 t=0 E[ Gt+1 2 2]

+ 8(1 pout)L2 F γ2

B2 1 T PT 1 t=0 E[ Gt+1 2 2]

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)

T E1 var,in

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F T PT 1 t=0 E[ Gt+1 2 2]

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)

T E1 var,in.

Finally, we add the upper bound on with E1 var,in with (29)

1 T PT 1 t=0 Et+1 var,in 8 pout

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F T PT 1 t=0 E[ Gt+1 2 2]

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)

T 4 L2 F B1S1 . (30)

Outer Variance. Now we consider the outer variance for t 1

Et+1 var,out (1 pout)2

Eη|i[Gt+1 i ] Eη|i[ Gt i] 2

Gt+1 i Gt i 2

Compared to (27) we know that the upper bound of is smaller than that of Et+1 var,in. Besides, whereas E1 var,out = 0 as we use large batch at t = 0. Therefore, the upper bound of Et+1 var is upper bounded by 2*(30).

Variance of g η. From Lemma 9, we know that

1 T PT 1 t=0 E[Et+1 g ] 1 B1

C2 f σ2 g m + 4(1 pout)

B2mpout 1 T PT 1 t=0

E Gt+1 i Gt i 2

C2 f σ2 g m + 1

Finally, we use E[ Gt+1 2 2] = E[Gt+1] 2 2 + Et+1 var .

1 T PT 1 t=0 Et+1 var 16 pout

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F T PT 1 t=0 E[Gt+1] 2 2

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F T PT 1 t=0 Et+1 var

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)

T 8 L2 F B1S1 .

By taking step size γ to satisfy

γ2L2 F max n pout

pin S2 18n2

B2 2 , 1 pout

pin S2 18n2

B2 2 , (1 pout)

which can be simplified to

γ2L2 F max n (1 pin)

pin S2 18 B2 , 1 pout

pin S2 18n2

B2 2 , (1 pout)

Then the coefficient of 1

T PT 1 t=0 Et+1 var is bounded by 1

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

The the variance has the following bound

1 T PT 1 t=0 Et+1 var 32 pout

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F T PT 1 t=0 E[Gt+1] 2 2

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)

T 8 L2 F B1S1 .

Theorem 10 Consider the (FCCO) problem. Suppose Assumptions 3, 4, 5 holds true. Let step size γ = O( 1 n LF ). Then for Nested VR, xs picked uniformly at random among {xt}T 1 t=0 satisfies: E[ F(xs) 2 2] ε2, for nonconvex F, if the hyperparameters of the inner loop S1 = O( L2 F ε 2), S2 = O( LF ε 1), pin = O(1/S2), the hyperparameters of the outer loop B1 = n, B2 = n, pout = 1/B2, and the number of iterations

T = Ω n LF (F (x0) F )

The resulting sample complexity is

O n LF LF (F (x0) F )

In fact, it reaches this sample complexity for all pinpout 1 pin ε.

Proof: Using descent lemma (Lemma 4) and bias-variance bounds of Nested VR (Lemma 13)

1 T PT 1 t=0 F(xt) 2 2 + 1 2T PT 1 t=0 E[Gt+1] 2 2

2(F (x0) F )

T PT 1 t=0 Et+1 var + 1

T PT 1 t=0 Et+1 bias

2(F (x0) F )

γT | {z } T0

+ 4 L2 F S1 |{z} T1

+ 12(1 pin)

T PT 1 t=0 E[Gt+1] 2 2 | {z } T2

+ 12(1 pin)

B2 2 L2 F γ2 + 2(1 pout)2

pout + γLF 1 T PT 1 t=0 Et+1 var | {z } T3

Compute T0. In order to let T0 ε2, we require that

γT ε 2. (31)

Compute T1. In order to let T1 to be smaller than ε2, we need

S1 = 4 L2 F ε2 .

Compute T2. In order to let the coefficient of 1

T PT 1 t=0 E[Gt+1] 2 2 in T2 to be less than 1

B2 2 L2 F γ2 1

which requires γ γ B2 pin S2 7LF n 1 pin = poutpin LF 7εLF 1 pin . (33)

Compute T3. Let us now focus on T3 and notice that the middle term 2(1 pout)2

pout 1 T PT 1 t=0 Et+1 var .

Using Lemma 13 we have that

pout 1 T PT 1 t=0 Et+1 var

32 2(1 pout)2

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F 1 T PT 1 t=0 E[Gt+1] 2 2 | {z } T3,1

+ 96 2(1 pout)2

B1 + (1 pin)(1 pout)

L2 F S1 | {z } T3,2

+ 2(1 pout)2

pout (1 pout)

T 8 L2 F B1S1 | {z } T3,3

Compute T3,3: As we already know that S1 = O(ε 2) and T 1 and B1pout 1. This imposes no more constraints, i.e.

Compute T3,2: As S1 = O(ε 2) and B1 = n and B2 = B1pout, then it requires

(1 pin)(1 pout)3

Compute T3,1: In order to satisfy the following

32 2(1 pout)2

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F 1 12

we need to enforce

γ pinpout LF εLF (1 pin)1/2(1 pout)3/2 . (34)

Now we go back to T3 and compare the other two coefficients

B2 2 L2 F γ2 + 2(1 pout)2

pout + γLF .

2 2(1 pout)2

pout we can safely ignore γLF . On the other hand, from (32) we know that the first term is also have

B2 2 L2 F γ2 1

4 2(1 pout)2

Constraints from the Bias-Variance Lemma (Lemma 13). By setting B1 = n and S1 = O( L2 F ε2 ), this constraint translates to

γ2L2 F max n (1 pin)

p2 in ε2 B2 , 1 pout

B2 (1 pin)ε2

p2 in 1 p2 out , (1 pout)

which is weaker than (33).

Summary on the Limit on γ. Combine (33) and (34) and γ 1 2LF , we have a final limit on step size γ

γ min n poutpin LF εLF 1 pin , 1 LF

Then the total sample complexity of Nested VR can be computed as

(# of iters T) (Avg. outer batch size B2 = B1pout) (Avg. inner batch size S2 = S1pin).

This sample complexity has the following requirement

B2S2T = B2S2(Tγ)

The lower bound nε 3 is reached when in (35) we have

poutpin LF εLF 1 pin 1 LF .

That is, poutpin 1 pin ε.

In particular, we can choose the following hyperparameters to reach O(nε 3) sample complexity

B1 = n, B2 = n, pout = 1 n, S1 = O( L2 F ε 2), S2 = O( LF ε 1), pin = O( L 1 F ε)

The step size γ can be chosen as

and the iteration complexity

T = Ω n LF (F (x0) F )

Putting these together gives the claimed sample complexity bound. By picking xs uniformly at random among {xt}T 1 t=0 , we get the desired guarantee.

E.3 Convergence of E-Nested VR

In this section, we analyze the sample complexity of Algorithm 1 (E-Nested VR) for the FCCO problem with

Gt+1 E-NVR =

i(zt+1 i ) L(2) Dt+1 y,i fi(0) with prob. pout

Gt E-NVR + 1 B2 P

(zt+1 i ) L(2) Dt+1 y,i fi(0) (zt i) L(2) Dt y,i fi(0) with prob. 1 pout.

Lemma 14 (Bias and Variance of E-Nested VR) If the step size γ satisfies

γ2L2 F max n (1 pin)

pin S2 18 B2 , 1 pout

pin S2 18n2

B2 2 , (1 pout)

then the variance and bias of E-Nested VR are

1 T PT 1 t=0 Et+1 var 14 32 pout

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F T PT 1 t=0 E[Gt+1] 2 2

+ 14 96 pout

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)

T 8 L2 F B1S1 .

1 T PT 1 t=0 Et+1 bias (1 pin)3 L2 F pin S2 6n2

T PT 1 t=0 E[Gt+1] 2 2 + 2(1 pin)2 L2 F S1 + C2 e S4 2

+ (1 pin)3 L2 F pin S2 6n2

B2 2 γ2 + (1 pout)2

1 T PT 1 t=0 Et+1 var .

Proof: Note that this proof is very similar to Nested VR so we highlight the differences. Let Gt+1 = Gt+1 E-NVR (36) be the E-Nested VR update and define

Gt+1 i := (zt+1 i ) L(2) Dt+1 y,i fi(0)

We expand the bias by inserting Ei,pin,η|i[Gt+1 i ]

Et+1 bias = F(xt+1) E[Gt+1] 2 2 2 F(xt+1) Ei,pin,η, η|i[Gt+1 i ] 2 2 | {z } At+1 1

+2 Ei,pin,η, η|i[Gt+1 i ] E[Gt+1] 2 2 | {z } At+1 2

Consider At+1 1 . The term At+1 1 captures the difference between full gradient and extrapolated gradient

At+1 1 = Ei

(E η|i[ g η(xt)]) fi(E[gη(xt)]) Epin,η|i

(E η|i[ g η(xt)]) L(2) Dt+1 y,i fi(0)

" fi(Eη|i[gη(xt)]) Epin,η|i

L(2) Dt+1 y,i fi(0)

2C2 g Ei h fi(Eη|i[gη(xt)]) Epin[ fi(Eη|i[yt+1 i ])] 2 2

| {z } =:At+1 1,1

" Epin[ fi(Eη|i[yt+1 i ])] Epin,η|i

L(2) Dt+1 y,i fi(0)

| {z } =:At+1 1,2

The first term At+1 1,1 can be upper bounded through smoothness of fξ, for t 1

At+1 1,1 = Ei h fi(Eη|i[gη(xt)]) pin fi(Eη|i[gη(xt)]) (1 pin) fi(yt i + Eη|i[gη(xt) gη(ϕt i)]) 2 2

= (1 pin)2 Ei h fi(E[gη(xt)]) fi(yt i + Eη|i[gη(xt) gη(ϕt i)]) 2 2

(1 pin)2L2 f Ei h Eη|i[gη(xt)] (yt i + Eη|i[gη(xt) gη(ϕt i)]) 2 2

= (1 pin)2L2 f Ei[ yt i Eη|i[gη(ϕt i)] 2 2]

= (1 pin)2L2 f Et y.

For t = 0, A1 1,1 = 0, then

1 T PT 1 t=0 At+1 1,1 (1 pin)2L2 f C2 g 1 T PT 1 t=0 Et+1 y . (37)

On the other hand, with Lemma 6

At+1 1,2 pin Ei

" fi(Eη|i[gη(xt)]) Eη|i

L(2) Dt+1 y,S1,i fi(0)

+ (1 pin) Ei

fi(yt i + Eη|i[gη(xt) gη(ϕt i)]) Eη|i h L(2) Dy,S2,i fi(0) i 2

pin C2 e S4 1 + (1 pin)C2 e S4 2

where Dt+1 y,S1,i is the distribution of 1 S1 P

η S1 gη(xt) and Dt+1 y,S2,i is the distribution of

yt i + 1 S2 P η S2(gη(xt) E[gη(ϕt i)]).

Thus the At+1 1 has the following upper bound

1 T PT 1 t=0 At+1 1 (1 pin)2L2 f C2 g 1 T PT 1 t=0 Et+1 y + C2 e S4 2 . (38)

Consider At+1 2 . Let us expand At+1 2 through recursion

At+1 2 = Ei,pin,η, η|i[Gt+1 i ] E[Gt+1] 2 2

= (1 pout)2 Gt Ei[Eη, η|i[ Gt i]] 2

= (1 pout)2 E[Gt] Ei[Eη, η|i[ Gt i]] 2

= (1 pout)2 At 2 + Et var .

For t = 0, we have that A1 2 = 0, then average over time gives

1 T PT 1 t=0 At+1 2 (1 pout)2

pout 1 T PT 1 t=0 Et+1 var .

Therefore, the bias has the following bound

1 T PT 1 t=0 Et+1 bias (1 pin)2L2 f C2 g 1 T PT 1 t=0 Et+1 y + C2 e S4 2 + (1 pout)2

pout 1 T PT 1 t=0 Et+1 var .

Using Lemma 11

1 T PT 1 t=0 Et+1 bias (1 pin)2L2 f C2 g (1 pin)C2 g pin S2 1 T PT 1 t=0 Ξt + 2σ2 g S1

+ C2 e S4 2 + (1 pout)2

pout 1 T PT 1 t=0 Et+1 var

(1 pin)3 L2 F pin S2 1 T PT 1 t=0 Ξt + 2(1 pin)2 L2 F S1 + C2 e S4 2 + (1 pout)2

pout 1 T PT 1 t=0 Et+1 var .

Using Lemma 10 we have that

1 T PT 1 t=0 Et+1 bias (1 pin)3 L2 F pin S2

T PT 1 t=0 E[ Gt+1] 2 2

+ 2(1 pin)2 L2 F S1 + C2 e S4 2 + (1 pout)2

pout 1 T PT 1 t=0 Et+1 var

(1 pin)3 L2 F pin S2 6n2

T PT 1 t=0 E[Gt+1] 2 2 + 2(1 pin)2 L2 F S1 + C2 e S4 2

+ (1 pin)3 L2 F pin S2 6n2

B2 2 γ2 + (1 pout)2

1 T PT 1 t=0 Et+1 var .

Variance. Combine the variance of Nested VR in Lemma 13 and Lemma 2 gives

1 T PT 1 t=0 Et+1 var 14 32 pout

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F T PT 1 t=0 E[Gt+1] 2 2

+ 14 96 pout

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)

T 8 L2 F B1S1 .

Theorem 5 [E-Nested VR Convergence] Consider the (FCCO) problem. Under the same assumptions as Theorem 3.

If n = O(ε 2/3), then we choose the hyperaparameters of E-Nested VR (Algorithm 1) as B1 = B2 = n, pout = 1, S1 = L2 F ε 2, S2 = LF ε 1, pin = L 1 F ε, γ = O( 1

If n = Ω(ε 2/3), then we choose the hyperaparameters of E-Nested VR as B1 = n, B2 = n, pout = 1/ n, S1 = S2 = max n Ce Cgε 1/2, L2 F /(nε2) o , pin = 1, γ = O( 1

Then the output xs of E-Nested VR satisfies: E[ F(xs) 2 2] ε2, for nonconvex F with iterations

T = Ω LF (F(x0) F )ε 2 .

Proof: Denote. Gt+1 = Gt+1 E-NVR (36). Using descent lemma (Lemma 3) and bias-variance of E-Nested VR (Lemma 14)

1 T PT 1 t=0 F(xt) 2 2 + 1 2T PT 1 t=0 E[Gt+1] 2 2

2(F (x0) F )

T PT 1 t=0 Et+1 var + 1

T PT 1 t=0 Et+1 bias

2(F (x0) F )

γT + (1 pin)3 L2 F pin S2 6n2

T PT 1 t=0 E[Gt+1] 2 2 + 2(1 pin)2 L2 F S1 + C2 e S4 2

+ (1 pin)3 L2 F pin S2 6n2

B2 2 γ2 + (1 pout)2

pout + LF γ 1 T PT 1 t=0 Et+1 var .

As we would like the right-hand side to be bounded by either 1

T PT 1 t=0 E[Gt+1]] 2 2 or ε2.

Bound on 2(F (x0) F )

γT with ε2 , i.e.

γT (F(x0) F )ε 2 (39)

Coefficient of 1

T PT 1 t=0 E[Gt+1] 2 2 is bounded by 1

(1 pin)3 L2 F pin S2 6n2

which can be achieved by choosing the following step size

γ poutpin S1 5 LF (1 pin)3/2 . (40)

Bound on 2(1 pin)2 L2 F S1 with ε2

2(1 pin)2 L2 F S1 ε2. (41)

Bound C2 e S4 2 with ε2. This leads to

Bound on the variance. First notice from (40) and γ 1 2LF ,

(1 pin)3 L2 F pin S2 6n2

4 (1 pout)2

2 (1 pout)2

Therefore, we only need to consider the upper bound on

pout 1 T PT 1 t=0 Et+1 var

14 32 (1 pout)2

B1 + 1 pout

pin S2 18n2

B2 2 + (1 pout)

γ2L2 F T PT 1 t=0 E[Gt+1] 2 2

+ 14 96 (1 pout)2

B1 + (1 pin)(1 pout)

L2 F S1 + (1 pout)3

pout T 8 L2 F B1S1 .

We impose the constraints for each term

pin S2 18n2

B2 2 L2 F γ2 1

pout 1 pout

pin S2 18n2

B2 2 L2 F γ2 1

pout 1 pout

B2 L2 F γ2 1

pout (1 pin)(1 pout)

pout T 8 L2 F B1S1 ε2.

These can be simplified as

γ pinpout B1 S1 (1 pout) 1 pin 1 LF (43)

γ pinp2 out B1 S1 (1 pout)3/2 1 pin 1 LF (44)

γ B1 (1 pout)3/2 1 LF (45)

B1S1 (1 pout)2 L2 F ε2 (46)

B1S1 (1 pout)3(1 pin) L2 F ε2p2 out (47)

B1S1 (1 pout)3 L2 F T ε2pout . (48)

Constraints from Lemma 14

γ2L2 F max n (1 pin)

pin S2 18 B2 , 1 pout

pin S2 18n2

B2 2 , (1 pout)

which can be translated to

γ pin S1 B2 LF 1 pin (49)

γ pinpout S1 B2 LF 1 pin 1 pout (50)

γ B2 LF 1 pout (51)

Constraint from sufficient decrease lemma:

γ 1 2LF . (52)

We simplify the conditions noticing that 1) (48) is weaker than (46); 2) (45) and (51) are weaker than (52). Combine all the constraints on γ, i.e. (43), (44), (49), (50), (52)

γ 1 LF min n min n 1, pout 1 pout

o pinpout B1 S1 (1 pout) 1 pin 1 LF , min n 1, pout 1 pout

o pin S1 B2 1 pin , 1, poutpin S1 5 LF (1 pin)3/2 o .

This can be simplified as an upper bound

γ 1 LF min n pinpout

S1 1 pin , pinpout

S1 B1 1 pout , pinp2 out S1 B1 1 pin 1 pout , 1 o .

Now we consider two sets of hyperparameters depending on the size of n Case 1: For n = O(ε 2/3), we choose the following set of hyperparameters

B1 = B2 = n, pout = 1, S1 = L2 F ε 2, S2 = LF ε 1, pin = L 1 F ε.

Then we have γ 1 LF min{ pin S1 1 pin , 1} = 1 LF , we have the total sample complexity of

B2S2T = B2S2T γ

γ (39) = F (x0) F

γ = (F (x0) F )n LF LF

0 25000 50000 75000

Suboptimality

m=1 m=10 m=100

0 25000 50000 75000

m=1 m=10 m=100

Figure 4: Performance of BSGD vs. E-BSGD on the few-shot sinsuoid regression task.

Case 2: For n = Ω(ε 2/3), we choose the following set of hyperparameters

B1 = n, B2 = n, pout = 1 n.

In this case, (46) is stronger than (47) which requires S1 L2 F nε2

S1 = S2 = max n σ1/2 bias ε 1/2, σ2 in nε2 o , pin = 1

Then we have γ 1 LF min{ pout n S1 1 pout , 1} = 1 LF , we have the total sample complexity of

B2S2T = B2S2T γ

γ (39) = F (x0) F

γ = (F(x0) F ) max n σ1/2 bias ε2.5 , σ2 in nε4

By picking xs uniformly at random among {xt}T 1 t=0 , we get the desired guarantee.

F Missing Details from Section 5

F.1 Application of First-order MAML

Over the past few years, the MAML framework [11] has become quite popular for few-shot supervised learning and meta reinforcement learning tasks. The first-order Model-Agnostic Meta-Learning (MAML) can be formulated mathematically as follows:

min x Ei p,Diquery ℓi EDisupp(x α ℓi(x, Di supp)), Di query

where α is the step size, Di supp and Di query are meta-training and meta-testing data respectively and ℓi being the loss function of task i. Stated in the CSO framework, fξ(x) := ℓi(x, Di query) and gη(x, ξ) := x α ℓi(x, Di supp) where ξ = (i, Di query) and η = Di supp.

In this context, lots of popular choices for fξ are smooth. For illustration purposes, we now discuss a widely used sine-wave few-shot regression task as appearing from the work of Finn et al. [11], where the goal is to do a few-shot learning of a sine wave, A sin(t ϕ), using a neural network Φx(t) with smooth activations, where A and ϕ represent the unknown amplitude and phase, and x denotes the model weight. Each task i is characterized by (Ai, ϕi, Di query). In the first-order MAML training process, we randomly select a task i, and draw training data η = Di supp. Define the loss function

for a given dataset D as ℓi(Φx; D) = 1

2 Et D Ai sin(t ϕi) Φx(t) 2 2. We then establish the outer function fi(x) = ℓi(Φx; Di query) and inner function gη(x) = x α xℓi(Φx; Di supp). As fi is smooth, our results are applicable.

In Figure 4, we show the results of BSGD and E-BSGD applied to this problem. In this experiment, the amplitude A is drawn from a uniform distribution U(0.1, 5) and the phase ϕ is drawn from U(0, π). Both Dsupp and Dquery are independently drawn from U( 5, 5). The step size is set to

α = 0.01. The batch size is fixed to 10. The performances of BSGD and E-BSGD are very close. This is not surprising because finetuning step size α is chosen to be small which significantly reduces the variance of gη, making the bias of meta gradient to be very small (O(α2)). Therefore, we observe similar performance of BSGD and E-BSGD. Similar trend also holds for BSpider Boost and Nested VR compared to their extrapolated variants.

F.2 Application of Deep Average Precision Maximization

The areas under precision-recall curve (AUPRC) has an unbiased point estimator that maximizes average precision (AP) [26, 34]. Let S+ and S be the set of positive and negative samples and S = S S+. Let hw( ) be a classifier parameterized with w and ℓbe a surrogate function, such as logistic or sigmoid. A smooth surrogate objective for maximizing average precision can be formulated as [33]:

x S+ ℓ(hw(x) hw(xi)) P

x S ℓ(hw(x) hw(xi))

This problem can be seen as a conditional stochastic optimization problem with gi(w) = [P

x S+ ℓ(hw(x) hw(xi)), P

x S ℓ(hw(x) hw(xi))] and fi : R R\{0} R is defined

as fi(y) = [y]1

[y]2 where [y]k denotes the kth coordinate of a vector y R R\{0}. During the stochastic optimization of this objective, we draw uniformly at random ξ := xi (drawn from the set S+) as a positive sample and η|ξ = [Fx1, Fx2] where set x1 is drawn uniformly at random from S+ and x2 is drawn uniformly at random from S and functional Fx(w) := ℓ(hw(x) hw(xi)). Note that fi C is smooth with gradient

" 1 [y]2 [y]1 ([y]2)2

Therefore, our results from Sections 3 and 4 again apply.

F.3 Necessity of Additional Smoothness Conditions

Throughout the paper, we assume bounded moments (Assumption 1) and a smoothness condition (Assumption 2) to derive our extrapolation technique. However, it is worth noting that the technique itself does not explicitly depend on higher-order derivatives. Our theoretical framework does not address the behavior of extrapolation in the absence of these smoothness constraints. In this section, we investigate the application of extrapolation to two non-smooth functions:

Re LU function given by q(x) = max{x, 0};

Perturbed quadratics represented as q(x) = x2/2 + Triangle Wave(x) + 1. The function Triangle Wave(x) has a period of 2 and spans the range [-1,1], defined as:

Triangle Wave(x) = 2 2 x

Visual representations of these functions can be found in Figure 5c. We set s = 0 and consider a random variable δ N(10, 100) with m = 1. We then apply first-, second-, and third-order extrapolation. The outcomes are depicted in Figure 5. Remarkably, both the Re LU and the perturbed quadratic functions do not conform to the differentiability assumptions inherent to our stochastic extrapolation schemes. Nonetheless, as indicated by Figure 5a and Figure 5b, our proposed secondand third-order extrapolation techniques yield a superior approximation of q(E[δ]).

Number of estimates

|q(s+ [ ]) Avg( ( )q(s))|

Number of estimates

|q(s+ [ ]) Avg( ( )q(s))|

Perturbed quadratics

5.0 2.5 0.0 2.5 5.0 x

Perturbed quad. Quadratic Re LU

Figure 5c Figure 5: (a) Fig. 5a: Error in estimating q(s + E[δ]) for our proposed first-, second-, and third-order extrapolation schemes applied to Re LU q(x) = max{x, 0}, s = 0, δ N(10, 100), m = 1. (b) Fig 5b: Error in estimating q(s + E[δ]) for our proposed first-, second-, and third-order extrapolation schemes applied to a perturbed quadratic q(x) = x2/2+Triangle Wave(x)+1, s = 0, δ N(10, 100), m = 1. The Triangle Wave(x) has a period of 2 and spans the range [-1,1], i.e. 2|2 x

2 | 1. (c) Fig 5c: The Re LU and perturbed quadratic used in the Fig. 5a and 5b along with quadratic curves.