# adaptive_accelerated_extragradient_methods_with_variance_reduction__92ba6c3a.pdf

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Zijian Liu * 1 Ta Duy Nguyen * 1 Alina Ene 1 Huy L. Nguyen 2

In this paper, we study the ﬁnite-sum convex optimization problem focusing on the general convex case. Recently, the study of variance reduced (VR) methods and their accelerated variants has made exciting progress. However, the step size used in the existing VR algorithms typically depends on the smoothness parameter, which is often unknown and requires tuning in practice. To address this problem, we propose two novel adaptive VR algorithms: Adaptive Variance Reduced Accelerated Extra-Gradient (Ada VRAE) and Adaptive Variance Reduced Accelerated Gradient (Ada VRAG). Our algorithms do not require knowledge of the smoothness parameter.

Ada VRAE uses O n log log n + q

Ada VRAG uses O n log log n + q

gradient evaluations to attain an O(ϵ)-suboptimal solution, where n is the number of functions in the ﬁnite sum and β is the smoothness parameter. This result matches the best-known convergence rate of non-adaptive VR methods and it improves upon the convergence of the state of the art adaptive VR method, Ada SVRG. We demonstrate the superior performance of our algorithms compared with previous methods in experiments on realworld datasets.

1. Introduction

In this paper, we consider the ﬁnite-sum optimization problem in the form of

i=1 fi(x) + h(x)

*Equal contribution 1Department of Computer Science, Boston University 2Khoury College of Computer and Information Science, Northeastern University. Correspondence to: Ta Duy Nguyen <taduy@bu.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

where each function fi is convex and β-smooth, h is convex and potentially nonsmooth but admitting an efﬁcient proximal operator, and X Rd is a closed convex set. Additionally, we further assume that X is compact when β is unknown. Problem (1) has found a wide range of applications in machine learning, typically in empirical risk minimization problems, and has been extensively studied in the past few years.

Among existing approaches to solve this problem, variance reduced (VR) methods (Johnson & Zhang, 2013; Defazio et al., 2014; Schmidt et al., 2017; Roux et al., 2012) have recently shown signiﬁcant improvement over the classic stochastic gradient methods such as stochastic gradient descent (SGD) and its variants. For example, in strongly convex problems, VR methods such as (Allen-Zhu, 2017; Lan et al., 2019; Lin et al., 2015) can achieve the optimal number of gradient evaluations of O (n + nκ) log 1

to attain an O(ϵ)-suboptimal solution, where κ is the condition number, which improves over full-batch gradient descent (O nκ log 1

ϵ ) and Nesterov s accelerated gradient descent (Nesterov, 1983; 2003) (O n κ log 1

ϵ ). For general convex problems, the current state-of-the-art VR methods, namely VRADA (Song et al., 2020) can ﬁnd an

O(ϵ)-suboptimal solution using O n log log n + q

gradient evaluations, which nearly-matches the lower bound

(Woodworth & Srebro, 2016).

However, most of existing VR gradient methods have the same limitation as classic gradient methods; that is, they require the prior knowledge of the smoothness parameter in order to set the step size. Lacking this information, one may have to carefully perform hyper-parameter tuning to avoid the situation that the algorithm divergences or converges too slowly due to too large or too small step size. This limitation of gradient methods motivates the development of methods that aim to adapt to unknown problem structures. A notable line of work starting with the inﬂuential Ada Grad algorithm has designed a family of gradient descent based methods that set the step size based on the gradients or iterates observed in previous iterations (Mc Mahan & Streeter, 2010; Duchi et al., 2011; Kingma & Ba, 2014; Levy, 2017; Levy et al., 2018; Bach & Levy, 2019; Cutkosky, 2019; Kavis et al., 2019; Joulani et al., 2020; Ene et al., 2021; Antonakopoulos et al.,

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Table 1. Our results and comparison with prior works.

Algorithm General convex Adaptive

SVRG (Johnson & Zhang, 2013) - No

SVRG++ (Allen-Zhu & Yuan, 2016) O n log β

Katyusha (Allen-Zhu, 2017) O n log β

VARAG (Lan et al., 2019) O n min log β

ϵ , log n + q

VRADA (Song et al., 2020) O n min log log β

ϵ , log log n + q

Ada SVRG (Dubois-Taine et al., 2021) O nβ

ϵ (ﬁxed sized inner loop, only if ϵ = Ω( β

n)) Yes O n log β

ϵ (multi-stage)

Ada VRAE (unknown β) (This Paper) O n min log log β

ϵ , log log n + q

VRAE (known β) (This Paper) O n min log log β

ϵ , log log n + q

Ada VRAG (unknown β) (This Paper) O n min log log β log β

ϵ , log log n + q

VRAG (known β) (This Paper) O n min log log β

ϵ , log log n + q

Lower Bound (Woodworth & Srebro, 2016) Ω n + q

2021; Ene & Nguyen, 2021). Remarkably, these works have shown that, in the setting where we have access to the exact full gradient in each iteration, it is possible to match the convergence rates of both unaccelerated and accelerated gradient descent methods without any prior knowledge of the smoothness parameter. These methods have also been analyzed in the stochastic setting under a bounded variance assumption, and they achieve a convergence rate that is comparable to that of SGD.

Given the theoretical and practical success of adaptive methods, it is natural to ask whether one can design VR methods that achieve state of the art convergence guarantees without any prior knowledge of the smoothness parameter. The recent work of (Dubois-Taine et al., 2021) gives the ﬁrst adaptive VR method Ada SVRG with the gradient complexity of O n log β

ϵ . Ada SVRG builds on the Ada Grad (Duchi et al., 2011) and SVRG algorithms (Johnson & Zhang, 2013), both of which are not accelerated.

Our contributions: In this work, we take this line of work further and design the ﬁrst accelerated VR methods that do not require any prior knowledge of the smoothness parameter. Our algorithms, Adaptive Variance Reduced Accelerated Extra-Gradient (Ada VRAE) and Adaptive Variance Reduced Accelerated Gradi-

ent (Ada VRAG), only use O n log log n + q

O n log log n + q

gradient evaluations respec-

tively to attain an O(ϵ)-suboptimal solution when β is unknown, both of which signiﬁcantly improve the convergence rate of Ada SVRG. Table 1 compares our algorithms and prior VR methods and Section 2 discusses our algorithmic approaches and techniques. The convergence rate of Ada VRAE matches up to constant factors the best-known convergence rate of non-adaptive VR methods (Song et al., 2020; Joulani et al., 2020). Both of our algorithms follow a different approach from these methods and are based on extra-gradient and mirror descent, instead of dual averaging.

We demonstrate the efﬁciency of our algorithms in practice on multiple real-world datasets. We show that Ada VRAG and Ada VRAE are competitive with existing standard and adaptive VR methods while having the advantage of not requiring hyperparameter tuning, and in many cases Ada VRAG outperforms these benchmarks.

1.1. Related work

Variance reduced gradient methods: Variance reduction technique (Roux et al., 2012; Schmidt et al., 2017; Shalev-

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Shwartz & Zhang, 2013; Mairal, 2013; Johnson & Zhang, 2013; Defazio et al., 2014) has been proposed to improve the convergence rate of stochastic gradient descent algorithms in the ﬁnite sum problem and has since become widely-used in many successful algorithms. Notable improvements can be seen in strongly convex optimization problems where earliest algorithms such as SVRG (Johnson & Zhang, 2013) or SAGA (Defazio et al., 2014) obtain O (n + κ) log 1

ϵ convergence rate compared with

βϵ of plain SGD, with the latter requiring an ad-

ditional assumption on the σ2-boundedness of the variance term, i.e., Ei h fi(x) f(x) 2i σ2. However, these non-accelerated methods do not achieve the optimal convergence rate. Recent works such as (Lin et al., 2015; Allen Zhu, 2017; Lan et al., 2019) focus on designing accelerated methods and successfully match the optimal lower bound for strongly convex optimization of Ω (n + nκ) log 1

given by (Lan & Zhou, 2018).

In non-strongly convex problems, however, existing works

do not yet match the lower bound of Ω n + q

in (Woodworth & Srebro, 2016). The best effort so far can be found in the line of accelerated methods started by (Allen Zhu, 2017) and followed by (Allen-Zhu, 2018; Lan et al., 2019; Li, 2021) that rely on incorporating the checkpoint in each update. Ada VRAG follows the same idea but offers simpler update and more efﬁcient choice of coefﬁcients that results in a better convergence rate, equivalent to VRADA (Song et al., 2020). By comparison, while VRADA is a dualaveraging scheme, Ada VRAG is a mirror descent method and Ada VRAE is an extra-gradient algorithm.

In a different line of research (Allen-Zhu & Hazan, 2016; Fang et al., 2018; Zhou et al., 2018), variance reduction has been applied to non-convex optimization to ﬁnd critical points with much better convergence rate.

Adaptive methods with variance reduction: There has been extensive research on adaptive methods (Duchi et al., 2011; Kingma & Ba, 2014; Reddi et al., 2018; Tieleman et al., 2012; Dozat, 2016) in the setting where we compute a full gradient in each iteration. However, there are only few works combining adaptive methods with VR techniques in the ﬁnite sum setup. Most relevant for our work is Ada SVRG (Dubois-Taine et al., 2021). This algorithm is built upon SVRG which as mentioned earlier is a nonaccelerated method and has a slower convergence rate. Ada SVRG uses the gradient norm to update the step size, similar to (Duchi et al., 2011) and the step is reset in every epoch, which could lead to step sizes that are too large in later stages. In contrast, both Ada VRAG and Ada VRAE are accelerated VR methods and use a cumulative step size. Ada VRAG uses the iterate movement to update the step size,

as in (Bach & Levy, 2019; Ene et al., 2021). Ada VRAE improves the convergence rate by a log β factor by using the gradient difference similarly to (Mohri & Yang, 2016; Joulani et al., 2020; Ene & Nguyen, 2021). In a different direction,(Xu et al., 2017) propose an adaptive VR algorithm adaptive to the unknown growth parameter instead of the smoothness parameter.

A different line of work considers VR methods that set the step size using stochastic line search (Schmidt et al., 2017; Mairal, 2013) or Barzilai-Borwein step size (Tan et al., 2016; Li et al., 2020). The former methods do not have theoretical guarantees, and the latter methods require knowledge of the smoothness parameter in order to obtain theoretical bounds.

Recent works design variance-reduced methods for nonconvex optimization. STORM (Cutkosky & Orabona, 2019) and STORM+(Levy et al., 2021) design an adaptive step size, though the former still requires the smoothness parameter in the step size. Super-Adam (Huang et al., 2021) also requires their parameters to satisfy some inequality involving the smoothness parameter like STORM.

1.2. Notation and problem setup

Let [n] denote the set {1, 2, , n}. For simplicity, we only consider the Euclidean norm := 2 (Our work can be extended to x A :=

x Ax for any A 0 with almost no change). x+ represents max {x, 0}.

We are interested in solving the following problem

min x X {F(x) = f(x) + h(x)}

where f(x) := 1

n Pn i=1 fi(x) and for i [n], fi : Rd R and h : X R are convex functions with a closed convex set X Rd. Let x = arg minx X F(x). We say a function G is β-smooth if G(x) G(y) β x y for all x, y Rd. Equivalently, we have G(y) G(x) + G(x), y x + β

2 y x 2. In this paper we always assume that each fi is β-smooth, which implies that f is also β-smooth. We assume that we can efﬁciently solve optimization problems of the form arg minx X γh(x) + 1

2 x v 2 where γ 0 and v

Rd. When the smoothness parameter β is unknown, we additionally assume that X is compact with diameter D, i.e., supx,y X x y D.

2. Our algorithms and convergence guarantees

In this section, we describe our algorithms and state their convergence guarantees. Our algorithm Ada VRAE shown in Algorithm 1 is a novel accelerated scheme that uses past extra-gradient update steps in the inner loop and novel averaging to achieve acceleration. In each inner iteration, the

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Algorithm 1 Ada VRAE

Input: initial point u(0), domain diameter D. Parameters: {a(s)},{Ts}, A(0) T0 > 0, η > 0.

x(1) 0 = z(1) 0 = u(0), compute f(u(0)) Initialize γ(1) 0 = γ, where γ is any small constant for s = 1 to S: A(s) 0 = A(s 1) Ts 1 Ts a(s) 2

for t = 1 to Ts: x(s) t = arg minx X n a(s) D g(s) t 1, x E + a(s)h(x)

2 x z(s) t 1 2

Let A(s) t = A(s) t 1 + a(s) + a(s) 2

x(s) t = 1 A(s) t

A(s) t 1x(s) t 1 + a(s)x(s) t + a(s) 2 u(s 1)

if t = Ts: Pick i(s) t Uniform ([n]) g(s) t = fi(s) t (x(s) t ) fi(s) t (u(s 1))+ f(u(s 1)) else: g(s) t = f(x(s) t )

η2 γ(s) t 1 2 + a(s) 2 g(s) t g(s) t 1 2

z(s) t = arg minz X n a(s) D g(s) t , z E + a(s)h(z)

2 z z(s) t 1 2 + γ(s) t γ(s) t 1 2 z x(s) t 2

u(s) = x(s+1) 0 = x(s) Ts , z(s+1) 0 = z(s) Ts , g(s+1) 0 = g(s) Ts ,

γ(s+1) 0 = γ(s) Ts return u(S)

Algorithm 2 Ada VRAG

Input: initial point u(0), domain diameter D. Parameters: {a(s)}, a(s) (0, 1), {q(s)}, {Ts}, η > 0. x(1) 0 = u(0)

Initialize γ(1) 0 = γ, where γ is any small constant for s = 1 to S:

x(s) 0 = a(s)x(s) 0 +(1 a(s))u(s 1), compute f(u(s 1)) for t = 1 to Ts: Pick i(s) t Uniform ([n]) g(s) t = fi(s) t (x(s) t 1) fi(s) t (u(s 1)) + f(u(s 1))

x(s) t = arg minx X n D g(s) t , x E + h(x)

+ γ(s) t 1q(s)

2 x x(s) t 1 2

x(s) t = a(s)x(s) t + (1 a(s))u(s 1)

Option I: γ(s) t = γ(s) t 1

x(s) t x(s) t 1 2

Option II: γ(s) t = γ(s) t 1 +

x(s) t x(s) t 1 2

η2 u(s) = 1 Ts PTs t=1 x(s) t , x(s+1) 0 = x(s) Ts , γ(s+1) 0 = γ(s) Ts return u(S)

new average iterate x(s) t is obtained by combining the old average iterate x(s) t 1, the new iterate x(s) t and the checkpoint

u(s 1) with coefﬁcients A(s) t 1, a(s) and a(s) 2 normalized

by the sum of them, i.e by A(s) t = A(s) t 1 + a(s) + a(s) 2. We will explain the intuition behind this choice of coefﬁcients in the analysis outline. At the beginning of each epoch, we set A(s) 0 = A(s 1) Ts 1 Ts a(s) 2 so that at the end, we only accumulate the coefﬁcients of the new iterates x(s) t . Ada VRAE adaptively sets the step sizes based on the stochastic gradient difference. Our choice of step sizes is a novel adaptation to the VR setting of the step sizes used by the works (Mohri & Yang, 2016; Kavis et al., 2019; Joulani et al., 2020; Ene & Nguyen, 2021) in the batch/full-gradient setting. Our algorithm builds on the work (Ene & Nguyen, 2021), which provides an unaccelerated past extra-gradient algorithm in the batch/full-gradient setting.

Theorem 2.1 states the parameter choices and the convergence guarantee for Ada VRAE, and we give its proof in Section A in the appendix. The convergence rate of Ada VRAE matches up to constant factors the rate of the state of the art non-adaptive VR methods (Joulani et al., 2020; Song et al., 2020). The initial step size γ(1) 0 can be set to any small constant γ, which in practice we choose γ = 0.01. Similarly to Ada Grad, setting η = Θ(D) gives us the optimal dependence of the convergence rate in the domain diameter. For simplicity, we state the convergence in Theorem 2.1 and 2.2 when η = Θ(D). We refer the reader to Theorems A.1 and B.1 in the appendix for the precise choice of parameters as well as the full dependence of the convergence rate on arbitrary choices of γ and η. In both Theorem 2.1 and 2.2, we measure convergence using the number of individual gradient evaluations fi, assuming that the exact computation of f takes n gradient evaluations. Theorem 2.1. (Convergence of Ada VRAE) Deﬁne s0 = log2 log2 4n , c = 3

2. Suppose we set the parameters of Algorithm 1 as follows:

( (4n) 0.5s 1 s s0 s s0 1+c

2c s0 < s ,

A(0) T0 = 5

Suppose that X is a compact convex set with diameter D and we set η = Θ(D). The number of individual gradient evaluations to achieve a solution u(S) such that E F(u(S)) F(x ) ϵ for Algorithm 1 is

O n log log V1

O n log log n + q

where V1 = O F(u(0)) F(x ) + (γ + β) D2 .

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Our algorithm Ada VRAG is shown in Algorithm 2. Compared with Ada VRAE, Ada VRAG has a worse dependence on the smoothness parameter β but it performs only one projection onto X in each inner iteration. Additionally, as we discuss in more detail below, it uses adaptive step sizes based on the iterate movement.

Ada VRAG follows a similar framework to existing VR methods such as VARAG (Lan et al., 2019) and VRADA (Song et al., 2020). Similarly to VRADA, the algorithm achieves acceleration at the epoch level, where an epoch is an iteration of the outer loop. The iterations in an epoch update the main iterates via mirror descent with novel choices of step sizes and coefﬁcients. The stochastic gradient is computed at a point that is a convex combination between the current iterate and the checkpoint; the coefﬁcients of this combination remain ﬁxed throughout the epoch. The step sizes are adaptively set based on the iterate movement.

The structure of the inner iterations of our algorithm differs from both VARAG and VRADA in several notable aspects. VARAG also uses mirror descent to update the main iterates and it computes the stochastic gradient at suitable combinations of the iterates and the checkpoint. Ada VARAG uses a different averaging of the iterates to compute the snapshots. Moreover, it uses a very different and simpler choice for the coefﬁcient used to combine the main iterates and the checkpoint in order to obtain the points at which the stochastic gradients are evaluated. In VARAG, this coefﬁcient is set to a constant (namely, 1/2) in the initial iterations, whereas in Ada VRAG, it starts from a small number and is increased gradually. This choice is critical for improving the ﬁrst term in the convergence from O(n log n) to O(n log log n). In a similar manner, VRADA attains the same convergence by a new choice of coefﬁcient. However, this is achieved via a very different approach based on dual-averaging.

The step sizes used by Ada VRAG have two components: the step γ(s) t that is updated based on the iterate movement and the per-epoch coefﬁcient q(s) to achieve acceleration at the epoch level. Our analysis is ﬂexible and allows the use of several approaches for updating the steps γ(s) t . One approach, shown as option I in Algorithm 2, is based on the multiplicative update rule of Ada Grad+ (Ene et al., 2021) which generalizes the Ada Grad update to the constrained setting. We also propose a different variant, shown as option II, that updates the steps in an additive manner. Our analysis shows a similar convergence guarantee for both options, with the main difference being in the dependence on the smoothness: option I incurs a dependence of β log β, whereas option II has a worse dependence of β. Option II achieved improved performance in our experiments.

Theorem 2.2 states the parameter choices and the convergence guarantee for Ada VRAG, and we give its proof in Section B in the appendix. Analogously to Ada VRAE, the

initial step size γ can be set to any small constant.

Theorem 2.2. (Convergence of Ada VRAG) Deﬁne s0 = log2 log2 4n , c = 3+

33 4 . Suppose we set the parameters of Algorithm 2 as follows:

( 1 (4n) 0.5s 1 s s0 c s s0+2c s0 < s ,

1 (1 a(s))a(s) 1 s s0

8(2 a(s))a(s)

3(1 a(s)) s0 < s ,

Suppose that X is a compact convex set with diameter D and we set η = Θ(D). Additionally, we assume that 2η2 > D2

if Option I is used for setting the step size. The number of individual gradient evaluations to achieve a solution u(S)

such that E F(u(S)) F(x ) ϵ for Algorithm 2 is

O n log log V2

O n log log n + q

O F(u(0)) F(x ) + γ + β log β γ D2

for Option I O F(u(0)) F(x ) + γ + β2 D2

for Option II

Comparison to Ada SVRG: As noted in the introduction, the state of the art adaptive VR method is the Ada SVRG algorithm (Dubois-Taine et al., 2021), which is a nonaccelerated method. Both of our algorithms achieve a faster convergence using different approaches and step sizes. Ada SVRG resets the step sizes in each epoch, whereas our algorithms use a cumulative update approach for the step sizes. In our experimental evaluation, the resetting of the step sizes led to slower convergence. Ada SVRG (multi-stage variant) uses varying epoch lengths similarly to SVRG++ (Allen-Zhu & Yuan, 2016), whereas our algorithms use epoch lengths that are set to n. Using an epoch of length n allows for implementing the random sampling via a random permutation of [n] and is the preferred approach in practice.

Both our algorithms and Ada SVRG require that the domain X has bounded diameter. This is a restriction that is shared by almost all existing adaptive methods. Recent work (Antonakopoulos et al., 2021; Ene & Nguyen, 2021) in the batch/full-gradient setting have proposed unaccelerated methods that are suitable for unbounded domains, at a loss of additional factors in the convergence. All of the existing accelerated methods require that the domain is bounded,

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

even in the batch/full-gradient setting. We note that our analysis holds for arbitrary compact domains, whereas the analysis of Ada SVRG only applies to domains that contain the global optimum. Similarly to Ada Grad, both our algorithms and Ada SVRG can be used in the unconstrained setting under the promise that the iterates do not move too far from the optimum.

Non-adaptive variants of our algorithms: In the setting where the smoothness parameter is known, we can set the step sizes of our algorithms based on the smoothness, as shown in Algorithms 3 and 4 (Sections C and D in the appendix). Both algorithms match the convergence rates of the state of the art VR methods (Joulani et al., 2020; Song et al., 2020) using different algorithmic approaches based on mirror descent and extra-gradient instead of dual-averaging. We experimentally compare the non-adaptive algorithms to existing methods in Section E of the appendix.

2.1. Analysis outline

We outline some of the key steps in the analysis of Ada VRAE. For the purpose of simplicity, we assume h = 0 and η = D. Starting from the observation x(s) t = 1 A(s) t

A(s) t 1x(s) t 1 + a(s)x(s) t + a(s) 2 u(s 1) and

A(s) t = A(s) t 1 + a(s) + a(s) 2, we have

x(s) t x(s) t 1 = a(s)

x(s) t x(s) t +

u(s 1) x(s) t

which allows us to carry out the analysis for the function progress in one iteration, i.e, f(x(s) t ) f(x(s) t 1) and obtain

E A(s) t a(s) 2 f(x(s) t ) f(x )

A(s) t 1 f(x(s) t 1) f(x ) i

a(s) D g(s) t , x(s) t x E

| {z } stochastic regret

+ E a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

" A(s) t 1 2β

f(x(s) t ) f(x(s) t 1) 2 #

By building on the standard analysis of the stochastic regret for extra-gradient methods, we obtain the following result

for the progress of one iteration:

E A(s) t a(s) 2 f(x(s) t ) f(x )

A(s) t 1 f(x(s) t 1) f(x ) i

z(s) t 1 x 2 γ(s) t 2

z(s) t x 2 #

" γ(s) t γ(s) t 1 2

x(s) t x 2 #

+ E a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

g(s) t g(s) t 1 2 #

" A(s) t 1 2β

f(x(s) t ) f(x(s) t 1) 2 #

| {z } gain

In comparison to the standard analysis, the coefﬁcient for the checkpoint appears in the coefﬁcient of f(x(s) t ) f(x ), which becomes A(s) t a(s) 2 instead of the usual

A(s) t , making the sum not telescope immediately. To resolve this, we ﬁrst turn our attention to the analysis

of the stochastic gradient difference g(s) t g(s) t 1 2 .

The key idea is to split (a(s)) 2

g(s) t g(s) t 1 2

into 1 2γ(s) t 1 16β

a(s) 2 g(s) t g(s) t 1 2 +

16β g(s) t g(s) t 1 2 , and bound each term in turn. For the ﬁrst term, we build on the techniques from prior work in the batch/full-gradient setting (Ene & Nguyen, 2021) when taking the sum over the iterations and epochs. For intuition, part of the gain

1 16β a(s) 2 g(s) t g(s) t 1 2 will be used to cancel out the

term E γ(s) t γ(s) t 1 2 x(s) t x 2 . We then only need to

consider the time before γ(s) t goes above O(β), and notice

that a(s) 2 g(s) t g(s) t 1 2 = γ(s) t 2 γ(s) t 1 2 , thus

we can upperbound the ﬁrst term via the last γ(s) t that is still small than O(β). We provide more details below.

For the second term, we use Young s in-

equality to write E g(s) t g(s) t 1 2

E 4 f(x(s) t ) g(s) t 2 + 4 f(x(s) t 1) g(s) t 1 2 +

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

E 2 f(x(s) t ) f(x(s) t 1) 2 . The gradient difference

loss term is cancelled by the gain term in (2), and thus we can focus on the ﬁrst two variance terms. We apply the usual variance reduction technique put forward by (Lan et al., 2019) (see Lemma A.2) to bound the two variance terms, as follows:

E g(s) t f(x(s) t ) 2 E h 2β f(u(s 1)) f(x(s) t )

D f(x(s) t ), u(s 1) x(s) t E i .

Thus we obtain an upper bound on (a(s)) 2

16β g(s) t g(s) t 1 2

in terms of a(s) 2 f(u(s 1)) f(x(s) t ) . This is the rea-

son for setting the coefﬁcient for the checkpoint to a(s) 2, so that the LHS of (2) can become the usual telescoping sum A(s) t f(x(s) t ) f(x ) A(s) t 1 f(x(s) t 1) f(x ) . Using the convexity of f, we obtain the following key result for the progress of each epoch:

E h A(s) Ts

f(x(s) Ts ) f(x ) A(s) 0 f(x(s) 0 ) f(x ) i

z(s) 0 x 2 γ(s) Ts 2

z(s) Ts x 2 #

γ(s) t γ(s) t 1 2

x(s) t x 2 #

+ E Ts a(s) 2 f(u(s 1)) f(x )

2γ(s) t 1 16β

! a(s) 2 g(s) t g(s) t 1 2 #

Intuitively, we want to have another telescoping sum when summing up the above inequality across all epochs s. To do so, we can set the starting points of the next epoch to be the ending points of the previous one, i.e., x(s) Ts = x(s+1) 0 =

u(s), γ(s) Ts = γ(s+1) 0 , z(s) Ts = z(s+1) 0 . However, an extra term

Ts a(s) 2 f(u(s 1)) f(x ) appears on the RHS. We need to reset the new starting coefﬁcient in the new epoch A(s) 0 to A(s 1) Ts 1 Ts a(s) 2 so that we can telescope the LHS.

To bound the term PS s=1 PTs t=1 γ(s) t γ(s) t 1 2 x(s) t x 2 + 1 2γ(s) t 1 16β

a(s) 2 g(s) t g(s) t 1 2 , since

γ(s) Ts = γ(s+1) 0 and, the sequence γ(s) t is not decreasing, we can make the ﬁrst part of the sum

telescope: PS s=1 PTs t=1 γ(s) t γ(s) t 1 2 x(s) t x 2 PS s=1 PTs t=1 γ(s) t γ(s) t 1 2 D2 = D2

2 γ(S) TS γ(1) 0 . More-

over, g(s) Ts = g(s+1) 0 , we can consider the doubly indexed

sequences γ(s) t and g(s) t as two singly indexed

sequences (γk) and (gk) and the coefﬁcient a(s) to be another sequence (ak). Then we can employ the following two inequalities:

2 (γK γ0) 1 48β

k=1 a2 k gk gk 1 2

a2 k gk gk 1 2

Finally, we need to choose the parameters a(s) so that the conditions needed for our analysis are satisﬁed and A(s) Ts is sufﬁciently large, so that we attain a fast convergence. We have to choose a(s) such that a(s) 2 4A(s) t 1 for all

s, t 1 and that A(s) 0 = A(s 1) Ts 1 Ts a(s) 2 0. The main idea is to divide the epochs into two phases: in the ﬁrst phase, A(s) Ts quickly rises to Ω(n) and in the second phase,

to achieve the optimal q

ϵ rate, A(s) Ts = Ω(n2). The

nearly-optimal choice of a(s) in the ﬁrst phase is (4n) 0.5s, stopping at s = s0 = log2 log2 4n , while in the second phase, we have to be more conservative and choose a(s) = s s0+ 1

2 3 . With this we can obtain the convergence rate of

O n min n log log β

ϵ , log log n o + q

3. Experiments

In this section we demonstrate the performances of Ada VRAG and Ada VRAE in comparison with the existing standard and adaptive VR methods. We use the experimental setup and the code base of (Dubois-Taine et al., 2021)1.

Datasets and loss functions: We experiment with binary classiﬁcation on four standard LIBSVM datasets: a1a, mushrooms, w8a and phishing (Chang & Lin, 2011). For each dataset, we show the results for three different objective functions: logistic, squared and huber loss. Following the setting in (Dubois-Taine et al., 2021) we add a ℓ2regularization term to the loss function, with regularization set to 1/n.

Constraint: In all experiments, we evaluate the algorithms under a ball constraint. That is, the domain of each problem in our experiment is a ball of radius R = 100 around the initial point, which means for every algorithm, in the update step, we need to do a projection onto this ball.

1Their code can be found at https://github.com/bpauld/Ada SVRG

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

(a) Logistic loss

(b) Squared loss

(c) Huber loss

Figure 1. a1a

(a) Logistic loss

(b) Squared loss

(c) Huber loss

Figure 2. mushrooms

(a) Logistic loss

(b) Squared loss

(c) Huber loss

Figure 3. w8a

(a) Logistic loss

(b) Squared loss

(c) Huber loss

Figure 4. phishing

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Algorithms and hyperparameter selection: We compare Ada VRAE and Ada VRAG with the common VR algorithms: SVRG (Johnson & Zhang, 2013), SVRG++(Allen Zhu & Yuan, 2016), VARAG (Lan et al., 2019), VRADA (Song et al., 2020), and Ada SVRG (Dubois-Taine et al., 2021) (in the experiment the multi-stage variant performs worse than the ﬁxed-sized inner loop variant, and we omit it from the plots). Among these, only Ada SVRG is an adaptive VR method, which does not require parameter tuning. For the non-adaptive methods we chose the step size (or equivalently, the inverse of the smoothness parameter (1/β) for VRADA) via hyperparameter search over {0.01, 0.05, 0.1, 0.5, 1, 5, 10, 100}. For each experiment, we used the choice that led to the best performance, and we report the parameters used in Table 2. The adaptive methods Ada SVRG, Ada VRAE, Ada VRAG do not require any hyperparameter tuning and we set their parameters as prescribed by the theoretical analysis. For Ada SVRG, we used η = D/

2R as recommended in the original paper. For Ada VRAE and Ada VRAG, we used γ = 0.01 and η = D/2 = R.

Implementation and initialization: For all algorithms, in the inner loop, we use a random permutation of the data points to select a function. We also ﬁx the batch size to 1 in all cases to match the theoretical setting. We initialize u(0) to be a random point in [0, 10]d where each dimension is uniformly chosen in [0, 10]. Each experiment is repeated ﬁve times with different initial point, which is kept the same across all algorithms.

Results: The results are shown in Figures 1, 2, 3, 4. For each experiment, we plot the mean value and standard deviation of the training objective against the number of gradient evaluations normalized by the number of examples.

Discussion: We observe that, in all experiments, Ada VRAG consistently performs competitively with all methods and generally have the best performances. The non-accelerated methods in general converge more slowly compared with accelerated methods, especially in the later epochs. In some cases, VARAG suffers from a slow convergence rate in the ﬁrst phase. This is possibly due to the fact that it sets to 1/2 the coefﬁcient for the checkpoint in the ﬁrst phase. VRADA sometimes exhibits similar behavior but to a lesser extent. In Ada VRAG and Ada VRAE, the coefﬁcient for the checkpoint is set to be small in the beginning and gradually increased over time when the quality of the checkpoint is improved. The other adaptive method, Ada SVRG, exhibits slow convergence in many cases. One reason might be that Ada SVRG resets the step size in every epoch and, in later epochs, the step size may be too large for the algorithm to converge. In contrast, Ada VRAG and Ada VRAE use cumulative step sizes.

4. Conclusion and future work

In this paper, we propose two accelerated variance reduced algorithms for the general ﬁnite-sum convex optimization problem with the step size set adaptively to the smoothness parameter. By a careful design of the coefﬁcient choices, the ﬁrst extra-gradient algorithm, Ada VRAE, which sets the step size via the gradient difference, uses

O n log log n + q

gradient evaluations to attain

an O(ϵ)-suboptimal solution, matching the best-known convergence rate of non-adaptive VR methods, while removing the requirement of the knowledge about the smoothness parameter. The second algorithm, Ada VRAG, which uses the iterate moment in the step size, needs

O n log log n + q

gradient evaluations, but hav-

ing the advantage of using a single projection in each iteration and performing better in practice. For both algorithms, as well as the other state-of-the-art VR algorithms, there is still a gap to the lower bound convergence Ω n + q

for the general ﬁnite-sum convex opti-

mization problem. Finding an algorithm that can achieve this lower bound and making it adaptive remain an open question for the future work.

Acknowledgments

ZL, TN, and AE were supported in part by NSF CAREER grant CCF-1750333, NSF grant III-1908510, and an Alfred P. Sloan Research Fellowship. HN was supported in part by NSF CAREER grant CCF-1750716 and NSF grant CCF1909314.

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Allen-Zhu, Z. Katyusha: The ﬁrst direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research, 18(1):8194 8244, 2017.

Allen-Zhu, Z. Katyusha x: Practical momentum method for stochastic sum-of-nonconvex optimization. ar Xiv preprint ar Xiv:1802.03866, 2018.

Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In International conference on machine learning, pp. 699 707. PMLR, 2016.

Allen-Zhu, Z. and Yuan, Y. Improved svrg for non-stronglyconvex or sum-of-non-convex objectives. In International conference on machine learning, pp. 1080 1089. PMLR, 2016.

Antonakopoulos, K., Belmega, V., and Mertikopoulos, P. Adaptive extra-gradient methods for min-max optimization and games. In International Conference on Learning Representations (ICLR), 2021.

Bach, F. and Levy, K. Y. A universal algorithm for variational inequalities adaptive to smoothness and noise. In Conference on Learning Theory, pp. 164 194. PMLR, 2019.

Chang, C.-C. and Lin, C.-J. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27, 2011.

Cutkosky, A. Anytime online-to-batch, optimism and acceleration. In International Conference of Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pp. 1446 1454. PMLR, 2019.

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex sgd. Advances in neural information processing systems, 32, 2019.

Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pp. 1646 1654, 2014.

Dozat, T. Incorporating nesterov momentum into adam. 2016.

Dubois-Taine, B., Vaswani, S., Babanezhad, R., Schmidt, M., and Lacoste-Julien, S. Svrg meets adagrad: Painless variance reduction. ar Xiv preprint ar Xiv:2102.09645, 2021.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.

Ene, A. and Nguyen, H. L. Adaptive and universal algorithms for variational inequalities with optimal convergence s. ar Xiv preprint ar Xiv:2010.07799, 2021.

Ene, A., Nguyen, H. L., and Vladu, A. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pp. 7314 7321, 2021.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Near-optimal non-convex optimization via stochastic path integrated differential estimator. ar Xiv preprint ar Xiv:1807.01695, 2018.

Huang, F., Li, J., and Huang, H. Super-adam: Faster and universal framework of adaptive gradients. ar Xiv preprint ar Xiv:2106.08208, 2021.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26:315 323, 2013.

Joulani, P., Raj, A., Gyorgy, A., and Szepesv ari, C. A simpler approach to accelerated optimization: iterative averaging meets optimism. In International Conference on Machine Learning, pp. 4984 4993. PMLR, 2020.

Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. In Advances in Neural Information Processing Systems (Neur IPS), pp. 6257 6266, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Lan, G. and Zhou, Y. Random gradient extrapolation for distributed and stochastic optimization. SIAM Journal on Optimization, 28(4):2753 2782, 2018.

Lan, G., Li, Z., and Zhou, Y. A uniﬁed variance-reduced accelerated gradient method for convex optimization. ar Xiv preprint ar Xiv:1905.12412, 2019.

Levy, K., Kavis, A., and Cevher, V. Storm+: Fully adaptive sgd with recursive momentum for nonconvex optimization. Advances in Neural Information Processing Systems, 34, 2021.

Levy, K. Y. Online to ofﬂine conversions, universality and adaptive minibatch sizes. In Advances in Neural Information Processing Systems (Neur IPS), pp. 1613 1622, 2017.

Levy, K. Y., Yurtsever, A., and Cevher, V. Online adaptive methods, universality and acceleration. In Advances in

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Neural Information Processing Systems (Neur IPS), pp. 6500 6509, 2018.

Li, B., Wang, L., and Giannakis, G. B. Almost tune-free variance reduction. In International Conference on Machine Learning, pp. 5969 5978. PMLR, 2020.

Li, Z. Anita: An optimal loopless accelerated variance-reduced gradient method. ar Xiv preprint ar Xiv:2103.11333, 2021.

Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for ﬁrst-order optimization. ar Xiv preprint ar Xiv:1506.02186, 2015.

Mairal, J. Optimization with ﬁrst-order surrogate functions. In International Conference on Machine Learning, pp. 783 791. PMLR, 2013.

Mc Mahan, H. B. and Streeter, M. J. Adaptive bound optimization for online convex optimization. In Conference on Learning Theory (COLT), pp. 244 256. Omnipress, 2010.

Mohri, M. and Yang, S. Accelerating online convex optimization via adaptive prediction. In Artiﬁcial Intelligence and Statistics (AISTATS), pp. 848 856, 2016.

Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In Doklady an ussr, volume 269, pp. 543 547, 1983.

Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.

Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.

Roux, N. L., Schmidt, M., and Bach, F. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. ar Xiv preprint ar Xiv:1202.6258, 2012.

Schmidt, M., Le Roux, N., and Bach, F. Minimizing ﬁnite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83 112, 2017.

Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(2), 2013.

Song, C., Jiang, Y., and Ma, Y. Variance reduction via accelerated dual averaging for ﬁnite-sum optimization. Advances in Neural Information Processing Systems, 33, 2020.

Tan, C., Ma, S., Dai, Y.-H., and Qian, Y. Barzilai-borwein step size for stochastic gradient descent. Advances in Neural Information Processing Systems, 29:685 693, 2016.

Tieleman, T., Hinton, G., et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26 31, 2012.

Woodworth, B. E. and Srebro, N. Tight complexity bounds for optimizing composite objectives. Advances in neural information processing systems, 29:3639 3647, 2016.

Xu, Y., Lin, Q., and Yang, T. Adaptive svrg methods under error bound conditions with unknown growth parameter. Advances in Neural Information Processing Systems, 30, 2017.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduction for nonconvex optimization. ar Xiv preprint ar Xiv:1806.07811, 2018.

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

A. Analysis of algorithm 1

In this section, we analyze Algorithm 1 and prove the following convergence guarantee:

Theorem A.1 (Convergence of Ada VRAE). Deﬁne s0 = log2 log2 4n , c = 3

2. If we choose parameters as follows

( (4n) 0.5s 1 s s0 s s0 1+c

2c s0 < s ,

A(0) T0 = 5

Assuming X is a compact convex set with diameter D, the number of individual gradient evaluations to achieve a solution u(S) such that E F(u(S)) F(x ) ϵ for Algorithm 1 is

O n log log V

n O n log log n + q

where V = 5

2 F(u(0)) F(x ) + γ u(0) x 2 + 16β(D4+2η4)

To start with, we state and prove the following variance reduction lemma commonly used in accelerated methods:

Lemma A.2. (Variance Reduction) Let i Uniform([n]) and g = fi(x) fi(u) + f(u) be an estimate of the gradient of f at x. We have

Ei h g f(x) 2i 2β (f(u) f(x) f(x), u x ) .

Proof. By the deﬁnition of g,

Ei h g f(x) 2i = Ei h fi(x) fi(u) + f(u) f(x) 2i

(a) Ei h fi(u) fi(x) 2i

(b) Ei [2β (fi(u) fi(x) fi(x), u x )]

(c) = 2β (f(u) f(x) f(x), u x ) ,

where (a) is because Ei [ fi(u) fi(x)] = f(u) f(x) and E h X E [X] 2i E h X 2i , (b) is by the

convexity and β-smoothness of fi, (c) is by i Uniform([n]) and the deﬁnition of f.

A.1. Single iteration progress

We ﬁrst analyze the progress in function value made in a single iteration of an epoch. The analysis follows the standard method as in (Ene & Nguyen, 2021); however, we need to pay attention to the extra term for the checkpoint that appears in the convex combination for x(s) t . We start off by the following observation

Lemma A.3. For any s 1 and t [Ts],

x(s) t x(s) t 1 = a(s)

x(s) t x(s) t +

u(s 1) x(s) t .

Proof. We note that the deﬁnition x(s) t = 1 A(s) t

A(s) t 1x(s) t 1 + a(s)x(s) t + a(s) 2 u(s 1) implies

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

A(s) t x(s) t = A(s) t 1x(s) t 1 + a(s)x(s) t + a(s) 2 u(s 1)

(a) A(s) t 1(x(s) t x(s) t 1) = a(s)(x(s) t x(s) t ) + a(s) 2 (u(s 1) x(s) t )

x(s) t x(s) t 1 = a(s)

x(s) t x(s) t +

u(s 1) x(s) t ,

where (a) is by A(s) t = A(s) t 1 + a(s) + a(s) 2.

Next, we bound the function progress in a single epoch via the stochastic regret. Note that, this lemma is somewhat weaker than we would desire, due to the appearance the coefﬁcient of the checkpoint, making the LHS not immediately telescope. We will account for this factor later in the analysis.

Lemma A.4. For all epochs s 1 and all iterations t [Ts]

E A(s) t a(s) 2 F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x )

a(s) D g(s) t , x(s) t x E

| {z } stochastic regret

+ a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

" A(s) t 1 2β

f(x(s) t ) f(x(s) t 1) 2 #

+ E A(s) t a(s) 2 h(x(s) t ) h(x ) A(s) t 1 h(x(s) t 1) h(x ) .

Proof. Using the observation in Lemma A.3, we have

F(x(s) t ) F(x(s) t 1)

=f(x(s) t ) f(x(s) t 1) + h(x(s) t ) h(x(s) t 1)

(a) D f(x(s) t ), x(s) t x(s) t 1 E 1

f(x(s) t ) f(x(s) t 1) 2 + h(x(s) t ) h(x(s) t 1)

D f(x(s) t ), x(s) t x(s) t E +

D f(x(s) t ), u(s 1) x(s) t E

f(x(s) t ) f(x(s) t 1) 2 + h(x(s) t ) h(x(s) t 1)

where (a) is due to the smoothness of f and (b) comes from Lemma A.3. By the convexity of f, we also have

F(x(s) t ) F(x )

=f(x(s) t ) f(x ) + h(x(s) t ) h(x )

D f(x(s) t ), x(s) t x E + h(x(s) t ) h(x )

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

We combine the two inequalities and obtain

A(s) t 1 F(x(s) t ) F(x(s) t 1) + a(s) F(x(s) t ) F(x )

a(s) D f(x(s) t ), x(s) t x E + a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

f(x(s) t ) f(x(s) t 1) 2

+ A(s) t 1 h(x(s) t ) h(x(s) t 1) + a(s) h(x(s) t ) h(x )

=a(s) D g(s) t , x(s) t x E + a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

+ a(s) D f(x(s) t ) g(s) t , x(s) t x E A(s) t 1 2β

f(x(s) t ) f(x(s) t 1) 2

+ A(s) t 1 h(x(s) t ) h(x(s) t 1) + a(s) h(x(s) t ) h(x ) .

Note that we can rearrange the terms

A(s) t 1 F(x(s) t ) F(x(s) t 1) + a(s) F(x(s) t ) F(x )

= A(s) t a(s) 2 F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x ) ,

A(s) t 1 h(x(s) t ) h(x(s) t 1) + a(s) h(x(s) t ) h(x )

= A(s) t a(s) 2 h(x(s) t ) h(x ) A(s) t 1 h(x(s) t 1) h(x ) .

Thus we obtain A(s) t a(s) 2 F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x )

a(s) D g(s) t , x(s) t x E + a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

+ a(s) D f(x(s) t ) g(s) t , x(s) t x E A(s) t 1 2β

f(x(s) t ) f(x(s) t 1) 2

+ A(s) t a(s) 2 h(x(s) t ) h(x ) A(s) t 1 h(x(s) t 1) h(x ) . (3)

Observe that for t < Ts

E h a(s) D f(x(s) t ) g(s) t , x(s) t x Ei = E h Ei(s) t

h a(s) D f(x(s) t ) g(s) t , x(s) t x Eii

and for t = Ts,we have f(x(s) t ) = g(s) t thus E h a(s) D f(x(s) t ) g(s) t , x(s) t x Ei = 0. By taking expectations w.r.t. both sides of (3), we get

E A(s) t a(s) 2 F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x )

E a(s) D g(s) t , x(s) t x E + a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

" A(s) t 1 2β

f(x(s) t ) f(x(s) t 1) 2 #

+ E A(s) t a(s) 2 h(x(s) t ) h(x ) A(s) t 1 h(x(s) t 1) h(x ) .

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

To analyze the stochastic regret, we split the inner product as follows

D g(s) t , x(s) t x E = D g(s) t , z(s) t x E + D g(s) t g(s) t 1, x(s) t z(s) t E + D g(s) t 1, x(s) t z(s) t E .

For each term we give a bound as stated in Lemma A.5.

Lemma A.5. For any s 1 all iterations t [Ts], we have

a(s) D g(s) t 1, x(s) t z(s) t E γ(s) t 1

z(s) t 1 z(s) t 2 γ(s) t 1

z(s) t 1 x(s) t 2 γ(s) t 1

x(s) t z(s) t 2

+ a(s) h(z(s) t ) h(x(s) t ) .

a(s) D g(s) t , z(s) t x E γ(s) t γ(s) t 1 2

x(s) t x 2 + γ(s) t 1

z(s) t 1 x 2 γ(s) t 2

z(s) t z(s) t 1 2 γ(s) t γ(s) t 1 2

x(s) t z(s) t 2

+ a(s) h(x ) h(z(s) t ) .

a(s) D g(s) t g(s) t 1, x(s) t z(s) t E

g(s) t g(s) t 1 2 + γ(s) t 2

x(s) t z(s) t 2 .

Proof. Since x(s) t = arg minx X

a(s) D g(s) t 1, x E + a(s)h(x) + γ(s) t 1

2 x z(s) t 1 2 , by the optimality condition of x(s) t ,

we have D a(s)g(s) t 1 + a(s)h (x(s) t ) + γ(s) t 1 x(s) t z(s) t 1 , x(s) t z(s) t E 0,

where h (x(s) t ) h(x(s) t ) is a subgradient of h at x(s) t . We rearrange the above inequality and obtain

a(s) D g(s) t 1, x(s) t z(s) t E γ(s) t 1 D x(s) t z(s) t 1, z(s) t x(s) t E + a(s) D h (x(s) t ), z(s) t x(s) t E

(a) γ(s) t 1 D x(s) t z(s) t 1, z(s) t x(s) t E + a(s) h(z(s) t ) h(x(s) t )

(b) = γ(s) t 1

z(s) t 1 z(s) t 2 γ(s) t 1

z(s) t 1 x(s) t 2 γ(s) t 1

x(s) t z(s) t 2

+ a(s) h(z(s) t ) h(x(s) t ) ,

where (a) follows from the convexity of h and the fact that h (x(s) t ) h(x(s) t ), and (b) is due to the identity a, b =

1 2 a + b 2 a 2 b 2 .

Using the optimality condition of z(s) t , we have D a(s)g(s) t + a(s)h (z(s) t ) + γ(s) t 1 z(s) t z(s) t 1 + γ(s) t γ(s) t 1 z(s) t x(s) t , z(s) t x E 0

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

where h (z(s) t ) h(z(s) t ) is a subgradient of h at z(s) t . We rearrange the above inequality and obtain

a(s) D g(s) t , z(s) t x E γ(s) t 1 D z(s) t z(s) t 1, x z(s) t E + γ(s) t γ(s) t 1 D z(s) t x(s) t , x z(s) t E

+ a(s) D h (z(s) t ), x z(s) t E

(c) γ(s) t 1 D z(s) t z(s) t 1, x z(s) t E + γ(s) t γ(s) t 1 D z(s) t x(s) t , x z(s) t E

+ a(s) h(x ) h(z(s) t )

(d) = γ(s) t 1

z(s) t 1 x 2 z(s) t x 2 z(s) t z(s) t 1 2

+ γ(s) t γ(s) t 1 2

x(s) t x 2 z(s) t x 2 x(s) t z(s) t 2

+ a(s) h(x ) h(z(s) t )

= γ(s) t γ(s) t 1 2

x(s) t x 2 + γ(s) t 1

z(s) t 1 x 2 γ(s) t 2

z(s) t z(s) t 1 2 γ(s) t γ(s) t 1 2

x(s) t z(s) t 2

+ a(s) h(x) h(z(s) t ) ,

where (c) follows from the convexity of h and the fact that h (z(s) t ) h(z(s) t ), and (d) is due to the identity a, b =

1 2 a + b 2 a 2 b 2 .

For the third inequality, we have

a(s) D g(s) t g(s) t 1, x(s) t z(s) t E (e) a(s) g(s) t g(s) t 1 x(s) t z(s) t

g(s) t g(s) t 1 2 + γ(s) t 2

x(s) t z(s) t 2 .

where (e) is by the Cauchy Schwarz inequality, (f) is by Young s inequality.

With above results, we obtain the descent lemma for one iteration. A key idea to remove a(s) 2 from the co-

efﬁcient of F(x(s) t ) F(x ) is to split the term (a(s)) 2

g(s) t g(s) t 1 2 into (a(s)) 2

2γ(s) t (a(s)) 2

g(s) t g(s) t 1 2 +

16β g(s) t g(s) t 1 2 and apply the VR lemma for the second term.

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Lemma A.6. For all epochs s 1 and all iterations t [Ts], we have

E A(s) t a(s) 2 F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x )

z(s) t 1 x 2 γ(s) t 2

z(s) t x 2 + γ(s) t γ(s) t 1 2

x(s) t x 2 #

+ E a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

f(u(s 1)) f(x(s) t ) D f(x(s) t ), u(s 1) x(s) t E #

f(u(s 1)) f(x(s) t 1) D f(x(s) t 1), u(s 1) x(s) t 1 E #

! g(s) t g(s) t 1 2 +

8β A(s) t 1 2β

! f(x(s) t ) f(x(s) t 1) 2 #

+ E a(s) 2 h(u(s 1)) h(x(s) t ) .

Proof. By Lemma A.5, we can bound a(s) D g(s) t , x(s) t x E as follows

a(s) D g(s) t , x(s) t x E γ(s) t 1

z(s) t 1 x 2 γ(s) t 2

z(s) t x 2 + γ(s) t γ(s) t 1 2

x(s) t x 2 γ(s) t 1

z(s) t 1 x(s) t 2

+ a(s) h(x ) h(x(s) t ) +

g(s) t g(s) t 1 2 .

z(s) t 1 x 2 γ(s) t 2

z(s) t x 2 + γ(s) t γ(s) t 1 2

+ a(s) h(x ) h(x(s) t ) +

g(s) t g(s) t 1 2 .

Combining the above result with Lemma A.4, we know

E A(s) t a(s) 2 F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x )

z(s) t 1 x 2 γ(s) t 2

z(s) t x 2 + γ(s) t γ(s) t 1 2

x(s) t x 2 #

+ E a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

g(s) t g(s) t 1 2 A(s) t 1 2β

f(x(s) t ) f(x(s) t 1) 2 #

+ E A(s) t a(s) 2 h(x(s) t ) h(x ) A(s) t 1 h(x(s) t 1) h(x ) + a(s) h(x ) h(x(s) t ) . (4)

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Note that A(s) t a(s) 2 h(x(s) t ) h(x ) A(s) t 1 h(x(s) t 1) h(x ) + a(s) h(x ) h(x(s) t )

= A(s) t a(s) 2 h(x(s) t ) h(x ) + a(s) 2 h(u(s 1)) h(x )

a(s) 2 h(u(s 1)) h(x ) A(s) t 1 h(x(s) t 1) h(x ) a(s) h(x(s) t ) h(x )

(a) A(s) t a(s) 2 h(x(s) t ) h(x ) + a(s) 2 h(u(s 1)) h(x )

A(s) t (h(x(s) t ) h(x ))

= a(s) 2 h(u(s 1)) h(x(s) t ) , (5)

where (a) is by the convexity of h and A(s) t = A(s) t 1 + a(s) + a(s) 2. Plugging in (5) into (4), we know

E A(s) t a(s) 2 F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x )

z(s) t 1 x 2 γ(s) t 2

z(s) t x 2 + γ(s) t γ(s) t 1 2

x(s) t x 2 #

+ E a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

g(s) t g(s) t 1 2 A(s) t 1 2β

f(x(s) t ) f(x(s) t 1) 2 #

+ E a(s) 2 h(u(s 1)) h(x(s) t ) .

Now for E g(s) t g(s) t 1 2 , when 1 < t < Ts, we have

E g(s) t g(s) t 1 2 E 4 f(x(s) t ) g(s) t 2 + 4 f(x(s) t 1) g(s) t 1 2

+ E 2 f(x(s) t ) f(x(s) t 1) 2

(b) E h 8β f(u(s 1)) f(x(s) t ) D f(x(s) t ), u(s 1) x(s) t E i

+ E h 8β f(u(s 1)) f(x(s) t 1) D f(x(s) t 1), u(s 1) x(s) t 1 E i

+ E 2 f(x(s) t ) f(x(s) t 1) 2 , (6)

where (b) is by Lemma A.2 for all 1 < t < Ts. When t = 1, note that both f(x(s) t 1) g(s) t 1 2 and f(u(s 1)) f(x(s) t 1) D f(x(s) t 1), u(s 1) x(s) t 1 E are zero by our deﬁnition x(s) 0 = u(s 1) and f(x(s) 0 ) = f(x(s 1) Ts 1 ) = g(s 1) Ts 1 = g(s) 0 ,

which means the above inequality is still true. When t = Ts, note that f(x(s) t ) g(s) t 2 = 0 and f(u(s 1)) f(x(s) t ) D f(x(s) t ), u(s 1) x(s) t E is always non-negative due to the convexity of f. So the above inequality also holds in this case.

Now we conlclude the above inequality is right for t [Ts].

Splitting (a(s)) 2

g(s) t g(s) t 1 2 into (a(s)) 2

2γ(s) t (a(s)) 2

g(s) t g(s) t 1 2 + (a(s)) 2

16β g(s) t g(s) t 1 2 and applying (6) to

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

16β g(s) t g(s) t 1 2 , we have

E A(s) t a(s) 2 F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x )

z(s) t 1 x 2 γ(s) t 2

z(s) t x 2 + γ(s) t γ(s) t 1 2

x(s) t x 2 #

+ E a(s) 2 D f(x(s) t ), u(s 1) x(s) t E

f(u(s 1)) f(x(s) t ) D f(x(s) t ), u(s 1) x(s) t E #

f(u(s 1)) f(x(s) t 1) D f(x(s) t 1), u(s 1) x(s) t 1 E #

! g(s) t g(s) t 1 2 +

8β A(s) t 1 2β

! f(x(s) t ) f(x(s) t 1) 2 #

+ E a(s) 2 h(u(s 1)) h(x(s) t ) .

A.2. Single epoch progress and ﬁnal output

Even though Lemma A.6 looks somewhat more convoluted, when we sum up over all iterations in one epoch, many terms are canceled out nicely and we obtain the following lemma that states the progress of the function value in one epoch. The trick is to set the value for each term at the end of one epoch equal to its value in the next one, with an exception for A(s 1) Ts 1 .

Due to the accumulation of the term F(u(s 1)) F(x ) throughout the epoch, we will set A(s) 0 = A(s 1) Ts 1 Ts a(s) 2.

Lemma A.7. For all epochs s 1, if

a(s) 2 4A(s) t 1, t [Ts] .

E h A(s) Ts

F(u(s)) F(x ) A(s 1) Ts 1

F(u(s 1)) F(x ) i

z(s) 0 x 2 γ(s+1) 0

z(s+1) 0 x 2 +

γ(s) t γ(s) t 1 2

x(s) t x 2 #

! g(s) t g(s) t 1 2 #

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Proof. Using Lemma A.6, we know

t=1 E A(s) t a(s) 2 F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x )

z(s) t 1 x 2 γ(s) t 2

z(s) t x 2 + γ(s) t γ(s) t 1 2

x(s) t x 2 #

t=1 E a(s) 2 D f(x(s) t ), u(s 1) x(s) t E + a(s) 2 h(u(s 1)) h(x(s) t )

f(u(s 1)) f(x(s) t ) D f(x(s) t ), u(s 1) x(s) t E #

f(u(s 1)) f(x(s) t 1) D f(x(s) t 1), u(s 1) x(s) t 1 E #

! g(s) t g(s) t 1 2 +

8β A(s) t 1 2β

! f(x(s) t ) f(x(s) t 1) 2 #

z(s) 0 x 2 γ(s+1) 0

z(s+1) 0 x 2 +

γ(s) t γ(s) t 1 2

x(s) t x 2 #

a(s) 2 f(u(s 1)) f(x(s) t ) + a(s) 2 h(u(s 1)) h(x(s) t ) #

f(u(s 1)) f(x(s) Ts ) + D f(x(s) Ts ), u(s 1) x(s) Ts

! g(s) t g(s) t 1 2 +

8β A(s) t 1 2β

! f(x(s) t ) f(x(s) t 1) 2 #

z(s) 0 x 2 γ(s+1) 0

z(s+1) 0 x 2 +

γ(s) t γ(s) t 1 2

x(s) t x 2 #

a(s) 2 f(u(s 1)) f(x(s) t ) + a(s) 2 h(u(s 1)) h(x(s) t ) #

! g(s) t g(s) t 1 2 +

8β A(s) t 1 2β

! f(x(s) t ) f(x(s) t 1) 2 #

z(s) 0 x 2 γ(s+1) 0

z(s+1) 0 x 2 +

γ(s) t γ(s) t 1 2

x(s) t x 2 #

a(s) 2 F(u(s 1)) F(x(s) t ) #

! g(s) t g(s) t 1 2 +

8β A(s) t 1 2β

! f(x(s) t ) f(x(s) t 1) 2 #

where (a) is due to z(s+1) 0 = z(s) Ts , γ(s+1) 0 = γ(s) Ts , x(s) 0 = u(s 1), (b) is by the convexity of f

D f(x(s) Ts ), u(s 1) x(s) Ts

E f(u(s 1)) f(x(s) Ts ),

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

(c) is by the deﬁnition of F = f + h. By adding PTs t=1 a(s) 2 F(x(s) t ) F(x ) to both sides of 7. we obtain

t=1 A(s) t F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x ) #

z(s) 0 x 2 γ(s+1) 0

z(s+1) 0 x 2 +

γ(s) t γ(s) t 1 2

x(s) t x 2 #

+ E Ts a(s) 2 F(u(s 1)) F(x )

! g(s) t g(s) t 1 2 +

8β A(s) t 1 2β

! f(x(s) t ) f(x(s) t 1) 2 #

t=1 A(s) t F(x(s) t ) F(x ) A(s) t 1 F(x(s) t 1) F(x ) #

=E h A(s) Ts

F(x(s) Ts ) F(x ) A(s) 0 F(x(s) 0 ) F(x ) i

(d) =E h A(s) Ts

F(u(s)) F(x ) A(s) 0 F(u(s 1)) F(x ) i ,

where (d) is due to the deﬁnition u(s) = x(s) Ts and x(s) 0 = u(s 1). Finally we have

F(u(s)) F(x ) A(s) 0 + Ts a(s) 2 F(u(s 1)) F(x )

z(s) 0 x 2 γ(s+1) 0

z(s+1) 0 x 2 +

γ(s) t γ(s) t 1 2

x(s) t x 2 #

! g(s) t g(s) t 1 2 +

8β A(s) t 1 2β

! f(x(s) t ) f(x(s) t 1) 2 #

Combining the fact A(s) 0 = A(s 1) Ts 1 Ts a(s) 2 and our condition a(s) 2 4A(s) t 1, we get the desired result.

The telescoping sum on the LSH allows us to obtain the guarantee for the ﬁnal output u(S).

Lemma A.8. For all S 1, assume we have

a(s) 2 4A(s) t 1, t [Ts] , s [S] .

E h A(S) TS

F(u(S)) F(x ) i

F(u(0)) F(x ) + γ

γ(s) t γ(s) t 1 2

x(s) t x 2 +

! g(s) t g(s) t 1 2 #

Proof. Note that our assumptions satisfy the requirements for Lemma A.7, by Applying Lemma A.7 and make the

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

telescoping sum from s = 1 to S, we obtain

E h A(S) TS

F(u(S)) F(x ) i

F(u(0)) F(x ) + E

z(s) 0 x 2 γ(S+1) 0

z(S+1) 0 x 2 #

γ(s) t γ(s) t 1 2

x(s) t x 2 +

! g(s) t g(s) t 1 2 #

F(u(0)) F(x ) + γ(1) 0

γ(s) t γ(s) t 1 2

x(s) t x 2 +

! g(s) t g(s) t 1 2 #

F(u(0)) F(x ) + γ

γ(s) t γ(s) t 1 2

x(s) t x 2 +

! g(s) t g(s) t 1 2 #

where we use γ(1) 0 = γ and z(1) 0 = u(0).

A.3. Bound for the residual term

We turn to bound the term

γ(s) t γ(s) t 1 2

x(s) t x 2 +

! g(s) t g(s) t 1 2

This follows the standard analysis used to bound the residual term in adaptive methods. We ﬁrst admit Lemma A.10 to give the ﬁnal bound for this term. Lemma A.9. If X is a compact convex set with diameter D, we have

γ(s) t γ(s) t 1 2

x(s) t x 2 +

! g(s) t g(s) t 1 2 8β D4 + 2η4

Proof. It follows that

γ(s) t γ(s) t 1 2

x(s) t x 2 +

! g(s) t g(s) t 1 2

γ(s) t γ(s) t 1 2 D2 +

! g(s) t g(s) t 1 2

(b) = γ(s) Ts γ(1) 0 2 D2 +

! g(s) t g(s) t 1 2

=γ(s) Ts γ(1) 0 2 D2 D4

16β (D4 + 2η4)

a(s) 2 g(s) t g(s) t 1 2

8β (D4 + 2η4)

! a(s) 2 g(s) t g(s) t 1 2

(c) 8β D4 + 2η4

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

where (a) is by γ(s) t γ(s) t 1 and x(s) t x D, (b) is by noticing γ(s+1) 0 = γ(s) Ts , (c) is by Lemma A.10.

Lemma A.10. Under our update rule of γ(s) t , we have

γ(S) TS γ(1) 0 D4

16β (D4 + 2η4)

a(s) 2 g(s) t g(s) t 1 2 4β D4 + 2η4

8β (D4 + 2η4)

! a(s) 2 g(s) t g(s) t 1 2 4β D4 + 2η4

Proof. For simplicity, g(1) 0 , g(1) 1 , . . . , g(1) T1 = g(2) 0 , g(2) 1 , . . . , g(2) T2 , . . . as (gk)k 0 and γ(1) 0 , γ(1) 1 , . . . , γ(1) T1 =

γ(2) 0 , γ(2) 1 , . . . , γ(2) T2 , . . . as (γk)k 0. For k 1, assume that g(s) t is the element that correspond to gk, and let ak = a(s).

Then we can write γk = 1 η

η2γ2 k 1 + a2 k gk gk 1 2. By writing η2γ2 k = η2γ2 k 1 + a2 k gk gk 1 2 we obtain

η2γ2 k = η2γ2 0 + Pk t=1 a2 t gt gt 1 2 and hence γk = 1

η2γ0 + Pk t=1 a2 t gt gt 1 2.

For 1). Using

b we have γk γ0 + 1

q Pk t=1 a2 t gt gt 1 2. Therefore

2 (γk γ0) D4

16β (D4 + 2η4)

t=1 a2 t gt gt 1 2

t=1 a2 t gt gt 1 2 D4

16β (D4 + 2η4)

t=1 a2 t gt gt 1 2

(a) 4β D4 + 2η4

where for (a) we use ax bx2 a2

For 2). Let τ be the last index such that γτ 4β(D4+2η4)

η4 or τ = 1 if γ0 > 4β(D4+2η4)

η4 . If τ 0 we have Pk t=1 1 2γt η4

8β(D4+2η4) a2 t gt gt 1 2 0 for all k. Assume τ > 0

8β (D4 + 2η4)

a2 t gt gt 1 2

1 2γt a2 t gt gt 1 2

γ2 t γ2 t 1 2γt

(γt γt 1) (γt + γt 1)

t=1 (γt γt 1)

(c) 4β D4 + 2η4

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

where (b) is due to γt 1 γt, (c) is by the deﬁnition of τ.

Finally we give an explicit choice for the parameters to satisfy all conditions and give the ﬁnal necessary bound.

A.4. Parameter choice and bound

The following lemma states the bound for the coefﬁcients.

Lemma A.11. Under the choice of parameters in Theorem A.1, s 1, we have

a(s) 2 < 4A(s) 0

( n(4n) 0.5s 1 s s0 n 4c(s s0)2 s0 < s

Proof. As a reminder, we choose the parameters as follows, where c = 3

2 and s = s0 = log2 log2 4n

( (4n) 0.5s 1 s s0 s s0 1+c

2c s0 < s ,

A(0) T0 = 5

The idea in this choice is that we divide the time into two phases in which the convergence behaves differently. In the ﬁrst phase, A(s) Ts quickly gets to Ω(n) and we can set the coefﬁcients for the checkpoint relatively small. In the second phase, to

achieve the optimal q

ϵ rate, A(s) Ts = Ω(n2). In this phase, we need to be more conservative and set the coefﬁcients for the checkpoint large. We analyze the two phases separately.

First we show by induction that for 1 s s0,

A(s) 0 = 1 + n

k=0 (4n) 0.5k, (8)

A(s) Ts = 1 + n

k=0 (4n) 0.5k. (9)

Indeed, we have

A(1) 0 = A(0) T0 T1 a(1) 2 (a) = 5

4 n(4n) 1 = 5

A(1) T1 = A(1) 0 + T1

a(1) + a(1) 2 (b) = 1 + n (4n) 0.5 + (4n) 1 ,

where (a) and (b) are both by plugging in a(1) = (4n) 0.5 and T1 = n. Supposed that 8 and 9 hold for all k s < s0. For

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

k = s + 1 s0, we have

A(s+1) 0 = A(s) Ts Ts+1 a(s+1) 2

k=0 (4n) 0.5k !

k=0 (4n) 0.5k,

A(s+1) Ts+1 = A(s+1) 0 + Ts+1

a(s+1) + a(s+1) 2

k=0 (4n) 0.5k !

+ n (4n) 0.5s+1 + (4n) 0.5s

k=0 (4n) 0.5k,

where (c) is by plugging a(s+1) = (4n) 0.5s+1, Ts+1 = n and the assumption on A(s) Ts , (d) is by plugging a(s+1) =

(4n) 0.5s+1 and Ts+1 = n. Now the induction is completed. From this we can see that A(s) 0 1 > (a(s)) 2

4 and A(s) Ts > n(4n) 0.5s.

Next, for s > s0, we show by induction that

4c(s s0 2 + 2c)(s s0 1) n

4c2 (s s0 1 + c)2, (10)

A(s) Ts > n

4c(s s0 1 + 2c)(s s0). (11)

Indeed we have A(s0) Ts0 = 1 + n Ps0 k=0(4n) 0.5k > n(4n) 0.5s0 n(4n) 0.5log2 log2 4n = n

A(s0+1) 0 = A(s0) Ts0 Ts0+1 a(s0+1) 2

A(s0+1) Ts0+1 = A(s0+1) 0 + Ts0+1

a(s0+1) + a(s0+1) 2

where (e) and (f) are both by a(s0+1) = 1

2, Ts0+1 = n. Supposed that 10 and 11 hold for all s0 < k s. For k = s + 1

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

A(s+1) 0 = A(s) Ts Ts+1 a(s+1) 2

4c(s s0 1 + 2c)(s s0) n s s0 + c

4c(s s0 1 + 2c)(s s0) n

4c2 (s s0 + c)2,

A(s+1) Ts+1 = A(s+1) 0 + Ts+1

a(s+1) + a(s+1) 2

= A(s) Ts + Ts+1a(s+1)

4c(s s0 1 + 2c)(s s0) + n

2c (s s0 + c)

4c(s s0 + 2c)(s s0 + 1),

where (g) and (h) are both due to Ts+1 = n, a(s+1) = s s0+c

2c and the assumption on A(s) Ts . Now the induction is completed. We can see that if c = 3

A(s) 0 > n 1

2 + (s s0 1 + c)2

s s0 1 + c2

2 + (s s0 1 + c)2

12c s s0 1 + c2

2 + (s s0 1 + c)2

16c2 + (s s0 1 + c)2

24c s s0 1 + c2

= n (s s0 1 + c)2

16c2 + (s s0 1)2 3(s s0 1) + (c2 6c2 + 12c)

> (s s0 1 + c)2

and A(s) Ts > n

A.5. Putting all together

We are now ready to put everything together and complete the proof of Theorem A.1.

Proof. (Theorem A.1) From Lemma A.11, we know a(s) 2 < 4A(s) 0 for any s 1, which implies for any s 1, t [Ts]

a(s) 2 < 4A(s) t 1.

Combining our parameters, we can ﬁnd the requirements for Lemma A.8 are satisﬁed, which will give us

E h A(S) TS

F(u(S)) F(x ) i A(0) T0

F(u(0)) F(x ) + γ

γ(s) t γ(s) t 1 2

x(s) t x 2 +

! g(s) t g(s) t 1 2 #

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

By using Lemma A.9, we know

E h A(S) TS

F(u(S)) F(x ) i A(0) T0

F(u(0)) F(x ) + γ

u(0) x 2 + 8β D4 + 2η4

F(u(0)) F(x ) + γ

u(0) x 2 + 8β D4 + 2η4

E h F(u(S)) F(x ) i V

( 2V (4n)1 0.5S 1 S s0

2c V n(S s0)2 s0 < S

where (a) is by plugging in A(0) T0 = 5

4, (b) is by A.11.

n , we choose S = log2 log2 4V

ϵ log2 log2 4n = s0, so we have

E h F(u(S)) F(x ) i 2V (4n)1 0.5S

where (c) is by n V

ϵ , (d) is by 4V

ϵ 0.5S = 4V

ϵ 0.5 log2 log2 4V

ϵ 0.5log2 log2 4V

2. Note that the ﬁnal full gradient computation in the last epoch is not needed, therefore the number of individual gradient evaluations is

#grads = n +

s=1 (2(Ts 1) + n) + 2(TS 1)

= 3n log2 log2 4V

= O n log log V

n , we choose S = s0 + lq

nϵ m s0 + 1, so we have

E h F(u(S)) F(x ) i 2c V n(S s0)2

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

The number of individual gradient evaluations is

#grads = n +

s=1 (2(Ts 1) + n) + 2(TS 1)

= 3ns0 + 3n(S s0)

= 3n log2 log2 4n + 3n

n log log n +

B. Analysis of algorithm 2

In this section, we analyze Algorithm 2 and prove the following convergence guarantee:

Theorem B.1. (Convergence of Ada VRAG) Deﬁne s0 = log2 log2 4n , c = 3+

33 4 . Suppose we set the parameters of Algorithm 2 as follows:

( 1 (4n) 0.5s 1 s s0 c s s0+2c s0 < s ,

1 (1 a(s))a(s) 1 s s0

8(2 a(s))a(s)

3(1 a(s)) s0 < s ,

Suppose that X is a compact convex set with diameter D and we set η = Θ(D). Addtionally, we assume that 2η2 > D2 if Option I is used for setting the step size. The number of individual gradient evaluations to achieve a solution u(S) such that E F(u(S)) F(x ) ϵ for Algorithm 2 is

O n log log V

n O n log log n + q

1 2(F(u(0)) F(x )) + γ u(0) x 2 + h β 1 D2

D2 + 2(η2 + D2) log

2η2β 2η2 D2

for Option I

1 2(F(u(0)) F(x )) + γ u(0) x 2 + η2 D2

η2 + β γ + 2D2

η2 + β γ for Option II

B.1. Single epoch progress and ﬁnal output

We ﬁrst analyze the progress in function value made in a single iteration of an epoch. The analysis is done in a standard way by combining the smoothness and convexity of f, the convexity of h and the optimality condition of x(s) t .

Lemma B.2. For all epochs s 1 and all iterations t [Ts], we have

E h F(x(s) t ) F(x ) i E h 1 a(s) F(u(s 1)) F(x ) i

" γ(s) t 1q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 #

" β 2 a(s) a(s) 2

2 1 a(s) γ(s) t 1q(s)a(s)

! x(s) t x(s) t 1 2 #

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Proof. We have

E h f(x(s) t ) f(x(s) t 1) i

(a) E D f(x(s) t 1), x(s) t x(s) t 1 E + β

x(s) t x(s) t 1 2

=E D g(s) t , x(s) t x(s) t 1 E + D f(x(s) t 1) g(s) t , x(s) t x(s) t 1 E + β

x(s) t x(s) t 1 2 ,

where (a) is due to f being β-smooth. Using Cauchy Schwarz inequality and Young s inequality (ab λ

2 a2 + 1 2λb2 with λ > 0) we have

D f(x(s) t 1) g(s) t , x(s) t x(s) t 1 E

f(x(s) t 1) g(s) t x(s) t x(s) t 1

f(x(s) t 1) g(s) t 2 + β 2(1 a(s))

x(s) t x(s) t 1 2 ,

also note that

x(s) t x(s) t 1 = a(s)x(s) t + (1 a(s))u(s 1) a(s)x(s) t 1 + (1 a(s))u(s 1) = a(s) x(s) t x(s) t 1 .

Hence, we obtain

E h f(x(s) t ) f(x(s) t 1) i

"D g(s) t , a(s) x(s) t x(s) t 1 E + 1 a(s)

f(x(s) t 1) g(s) t 2 + β 2 a(s) a(s) 2

2 1 a(s) x(s) t x(s) t 1 2 #

(b) E h D g(s) t , a(s) x(s) t x(s) t 1 E + 1 a(s) f(u(s 1)) f(x(s) t 1) f(x(s) t 1), u(s 1) x(s) t 1 i

" β 2 a(s) a(s) 2

2 1 a(s) x(s) t x(s) t 1 2 #

=E h D g(s) t , a(s) x(s) t x E + D g(s) t , a(s) x x(s) t 1 E D f(x(s) t 1), 1 a(s) u(s 1) x(s) t 1 Ei

+ E h 1 a(s) f(u(s 1)) f(x(s) t 1) i + E

" β 2 a(s) a(s) 2

2 1 a(s) x(s) t x(s) t 1 2 #

(c) =E h D g(s) t , a(s) x(s) t x E + D f(x(s) t 1), a(s) x x(s) t 1 1 a(s) u(s 1) x(s) t 1 Ei

+ E h 1 a(s) f(u(s 1)) f(x(s) t 1) i + E

" β 2 a(s) a(s) 2

2 1 a(s) x(s) t x(s) t 1 2 #

(d) =E h D g(s) t , a(s) x(s) t x E + D f(x(s) t 1), a(s) x x(s) t 1 E + 1 a(s) f(u(s 1)) f(x(s) t 1) i

" β 2 a(s) a(s) 2

2 1 a(s) x(s) t x(s) t 1 2 #

"D g(s) t , a(s) x(s) t x E + β 2 a(s) a(s) 2

2 1 a(s) x(s) t x(s) t 1 2

+ 1 a(s) f(u(s 1)) + a(s)f(x ) f(x(s) t 1) i , (12)

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

where (b) is by Lemma A.2, (c) is because of

E h D g(s) t , a(s) x x(s) t 1 Ei = E h D f(x(s) t 1), a(s) x x(s) t 1 Ei ,

(d) is by x(s) t 1 = a(s)x(s) t 1 + 1 a(s) u(s 1), (e) is due to the convexity of f which implies

D f(x(s) t 1), a(s) x x(s) t 1 E a(s) f(x ) f(x(s) t 1) .

By adding E h f(x(s) t 1) f(x ) i to both sides of (12), we obtain

E h f(x(s) t ) f(x ) i

"D g(s) t , a(s) x(s) t x E + β 2 a(s) a(s) 2

2 1 a(s) x(s) t x(s) t 1 2 + 1 a(s) f(u(s 1)) f(x ) #

Next, we upper bound the inner product D g(s) t , a(s) x(s) t x E . By the optimality condition of x(s) t , we have

D g(s) t + h (x(s) t ) + γ(s) t 1q(s) x(s) t x(s) t 1 , x(s) t x E 0,

where h (x(s) t ) h(x(s) t ) is a subgradient of h at x(s) t . We rearrange the above inequality and obtain

a(s) D g(s) t , x(s) t x E

a(s) D h (x(s) t ) + γ(s) t 1q(s) x(s) t x(s) t 1 , x x(s) t E

(f) a(s) h(x ) h(x(s) t ) + a(s)γ(s) t 1q(s) D x(s) t x(s) t 1, x x(s) t E

(g) =a(s) h(x ) h(x(s) t ) + a(s)γ(s) t 1q(s)

x(s) t 1 x 2 x(s) t x 2 x(s) t 1 x(s) t 2 , (14)

where (f) follows from the convexity of h and the fact that h (x(s) t ) h(x(s) t ), and (g) is due to the identity a, b =

1 2 a + b 2 a 2 b 2 .

We plug in (14) into (13), and obtain

E h f(x(s) t ) f(x ) i

E h 1 a(s) f(u(s 1)) f(x ) + a(s) h(x ) h(x(s) t ) i

" γ(s) t 1q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 +

β 2 a(s) a(s) 2

2 1 a(s) γ(s) t 1q(s)a(s)

! x(s) t x(s) t 1 2 #

(h) = E h 1 a(s) F(u(s 1)) F(x ) + h(x ) a(s)h(x(s) t ) 1 a(s) h(u(s 1)) i

" γ(s) t 1q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 +

β 2 a(s) a(s) 2

2 1 a(s) γ(s) t 1q(s)a(s)

! x(s) t x(s) t 1 2 #

(i) E h 1 a(s) F(u(s 1)) F(x ) + h(x ) h(x(s) t ) i

" γ(s) t 1q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 +

β 2 a(s) a(s) 2

2 1 a(s) γ(s) t 1q(s)a(s)

! x(s) t x(s) t 1 2 #

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

where (h) is by the deﬁnition of F = f + h, and (i) is by the convexity of h which implies

h(x(s) t ) = h a(s)x(s) t + (1 a(s))u(s 1) a(s)h(x(s) t ) + (1 a(s))h(u(s 1)).

Now we move the term E h h(x ) h(x(s) t ) i to the LHS, and obtain

E h F(x(s) t ) F(x ) i

" 1 a(s) F(u(s 1)) F(x ) + γ(s) t 1q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 #

" β 2 a(s) a(s) 2

2 1 a(s) γ(s) t 1q(s)a(s)

! x(s) t x(s) t 1 2 #

By Lemma B.2, if 1 Ts PTs t=1 x(s) t is deﬁned as a new chekpoint like what we do in Algorithm 2, the following guarantee for the function value progress in one epoch comes up immediately by the convexity of F.

Lemma B.3. For all epochs s 1, we have

E h F(u(s)) F(x ) i E h 1 a(s) F(u(s 1)) F(x ) i

γ(s) t 1q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 #

β 2 a(s) a(s) 2

2 1 a(s) γ(s) t 1q(s)a(s)

! x(s) t x(s) t 1 2 #

Proof. We have

E h F(u(s)) F(x ) i

F(x(s) t ) F(x ) #

(b) E h 1 a(s) F(u(s 1)) F(x ) i

γ(s) t 1q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 #

β 2 a(s) a(s) 2

2 1 a(s) γ(s) t 1q(s)a(s)

! x(s) t x(s) t 1 2 #

where (a) is by the convexity of F and the deﬁnition of u(s) = 1 Ts PTs t=1 x(s) t , and (b) is by Lemma B.2.

Lemma A.8 is a quite general result without any assumptions on any parameters. To ensure that we can make the telescoping sum over the function value part, and also to simplify the term besides the function value part, we need some speciﬁc conditions on our parameters to be satisﬁed, which is stated in Lemma B.4. With these extra conditions, we can ﬁnally ﬁnd the following guarantee for the function value gap of the ﬁnal output u(S).

Lemma B.4. For all S 1, if the parameters satisfy 2 a(s) a(s)

1 a(s) q(s), s [S]

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

and (1 a(s+1))Ts+1

q(s+1)a(s+1) Ts q(s)a(s) , s [S 1] .

then we have

E TS q(S)a(S) (F(u(S)) F(x ))

q(1)a(1) (F(u(0)) F(x ))

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

Proof. If (2 a(s))a(s)

1 a(s) q(s) for any s [S], by using Lemma B.3, we know

E h F(u(s)) F(x ) i

E h 1 a(s) F(u(s 1)) F(x ) i

γ(s) t 1q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 #

β 2 a(s) a(s) 2

2 1 a(s) γ(s) t 1q(s)a(s)

! x(s) t x(s) t 1 2 #

E h (1 a(s))(F(u(s 1)) F(x )) i

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 !#

Now multiply both sides by Ts q(s)a(s) , we have

E Ts q(s)a(s) (F(u(s)) F(x ))

E (1 a(s))Ts

q(s)a(s) (F(u(s 1)) F(x ))

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

If (1 a(s+1))Ts+1

q(s+1)a(s+1) Ts q(s)a(s) is satisﬁed for any s [S 1], we can make the telescoping sum from s = 1 to S to get

E TS q(S)a(S) (F(u(S)) F(x ))

q(1)a(1) (F(u(0)) F(x ))

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

B.2. Bound for the residual term

By the analysis in the previous subsection, we get an upper bound for the function value gap of u(S) involving F(u(0)) F(x ) and

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

In this subsection we will show how to bound 15 under the compact assumption of X. Before giving the detailed analysis of the two different update options, we ﬁrst state the following lemma to simplify 15.

Lemma B.5. If γ(s) t γ(s) t 1 for any s [S], t [Ts] and X is a compact convex set with diameter D, then we have

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

u(0) x 2 + E

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

Proof. It follows that

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2

x(s) t 1 x 2 γ(s) t 2

x(s) t x 2 + γ(s) t γ(s) t 1 2

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2

x(s) t 1 x 2 γ(s) t 2

x(s) t x 2 + γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2

x(s) 0 x 2 γ(s) Ts 2

x(s) Ts x 2 +

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 !

x(s) 0 x 2 γ(s+1) 0

x(s+1) 0 x 2 +

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 !

x(1) 0 x 2 γ(S+1) 0

x(S+1) 0 x 2 +

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2

x(1) 0 x 2 +

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 ,

where (a) is due to γ(s) t γ(s) t 1 and x(s) t x D, (b) follows from the deﬁnition of x(s+1) 0 = x(s) Ts and γ(s+1) 0 = γ(s) Ts ,

(c) is by the deﬁnition of x(1) 0 = u(0) and γ(1) 0 = γ. Now Taking expectations with both sides yields what we want.

With the above result, we can show the bound of 15 under Option I and Option II respectively. There are two key common parts in our analysis, the ﬁrst one is to notice that we can reduce the doubly indexed sequence n x(s) t o and n γ(s) t o into two singly indexed sequences, which are much easier to bound. The second technique is to deﬁne a hitting time τ to upper bound γ(s) t . Read our proof for the details.

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Lemma B.6. For Option I, if X is a compact convex set with diameter D and 2η2 > D2, we have

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

u(0) x 2 + β

D2 + 2(η2 + D2) log

2η2β 2η2 D2

Proof. For Option I, by the deﬁnition of γ(s) t , we have

γ(s) t γ(s) t 1, s [S] , t [Ts] .

By requiring that X is a compact convex set with diameter D, we can apply Lemma B.5 and obtain

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

u(0) x 2 + E

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

Note that the last element x(s) Ts (resp. γ(s) Ts ) in the s-th epoch is just the start element x(s+1) 0 (resp. γ(s+1) 0 ) in the (s + 1)-th

epoch, which means we can consider the doubly indexed sequences {x(s) t } and {γ(s) t } as two singly indexed sequences {x t, t 0} and {γ t, t 0, γ 0 = γ} with the reformulated update rule as follows

γ t = γ t 1

x t x t 1 2

Besides, by deﬁning T = PS s=1 Ts, we have

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 =

γ t γ t 1 2 D2 + β γ t 1 2

x t x t 1 2 .

Note that we require 2η2 > D2, so if γ 2η2β 2η2 D2 β

4η2 γ t 1 0, by using the reformulated update rule, we have

γ t γ t 1 2 D2 + β γ t 1 2

x t x t 1 2

(γ t)2 (γ t 1)2

2(γ t + γ t 1) D2 + β γ t 1 2

x t x t 1 2

2η2(γ t + γ t 1)

x t x t 1 2 + β γ t 1 2

x t x t 1 2

γ t 1 4η2 D2 + β γ t 1 2

x t x t 1 2

x t x t 1 2

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

where (a) is by γ t γ t 1. Now we assume γ < 2η2β 2η2 D2 , deﬁne

τ = max t [T ], γ t 1 < 2η2β 2η2 D2

By our assumption on γ, we know τ 1, Combining the reformulated update rule, we have

γ t γ t 1 2 D2 + β γ t 1 2

x t x t 1 2

x t x t 1 2

x t x t 1 2

x t x t 1 2

x t x t 1 2 !

t=1 η2 (γ t)2 (γ t 1)2

D2 + η2 + D2 τ 1 X

(γ t)2 (γ t 1)2

D2 + 2 η2 + D2 τ 1 X

t=1 log γ t γ t 1

γ D2 + 2 η2 + D2 log γ τ 1

D2 + 2 η2 + D2 log

2η2β 2η2 D2

where (b) is by γ t 1 γ, (c) is by x τ x τ 1 D, (d) is by the reformulated update rule, (e) is due to

γ t = γ t 1

x t x t 1 2

(f) is by the inequality 1 1 x2 log x2 = 2 log x, (g) is by the deﬁnition of τ.

Combining two cases of γ, we obtain the bound

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 =

γ t γ t 1 2 D2 + β γ t 1 2

x t x t 1 2

D2 + 2 η2 + D2 log

2η2β 2η2 D2

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

By plugging in (17) into (16), we have

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

u(0) x 2 + β

D2 + 2 η2 + D2 log

2η2β 2η2 D2

Lemma B.7. For Option II, if X is a compact set with diameter D, we have

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

u(0) x 2 + η2

η2 + β γ + 2D2

Proof. For Option II, by the deﬁnition of γ(s) t , we have

γ(s) t γ(s) t 1, s [S] , t [Ts] .

By requiring that X is a compact convex set with diameter D, we can apply Lemma B.5 and obtain

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

u(0) x 2 + E

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

Note that the last element x(s) Ts (resp. γ(s) Ts ) in the s-th epoch is just the starting element x(s+1) 0 (resp. γ(s+1) 0 ) in the (s+1)-th

epoch, which means we can consider the doubly indexed sequences {x(s) t } and {γ(s) t } as two singly indexed sequences {x t, t 0} and {γ t, t 0, γ 0 = γ} with the reformulated update rule as follows

γ t = γ t 1 +

x t x t 1 2

Besides, by deﬁning T = PS s=1 Ts, we have

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 =

γ t γ t 1 2 D2 + β γ t 1 2

x t x t 1 2 .

2η2 + β γ t 1 2 0, by using the reformulated update rule, we have

γ t γ t 1 2 D2 + β γ t 1 2

x t x t 1 2 =

2η2 + β γ t 1 2

x t x t 1 2

Now we assume γ < D2

η2 + β. Deﬁne

τ = max t [T ], γ t 1 < D2

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

By our assumption on γ, we know τ 1. Combining the reformulated update rule, we have

γ t γ t 1 2 D2 + β γ t 1 2

x t x t 1 2

2η2 + β γ t 1 2

x t x t 1 2

2η2 + β γ t 1 2

x t x t 1 2

x t x t 1 2

η2 γ t γ t 1

x τ x τ 1 2

η2 + β γ 2D2

where (a) is by the fact γ t 1 γ, (b) and (c) are by the reformulated update rule, (d) is by the deﬁnition of τ and x τ x τ 1 D.

Combining two cases of γ, we obtain the bound

γ(s) t γ(s) t 1 2 D2 + β γ(s) t 1 2

x(s) t x(s) t 1 2

γ t γ t 1 2 D2 + β γ t 1 2

x t x t 1 2

η2 + β γ + 2D2

η2 + β γ . (19)

By plugging in (19) into (18), we have

x(s) t 1 x 2 γ(s) t 1

x(s) t x 2 + β γ(s) t 1 2

x(s) t x(s) t 1 2 #

u(0) x 2 + η2

η2 + β γ + 2D2

B.3. Parameter bound

Combining the previous two parts analysis on the function value gap and the residual term, we already can see the bound for F(u(S)) F(x ). However, we need to make sure that our choice stated in Theorem B.1 indeed satisﬁes the conditions used in previous lemmas, besides, we also need to give the bounds for our choice explicitly. The following two lemmas can help us to do this.

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Lemma B.8. Under the choice of parameters in Theorem B.1, s 1, we have the following facts

2, 2 a(s) a(s)

1 a(s) q(s),

(1 a(s+1))Ts+1

q(s+1)a(s+1) Ts q(s)a(s) .

Proof. Under the choice of parameters in Theorem B.1, the ﬁrst inequality follows that

a(s0) = 1 (4n) 0.5s0 1 (4n) 0.5log2 log2 4n = 1

For the second inequality, note that 2 a(s) a(s) 1 a(s) q(s) =

( 2 a(s) a(s) 2 1 s s0 3 8 s0 < s .

By noticing 2 a(s) a(s) 2 a(s) 1, the inequality (2 a(s))a(s)

1 a(s) q(s) becomes true immediately.

For the third inequality, note that we have Ts n, we only need to prove for any s 1, there is

q(s+1)a(s+1) 1 q(s)a(s) .

We consider the following three cases:

For 1 s s0 1, note that 1 a(s+1) 2 = (4n) 0.5s = 1 a(s), q(s) = 1 (1 a(s))a(s) . We know

q(s+1)a(s+1) = (1 a(s+1))2

= 1 q(s)a(s) .

For s = s0, note that a(s0+1) = c 1+2c = 9

33 8 , q(s0+1) = 8(2 a(s0+1))a(s0+1)

3(1 a(s0+1)) we have

q(s0+1)a(s0+1) = 3(1 a(s0+1))2

8(2 a(s0+1)) a(s0+1) 2

2 (a) 1 a(s0)

(b) = 1 q(s0)a(s0) ,

where (a) is by a(s0) 1

2, (b) is by q(s0) = 1 (1 a(s0))a(s0) .

For s s0 + 1, note that q(s) = 8(2 a(s))a(s)

3(1 a(s)) , by plugging in q(s), we only need to show

(1 a(s+1))2

(a(s+1))2(2 a(s+1)) 1 a(s)

(a(s))2(2 a(s)).

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Plug in a(s) = c s s0+2c, the above inequality is equivalent to

(2(s s0) + 3c)(s s0 + 1 + 2c)(s s0 + 1 + c)2 (2(s s0) + 2 + 3c)(s s0 + c)(s s0 + 2c)2.

Let y = s s0 1, we need to show

(2y + 3c)(y + 1 + 2c)(y + 1 + c)2 (2y + 2 + 3c)(y + c)(y + 2c)2

is true for y 1. People can check when c = 3+

33 4 , the above inequality is right for y 1.

Lemma B.9. Under the choice of parameters in Theorem B.1, s 1, we have the following bounds

q(1)a(1) = 1

and q(s)a(s)

( 4 (4n)1 0.5s 1 s s0 2(5+

3n(s s0+2c)2 s0 < s .

Proof. Note that a(1) = 1 1 2 n, T1 = n, q(1) = 1 (1 a(1))a(1) , plugging in these values, we obtain

q(1)a(1) = (1 a(1))2T1

For 1 s s0, note that q(s) = 1 (1 a(s))a(s) in our choice, so we know

Ts = 1 Ts(1 a(s))

(a) = 4 (4n)1 0.5s

where (a) is by plugging in Ts = n and a(s) = 1 (4n) 0.5s.

For s > s0, note that q(s) = 8(2 a(s))a(s)

3(1 a(s)) we have

Ts = 8(2 a(s))(a(s))2

3Ts(1 a(s))

(b) = 8(2 a(s))(a(s))2

3n(s s0 + 2c)2 ,

where (b) is by plugging in Ts = n, (c) is by noticing 2 a(s)

1 a(s) 2 a(s0+1)

1 a(s0+1) = 5+

33 4 for s > s0, and plug in a(s) = c s s0+2c.

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

B.4. Putting all together

We are now ready to put everything together and complete the proof of Theorem B.1.

Proof. (Theorem B.1) By Lemma B.8, s 1, we have 2 a(s) a(s)

1 a(s) q(s),

(1 a(s+1))Ts+1

q(s+1)a(s+1) Ts q(s)a(s) .

Hence all the conditions for Lemma B.4 are satisﬁed. Besides, we assume X is a compact convex set with diameter D, which satisﬁes the requirements for Lemma B.7 and B.6.

1. For Option I, by Lemma B.4 and B.6

E TS q(S)a(S) (F(u(S)) F(x )) (1 a(1))T1

q(1)a(1) (F(u(0)) F(x )) + γ

D2 + 2(η2 + D2) log

2η2β 2η2 D2

2. For Option II, by Lemma B.4 and B.7

E TS q(S)a(S) (F(u(S)) F(x )) (1 a(1))T1

q(1)a(1) (F(u(0)) F(x )) + γ

η2 + β γ + 2D2

Plugging in the bound (1 a(1))T1

q(1)a(1) = 1

4 from Lemma B.9, we have

E TS q(S)a(S) (F(u(S)) F(x )) V

E h F(u(S)) F(x ) i q(S)a(S)V

2V (4n)1 0.5S 1 S s0

33)c2V 3n(S s0+2c)2 s0 < S ,

where (a) is by the bound for q(S)a(S)

TS from Lemma B.9.

n , we choose S = log2 log2 4V

ϵ log2 log2 4n = s0, so we have

E h F(u(S)) F(x ) i 2V (4n)1 0.5S

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

where (b) is by n V

ϵ , (c) is by 4V

ϵ 0.5S = 4V

ϵ 0.5 log2 log2 4V

ϵ 0.5log2 log2 4V

2. The number of individual gradient evaluations is

#grads = n S +

= 3n log2 log2 4V

= O n log log V

n , we choose S = s0 + c q

33)V 3nϵ 15

= s0 + 1, so we have

E h F(u(S)) F(x ) i (5 +

33)c2V 3n(S s0 + 2c)2

3n s0 + c q

33)V 3nϵ 15

33)V 3nϵ + c

The number of individual gradient evaluations is

#grads = n S +

= 3ns0 + 3n(S s0)

= 3n log2 log2 4n + 3n

33)V 3nϵ 15

n log log n +

C. Ada VRAE for known β

In this section, we give a non-adaptive version of our algorithm Ada VRAE. The algorithm is shown in Algorithm 3. The only change is in the step size: we set γ(s) t = 8β for all epochs s and iterations t. The analysis readily extends to show the following convergence guarantee:

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Algorithm 3 VRAE

Input: initial point u(0), smoothness parameter β. Parameters: {a(s)}, {Ts}, A(0) T0 > 0

x(1) 0 = z(1) 0 = u(0), compute f(u(0)) for s = 1 to S: A(s) 0 = A(s 1) Ts 1 Ts a(s) 2

for t = 1 to Ts:

x(s) t = arg minx X

a(s) D g(s) t 1, x E + a(s)h(x) + 4β x z(s) t 1 2

Let A(s) t = A(s) t 1 + a(s) + a(s) 2

x(s) t = 1 A(s) t

A(s) t 1x(s) t 1 + a(s)x(s) t + a(s) 2 u(s 1)

if t = Ts: Pick i(s) t Uniform ([n]) g(s) t = fi(s) t (x(s) t ) fi(s) t (u(s 1)) + f(u(s 1)) else: g(s) t = f(x(s) t )

z(s) t = arg minz X

a(s) D g(s) t , z E + a(s)h(z) + 4β z z(s) t 1 2

u(s) = x(s+1) 0 = x(s) Ts , z(s+1) 0 = z(s) Ts , g(s+1) 0 = g(s) Ts return u(S)

Theorem C.1. Let s0 = log2 log2 4n , c = 3

2. If we choose parameters as follows

( (4n) 0.5s 1 s s0 s s0 1+c

2c s0 < s ,

A(0) T0 = 5

The number of gradient evaluations to achieve a solution u(S) such that E F(u(S)) F(x ) ϵ for Algorithm 3 is

O n log log V

n O n log log n + q

where V = 5

2 F(u(0)) F(x ) + 8β u(0) x 2.

Proof. Note that Algorithm 3 is essentially the same as Algorithm 1 by choosing γ(s) t 8β with no other changes. Hence the requirements for Lemma A.8 still hold. So we can obtain

E h A(S) TS

F(u(S)) F(x ) i A(0) T0

F(u(0)) F(x ) + 4β u(0) x 2 .

Then by the similar proof in Theorem A.1, we get the desired result.

D. Ada VRAG for known β

In this section, we give a non-adaptive version of our algorithm Ada VRAG. The algorithm is shown in Algorithm 4. VRAG admits the following convergence guarantee:

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Algorithm 4 VRAG

Input: initial point u(0), smoothness parameter β Parameters: {a(s)} where a(s) (0, 1), {Ts} x(1) 0 = u(0)

for s = 1 to S:

x(s) 0 = a(s)x(s) 0 + (1 a(s))u(s 1), calculate f(u(s 1)) for t = 1 to Ts: Pick i(s) t Uniform ([n]) g(s) t = fi(s) t (x(s) t 1) fi(s) t (u(s 1)) + f(u(s 1))

x(s) t = arg minx X

D g(s) t , x E + h(x) + β(2 a(s))a(s)

x x(s) t 1 2

x(s) t = a(s)x(s) t + (1 a(s))u(s 1)

u(s) = 1 Ts PTs t=1 x(s) t , x(s+1) 0 = x(s) Ts return u(S)

Theorem D.1. (Convergence of VRAG) Deﬁne s0 = log2 log2 4n , c = 3+

33 4 . If we choose the parameters as follows

( 1 (4n) 0.5s 1 s s0 c s s0+2c s0 < s ,

The number of individual gradient evaluations to achieve a solution u(S) such that E F(u(S)) F(x ) ϵ for Algorithm 4 is

O n log log V

n O n log log n + q

2(F(u(0)) F(x )) + β u(0) x 2 .

Before giving the proof of Theorem C.1, we state some intuition on our parameter choice. Note that by deﬁning the following two auxiliary sequences

1 (1 a(s))a(s) 1 s s0 (2 a(s))a(s)

1 a(s) s0 < s ,

γ(s) t 1 = β 2 a(s) a(s)

(1 a(s))q(s) , t [Ts] ,

the update rule of x(s) t in every epoch in Algorithm 4 is equivalent to the update rule of x(s) t in every epoch in Algorithm 2. Since γ(s) t 1 is a constant in the corresponding epoch now, we will use γ(s) without the subscript to simplify the notation. The above argument means that we can apply Lemma B.3 directly to obtain the following lemma.

Lemma D.2. For all epochs s 1, we have

E h F(u(s)) F(x ) i E 1 a(s) F(u(s 1)) F(x ) + γ(s)q(s)a(s)

x(s) 0 x 2 x(s+1) 0 x 2 .

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

Proof. By applying Lemma B.3, we know

E h F(u(s)) F(x ) i

E h 1 a(s) F(u(s 1)) F(x ) i

γ(s) t 1q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 #

β 2 a(s) a(s) 2

2 1 a(s) γ(s) t 1q(s)a(s)

! x(s) t x(s) t 1 2 #

" 1 a(s) F(u(s 1)) F(x ) + γ(s)q(s)a(s)

x(s) t 1 x 2 x(s) t x 2 #

=E 1 a(s) F(u(s 1)) F(x ) + γ(s)q(s)a(s)

x(s) 0 x 2 x(s) Ts x 2

(b) =E 1 a(s) F(u(s 1)) F(x ) + γ(s)q(s)a(s)

x(s) 0 x 2 x(s+1) 0 x 2 ,

where (a) is by γ(s) t 1q(s) = β(2 a(s))a(s)

1 a(s) and γ(s) = γ(s) t 1, t [Ts], (b) is by x(s+1) 0 = x(s) Ts .

Now if we still multiply both sides by Ts q(s)a(s) , we need to ensure that γ(s) can help us to make a telescoping sum. However, this is not always true. So we need some different conditions as stated in the following lemma to obtain a bound for the function value gap of u(S). The new bound for the function value gap of u(S) for Algorithm 4 is as follows.

Lemma D.3. If s = s0, we have

a(s+1) a(s),

(1 a(s+1))Ts+1

q(s+1)a(s+1) Ts q(s)a(s) .

Additionally, for s0, assume we have

1 a(s0+1) 2 Ts0+1 2 a(s0+1) a(s0+1) 2 (1 a(s0))Ts0 2 a(s0) a(s0) 2 .

Then for S s0,

E TS q(S)a(S)

F(u(S)) F(x )

1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) + β

For S > s0,

" 2 a(s0) a(s0) 2 TS q(S)a(S)

F(u(S)) F(x ) #

1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) + β

Proof. By applying Lemma D.2 and multiply both sides by Ts q(s)a(s) ,we have

E Ts q(s)a(s)

F(u(s)) F(x ) E

" 1 a(s) Ts q(s)a(s)

F(u(s 1)) F(x ) + γ(s)

x(s) 0 x 2 x(s+1) 0 x 2 #

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

E TS q(S)a(S)

F(u(S)) F(x )

" 1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) +

x(s) 0 x 2 x(s+1) 0 x 2 #

1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) + E

β 2 a(s) a(s) 2

x(s) 0 x 2 x(s+1) 0 x 2 #

1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) + β 2 a(1) a(1) 2

β h 2 a(s+1) a(s+1) 2 2 a(s) a(s) 2i

x(s+1) 0 x 2

" β 2 a(S) a(S) 2

x(S+1) 0 x 2 #

1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) + β

" β 2 a(S) a(S) 2

x(S+1) 0 x 2 #

where (a) is by the deﬁnition of γ(s) when s s0, (b) is by 2 a(1) a(1) 2 1 and x(1) 0 = u(0), additionally, note that

our assumption a(s+1) a(s) 2 a(s+1) a(s+1) 2 2 a(s) a(s) 2.

For S > s0, we can also make the telescoping sum from s = s0 + 1 to S by a similar argument to get

E TS q(S)a(S)

F(u(S)) F(x ) E

" 1 a(s0+1) Ts0+1 q(s0+1)a(s0+1)

F(u(s0)) F(x ) + β

x(s0+1) 0 x 2 #

Multiplying both sides by 2 a(s0) a(s0) 2, we have

" 2 a(s0) a(s0) 2 TS q(S)a(S)

F(u(S)) F(x ) #

" 2 a(s0) a(s0) 2 1 a(s0+1) Ts0+1 q(s0+1)a(s0+1)

F(u(s0)) F(x ) + β 2 a(s0) a(s0) 2

x(s0+1) 0 x 2 #

" 2 a(s0) a(s0) 2 1 a(s0+1) 2 Ts0+1

(2 a(s0+1)) a(s0+1) 2 F(u(s0)) F(x ) + β 2 a(s0) a(s0) 2

x(s0+1) 0 x 2 #

where (c) is by the deﬁnition q(s0+1) = (2 a(s0+1))a(s0+1)

1 a(s0+1) . Note that by our assumption

2 a(s0) a(s0) 2 1 a(s0+1) 2 Ts0+1

(2 a(s0+1)) a(s0+1) 2 (1 a(s0))Ts0

= Ts0 q(s0)a(s0) ,

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

" 2 a(s0) a(s0) 2 TS q(S)a(S)

F(u(S)) F(x ) #

" Ts0 q(s0)a(s0)

F(u(s0)) F(x ) + β 2 a(s0) a(s0) 2

x(s0+1) 0 x 2 #

Now combining

E Ts0 q(s0)a(s0)

F(u(s0)) F(x )

1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) + β

" β 2 a(s0) a(s0) 2

x(s0+1) 0 x 2 #

" 2 a(s0) a(s0) 2 TS q(S)a(S)

F(u(S)) F(x ) #

1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) + β

Using the above new lemma w.r.t. the function value gap of u(S), we ﬁnally can give the proof of Theorem D.1.

Proof. (Theorem D.1) Note that by our choice a(s+1) a(s) is true for any s = s0. Besides, our parameters a(s) and q(s) are totally the same as the choice in Theorem B.1 when s s0. Hence we know

(1 a(s+1))Ts+1

q(s+1)a(s+1) Ts q(s)a(s)

is still true for s s0 1. For s s0 + 1, note that our new q(s) are only different from the choice in Theorem B.1 by a constant, which implies (1 a(s+1))Ts+1

q(s+1)a(s+1) Ts q(s)a(s)

also holds for s s0 + 1. Besides, we can show 1 a(s0+1) 2 Ts0+1 2 a(s0+1) a(s0+1) 2 (1 a(s0))Ts0 2 a(s0) a(s0) 2

is true by plugging in the value of a(s0+1) = c 1+2c and noticing that a(s0) 1

2. Hence all the conditions for Lemma D.3 are satisﬁed, then we know for S s0,

E TS q(S)a(S)

F(u(S)) F(x )

1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) + β

For S > s0,

" 2 a(s0) a(s0) 2 TS q(S)a(S)

F(u(S)) F(x ) #

1 a(1) T1 q(1)a(1)

F(u(0)) F(x ) + β

By noticing

a(s0) = 1 (4n) 0.5s0

1 (4n) 0.5(log2 log2 4n)+1

2 a(s0) a(s0) 2 2

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

and 1 a(1) T1 q(1)a(1) = 1

combining the fact that our new q(s) for S > s0 have the same order of the choice in Theorem B.1. Following a similar proof, we can arrive the desired result.

E. Hyperparameter choices and additional results

Table 2 reports the hyperparameter choices used in the experiments. VRAG and VRAE are the non-adaptive versions our algorithms (Algorithms 3 and 4). We set their step sizes via a hyperparameter search as described in Section 3. Figures 5, 6, 7, 8 give the experimental evaluation of our non-adaptive algorithms.

Table 2. Hyperparameters used in the experiments

Dataset Loss SVRG SVRG++ VARAG VRADA VRAG VRAE

a1a logistic 0.5 0.5 1 1 1 1

squared 0.01 0.05 0.05 0.1 0.1 0.05

huber 0.05 0.1 0.1 0.5 0.1 0.1

mushrooms logistic 0.5 1 1 1 1 1

squared 0.01 0.01 0.05 0.1 0.05 0.01

huber 0.05 0.1 0.1 0.1 0.1 0.05

w8a logistic 0.1 1 1 100 1 5

squared 0.01 0.01 0.01 100 0.05 0.05

huber 0.01 0.1 0.1 100 0.1 0.5

phishing logistic 50 100 100 100 100 100

squared 0.05 0.5 1 1 1 1

huber 0.5 1 1 5 5 5

Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction

(a) Logistic loss

(b) Squared loss

(c) Huber loss

Figure 5. a1a

(a) Logistic loss

(b) Squared loss

(c) Huber loss

Figure 6. mushrooms

(a) Logistic loss

(b) Squared loss

(c) Huber loss

Figure 7. w8a

(a) Logistic loss

(b) Squared loss

(c) Huber loss

Figure 8. phishing