# efficient_continual_finitesum_minimization__d7aafee2.pdf

Published as a conference paper at ICLR 2024

EFFICIENT CONTINUAL FINITE-SUM MINIMIZATION

Ioannis Mavrothalassitis , Stratis Skoulakis , Leello Tadesse Dadi, Volkan Cevher LIONS, École Polytechnique Fédérale de Lausanne {ioannis.mavrothalassitis, efstratios.skoulakis, leello.dadi, volkan.cevher}@epfl.ch

Given a sequence of functions f1, . . . , fn with fi : D 7 R, finite-sum minimization seeks a point x D minimizing Pn j=1 fj(x)/n. In this work, we propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization, that asks for a sequence of points x 1, . . . , x n D such that each x i D minimizes the prefix-sum Pi j=1 fj(x)/i. Assuming that each prefix-sum is strongly convex, we develop a first-order continual stochastic variance reduction gradient method (CSVRG) producing an ϵ-optimal sequence with O(n/ϵ1/3 + 1/ ϵ) overall first-order oracles (FO). An FO corresponds to the computation of a single gradient fj(x) at a given x D for some j [n]. Our approach significantly improves upon the O(n/ϵ) FOs that Stochastic Gradient Descent requires and the O(n2 log(1/ϵ)) FOs that state-of-the-art variance reduction methods such as Katyusha require. We also prove that there is no natural first-order method with O (n/ϵα) gradient complexity for α < 1/4, establishing that the first-order complexity of our method is nearly tight.

1 INTRODUCTION

Given n data points describing a desired input and output relationship of a model, a cornerstone task in supervised machine learning (ML) is selecting the model s parameters enforcing data fidelity. In optimization terms, this task corresponds to minimizing an objective with the finite-sum structure.

Finite-sum minimization: Given a sequence of functions f1, . . . , fn with fi : D 7 R, let x arg minx g(x) := Pn i=1 fi(x)/n. Given an accuracy target ϵ > 0, finite-sum minimization seeks an approximate solution ˆx, such that: g(ˆx) g(x ) ϵ (1)

we call such an ˆx an ϵ-accurate solution. If ˆx is random then (1) takes the form E [g(ˆx)] g(x ) ϵ.

In contemporary machine learning applications, n can be in the order of billions, which makes it clear that methods tackling finite-sum minimization must efficiently scale with n and the accuracy ϵ > 0.

First-order methods have been the standard choice for tackling (1) due to their efficiency and practical behavior. The complexity of a first-order method is captured through the notion of overall first-order oracles (FOs) where a first-order oracle corresponds to the computation of a single gradient fi(x) for some i [n] and a point x D. In case each function fi is strongly convex, Stochastic Gradient Descent requires O(1/ϵ) FOs Robbins and Monro (1951); Kiefer and Wolfowitz (1952) so as to solve (1). Over the last years, the so-called variance reduction methods are able to solve strongly convex finite-sum problems with O(n log(1/ϵ)) FOs Nguyen et al. (2017b); Johnson and Zhang (2013); Xiao and Zhang (2014); Defazio et al. (2014); Roux et al. (2012b); Allen-Zhu (2017).

Continual Finite-Sum Minimization: In many modern applications new data constantly arrives over time. Hence it is important that a model is constantly updated so as to perform equally well both on the past and the new data Castro et al. (2018); Rosenfeld and Tsotsos (2018); Hersche et al. (2022). Unfortunately updating the model only with respect to the new data can lead to a vast deterioration of its performance over the past data. The latter phenomenon is known in the context of continual learning as catastrophic forgetting Castro et al. (2018); Goodfellow et al. (2014); Kirkpatrick et al. (2017); Mc Closkey and Cohen (1989). At the same time, retraining the model from scratch using

Equal contribution

Published as a conference paper at ICLR 2024

both past and new data comes with a huge computational burden. Motivated by the above, we study a twist of (1), called continual finite-sum minimization.

Continual Finite-Sum Minimization: Given a sequence of functions f1, . . . , fn with fi : D 7 R, let x i arg minx D gi(x) := Pi j=1 fj(x)/i for all i [n]. Given an accuracy target ϵ > 0, continual finite-sum minimization seeks a sequence of approximate solutions ˆx1, . . . , ˆxn D, such that: gi(ˆxi) gi(x i ) + ϵ for each i [n] (2)

We call such a sequence an ϵ-optimal sequence. If x i is random then (2) takes the form E[gi(ˆxi)] gi(x i ) ϵ

Remark 1. Throughout the paper we assume that each function fi( ) is strongly convex, smooth and that D is convex and bounded. See Section 2 for the exact definitions.

Notice that fi( ) can capture a new data point meaning that (2) guarantees that, the model (parameterized by ˆxi) is well-fitted in all of the data seen so far for each stage i [n]. As already discussed, since n can be in the order of millions, it is important to design first-order methods for continual finite-sum minimization that efficiently scale with n and ϵ > 0 with respect to overall FOs. A first attempt for tackling (2) is via existing first-order methods for finite-sum minimization. As discussed above Stochastic Gradient Descent can guarantee accuracy ϵ > 0 for (1) using O(1/ϵ). As a result, at each stage i [n] one could perform O(1/ϵ) stochastic gradient decent steps to guarantee accuracy ϵ > 0. However the latter approach would require overall O(n/ϵ) FOs. On the other hand one could use a VR methods to select ˆxi D at each stage i [n]. As already mentioned such methods require O(i log(1/ϵ)) FOs to guarantee accuracy ϵ > 0 at stage i [n]. Thus the naive use of a VR method such as SVRG Johnson and Zhang (2013); Xiao and Zhang (2014), SAGA Defazio et al. (2014), SARAH Nguyen et al. (2017b) and Katyusha Allen-Zhu (2017) would require overall O(n2 log(1/ϵ)) FOs.

We remark that for large values of n and even mediocre accuracy ϵ > 0 both O(n/ϵ) and especially O(n2 log(1/ϵ)) are prohibitely large. As a result, the following question arises:

Question 1. Are there first-order methods for continual finite-sum minimization whose required number of FOs scales more efficiently with respect to n and ϵ > 0?

Our Contribution The main contribution of this work is the design of a first-order method for Continual finite-sum minimization, called CSVRG, with overall O n/ϵ1/3 + log n/ ϵ FO complexity. The latter significantly improves upon the O(n/ϵ) FO complexity of SGD and the O(n2 log(1/ϵ)) complexity of variance reduction methods. As a naive baseline, we also present a sparse version of SGD tailored (2) with O(log n/ϵ2) FO complexity. The latter improves on the n dependence of SGD with the cost of a worse dependence on 1/ϵ.

Since variance-reduction methods such as Katyusha are able to achieve O(n2 log(1/ϵ)) FO complexity. A natural question is whether there exists a first-order method maintaining the log(1/ϵ) dependence with a subquadratic dependence on n, O n2 α log(1/ϵ) . We prove that the latter goal cannot be achieved for a wide class of first-order methods that we call natural first-order methods. More precisely, we establish that there is no natural first-order method for (2) requiring O(n2 α log(1/ϵ)) FOs for any α > 0. We also establish that there is no natural first-order method for (2) requiring O (n/ϵα) FOs for any α < 1/4. The latter lower bound implies that the O n/ϵ1/3

FO complexity of CSVRG is close to being tight. Table 1 summarizes our results.

Comparing the methods for Θ(1/n) accuracy According to the level of accuracy ϵ > 0 that (2) requires, different methods might be more efficient than others. We remark that for the accuracy regime of ϵ = O(1/n) all previous first-order methods require O(n2) FOs while CSVRG requires only O(n4/3) FOs!

We remark that the accuracy regime ϵ = O(1/n) for (2) is of particular interest and has been the initial motivation of this work. The reason for the latter lies on the generalization bounds in the strongly convex case. More precisely, the statistical error of empirical risk minimization in the strongly convex case is Θ(1/n) Shalev-Shwartz et al. (2010). Thus the accuracy of the respective finite-sum problem should be selected as ϵ = O(1/n) so as to match the unavoidable statistical error (see Bottou and Bousquet (2007) for the respective discussion). For accuracy ϵ = Θ(1/n), a very

Published as a conference paper at ICLR 2024

ϵ-accuracy at each stage i [n] Method Number of FOs Stochastic Gradient Descent O( 1

SVRG/SAGA/SARAH O

n2 log(1/ϵ) + L

µ n log(1/ϵ)

n2 log(1/ϵ) + q

L µ n3/2 log(1/ϵ)

SGD-sparse (naive baseline) O |D|G3

CSVRG (this work) O L2/3G2/3

µ (n log n)/ϵ1/3 + L2G

µ5/2 log n/ ϵ

Lower Bound (this work) Ω n2 log(1/ϵ)

Lower Bound (this work) Ω n/ϵ1/4

Table 1: Number of FOs for the strongly convex case for an ϵ-accurate solution for each stage i [n].

interesting open question is bridging the gap between the O(n4/3) FO complexity of CSVRG and the Ω(n5/4) FO complexity that our lower bound suggests (setting ϵ = 1/n and α = 1/4).

Our Techniques As already discussed the naive use of a VR method at each stage i [n] results in overall O(n2 log(1/ϵ)) FOs. A first approach in order to alleviate the latter phenomenon is to sparsely use such a method across the n stages and in most intermediate stages compute ˆxi D via O(1) FOs using a stochastic gradient estimator. However both the sparsity level for using a VR method as well as the stochastic gradient estimator are crucial design choices for establishing accuracy ϵ > 0 at each stage i [n].

Our first-order method (CSVRG) adopts a variance-reduction pipeline along the above lines. CSVRG (Algorithm 1) maintains a direction i that at stage i [n] tries to approximate Pi k=1 fk(ˆxi 1)/i. Since computing the latter direction at each stage i [n] would lead to Ω(n2) FOs, CSVRG

Sets i Pi k=1 fk(ˆxi)/i only in a very sparse sub-sequence of stages (i FOs)

Sets i (1 1

i fi(ˆxprev) in most stages (1 FO)

In order to output an ϵ-accurate point ˆxi D at stage i [n], we initialize an internal subroutine (Algorithm 2) to the previous output ˆxi 1 D and perform Ti stochastic gradient descent steps using the following novel gradient estimator

fut(xt i) fut(ˆxprev) + i 1 + 1

where index ut is selected uniformly at random in [i 1] and prev is the latest stage ℓ i 1 at which ℓ= Pℓ k=1 fk(ˆxℓ)/ℓ. Notice that the estimator t i requires only 3 FOs. Combining all the latter, we establish that by an appropriate parametrization on the sparsity of the sub-sequence at which ℓ= Pℓ k=1 fk(ˆxℓ)/ℓand an appropriate selection of iteration Ti, our first-order method CSVRG requires overall O(n log n/ϵ1/3) FOs.

1.1 RELATED WORK

Apart from the close relation of our work with the long line of research on variance reduction methods (see also Appendix A for a detailed discussion), the motivation of our work also relates with the line of research in incremental learning, the goal of which is adapting a model to new information, data or tasks without forgetting the ones that it was trained on earlier. The phenomenon of losing sight of old information is called catastrophic forgetting Castro et al. (2018); Goodfellow et al. (2014); Kirkpatrick et al. (2017); Mc Closkey and Cohen (1989); Mermillod et al. (2013) and it is one of the main challenges of incremental learning. There have been three main empirical approaches to tackle catastrophic forgetting, regularization based Nguyen et al. (2017a), memory based Tulving (1985), architecture based Yoon et al. (2017), as well as combination of the above Sodhani et al. (2020).

Published as a conference paper at ICLR 2024

2 PRELIMINARIES

In this section we introduce some basic definitions and notation. We denote with Unif(1, . . . , n) the uniform distribution over {1, . . . , n} and [n] := {1, . . . , n}. Definition 1 (Strong Convexity). A differentiable function f : D 7 R is µ-strongly convex in D if and only if for all x, y D

f(x) f(y) + f(y) (x y) + µ

2 x y 2 (3)

In Problem (2) we make the assumption that D is convex and compact. We denote with |D| the diameter of D, |D| = maxx,y D x y 2. The compactness of D also provides us with the property that each fi : D 7 R is also G-Lipschitz and L-smooth. More precisely,

|f(x) f(y)| G x y (G-Lipschtiz) f(x) f(y) L x y (L-smooth)

where G = maxx D f(x) 2 and L = maxx D 2f(x) 2. To simplify notation we denote as gi(x) the prefix-function of stage i [n],

j=1 fj(x)/i (4)

We denote with x i D the minimizer of the function gi(x), x i := arg minx D gi(x). We denote the projection of x Rd to the set D as ΠD (x) := arg minz D x z 2. For a set S Rd, we denote ΠD(S) := {ΠD(x) : for all x S}. Throughout the paper we assume that each function fi( ) in continual finite-sum minimization (2) is µ-strongly convex, L-smooth and G-Lipschitz.

3 OUR RESULTS

We first state our main result establishing the existence of a first-order method for continual finite-sum minimization (2) with O n/ϵ1/3 + log n/ ϵ FO complexity. Theorem 1. There exists a first-order method, CSVRG (Algorithm 1), for continual finite-sum minimization (2) with O L2/3G2/3

µ5/2 log n ϵ FO complexity.

As already mentioned, our method admits O n/ϵ1/3 + 1/ ϵ first-order complexity, significantly improving upon the O(n/ϵ) of SGD and the O n2 log(1/ϵ) of VR methods. The description of our method (Algorithm 1) as well as the main steps for establishing Theorem 1 are presented in Section 4.

In Theorem 2 we present the formal guarantees of the naive baseline result that is a sparse variant of SGD for continual finite-sum minimization (2) with O(1/ϵ2) FO complexity. Theorem 2. There exists a first-order method, SGD-sparse (Algorithm 4), for continual finite-sum minimization (2) with O G3|D| log n

µϵ2 FO complexity.

Algorithm 4 as well as the proof of Theorem 2 are presented in Appendix I.

Lower Bounds for Natural First-Order Methods In view of the above, the question that naturally arises is whether there exists a method for continual finite-sum minimization (2) with O(n2 α log(1/ϵ)) FO complexity. Unfortunately the latter goal cannot be met for a wide class of first-order methods, that we call natural. The formal definition of natural first-order method is presented in Section 3.1. To this end, we remark that methods such that GD, SGD, SVRG, SAGA, SARAH, Katyusha and CSVRG (Algorithm 1) lie in the above class. Theorem 3. For any α > 0, there is no natural first-order method for Problem (2) with O n2 α log(1/ϵ) FO complexity. Moreover for any α < 1/4, there is no natural first-order method for continual finite-sum minimization (2) with O (n/ϵα) FO complexity.

Due to space limitations the proof of Theorem 3 is presented in Appendix H.

Remark 2. The O(n/ϵ1/3 +1/ ϵ) FO complexity of CSVRG is close to the Ω(n/ϵ1/4) lower bound.

Published as a conference paper at ICLR 2024

3.1 NATURAL FIRST-ORDER METHODS

In this section we provide the formal definition of natural first-order method for which our lower bound stated in Theorem 3 applies. Before doing so we present the some definitions characterizing general first-order methods.

At an intuitive level a first-order method for continual finite-sum minimization (2) determines ˆxi D by only requiring first-order access to the functions f1, . . . , fi. Specifically at each stage i [n], a first-order method makes queries of the form { fj(x) for j i} and uses this information to determine ˆxi D. The latter intuitive description is formalized in Definition 2. Definition 2. Let A be a first-order method for continual finite-sum minimization (2). At each stage i [n],

ˆx0 D denotes the initial point of A and ˆxi D denotes the output of A at stage i [n].

xt i D denotes the intermediate point of A at round t 1 of stage i [n] and Ti N the number of iterations during stage i [n].

At each round t [Ti] of stage i [n], A performs a set of first-order oracles denotes Qt i.

Each first-order oracle q Qt i admits the form q = (qvalue, qindex) where qvalue D denotes the queried point and qindex i denotes the index of the queried function. A computes the gradients { fqindex(qvalue) for all q Qt i} and uses this information to determine xt+1 i D.

The FO complexity of A is thus P

i [n] PTi t=1 |Qt i|.

Example 1. Gradient Descent at stage i [n], sets x0 i ˆxi 1 and for each round t [Ti] peforms

xt i ΠD h xt 1 i γ Pi j=i fj(xt 1 i )/i i . Finally it sets ˆxi x Ti i . Thus in terms of Definition 2 the

first-order oracles of GD at round t Ti are Qt i = { xt 1 i , 1 , . . . , xt 1 i , i }.

In Definition 3 we formally define the notion of a natural first-order method. In simple terms a first-order method is called natural if during each stage i [n] it only computes FOs of the form fj(x) where x D is previously generated point and j i. Definition 3. A first-order method A is called natural iff at each step t [Ti] of stage i [n],

For any query q Qt i, qindex i and qvalue PPt 1 i where PPt 1 i := ( m i 1 τ Tm xτ m) ( τ t 1xτ i ) ˆx0 (PPt 1 i stands for previous points)

xt i ΠD(S) where S is the linear span of q Qt i fqindex(qvalue) PPt 1 i .

ˆxi ΠD(S) where S is the linear span of PPTi i . Remark 3. Definition 3 also captures randomized first-order methods such as SGD by considering Qt i being a random set.

Natural first-order methods is the straightforward extension of linear-span methods Bubeck (2015); Nesterov (2014) in the context of continual finite-sum minimization (2). Linear-span methods require that for any first-order oracle f(x) computed at iteration t, x {x0, x1, . . . , xt 1}. Moreover xt is required to lie in the linear span of the previous points and the computed gradients. Linear-span methods capture all natural first-order optimization methods and have been used to establish the well-known Ω(1/ ϵ) and Ω p

L/µ log(1/ϵ) respectively for the convex and strongly convex case (see Section 3.5 in Bubeck (2015)).

4 CSVRG AND CONVERGENCE RESULTS

In this section we present our first-order method CSVRG that is able to achieve the guarantees established in Theorem 1. CSVRG is formally described in Algorithm 1.

Published as a conference paper at ICLR 2024

Algorithm 1 CSVRG

1: ˆx0 D, prev 0, update false 2: ˆx1 Gradient Descent(ˆx0), 1 f1(ˆx1)

3: for each stage i = 2, . . . , n do

4: if i prev α i then

5: i 1 1 i 1 Pi 1 j=1 fj(ˆxi 1) Compute gi 1(ˆxi 1) with i 1 FOs

6: prev i 1

7: update true

9: Ti O L2G µ5/2i ϵ + L2G2α2

µ2 Number of iterations of Algorithm 2 at stage i [n]

10: ˆxi FUM(prev, i 1, Ti) Selection of ˆxi D by Algorithm 2

11: if update then

i Pi j=1 fj(ˆxi) Compute gi(ˆxi) with i FOs

14: update false 15: else 16: i 1 1

i fi(ˆxprev) Update i with 1 FO 17: end if 18: end for

Algorithm 2 Frequent Update Method (FUM)

µ2 , Z Ti(Ti 1)

2 + (Ti + 1)(β 1)

2: x0 i ˆxi 1 Initialization at ˆxi 1 D

3: for each round t := 1, . . . , Ti do

4: Select ut Unif(1, . . . , i 1)

i fut(xt i) fut(ˆxprev) + i 1 + 1

i fi(xt i) 3 FOs

6: γt 4/(µ(t + β)) Step-size selection

7: xt+1 i ΠD (xt i γt t i) Update xt i D

8: end for 9: Output: ˆxi 1

Z PTi 1 s=0 (s + β 1)xt+1 i Final output

We first remark that the computation of the output ˆxi D for each stage i [n], is performed at Step 10 of Algorithm 1 by calling Algorithm 2. Then Algorithm 2 initializes x0 i := ˆxi 1 and performs Ti stochastic gradient steps using the estimator

fut(xt i) fut(ˆxprev) + i 1 + 1

To this end we remark that in order to compute t i Algorithm 2 requires just 3 FOs. Before explaining the specific selection of the above estimator we start by explaining the role of i.

The role of i: In order to keep the variance of the estimator t i low, we would ideally want i 1 := gi 1(ˆxi 1). The problem is that computing gi(ˆxi) requires i FOs meaning that performing such a computation at each stage i [n], would directly result in Ω(n2) FOs. In order to overcome the latter challenge, a full gradient gi(x) is computed very sparsely by Algorithm 1 across the n stages. Specifically,

In case i prev α i then i := Pi j=1 fj(ˆxi)/i (i FOs)

In case i prev < α i then i := 1 1

i fi(ˆxprev) (1 FO)

Published as a conference paper at ICLR 2024

Thus the parameter α > 0 controls the number of times Algorithm 1 reaches Steps 5 and 12 that at stage i [n] require i FOs. The latter is formally stated and established in Lemma 1. Lemma 1. Over a sequence of n stages, Algorithm 1 reaches Step 5 and 12, log n/α times.

Combining the latter with the fact that Algorithm 2 requires 3Ti FOs at each stage i [n] and the fact that Step 5 and 12 require at most n FOs, we get that Corollary 1. Over a sequence of n stages, Algorithm 1 requires 3 Pn i=1 Ti + 2n log n/α FOs.

Up next we present Theorem 4 establishing that for a specific selection of parameter α > 0, Algorithm 1 guarantees that each ˆxi D is an ϵ-optimal point for the function gi(x). Theorem 4. Let a convex and compact set D and a sequence µ-strongly convex functions f1, . . . , fn with fi : D 7 R. Then Algorithm 1, with Ti = 720GL2/(µ5/2i ϵ) + 9L2/3G2/3/(ϵ1/3µ) + 864L2/µ2 and α = µϵ1/3/(20G2/3L2/3), guarantees E [gi(ˆxi)] gi(x i ) ϵ for each stage i [n] where ˆxi D is the output of Algorithm 2 at Step 9 of Algorithm 1.

The proof of Theorem 1 directly follows by Theorem 4 and Corollary 1. For completeness the proof Theorem 1 is presented in Appendix G. In the rest of the section we present the key steps for proving Theorem 4 (see Appendix F.2) for the full proof.

We first explain the role of the gradient estimator t i in Step 5 of Algorithm 2. This estimator t i of Step 5 in Algorithm 2 may seem unintuitive at first sight since it subtracts the term fut(ˆxprev). Interestingly enough the latter estimator-selection permits us to establish the following two key properties for t i. Lemma 2 (Unbiased). Let t i the gradient estimator used in Step 5 of Algorithm 2. Then for all t [Ti], E [ t i] = gi(xt i).

Lemma 3 (Bounded Variance). Let t i the gradient estimator used in Step 5 of Algorithm 2. Then for all t [Ti],

E h t i gi(xt) 2 2

i 8L2E h xt i x i 2 2

µ2 α2 + 16L2E h x prev ˆxprev 2 2

where α > 0 is the parameter used at Step 4 of Algorithm 1.

The proof of Lemma 2 and Lemma 3 are respectively presented in Appendix C and Appendix D and consist one of the main technical contributions of this work. In Section 5 we explain why the specific estimator-selection t i is crucial for establishing both Lemma 2 and 3.

In the rest of the section, we provide the main steps for establishing Theorem 4. Let us first inductively assume that E gj(ˆxj) gj(x j) ϵ for all j i 1. We use the latter to establish that E [gi(ˆxi) gi(x i )] ϵ.

Notice that Step 4 of Algorithm 1 guarantees that prev i 1 at Step 10. Hence the induction hypothesis ensures that E gprev(ˆxprev) gprev(x prev) ϵ and thus by strong convexity

E h ˆxprev x prev 2i 2ϵ/µ. Thus the bound in the variance of Lemma 3,

E h t i gi(xt) 2 2

i O L2E h xt i x i 2 2

Using the fact that E [ t i] = gi(xt i) and the bound in the variance E h t i gi(xt) 2 2 i of Equation 5, in Lemma 8 of Appendix E we establish that

E [gi(ˆxi)] gi(x i ) := E h gi(x Ti i ) i gi(x i ) O

x0 i x i 2 2 T 2 i + L2G2

Notice that at Step 2 of Algorithm 2, x0 i := ˆxi 1 D meaning that

E [gi(ˆxi)] gi(x i ) O

µ3 ˆxi 1 x i 2 2 T 2 i + L2G2

µ5i2 T 2 i + L4ϵ

µ4T 2 i + L2G2α2

Published as a conference paper at ICLR 2024

where the last inequality follows by ˆxi 1 x i 2 2 2 ˆxi 1 x i 1 2 2 + 2 x i 1 x i 2 2, x i x i 1 O(G2/(µi)) and ˆxi 1 x i 1 ϵ/µ. As a result, by taking Ti =

O L2G µ5/2i ϵ + L2G2α2

µ2 we get that E [gi(ˆxi)] gi(x i ) ϵ.

5 ANALYZING THE ESTIMATOR t i

In this section we provide the key steps for proving Lemma 2 and Lemma 3 respectively.

Unbias estimator: The basic step for showing that t i of Step 5 in Algorithm 2 is an unbiased estimator, E [ t i] = gi(xt i), is establishing the following property for i 1 defined in Algorithm 1.

Lemma 4. At Step 10 of Algorithm 1, it holds that i 1 = Pi 1 k=1 fk(ˆxprev)/(i 1)

The proof of Lemma 4 is presented in Appendix C and admits an inductive proof on the steps of Algorithm 1. Once Lemma 4 is established the fact that E [ t i] = f(xt i) easily follows by the selection of the estimator (see Step 5 of Algorithm 2). For the full proof of Lemma 2 we refer the reader to Appendix C.

Bounding the Variance of t i: The first step towards establishing Lemma 3 is presented in Lemma 5 providing a first bound on the variance E h t i gi(xt i) 2 2 i . The proof of Lemma 5 uses ideas derived from the analysis of VR methods and is presented in Appendix D.

Lemma 5. For all rounds t Ti of stage i [n],

E h t i gi(xt i) 2 2

i 8L2E h xt i x i 2 2

i + 8L2E h x i ˆxprev 2 2 i (6)

where prev is defined in Algorithm 1.

The next step for establishing Lemma 3 is handling the term E h x i ˆxprev 2 2 i of Equation 6. In order to do the latter in we exploit the strong convexity assumption so as to establish Lemma 6 the proof of which lies in Appendix B.

Lemma 6. Let ˆxj D the output of Algorithm 1 for stage j [n]. Then for all i {j + 1, . . . , n},

ˆxj x i 2 2 8

2 + 2 ˆxj x j 2 2

Now by applying Lemma 6 for j := prev we get that

x i ˆxprev 2 2 8 µ2

G(i prev) 2prev + (i prev)

2 + 2 ˆxprev x prev 2

2 + 2 ˆxprev x prev 2

Step 4 and 6 of Algorithm 1 ensure that at Step 10 of Algorithm 1 (when Algorithm 2 is called), i prev α i. Thus,

x i ˆxprev 2 2 8

2 + 2 ˆxprev x prev 2 32G2

µ2 α2 + 2 ˆxprev x prev 2

and thus E h x i ˆxprev 2 2 i 32G2

µ2 α2+2E h ˆxprev x prev 2i . Combining the latter with Equation 6 we overall get,

E h t i gi(xt i) 2 2

i O L2E h xt i x i 2 2

µ2 α2 + L2E h ˆxprev x prev 2i

Published as a conference paper at ICLR 2024

Table 2: Optimality gap as the stages progress on a ridge regression problem (averaged over 10 independent runs). CSVRG performs the exact same number of FOs with SGD and slightly less than SGD-sparse. Katyusha and SVRG perform the exact same number of FOs. CSVRG/SGD/SGDsparse perform roughly 4% of the FOs of Katyusha/SVRG.

6 EXPERIMENTS

We experimentally evaluate the methods (SGD,SGD-sparse, Katyusha, SVRG and CSVRG) on a ridge regression task. Given some dataset (ai, bi)n i=1 Rd R, at each stage i [n] we consider the the finite sum objective gi(x) := Pi j=1(a j x bj)2/i + λ x 2 2 with λ = 10 3. We choose the latter setting so as to be able to compute the exact optimal solution at each stage i [n].

For our experiments we use the datasets found in the LIBSVM package Chang and Lin (2011) for which we report our results. At each stage i [n], we reveal a new data point (ai, bi). In all of our experiments we run CSVRG with α = 0.3 and Ti = 100. The inner iterations of SGD and SGD-sparse are appropriately selected so that their overall FO calls match the FO calls of CSVRG. At each stage i [n] we run Katyusha Allen-Zhu (2017) and SVRG Johnson and Zhang (2013) on the prefix-sum function gi(x) with 10 outer iterations and 100 inner iterations.

In Table 2 we present the error achieved by each method at each stage i [n] and the overall number of FOs until the respective stage. As our experimental evaluations reveal, CSVRG nicely interpolates between SGD/SGD-sparse and Katyusha/SVRG. More precisely, CSVRG achieves a significantly smaller suboptimality gap than SGD/SGD-sparse with the same number of FOs while it achieves a comparable one with Katyusha/SVRG with way fewer FOs. In Appendix K.2 we present further experimental evaluations as well as the exact parameters used for each method.

7 CONCLUSIONS

In this work we introduce continual finite-sum optimization, a tweak of standard finite-sum minimization, that given a sequence of functions asks for a sequence of ϵ-accurate solutions for the respective prefix-sum function. For the strongly convex case we propose a first-order method with O(n/ϵ1/3) FO complexity. This significantly improves upon the O(n/ϵ) FOs that Stochastic Gradient Descent requires and upon the O(n2 log(1/ϵ)) FOs that VR methods require. We additionally prove that no natural method can achieve O n2 α log(1/ϵ) FO complexity for any α > 0 and O (n/ϵα) FO complexity for α < 1/4.

Published as a conference paper at ICLR 2024

8 ACKNOWLEDGEMENTS

Authors acknowledge the constructive feedback of reviewers and the work of ICLR 24 program and area chairs. This work was supported by Hasler Foundation Program: Hasler Responsible AI (project number 21043), by the Swiss National Science Foundation (SNSF) under grant number 200021_20501, by the Army Research Office (Grant Number W911NF-24-1-0048) and by Innosuisse (contract agreement 100.960 IP-ICT).

Limitations: The theoretical guarantees of our proposed method are restricted to the strongly convex case. Extending our results beyond the strong convexity assumptions is a very promising research direction.

Reproducibility Statement In the appendix we have included the formal proofs of all the theorems and lemmas provided in the main part of the paper. In order to facilitate the reproducibility of our theoretical results, for each theorem we have created a separate self-contained section presenting its proof. Concerning the experimental evaluations of our work, we provide the code used in our experiments as well as the selected parameters in each of the presented methods.

Ethics Statement The authors acknowledge that they have read and adhere to the ICLR Code of Ethics.

Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200 1205, 2017.

Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 2680 2691, 2018a.

Zeyuan Allen-Zhu. Katyusha X: practical momentum method for stochastic sum-of-nonconvex optimization. Co RR, abs/1802.03866, 2018b. URL http://arxiv.org/abs/1802.03866.

Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In International conference on machine learning, pages 1080 1089. PMLR, 2016.

Doron Blatt, Alfred O Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29 51, 2007.

Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, pages 161 168. Curran Associates, Inc., 2007.

Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Sébastien Bubeck. Convex optimization: Algorithms and complexity, 2015.

Francisco M Castro, Manuel J Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European conference on computer vision (ECCV), pages 233 248, 2018.

Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1 27, 2011.

Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1646 1654, 2014.

Published as a conference paper at ICLR 2024

Bernard Delyon and Anatoli Juditsky. Accelerated stochastic approximation. SIAM Journal on Optimization, 3(4):868 881, 1993.

Benjamin Dubois-Taine, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, and Simon Lacoste Julien. SVRG meets adagrad: painless variance reduction. Mach. Learn., 111(12):4359 4409, 2022.

Benjamin Dubois-Taine, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, and Simon Lacoste Julien. Svrg meets adagrad: painless variance reduction. Machine Learning, pages 1 51, 2022.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. SPIDER: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 687 697, 2018a.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in Neural Information Processing Systems, 31, 2018b.

Ian J. Goodfellow, Mehdi Mirza, Xia Da, Aaron C. Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgeting in gradient-based neural networks. In Yoshua Bengio and Yann Le Cun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.

Elad Hazan. Introduction to online convex optimization, 2023.

Michael Hersche, Geethan Karunaratne, Giovanni Cherubini, Luca Benini, Abu Sebastian, and Abbas Rahimi. Constrained few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9057 9067, 2022.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.

Ali Kavis, Stratis Skoulakis, Kimon Antonakopoulos, Leello Tadesse Dadi, and Volkan Cevher. Adaptive stochastic variance reduction for non-convex finite-sum minimization. In Neur IPS, 2022.

Harry Kesten. Accelerated stochastic approximation. The Annals of Mathematical Statistics, pages 41 59, 1958.

Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, pages 462 466, 1952.

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114 (13):3521 3526, 2017.

Suhas Kowshik, Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Near-optimal offline and streaming algorithms for learning non-linear dynamical systems. Advances in Neural Information Processing Systems, 34:8518 8531, 2021a.

Suhas Kowshik, Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Streaming linear system identification with reverse experience replay. Advances in Neural Information Processing Systems, 34:30140 30152, 2021b.

Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical programming, 171:167 215, 2018.

Guanghui Lan, Zhize Li, and Yi Zhou. A unified variance-reduced accelerated gradient method for convex optimization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 10462 10472, 2019a.

Published as a conference paper at ICLR 2024

Guanghui Lan, Zhize Li, and Yi Zhou. A unified variance-reduced accelerated gradient method for convex optimization. Advances in Neural Information Processing Systems, 32, 2019b.

Zhize Li and Jian Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 5569 5579, 2018.

Zhize Li and Peter Richtárik. Zerosarah: Efficient nonconvex finite-sum optimization with zero full gradient computation. Co RR, abs/2103.01447, 2021.

Zhize Li, Hongyan Bao, Xiangliang Zhang, and Peter Richtárik. PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 6286 6295. PMLR, 2021.

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. Catalyst acceleration for first-order convex optimization: from theory to practice. Journal of Machine Learning Research, 18(1):7854 7907, 2018.

Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4):2244 2273, 2015.

David Lopez-Paz and Marc Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.

Michael Mc Closkey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109 165. Elsevier, 1989.

Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects, 2013.

Aryan Mokhtari, Mert Gürbüzbalaban, and Alejandro Ribeiro. A double incremental aggregated gradient method with linear convergence rate for large-scale optimization. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pages 4696 4700. IEEE, 2017.

Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. Advances in neural information processing systems, 33:16666 16676, 2020.

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014. ISBN 1461346916.

Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. ar Xiv preprint ar Xiv:1710.10628, 2017a.

Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takác. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2613 2621. PMLR, 2017b.

Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran-Dinh. Proxsarah: An efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res., 21: 110:1 110:48, 2020.

Published as a conference paper at ICLR 2024

Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alexander J. Smola. Stochastic variance reduction for nonconvex optimization. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 314 323. JMLR.org, 2016.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400 407, 1951.

Amir Rosenfeld and John K Tsotsos. Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence, 42(3):651 663, 2018.

Nicolas Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. Advances in neural information processing systems, 25, 2012a.

Nicolas Le Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 2672 2680, 2012b.

Paul Ruvolo and Eric Eaton. Ella: An efficient lifelong learning algorithm. In International conference on machine learning, pages 507 515. PMLR, 2013.

Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. Advances in Neural Information Processing Systems, 26, 2013a.

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(1), 2013b.

Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635 2670, 2010.

Shagun Sodhani, Sarath Chandar, and Yoshua Bengio. Toward training recurrent neural networks for lifelong learning. Neural computation, 32(1):1 35, 2020.

Shagun Sodhani, Mojtaba Faramarzi, Sanket Vaibhav Mehta, Pranshu Malviya, Mohamed Abdelsalam, Janarthanan Janarthanan, and Sarath Chandar. An introduction to lifelong supervised learning. ar Xiv preprint ar Xiv:2207.04354, 2022.

Chaobing Song, Yong Jiang, and Yi Ma. Variance reduction via accelerated dual averaging for finite-sum optimization. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020a.

Chaobing Song, Yong Jiang, and Yi Ma. Variance reduction via accelerated dual averaging for finite-sum optimization. Advances in Neural Information Processing Systems, 33:833 844, 2020b.

Endel Tulving. How many memory systems are there? American psychologist, 40(4):385, 1985.

Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 2403 2413, 2019.

Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim., 24(4):2057 2075, 2014.

Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. ar Xiv preprint ar Xiv:1708.01547, 2017.

Published as a conference paper at ICLR 2024

Junyu Zhang and Lin Xiao. A stochastic composite gradient method with incremental variance reduction. Advances in Neural Information Processing Systems, 32, 2019.

Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. Advances in Neural Information Processing Systems, 26, 2013.

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 3925 3936, 2018.

Published as a conference paper at ICLR 2024

A FURTHER RELATED WORK

In this section we present more extensively all work related to ours. This work lies in the intersection of convex finite sum minimization and incremental learning.

Finite Sum Minimization: As already mentioned, our work is closely related with the long line of research on variance reduction methods for strongly convex Nguyen et al. (2017b); Defazio et al. (2014); Allen-Zhu (2018b); Allen-Zhu (2017); Roux et al. (2012b); Xiao and Zhang (2014); Mokhtari et al. (2017); Johnson and Zhang (2013); Allen-Zhu (2017); Allen-Zhu (2018b); Zhang et al. (2013); Shalev-Shwartz and Zhang (2013a); Nguyen et al. (2017b); Defazio et al. (2014); Allen-Zhu (2018b); Allen-Zhu (2017); Roux et al. (2012b); Xiao and Zhang (2014), convex Lan et al. (2019b); Dubois Taine et al. (2022); Song et al. (2020a); Lan et al. (2019a); Song et al. (2020b); Lin et al. (2018;?); Delyon and Juditsky (1993); Lin et al. (2015); Shalev-Shwartz and Zhang (2013b); Dubois-Taine et al. (2022) and non-convex Allen-Zhu and Yuan (2016); Allen-Zhu (2018a); Wang et al. (2019); Pham et al. (2020); Li and Richtárik (2021); Li et al. (2021); Fang et al. (2018a); Zhou et al. (2018); Reddi et al. (2016); Li and Li (2018); Kavis et al. (2022) finite-sum minimization (see also Appendix A for a more detailed discussion). As already mentioned the basic difference with our work comes from the fact these work concern the standard finite-sum setting while ours concerns its continual counter-part.

Our work leverages two components of variance reduction techniques. Recent works in variance reduction such as the SVRG algorithm Johnson and Zhang (2013) and the Katyusha algorithm Allen Zhu (2017); Allen-Zhu (2018b) primarily leverage a full gradient computation, which is essentially used as a substitute for the gradients that are not computed in this iteration, in order to reduce the variance of their gradient estimator. Older methods, which seek to accelerate SGD, such as the RPDG method Lan and Zhou (2018) and the SAG method Roux et al. (2012a), which is a randomized variant of the IAG method Blatt et al. (2007), use a stochastic average of the current gradient that it sampled and substitutes the rest of the gradients with ones computed from previous iterations. Our method seeks to utilize the first technique, but the complexity constraint does not allow for this tool. We mitigate this problem using a technique similar to the incremental gradient methods. There are also several other methods and techniques that relate to accelerated stochastic gradient methods, both in the convex setting Song et al. (2020b); Allen-Zhu and Yuan (2016); Lan et al. (2019b); Lin et al. (2018); Kesten (1958); Delyon and Juditsky (1993); Zhang et al. (2013); Lin et al. (2015); Shalev-Shwartz and Zhang (2013b;a); Dubois-Taine et al. (2022) as well as in the non-convex setting Zhang and Xiao (2019); Fang et al. (2018b). In the past there have also been several works that focus on finite sum minimization in dynamic settings using SGD Kowshik et al. (2021b;a); Nagaraj et al. (2020)

Incremental Learning: The other side of this work is incremental learning, the goal of which is adapting a model to new information, data or tasks without forgetting the ones that it was trained on earlier. The phenomenon of losing sight of old information is called catastrophic forgetting Castro et al. (2018); Goodfellow et al. (2014); Kirkpatrick et al. (2017); Mc Closkey and Cohen (1989) and it is one of the main challenges of incremental learning. The area closest to our work from the field of incremental learning is lifelong learning. In this area catastrophic forgetting is expressed in the "stability-plasticity dilemma" Mermillod et al. (2013). There have been three main approaches to tackle this issue, regularization based approaches Nguyen et al. (2017a), memory based approaches Tulving (1985) and architecture based approaches Yoon et al. (2017); Lopez-Paz and Ranzato (2017). As shown in the current literature, combinations of these methods can be utilized in practice Sodhani et al. (2020) as well. All of the previous directions are studied in depth in Sodhani et al. (2022) and a brief overview of the techniques of the field is also presented in Castro et al. (2018). The area of lifelong learning also includes algorithms such as ELLA Ruvolo and Eaton (2013)

B PROOF OF LEMMA 6

In this section we provide the proof of Lemma 6.

Lemma 6. Let ˆxj D the output of Algorithm 1 for stage j [n]. Then for all i {j + 1, . . . , n},

ˆxj x i 2 2 8

2 + 2 ˆxj x j 2 2

Published as a conference paper at ICLR 2024

The proof of Lemma 6 follows by the proof of Lemma 7 that we state and prove up next. The proof of Lemma 6 is deferred at the end of this section.

Lemma 7. For all i [n 1] and j [n i],

x i+j x i 2 2j G µ(2i + j)

where x i = arg minx D gi(x) and x i+j = arg minx D gi+j(x).

Proof of Lemma 7. By the optimality of x i+j D for the function gi+j(x) we get that

gi+j(x i+j), x i x i+j 0

The inequality is shown in (Boyd and Vandenberghe, 2004, equation 4.21).

From the strong convexity of gi+j(x) we get that

x i+j x i 2 2 gi+j(x i ) gi+j(x i+j) gi+j(x i+j), x i x i+j

gi+j(x i ) gi+j(x i+j)

k=1 fk(x i ) 1 i + j

k=1 fk(x i+j)

= i i + j gi(x i ) gi(x i+j) + 1 i + j

fk(x i ) fk(x i+j)

x i+j x i 2 2 gi(x i ), x i+j x i

fk(x i ) fk(x i+j)

x i+j x i 2 2

k=i+1 G x i+j x i 2

x i+j x i 2 2

+ j G x i+j x i 2 i + j

We conclude the section with the proof of Lemma 6.

Proof of Lemma 6. By the inequality a + b 2 2 2 a 2 2 + 2 b 2 2 we get that

ˆxj x i 2 2 2 x i x j 2 2 + ˆxj x j 2 2

2 + ˆxj x j 2 2

The second inequality comes from Lemma 7 and strong convexity.

C PROOF OF LEMMA 2

In this section we provide the proof of Lemma 2.

Lemma 2 (Unbiased). Let t i the gradient estimator used in Step 5 of Algorithm 2. Then for all t [Ti], E [ t i] = gi(xt i).

Published as a conference paper at ICLR 2024

The basic step in order to establish Lemma 2 is Lemma 4 that we present up next. The proof of Lemma 2 is deferred at the end of the section.

Lemma 4. At Step 10 of Algorithm 1, it holds that i 1 = Pi 1 k=1 fk(ˆxprev)/(i 1)

Proof of Lemma 4. We will inductively establish Lemma 4. Notice that after stage i = 1, Algorithm 1 sets 1 = f1(ˆx1). Up next we show that in case the induction hypothesis holds for stage i 1 then it must essentially hold for stage i. Up next we consider the following 3 mutually exclusive cases:

1. i prev αi meaning that prev is updated in this stage.

2. (i 1) prev α(i 1) meaning that prev was update in the previous stage.

3. i prev < αi and (i 1) prev < α(i 1).

For the first case Algorithm 1 reaches Step 5 and thus i 1 = Pi 1 j=1 fj(ˆxi 1)/i 1. At the same

time prev is set to i 1 meaning that prev = i 1 and i 1 = 1 i 1 Pi 1 k=1 fk(ˆxprev).

For the second case, Algorithm 1 had reached Step 12 at stage i 1 meaning that i 1 = Pi 1 j=1 fj(ˆxi 1)/i 1. At the same time prev was set to i 1, meaning that

prev = i 1 and i 1 = 1 i 1 Pi 1 k=1 fk(ˆxprev).

For the third case, from the inductive hypothesis we have that:

i 2 = 1 i 2

k=1 fk(ˆxprev)

At stage i 1, i 1 was calculated according to Step 16 of Algorithm 1. As a result,

i 1 = 1 1 i 1

i 2 + 1 i 1 fi 1(ˆxprev)

k=1 fk(ˆxprev) + 1 i 1 fi 1(ˆxprev)

k=1 fk(ˆxprev)

Proof of Lemma 2. Let Fi t denote the filtration until step t [Ti] of stage i [n].

By the definition of t i at Step 5 of Algorithm 2 we know that

E t i | Fi t = 1 1

E fut(xt i) fut(ˆxprev) | Fi t + i 1 + 1

k=1 Pr [ut = k] fk(xt i) fk(ˆxprev) + i 1

fk(xt i) fk(ˆxprev) + i 1

k=1 fk(xt i) 1

k=1 fk(ˆxprev) + i 1

Published as a conference paper at ICLR 2024

Lemma 4 ensures that i 1

i Pi 1 k=1 fk(ˆxprev) meaning that

E t i | Fi t = 1

k=1 fk(xt i) + 1

i fi(xt i) = gi(xt i).

D PROOF OF LEMMA 3

In this section we provide the proof of Lemma 3. Lemma 3 (Bounded Variance). Let t i the gradient estimator used in Step 5 of Algorithm 2. Then for all t [Ti],

E h t i gi(xt) 2 2

i 8L2E h xt i x i 2 2

µ2 α2 + 16L2E h x prev ˆxprev 2 2

where α > 0 is the parameter used at Step 4 of Algorithm 1.

In order to prove Lemma 3 we first state and establish Lemma 5 Lemma 5. For all rounds t Ti of stage i [n],

E h t i gi(xt i) 2 2

i 8L2E h xt i x i 2 2

i + 8L2E h x i ˆxprev 2 2 i (6)

where prev is defined in Algorithm 1.

Proof of Lemma 5. By substituting the definition of t i:

E[ t i gi(xt i) 2 2] = 1 1

2 E fut(xt i) fut(ˆxprev) + i 1 gi 1(xt i) 2

2 E h fut(xt i) fut(ˆxprev) 2 2

k=1 fk(xt i) i 1

2 E h fut(xt i) fut(ˆxprev) 2 2

k=1 ( fk(xt i) fk(ˆxprev))

2 L2 E h xt i ˆxprev 2 2

2 1 (i 1)2 E

k=1 ( fk(xt i) fk(ˆxprev))

2 L2 E h xt i ˆxprev 2 2

2 i 1 (i 1)2 E

fk(xt i) fk(ˆxprev) 2 2

2 L2 E h xt i ˆxprev 2 2

k=1 E h fk(xt i) fk(ˆxprev) 2 2

2 L2 E h xt i ˆxprev 2 2

k=1 E h fk(xt i) fk(ˆxprev) 2 2

2 L2 E h xt i ˆxprev 2 2

k=1 E h xt i ˆxprev 2 2

2 L2E h xt i ˆxprev 2 2

2 E[ xt i x i 2 2] + E[ x i ˆxprev 2 2]

Published as a conference paper at ICLR 2024

Proof of Lemma 3. Applying Lemma 6 for j := prev we get that,

E h x i ˆxprev 2 2 i 8 µ2

2 + 2E h x prev ˆxprev 2

2 + 2E h x prev ˆxprev 2

2 + 2E h x prev ˆxprev 2

µ2 α2 + 2E h x prev ˆxprev 2

The third inequality comes from the fact that i prev αi which is enforced by Step 4 of Algorithm 1.

By Lemma 5 we get that

E[ t i gi(xt i) 2 2] 8L2E[ xt i x i 2 2] + 8L2E[ x i ˆxprev 2 2]

8L2E[ xt i x i 2 2] + 64G2L2

µ2 α2 + 16L2E h x prev ˆxprev 2

which concludes the proof.

E PROOF OF LEMMA 8

In this section we provide proof for Lemma 8. Lemma 8 (Convergence Bound). If E[gj(ˆxj)] gj(x j) ϵ for all stages j [i 1] then

E [gi (ˆxi) gi(x i )] 1

8 (β 1)(β 2)E h x0 i x i 2 2

i + κ2G2α2 1

µTi + 2κ2ϵTi

where Z = Ti(Ti 1)

2 + (Ti + 1)(β 1), κ = 6L/µ and β = 72L2/µ2.

Proof of Lemma 8. In order to establish Lemma 8 we use Lemma 9 which is similar to (Allen-Zhu, 2018b, Lemma 3.2). The proof of Lemma 9 is presented in Appendix E.1.

Lemma 9. In case E[ t i] = gi(xt i) then

E gi(xt+1 i ) gi(x i ) E 9

16γt t i gi(xt i) 2 2 + 1 µγt

x i xt i 2 2 1

x i xt+1 i 2 2

where xt+1 i := ΠD (xt i γt t i), γt = 4 µ(t+β) and β = 72L2/µ2 (see Step 1 of Algorithm 2).

Lemma 2 (Unbiased). Let t i the gradient estimator used in Step 5 of Algorithm 2. Then for all t [Ti], E [ t i] = gi(xt i).

Since Lemma 2 establishes that E[ t i] = gi(xt i) we can apply Lemma 9. In order to bound the variance E h t i gi(xt i) 2 2 i appearing in the RHS of Equation 8, we use Lemma 3 (see Appendix D for its proof).

Lemma 3 (Bounded Variance). Let t i the gradient estimator used in Step 5 of Algorithm 2. Then for all t [Ti],

E h t i gi(xt) 2 2

i 8L2E h xt i x i 2 2

µ2 α2 + 16L2E h x prev ˆxprev 2 2

where α > 0 is the parameter used at Step 4 of Algorithm 1.

Published as a conference paper at ICLR 2024

As discussed above by combining Lemma 9 with Lemma 3, we get the following:

E gi(xt+1 i ) gi(x i ) γ 1 t + 9L2γt µ

2 E h xt i x i 2 2

2 E h xt+1 i x i 2 2

µ2 α2γt + 9L2γt E h x prev ˆxprev 2

By the strong-convexity of gprev( ) we get

x prev ˆxprev 2

µ gprev(ˆxprev) gprev(x prev)

and from our inductive hypothesis

E gprev(ˆxprev) gprev(x prev) ϵ

Notice that by the selection of γt = 4/(µ(t + β)) and β = 72L2/µ2 of Algorithm 2 we get that

9L2γt = 36L2

µ(t + β) 36L2

µ72L2/µ2 36L2

Using the previous inequalities in Equation 9 we get that

E [gi(xt+1) gi(x i )] γ 1 t µ/2

2 E h xt i x i 2 2

2 E h xt+1 i x i 2 2

+ κ2G2α2γt + 18L2γt

Since κ = 6L/µ.

Multiplying both parts of Equation 10 with (t + β 1) and substituting γt = 4/ (µ(t + β)) we get that

(t + β 1)E gi(xt+1 i ) gi(x i ) µ(t + β 1)(t + β 2)

8 E h xt i x i 2 2

µ(t + β 1)(t + β)

8 E h xt+1 i x i 2 2

+ κ2G2α2 + 18L2

µ ϵ 4 µ(t + β) (t + β 1)

where β = 72L2/µ2. By setting κ = 6L/µ we get that,

(t + β 1)E gi(xt+1 i ) gi(x i ) µ(t + β 1)(t + β 2)

8 E h xt i x i 2 2

µ(t + β 1)(t + β)

8 E h xt+1 i x i 2 2

µ α2 + 2κ2ϵ

By taking the summation over all iterations t = 0, . . . , Ti 1, we get

t=0 (t + β 1)E[gi(xt+1 i ) gi(x i )] µ

t=0 (t + β 1)(t + β 2)E h xt i x i 2 2

t=0 (t + β 1)(t + β)E h xt+1 i x i 2 2

µ α2 + 2κ2ϵ

Published as a conference paper at ICLR 2024

As a result, we get that

t=0 (t + β 1)E[gi(xt+1 i ) gi(x i )] µ

8 (β 1)(β 2)E h x0 i x i 2 2

8 (Ti + β 2)(Ti + β 1)E h x Ti x i 2 2 i

µ α2 + 2κ2ϵ

Dividing by Z = PTi 1 t=0 (t + β 1) = Ti(Ti 1)/2 + Ti(β 1) and using the convexity of gi we get that

t=0 (t + β 1)xt+1 i

8 (β 1)(β 2)E h x0 i x i 2 2

As a result,

E [gi (ˆxi) gi(x i )] 1

8 (β 1)(β 2)E h x0 i x i 2 2

i + κ2G2α2 1

µTi + 2κ2ϵTi

E.1 PROOF OF LEMMA 9

In this section we provide the proof for Lemma 9.

Lemma 9. In case E[ t i] = gi(xt i) then

E gi(xt+1 i ) gi(x i ) E 9

16γt t i gi(xt i) 2 2 + 1 µγt

x i xt i 2 2 1

x i xt+1 i 2 2

where xt+1 i := ΠD (xt i γt t i), γt = 4 µ(t+β) and β = 72L2/µ2 (see Step 1 of Algorithm 2).

Proof of Lemma 9. We start with (Allen-Zhu, 2018b, Lemma 3.2), which we state below.

Lemma 10. Allen-Zhu (2018b) If wt+1 = arg miny Rd{ 1

2η y wt 2 2 + ψ(y) + t, y } for some

random vector t Rd satisfying E[ t] = f(wt), then for every u Rd, we have

E[F(wt+1) F(u)] E

" η 2(1 ηL)

2 + (1 µfη) u wt 2 2 (1 + µψη) u wt+1 2 2 2η

where ψ is a proximal term, µψ is its strong convexity and µf is the strong convexity of the optimization function f and η is the step of the algorithm and the function F is defined as:

F(x) = f(x) + ψ(x)

For our setting the proximal term is:

ψ(x) = 0, x D , x / D

This means that µψ = 0. The step η for our analysis is γt and f gi.

By taking u = x i = arg minx D gi(x) and since due to projection on D

F(xt+1 i ) = gi(xt+1 i ) + ψ(xt+1 i ) = gi(xt+1 i )

Published as a conference paper at ICLR 2024

the inequality can be restated as:

E[gi(xt+1 i ) gi(x i )] E γt 2(1 γt L)

t i gi(xt i) 2 2 + 1 µγt

x i xt i 2 2 1

x i xt+1 i 2 2

Notice that by the selection of γt = 4/ (µ(t + β)) in Step 6 of Algorithm 2 we can do the following simplification

1 (1 γt L) = 1 (1 4L µ(t+β)) 1 (1 4L

µβ ) = 1 (1 4µ 72L) 1 (1 4

which gives the theorem statement. The last part of the proof required, is to show that the following two update rules are equivalent.

xt+1 i = arg min y Rd { 1

y xt i 2 2 + ψ(y) + t i, y }

xt+1 i = ΠD xt i γt t i

which is a well known fact in the literature and this concludes the proof.

F OMITTED PROOFS OF SECTION 4

F.1 PROOF OF LEMMA 1

In this section we prove Lemma 1, for the sake of exposition we restate it up next. Lemma 1. Over a sequence of n stages, Algorithm 1 reaches Step 5 and 12, log n/α times.

Proof of Lemma 1. Step 5 and 12 are only executed when the following inequality is satisfied:

i prev α i i 1 1 α prev (11)

Once Algorithm 1 reaches Step 5 and 12 it necessarily, reaches Step 13 where prev is updated to i. Let z0 = 1, z1, . . . , zk, . . . the sequence of stages where zk denotes the stage at which Algorithm 1 reached Step 5 and 12 for the k-th time. By Equation 11 we get that zk+1 1 1 α zk implying that

Since zk n we get that k log n log( 1 1 α). Notice that log 1 1 α = log(1 α) 1 (1 α) = α

and thus k log n

Using Lemma 1, we can now also show Corollary 1. Corollary 1. Over a sequence of n stages, Algorithm 1 requires 3 Pn i=1 Ti + 2n log n/α FOs.

Proof of Corollary 1. At each iteration of Algorithm 2 requires 3 FOs (Step 5) and thus Algorithm 2 requires overall 3Ti FOs during stage i [n]. At Step 5 and 12 Algorithm 1 requires at most n FOs and thus by Lemma 1 it overall requires 2n log n/α FOs.

F.2 PROOF OF THEOREM 4

In this section we provide the proof for Theorem 4, for the sake of exposition we restate it up next. Theorem 4. Let a convex and compact set D and a sequence µ-strongly convex functions f1, . . . , fn with fi : D 7 R. Then Algorithm 1, with Ti = 720GL2/(µ5/2i ϵ) + 9L2/3G2/3/(ϵ1/3µ) + 864L2/µ2 and α = µϵ1/3/(20G2/3L2/3), guarantees

E [gi(ˆxi)] gi(x i ) ϵ for each stage i [n]

where ˆxi D is the output of Algorithm 2 at Step 9 of Algorithm 1.

Published as a conference paper at ICLR 2024

Proof of Theorem 4. At stage i := 1, Algorithm 1 performs gradient descent using f1 in order to produce ˆx1 D. The latter requires O (L log(1/ϵ)/µ) FOs. Let us inductively assume that,

E[gj(ˆxj)] gj(x j) ϵ for all j [i 1].

Using the latter we will establish that E[gi(ˆxi)] gi(x i ) ϵ. In order to do the latter we first use Lemma 8. The proof of which can be found in Appendix E and its proof is based on the fact that the estimator t i is unbiased and admits bounded variance (see Lemma 2 and 3).

Lemma 8 (Convergence Bound). If E[gj(ˆxj)] gj(x j) ϵ for all stages j [i 1] then

E [gi (ˆxi) gi(x i )] 1

8 (β 1)(β 2)E h x0 i x i 2 2

i + κ2G2α2 1

µTi + 2κ2ϵTi

where Z = Ti(Ti 1)

2 + (Ti + 1)(β 1), κ = 6L/µ and β = 72L2/µ2.

Since the conditions of Lemma 8 are ensured by the induction hypothesis, we will appropriately select Ti such that the right-hand side of Equation 7 is upper bound by ϵ > 0.

At first, we upper bound the term E h x0 i x i 2 2

i appearing in the first term of the RHS of Equation 7.

Recall that in Step 2 of Algorithm 2, we set x0 i ˆxi 1 D. As a result, x0 i x i 2 2 = ˆxi 1 x i 2 2

In order to upper bound the term ˆxi 1 x i 2 2 we use Lemma 6, the proof of which can be found in Appendix B.

Lemma 6. Let ˆxj D the output of Algorithm 1 for stage j [n]. Then for all i {j + 1, . . . , n},

ˆxj x i 2 2 8

2 + 2 ˆxj x j 2 2

Applying Lemma 6 for j := i 1 we get that

E h x0 i x i 2 2

i = E h ˆxi 1 x i 2 2 i 8

2 + 2E h ˆxi 1 x i 1 2 2

By the inductive hypothesis we know that E [gi 1(ˆxi 1)] gi 1(x i 1) ϵ and thus by the strong convexity gi( ), we get that

E h ˆxi 1 x i 1 2 2

which yields Equation 13:

E h x0 i x i 2 2

The value of Z can be lower bounded as follows:

Z := Ti(Ti 1)/2 + (Ti + 1)(β 1) Ti(Ti 1)/2 T 2 i /4

So by combining Equation 7 with the previous two inequalities we get:

E[gi(ˆxi)] gi(x i ) 4 T 2 i

8 (β 1)(β 2)

µTi + 2κ2ϵTi

µT 2 i (2i 1)2 + 2ϵ

(β 1)(β 2) + 4κ2G2 α2

Tiµ + 8κ2ϵ 1

Now the upper bound on E[gi(ˆxi)] gi(x i ) admits four terms all of which depend on Ti. Thus we can select Ti so that each of the terms is upper bounded by ϵ/4. Namely,

Published as a conference paper at ICLR 2024

µT 2 i (2i 1)2 (β 1)(β 2) ϵ/4

2ϵ T 2 i (β 1)(β 2) ϵ/4

4κ2G2 α2 i2Tiµ ϵ/4 8κ2ϵ 1

Since κ = 6 L

µ , we get that we can set Ti as follows,

Ti = max{192 GL2

µ5/2i ϵ, 204L2

µ2 , 576L2G2α2

µ3i2Ti , 1152L2

= max{192 GL2

µ5/2i ϵ, 576L2G2α2

µ3i2Ti , 1152L2

µ5/2i ϵ + 576L2G2α2

µ3i2Ti + 1152L2

The proof is completed by selecting α = µϵ1/3/(9G2/3L2/3).

G PROOF OF THEOREM 1

In this section we provide the proof for Theorem 1.

Theorem 1. There exists a first-order method, CSVRG (Algorithm 1), for continual finite-sum minimization (2) with O L2/3G2/3

µ5/2 log n ϵ FO complexity.

Proof of Theorem 1. The proof of Theorem 1 follows by combining Theorem 4 and Corollary 1. Their proofs are respectively in Appendix F.2 and F.1.

Theorem 4. Let a convex and compact set D and a sequence µ-strongly convex functions f1, . . . , fn with fi : D 7 R. Then Algorithm 1, with Ti = 720GL2/(µ5/2i ϵ) + 9L2/3G2/3/(ϵ1/3µ) + 864L2/µ2 and α = µϵ1/3/(20G2/3L2/3), guarantees

E [gi(ˆxi)] gi(x i ) ϵ for each stage i [n]

where ˆxi D is the output of Algorithm 2 at Step 9 of Algorithm 1.

Corollary 1. Over a sequence of n stages, Algorithm 1 requires 3 Pn i=1 Ti + 2n log n/α FOs.

Using the selection of Ti provided in Theorem 4 to Corollary 1 we get that

i=1 Ti 720GL2

i + n 9L2/3G2/3

ϵ1/3µ + 864L2

µ5/2 ϵ log n + n 9L2/3G2/3

ϵ1/3µ + 864L2

At the same time, using the selection of α provided in Theorem 4 we get

2n log n/α := 40L2/3G2/3

H PROOF OF THEOREM 3

In this section we provide the proof of Theorem 3. We first present an intermediate theorem that is the main technical contribution of the section.

Published as a conference paper at ICLR 2024

Theorem 5. Let a natural first-order method A (even randomized) that given a sequence of n strongly convex functions f1, . . . , fn outputs a sequence ˆx1, . . . , ˆxn D by performing overall o(n2) FOs. Then there exists a sequence of strongly convex functions f1, . . . , fn with fi : [ 1, 1]d 7 R and µ, G, L = O(1) such that

E [gi(ˆxi)] gi(x i ) Ω 1/n4 for some stage i [n]

The proof of Theorem 5 lies in Section H.1. To this end we use Theorem 5 to establish Theorem 3.

Theorem 3. For any α > 0, there is no natural first-order method for Problem (2) with O n2 α log(1/ϵ) FO complexity. Moreover for any α < 1/4, there is no natural first-order method for continual finite-sum minimization (2) with O (n/ϵα) FO complexity.

Proof of Theorem 3. Let us assume that there exists a natural first order method A with overall complexity O n2 α log(1/ϵ) for some α > 0. By setting ϵ = O(1/n5), we get that there exists a natural first-order method that for sequence f1, . . . , fn guarantees

E [gi(ˆxi)] gi(x i ) O 1/n5 for each stage i [n]

with overall FO complexity O(n2 α log n). However the latter contradicts with Theorem 5.

Respectively let us assume that there exists a natural first order method A with overall complexity O (n/ϵα) for some α < 1/4. By setting ϵ = O(1/n4), we get that there exists a natural first-order method that for sequence f1, . . . , fn guarantees

E [gi(ˆxi)] gi(x i ) O 1/n4 for each stage i [n]

with overall FO complexity O(n1+α/4). However the latter contradicts with Theorem 5.

H.1 PROOF OF THEOREM 5

Proof of Theorem 5. To simplify notation, we denote with [x]ℓthe coordinate ℓof vector x Rd. In our lower bound construction we consider d := 2.

Since A performs o(n2) FOs then there exists a stage i > 1 such that P

t Ti |Qt i| i/2 (otherwise the overall number of FOs, P

t Ti |Qt i| n2/4). Using this specific index i [n] we construct the following (random) sequence of functions f1, . . . , fn where each fi : [ 1, 1]2 7 R:

([x]1)2 + ([x]2)2 ℓ = i, k ([x]1)2 + ([x]2)2 + ([x]1 [x]2)2 ℓ= k ([x]1 1)2 + ([x]1)2 + ([x]2)2 ℓ= i

where k Unif(1, . . . , i 1).

Before proceeding we remark that each function fℓ: [ 1, 1]2 7 R is G-Lipschitz, L-smooth and µ-strongly convex with G, L, µ = O(1).

Corollary 2. Each fℓ: [ 1, 1]2 7 R is 2-strongly convex, 6-smooth and 6

2-Lipschitz.

Let the initial point of A be ˆx0 = (0, 0) [ 1, 1]2.

Notice that fℓ(0, 0) = (0, 0) for each ℓ = i. Since A is a natural first-order method then Definition 3 implies that A always remains that (0, 0) at all stages j i 1. The latter is formally stated and established in Lemma 11.

Lemma 11. For any stage j [i 1],

[ˆxj]1 = [ˆxj]2 = 0

[xt j]1 = [xt j]2 = 0 for all rounds t [Tj].

Published as a conference paper at ICLR 2024

The proof of Lemma 11 lies in Section H.2. Its proof is based on the simple observation that if A has stayed at (0, 0) at all stages ℓ i 2. Then at stage ℓ+ 1 i 1, Item 1 of Definition 3 implies that A always queries fj(0, 0) for some j i 1 meaning that fj(0, 0) = (0, 0). Then Item 2 and 3 of Definition 3 directly imply that xt ℓ= (0, 0) and ˆxℓ= (0, 0).

Corollary 3. With probability greater than 1/2, the function fk is never queried during state i. In other words, qindex = k for all q t Ti Qt i.

Proof. By the selection of stage i [n], we know that P

t Ti |Qt i| i/2. Since k was sampled uniformly at random in {1, . . . , i 1}, Pr [qindex = k for all q t Ti Qt i] 1/2.

Notice that for all ℓ = i, fℓ( , 0) = ( , 0). The main idea of the proof is that in case k is never queried during stage i, (qindex = k for all q t Ti Qt i) then all FOs during stage admit the form findex(qvalue) = findex( , 0) = ( , 0). The latter implies that if qindex = k for all q t Ti Qt i then [ˆxi]2 = 0.

Lemma 12. In case qindex = k for all q t Ti Qt i then [ˆxi]2 = 0.

The proof of Lemma 12 is based on the intuition that we presented above. Its proof is presented in Appendix H.2.

To this end we are ready to complete the proof of Theorem 5. Combining Lemma 12 with Corollary 3 we get that Pr [[ˆxi]2 = 0] 1/2. (14)

Up next we lower bound the error gi(ˆxi) gi(x i ) in case [ˆxi]2 = 0.

By the definition of the sequence f1, . . . , fk, . . . , fi, we get that

gi(x) := ([x]1)2 + 1

i ([x]1 1)2 + 1

i ([x]1 [x]2)2 + ([x]2)2

In case [ˆxi]2 = 0 we get that

gi( ˆxi) = 1 + 1

([ˆxi]1)2 + 1

i ([ ˆxi]1 1)2

min w [ 1,1]

i (w 1)2 notice that w = 1 i + 2

= i + 1 i(i + 2)2 + (i + 1)2

i(i + 2)2 = i + 1 i(i + 2) At the same time, the minimum gi(x i ) equals,

gi(x i ) = ([x i ]1)2 + 1

i ([x i ]1 1)2 + 1

i ([x i ]1 [x i ]2)2 + ([x i ]2)2

= min w,z [ 1,1]

i (w 1)2 + 1

i (w z)2 + z2 notice that (w , z ) = i + 1 i2 + 3i + 1, 1 i2 + 3i + 1

= i + 1 i2 + 3i + 1

i + 1 i2 + 3i + 1 1 2 + 1

i + 1 i2 + 3i + 1 1 i2 + 3i + 1

2 + 1 i2 + 3i + 1

By taking the difference gi(ˆxi) gi(x i ) we get that

gi( ˆxi) gi(x i ) min w [ 1,1]

i (w 1)2 min w,z [ 1,1]2

i (w 1)2 + 1

= i + 1 i(i + 2) i + 1 i2 + 3i + 1

i + 1 i2 + 3i + 1 1 2 1

i + 1 i2 + 3i + 1 1 i2 + 3i + 1

1 i2 + 3i + 1

= 1 i4 + 5i3 + 7i2 + 2i

Published as a conference paper at ICLR 2024

We finally get that

E [gi( ˆxi) gi(x i )] = Pr [[ˆxi]2 = 0] E [gi( ˆxi) gi(x i ) | [ˆxi]2 = 0] + Pr [[ˆxi]2 = 0] E [gi( ˆxi) gi(x i ) | [ˆxi]2 = 0]

1 2 1 i4 + 5i3 + 7i2 + 2i Ω 1/n4

H.2 OMITTED PROOFS

Corollary 2. Each fℓ: [ 1, 1]2 7 R is 2-strongly convex, 6-smooth and 6

2-Lipschitz.

Proof. Notice that Hessians

2fℓ(x) = 2 0 0 2

for ℓ = i, k. Meaning that fℓis 2-strongly convex and 2-smooth.

2fk(x) = 4 2 2 4

for ℓ = i, k. Meaning that fℓis 2-strongly convex and 6-smooth.

2fi(x) = 4 0 0 2

for ℓ = i, k. Meaning that fℓis 2-strongly convex and 4-smooth.

Respectively notice that the gradients

fℓ(x) = 2([x]1, [x]2) and thus max[x]1,[x]2 [ 1,1] fℓ(x) 2 2

2. As a result, fℓ(x) is 2

2-Lipschtiz in [ 1, 1]2.

fk(x) = 2(2[x]1 [x]2, 2[x]2 [x]1) and thus max[x]1,[x]2 [ 1,1] fk(x) 2 6

2. As a result, fk(x) is 6

2-Lipschtiz in [ 1, 1]2.

fi(x) = 2(2[x]1 1, 2[x]2) and thus max[x]1,[x]2 [ 1,1] fi(x) 2 4. As a result, fk(x) is 4-Lipschtiz in [ 1, 1]2.

Lemma 11. For any stage j [i 1],

[ˆxj]1 = [ˆxj]2 = 0

[xt j]1 = [xt j]2 = 0 for all rounds t [Tj].

Proof of Lemma 11. By the construction of the sequence f1, . . . , fn, we get that for all ℓ i 1,

fℓ(x) = 2 ([x]1, [x]2) for ℓ = k, i

fk(x) = 2 (2[x]1 [x]2, 2[x]2 [x]1)

Thus in case x = (0, 0) then fℓ(x) = fk(x) = (0, 0).

Up next we inductively establish Lemma 11. For the induction base we first establish that at stage ℓ:= 1,

Corollary 4. For stage ℓ= 1,

xt 1 = (0, 0) for all rounds t [T1].

Published as a conference paper at ICLR 2024

ˆx1 = (0, 0)

Proof. At round t := 1 of stage ℓ:= 1, Item 1 of Definition 3 ensures that for all FOs q Q1 1

qindex = 1 and qvalue = (0, 0)

Since f1(x) = 2x, the latter implies that fqindex(qvalue) = (0, 0) for all q Q1 1. Thus Item 2 of Definition 3 ensures that x1 1 = (0, 0).

We prove Corollary 4 through induction.

Induction Base: x1 1 = (0, 0)

Induction Hypothesis: xτ 1 = (0, 0) for all τ {1, . . . , t 1}.

Induction Step: xt 1 = (0, 0)

Item 1 of Definition 3 ensures that for all q Qt 1, qvalue ( τ t 1xτ 1) ˆx0 and thus by the inductive hypothesis qvalue = (0, 0). Thus

fqindex(qvalue) = (0, 0) for all q = (qindex, qvalue) τ t Qτ 1

Then Item 2 of Definition 3 implies that xt 1 = (0, 0).

To this end we have established that xt 1 = (0, 0) for all t [T1]. Then Item 3 of Definition 3 implies that ˆx1 = (0, 0).

We complete the proof of Lemma 11 with a similar induction.

Induction Base: xt 1 = (0, 0) for all t [T1] and ˆx1 = (0, 0).

Induction Hypothesis: xt ℓ= (0, 0) for all t [Tℓ] and ˆxℓ= (0, 0), for all stages ℓ i 2.

Induction Step: xt ℓ+1 = (0, 0) for all t [Tℓ+1] and ˆxℓ+1 = (0, 0)

Let us start with round t := 1 of stage ℓ+ 1. Item 1 of Definition 3 together with the inductive hypothesis ensure that for any FO q Q1 ℓ+1, qvalue = (0, 0). Since qindex ℓ+ 1 i 1 we get that fqindex(qvalue) = (0, 0) for all queries q Q1 ℓ+1.

The latter together with Item 2 of Definition 3 imply

x1 ℓ+1 = (0, 0).

Let us inductively assume that xt ℓ+1 = (0, 0). Then we the exact same argument as before we get that

fqindex(qvalue) = (0, 0) for all queries q Qt+1 ℓ+1.

Then again Item 2 of Definition 3 implies that

xt+1 ℓ+1 = (0, 0).

To this end we have established that xt ℓ+1 = (0, 0) for all t [Tℓ+1]. Thus Item 3 of Definition 3 implies that ˆxℓ+1 = (0, 0). The latter completes the induction step and the proof of Lemma 11.

Lemma 12. In case qindex = k for all q t Ti Qt i then [ˆxi]2 = 0.

Proof of Lemma 12. First notice that

fℓ(x) = 2 ([x]1, [x]2) for ℓ = k, i

fi(x) = 2(2[x]1 1, [x]2)

Published as a conference paper at ICLR 2024

Thus for any x = ([x]1, 0), fℓ(x) = 2([x]1, 0) and fi(x) = 2(2[x]1 1, 0). Using the latter observation and the Lemma 11 we inductively establish that in case qindex = k for all q t Ti Qt i then xt i = [xt i]1, 0 .

We start by establishing the latter for t = 1.

Corollary 5. In case qindex = k for all q t Ti Qt i then xt i = ([xt i]1, 0).

Proof. Lemma 11 ensures that for all stages ℓ i 1

xt ℓ= (0, 0) for all t [Tℓ].

ˆxℓ= (0, 0)

The latter together with Item 1 of Definition 3 imply that qvalue = (0, 0) for all q Q1 i . Since qindex = k for all q Q1 i we get that

fqindex(qvalue) = ( , 0) for all q Q1 i

Then Item 2 of Definition 3 implies that x1 i = ( , 0).

We complete the proof of Lemma 12 through an induction.

Induction Base: x1 i = ( , 0)

Induction Hypothesis: xτ i = ( , 0) for all τ t 1

Induction Step: xt+1 i = ( , 0)

The induction hypothesis together with Lemma 11 and Item 1 of Definition 3 imply that for each FO q Qt i, qvalue = ( , 0). Since k = qvalue for all q Qt i, we get that

fqindex(qvalue) = ( , 0) for all q Qt i

Then Item 2 of Definition 3 implies that xt i = ( , 0).

To this end our induction is completed and we have established that xt i = ( , 0) for all t [Ti]. Then the fact that ˆxi = ( , 0) is directly implied by Item 3 of Definition 3.

Published as a conference paper at ICLR 2024

I SPARSE STOCHASTIC GRADIENT DESCENT

In this section we present the sparse version Stochastic Gradient Descent, called SGD-Sparse (Algorithm 4) that is able to ensure accuracy ϵ > 0 at each stage i [n] with O DG3 log n

ϵ2 FO complexity.

In Algorithm 3 we first present SGD for specific stage i [n].

Algorithm 3 Stochastic Gradient Descent(i, x, γ, Ti)

1: xi 0 x D 2: for each round t := 1, . . . , Ti do

3: Sample jt Unif({1, . . . , i 1})

4: xt i ΠD xt 1 i γ fjt(xt 1 i )/t

6: return ˆxi PTi t=1 xt i/Ti

SGD-sparse is presented in Algorithm 4. Algorithm 4 tracks a sparse sub-sequence of stages i [n] (prev (1 + α) < i) at which the returned point ˆxi is produced by running SGD (Algorithm 3) for Ti iterations. In the remaining epochs Algorithm 4 just returns the solution of the last stage at which SGD was used ( ˆxi ˆxprev, Step 8 of Algorithm 4). Notice that the sparsity of the SGD-call (Step 4) is controlled by the selection of the parameter α 0. For example for α = 0, Step 4 is reached at each stage i [n] while for large values of α 0, Step 4 might never be reached.

Algorithm 4 Stochastic Gradient Descent-sparse

1: ˆx0 D and prev 0

2: for each stage i := 1, . . . , n do

3: if prev (1 + α) < i then

4: ˆxi Stochastic Gradient Descent(i, ˆxi 1, γ, Ti) # output ˆxi D for stage i

5: ˆxprev ˆxi 6: prev i 7: else

8: ˆxi ˆxprev # output ˆxi D for stage i

9: end if 10: end for

In the rest of the section we establish the formal guarantees of SGD that are formally stated in Algorithm 4. We start by presented the well-known guarantees of SGD.

Theorem 6 (Hazan (2023)). Let ˆxi := Stochastic Gradient Descent(i, ˆxi 1, γ, Ti) where γ = 1/µ,

ϵµ and G = maxx D f(x) . Then, E [gi(ˆxi) gi(x i )] ϵ.

Theorem 7. Let a convex and compact set D and a sequence of µ-strongly convex functions f1, . . . , fn with fi : D 7 R. Then Algorithm 4, with Ti = O G2

ϵµ , α = 1 2|D|G and γ = 1/µ, guarantees

E [gi(ˆxi) gi(x i )] ϵ for each stage i [n]

The FO complexity of Algorithm 4 is O |D|G3 log n

Published as a conference paper at ICLR 2024

Proof. In case i = prev then by selecting T = O G2

µϵ and γ = 1 /µ, Theorem 6 implies that

E [gi(ˆxi)] gi(x i ) ϵ/2. Up next we establish the claim for prev < i (1 + α) prev.

E [gi(ˆxi)] gi(x i ) = E [gi(ˆxprev)] gi(x i ) (ˆxi = ˆxprev, Step 8 of Algorithm 4)

j=1 fj(ˆxprev)

j=1 fj(x i )

i E [gprev(ˆxprev)] + 1

j=prev+1 E [fj(ˆxprev)] prev

i gprev(x i ) 1

j=prev+1 fj(x i )

i (E [gprev(ˆxprev)] gprev(x i )) + 1

j=prev+1 (E [fj(ˆxprev)] fj(x i ))

i (E [gprev(ˆxprev)] gprev(x i )) + i prev

E [gprev(ˆxprev)] gprev(x prev) + gprev(x prev) gprev(x i ) | {z } 0

+ |D|G i prev

i E [gprev(ˆxprev)] gprev(x prev) + |D|G i prev

ϵ 2 + |D|G i prev

ϵ 2 + |D|G(1 + α)prev prev

= ϵ 2 + |D|Gα = ϵ

To this end we have established that E [gi(ˆxi)] gi(x i ) ϵ for each stage i [n]. We conclude the proof by upper bounding the overall FO complexity of Algorithm 4.

An execution of Stochastic Gradient Decent at Step 4 of Algorithm 4 that calculates an ϵ/2 optimal solution requires O G2/(µϵ) FOs. Let us now upper bound the required executions of SGD. Let k denotes number of times Algorithm 4 reached Step 4 over the course of n stages. Due Step 5 of Algorithm 4, 1 + ϵ 2|D|G

k n k ln n ln(1 + ϵ 2|D|G) ln n 1 1 1+ ϵ 2|D|G 4|D|Glog n

Thus the overall FO complexity of Algorithm 4 is O G3|D| log n

Published as a conference paper at ICLR 2024

J EXPERIMENTAL DETAILS

In this section we present additional experimental evaluations in LIBSVM datasets.

Table 3: Optimality gap as the stages progress on a ridge regression problem (averaged over 10 independent runs). CSVRG performs the exact same number of FOs with SGD and slightly less than SGD-sparse. Katyusha and SVRG perform the exact same number of FOs. CSVRG/SGD/SGDsparse perform roughly 4% of the FOs of Katyusha/SVRG.

In Table 4 and 5 we present the parameters of the various method used in our experimental evaluations. We remind that λ = 10 3.

Method breast cancer cod-rna diabetes SGD step size: (tλ) 1 T: 300

step size: (tλ) 1 T: 300

step size: (tλ) 1 T: 300 SGD-sparse step size: (tλ) 1 T: 414 α : 0.002

step size: (tλ) 1 T: 480 α : 0.002

step size: (tλ) 1 T: 480 α : 0.002 Katyusha step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.0522

step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.015

step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.01262 SVRG step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.0522

step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.015

step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.01262 CSVRG step size at i: (itλ) 1 Ti = 100 α = 0.3

step size at i: (itλ) 1 Ti = 100 α = 0.3

step size at i: (itλ) 1 Ti = 100 α = 0.3

Table 4: Parameters used in our experiments

Published as a conference paper at ICLR 2024

Method german.numer skin-nonskin SGD step size: (tλ) 1 T: 300

step size: (tλ) 1 T: 300 SGD-sparse step size: (tλ) 1 T: 414 α : 0.002

step size: (tλ) 1 T: 480 α : 0.002 Katyusha step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.0317

step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.068 SVRG step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.0317

step size: 1/(3L) Outer Iterations: 10 Inner Iterations: 100 L : 0.068 CSVRG step size at i: (itλ) 1 Ti = 100 α = 0.3

step size at i: (itλ) 1 Ti = 100 α = 0.3

Table 5: Parameters used in our experiments

Published as a conference paper at ICLR 2024

K EXPERIMENTAL EVALUATIONS WITH NEURAL NETWORKS

In this section we include additional experimental evaluations CSVRG,SGD and SVRG for the continual finite-sum setting in the context of 2-Layer Neural Networks and the MNIST dataset. More precisely, we experiment with Neural Networks with an input layer of size 784 (corresponds to the size of the inputs from the MNIST dataset), a hidden layer with 1000 nodes and an Re LU activation function and an output layer with 10 nodes corresponding to the 10 MNIST classes. As a training loss we use the cross-entropy between the model s output and the one-hot encoding of the classes. We consider two different constructions of the data streams that are respectively presented in Section K.1 and Section K.2. For both settings the parameters used for the various methods are presented in Table K. We also remark that the parameters of CSVRG, SGD were appropriately selected so as to perform roughly the same number of FOs.

Method MNIST SGD step size: 0.001 T: 1385 SVRG step size: 0.05 Outer Iterations: 7 Inner Iterations: 40 CSVRG step size at i: 0.001 Ti = 600 α = 0.01

Table 6: Parameters used for training of the neural network.

K.1 DATA STREAMS PRODUCED WITH THE STATIONARY DISTRIBUTION

In this experiment we create a stream of 3000 data points where the data-point at iteration i were sampled uniformly at random (without replacement) from the MNIST dataset. Table 7 illustrates the training loss of each different method across the different stages. As Table 7 reveals that in the early stages all methods achieve training loss close to 0 by overfitting the model while in the latter stages CSVRG and SVRG are able to achieve much smaller loss in comparison to SGD. We remark that CSVRG is able to meet the latter goal with way fewer FO calls than SVRG as demonstrated in Table 7.

Table 7: Training loss as the stages progress on a neural network training for the MNIST dataset for the setting of sampling without replacement.

K.2 DATA STREAMS WITH ASCENDING LABELS

Motivated by continual learning at which new tasks appear in a streaming fashion, we experiment with a sequence of MNIST data points at which new digits gradually appear. More precisely, we construct the stream of data points as follows:

1. For each of the classes i {0, 9}, we randomly sample 300 data points.

Published as a conference paper at ICLR 2024

2. For the first 600 stages we randomly mix 0/1 data points.

3. The stages {601 + (i 2) 300, 600 + (i 1) 300} for i {2, 9} contain the data points of category i (e.g. category 2 appears in stages {601, 900}).

Table 8 illustrates the training loss of CSVRG and SGD for the various stages of the above data-stream. As Table 8 reveals, CSVRG is able to achieve significantly smaller loss than SGD with the same number of FOs. We also remark that that the bumps appearing in bot curves corresponds to the stages at which a new digit is introduced.

Table 8: The training loss of CSVRG and SGD for the various stages.

In Table 9 we present the classification accuracy of the models respectively produced by CSVRG and SGD for the various stages. At each stage i we present the accuracy of the produced model according to the categories revealed so far. The sudden drops in the accuracy occur again at the stages where new digits are introduced. For example in iteration 601, the accuracy drops from roughly 1 to roughly 0.667 due to the fact in stage 600 the accuracy is measured with respect to the 0/1 classification task while in stage 601 the accuracy is measured with respect to the 0/1/2 classification task. At stage 601 both models misclassifies all the 2 examples and thus the accuracy drops to 2/3. In the left plot, we plot the classification accuracy with respect to the data used in the training set while in the right one we plot the classification accuracy on unseen test data. Both plots reveal that CSVRG admits a noticable advantage of SGD.

Table 9: Classification accuracy. For the train set on the left and the test set on the right. As a comparison between the two algorithms.