# stochastic_optimization_with_arbitrary_recurrent_data_sampling__a61f0071.pdf

Stochastic Optimization with Arbitrary Recurrent Data Sampling

William Powell * 1 Hanbaek Lyu * 1

For obtaining optimal first-order convergence guarantees for stochastic optimization, it is necessary to use a recurrent data sampling algorithm that samples every data point with sufficient frequency. Most commonly used data sampling algorithms (e.g., i.i.d., MCMC, random reshuffling) are indeed recurrent under mild assumptions. In this work, we show that for a particular class of stochastic optimization algorithms, we do not need any further property (e.g., independence, exponential mixing, and reshuffling) beyond recurrence in data sampling to guarantee optimal rate of first-order convergence. Namely, using regularized versions of Minimization by Incremental Surrogate Optimization (MISO), we show that for non-convex and possibly non-smooth objective functions with constraints, the expected optimality gap converges at an optimal rate O(n 1/2) under general recurrent sampling schemes. Furthermore, the implied constant depends explicitly on the speed of recurrence , measured by the expected amount of time to visit a farthest data point, either averaged ( target time ) or supremized ( hitting time ) over the initial locations. We discuss applications of our general framework to decentralized optimization and distributed non-negative matrix factorization.

1. Introduction

In this paper we consider the minimization of a non-convex weighted finite-sum objective f : Rp R:

θ arg min θ Θ

n f(θ) := X

v V f v(θ)π(v) o (1)

*Equal contribution 1Department of Mathematics, University of Wisconsin-Madison, WI, USA. Correspondence to: William Powell <wgpowell@wisc.edu>, Hanbaek Lyu <hlyu@math.wisc.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

where Θ Rp is a convex, but not necessarily compact feasible set and θ represents the parameters of a model to be optimized. Here V is a finite index set where one can view each index v V representing (a batch of) data that can be accessed at once. Then f v(θ) is the loss incurred using parameter θ with respect to data at v, which is weighted by π(v) 0 when forming the overall objective f in (1). Without loss of generality, we assume the π(v)s sum to one. When π(v) 1 |V| the problem (1) becomes the classical finite sum problem in the optimization literature. Instances of non-uniform π arise when training a model with imbalanced data as has been studied in (Steininger et al., 2021; Wang et al., 2022b; Sow et al., 2024).

We aim to solve this problem by developing an algorithm which produces iterative parameter updates θn given only access to an arbitrary sequence of data samples (vn)n 1. In order to reach a first-order stationary point of (1) for general objectives, it is necessary to use a sampling algorithm that is recurrent, meaning that every data point is sampled infinitely often with sufficient frequency . Note that recurrence is satisfied by many common sampling schemes such as i.i.d. (independently and identically distributed) sampling, (irreducible) Markov Chain Monte-Carlo (MCMC), cyclic sampling (Bertsekas, 2011), and random-reshuffling (Ying et al., 2017). The main question we ask in this work is the following:

Is there any class of stochastic optimization algorithms for which recurrent sampling is enough to obtain optimal first-order convergence guarantee for (1)?

In this paper, we show that for a class of suitable extensions of stochastic optimization algorithms known as Minimization by Incremental Surrogate Optimization (MISO) (Mairal, 2015), no additional property of a data sampling algorithm (e.g., independence, exponential mixing, reshuffling) other than recurrence is needed in order to guarantee convergence to first-order stationary points. Furthermore, we show that the rate of convergence depends crucially on either the averaged or supremized return time to the farthest data point, corresponding to the notion of target time and hitting time in Markov chain theory, respectively.

With the original MISO algorithm in (Mairal, 2013), even under the general recurrent data sampling, we are able to

Stochastic Optimization with Arbitrary Recurrent Data Sampling

obtain asymptotically optimal iteration complexity if we can use strongly convex surrogate functions. However, there is a significant technical bottleneck in showing asymptotic convergence to stationary points, which was classically established in (Mairal, 2015) in case of the i.i.d. data sampling. We find that using additional regularization helps with improving the convergence rate and allows us to prove asymptotic convergence to stationary points under arbitrary recurrent data sampling. For these reasons, we propose a slight extention of MISO that we call the Regularized Minimization by Incremental Surrogate Optimization (RMISO), which takes the following form:

Step 1. Sample vn according to a recurrent sampling algorithm

Step 2. gvn n Convex majorizing surrogate of f vn at θn 1; gv n = gv n 1 for v = vn

Step 3. gn P

v V gv nπ(v); Compute

θn arg min θ Θ

h gn(θ) + Ψ( θ θn 1 ) i .

The algorithm maintains a list of majorizing surrogate functions for each data point v. At each step, a new data sample vn is drawn according to a recurrent data sampling algorithm. We then find a new majorizing convex surrogate gvn n that is tight at the current parameter θn 1. All other surrogates are unchanged. Then the new parameter θn is found by minimizing the empirical mean of the current surrogates plus a regularization term Ψ( θ θn 1 ) that penalizes large values of θn θn 1 . To handle dependent data, many algorithms use some form of projection or regularization to achieve this property (Lyu, 2023; Bhandari et al., 2018; Roy & Balasubramanian, 2023). This allows one to control the bias introduced by dependent sampling schemes as well as use the broader class of convex surrogates instead of requiring them to be strongly convex. The original MISO (Mairal, 2015) is recovered by omitting this regularization term. The particular choice of this term is crucial for the success of the analysis under the general recurrent data sampling setting.

Applications of our work include distributed optimization over networks where V forms the vertex set of a connected graph G = (V, E) and each vertex v stores some data. Prior work (Johansson et al., 2010; 2007; Ram et al., 2009; Lopes & Sayed, 2007; Mao et al., 2020; Even, 2023; Sun et al., 2022) studies the performance of various optimization algorithms in this setting assuming the sequence (vn)n 1 is a Markov chain on the graph G. Here π is typically taken to be uniform and it is frequently assumed that the Markov chain is an MCMC sampling converging to π, see (Sun et al., 2022; Johansson et al., 2010; Wang et al., 2022a).

In this setting we find both theoretically and empirically that convergence of our algorithm can be accelerated by choosing sampling schemes that guarantee a higher frequency of visits to each v V. Such schemes may be non-Markovian or not aperiodic and so not guaranteed to converge to a stationary distribution. Moreover, our analysis does not require π to agree with the stationary measure of the sampling process if it exists. As remarked in (Even, 2023), this additional flexibility may be advantageous as it allows one to opt for more efficient sampling schemes who s stationary measure may not agree with the data-weighting-distribution π. In our context, these are the schemes which minimize the measures of recurrence we define in the sequel.

1.1. Contribution

Our algorithms and analysis consider three cases which we briefly summarize in the following bullet points.

We show convergence rates of O(n 1/2) for MISO with strongly convex surrogates or constant quadratic proximal regularization, matching the rate shown for SAG in (Even, 2023). The implied constant depends on the potentially much smaller target time rather than the hitting time.

The same convergence rates hold for MISO with dynamic quadratic proximal regularization where, inspired by the dynamic step size used for SAG in (Even, 2023), the regularization parameter is adaptive to the state of the sampling process. Asymptotic convergence of stationarity measures in expectation is also proved.

Convergence rates of O(n 1/2 log n) are shown for MISO with diminishing search radius restriction, where averaged surrogates are minimized within a diminishing radius. We show almost sure convergence to stationarity for this method.

We experimentally validate our results for the tasks of non-negative matrix factorization and logistic regression. We find that our method is robust to data heterogeneity as it produces stable iterate trajectories while still maintaining fast convergence (see Sec. 4.2).

1.2. Related Work

MISO (Mairal, 2015) was originally developed to solve finite sum problems under i.i.d sampling and proceeds by repeatedly minimizing a surrogate of the empirical loss function. In (Mairal, 2015) it is shown that for MISO the expected objective optimality gap E[f(θn) f(θ )] decays at rate O(1/n) when the objective function is convex and exponentially fast when it is strongly convex, just as batch gradient descent does (Bottou et al., 2018). For non-convex f it is shown that the iterates produced by MISO converge to

Stochastic Optimization with Arbitrary Recurrent Data Sampling

the set of stationary points of f over a convex constraint set, but no convergence rate analysis is given. Convergence rates for non-convex objectives were later provided for unconstrained problems in (Qian et al., 2019) where it was shown that the expected gradient norm E[ f(θn) ] decays at rate O(n 1/2). This rate was matched for the constrained setting in (Karimi et al., 2022). However, both papers only consider i.i.d sampling.

(R)MISO may be compared with Stochastic Averaged Gradient (SAG) (Schmidt et al., 2017) as both store the most recent information computed using the data v and output new parameter updates θn depending on an average of this information over V. Recently in (Even, 2023), it was shown that for non-convex objectives SAG produces iterates such that the expected gradient norm decays at rate O(n 1/2) under Markovian sampling. In comparison, the expected gradient norm converges at rate O(n 1/4) for other stochastic first-order methods such as Stochastic Gradient Descent (SGD) (Sun et al., 2018; Alacaoglu & Lyu, 2023; Even, 2023; Karimi et al., 2019). Other works devoted to the study of first order optimization methods under Markovian sampling include (Beznosikov et al., 2023; Bhandari et al., 2018; Wang et al., 2022a; Huo et al., 2023; Lyu, 2023).

There has also been a recent focus on proving faster convergence for SGD using without-replacement sampling methods such as random-reshuffling (G urb uzbalaban et al., 2021; Ying et al., 2017). This has been further extended to variance reduced algorithms (Huang et al., 2021; Malinovsky et al., 2023; Beznosikov & Tak aˇc, 2023) and distributed optimization (Mishchenko et al., 2022; Horv ath et al., 2022). New sampling algorithms that aim to improve over random reshuffling have been suggested in (Rajput et al., 2022; Lu et al., 2022a; Mohtashami et al., 2022). In particular, in (Lu et al., 2022b) the authors show that the convergence of SGD can be accelerated provided a certain concentration inequality holds and propose leveraging this using a greedy sample selection strategy.

To obtain our results, we adopt a new analytical approach which is inspired in part by the analysis of SAG in (Even, 2023). This strategy differs significantly from mixing rate arguments used in the analysis of stochastic optimization methods with Markovian data (e.g (Sun et al., 2018; Bhandari et al., 2018; Nagaraj et al., 2020; Lyu et al., 2020; 2022; Lyu, 2023; Alacaoglu & Lyu, 2023)). We give a short sketch of our proofs in Section 3.4 and a brief overview of mixing rate techniques and the challenges of adapting them to the analysis of MISO in Appendix B. We believe that these techniques may be of interest in their own right and may further contribute to analyzing other stochastic optimization methods with recurrent data sampling.

1.3. Notation

In this paper, we let Rp denote the ambient space for the parameter space Θ equipped with the standard inner product , and the induced Euclidean norm . For θ Rp and ε > 0, we let Bε(θ) represent the closed Euclidean ball of radius ε centered at θ. We let 1(A) be the indicator function of an event A which takes value 1 on A and 0 on Ac. We denote πmin = minv V π(v). We let a b = min{a, b} for real numbers a and b. For a set X we let |X| denote its cardinality.

2. Preliminary Definitions and Algorithm Statement

In this section we state the two main algorithms used to solve (1). To do this we start by defining first-order surrogate functions and then define a few random variables that will be important in both implementation and analysis. First-order surrogates are defined by Definition 2.1 (First-order surrogates). A convex function g : Rp R is a first-order surrogate function of f at θ if

(i) g(θ ) f(θ ) holds for all θ Θ

(ii) the approximation error h := g f is differentiable and h is L-Lipschitz continuous for some L > 0; moreover h(θ) = 0 and h(θ) = 0.

We denote by SL(f, θ) the set of all first order surrogates of f at θ such that h is L-Lipschitz. We further define SL,µ(f, θ) to be the set of all surrogates g SL(f, θ) such that g is µ-strongly convex.

Certain properties of the data sampling process are crucial in our analysis, especially in proving Lemmas D.1, E.2, and E.4. Below we define the return time and last passage time. Definition 2.2 (Return time). For n 0 and v V, the time to return to data v starting from time n is defined as

τn,v = inf{j 1 : vn+j = v}. (2)

That is, τn,v is the amount of time which one has to wait after time n for the process to return to v. The return time may be viewed as a generalization of the return times of a Markov chain. Indeed, τ0,v = inf{n 1 : vn = v} agrees with the classical notion of return time from the Markov chain literature. This is closely related to the last passage time defined below. Definition 2.3 (Last passage time). For n 1 and v V we define the last passage time of v before time n as

kv(n) = sup{j n : vj = v}. (3)

If the process has not yet visited v, i.e. {j n : vj = v} = , then we set kv(n) = 1.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Algorithm 1 Incremental Majorization Minimization with Dynamic Proximal Regularization

1: Input: Initialize θ0 Θ; N > 0; ρ 0 2: Option: Regularization {Dynamic, Constant} 3: Initialize surrogates gv 0 SL,µ(f v, θ0). 4: for n = 1 to N do 5: sample a data point vn 6: choose gvn n SL,µ(f vn, θn 1); gv n = gv n 1 v = vn 7: gn P

v V gv nπ(v) 8: if Regularization = Dynamic then 9: ρn ρ + maxv V(n kv(n)) 10: else if Regularization = Constant then 11: ρn ρ 12: end if 13: θn arg minθ Θ h gn(θ) + ρn

2 θ θn 1 2i

14: end for 15: output: θN

Algorithm 2 Incremental Majorization Minimization with Diminishing Radius

1: Input: Initialize θ0 Θ; N > 0; (rn)n 1 2: Initialize surrogates gv 0 SL(f v, θ0). 3: for n = 1 to N do 4: sample a data point vn 5: choose gvn n SL(f vn, θn 1); gv n = gv n 1 for all v = vn 6: gn P

v V gv nπ(v) 7: θn arg minθ Θ Brn (θn 1) gn(θ) 8: end for 9: output: θN

The last passage time kv(n) appears naturally as it is the last time the surrogate for data point v has been updated during the execution of either Algorithm 1 or 2. Thus, gv n is a surrogate of f v at θkv(n) 1 and the corresponding surrogate error at this point hv n(θkv(n) 1) and its gradient are equal to zero. We will use this fact crucially in the proof of the key lemma, Lemma D.1.

Our algorithms are stated formally in Algorithms 1 and 2. In Algorithm 1, the regularization term uses Proximal Regularization (PR) while Algorithm 2 utilizes a Diminishing Radius (DR) restriction.

3. Main Results

3.1. Optimality Conditions

We now introduce the optimality conditions used in this paper and related quantities. Here we denote f to be a general objective function f : Θ R, but elsewhere f will refer to the objective function in (1) unless otherwise stated.

For a given function f and θ , θ Θ, we define its directional derivative at θ in the direction θ θ as

f(θ , θ θ ) := lim α 0+ f(θ + α(θ θ )) f(θ )

A necessary first order condition for θ to be a local minimum of f is to require f(θ , θ θ ) 0 for all θ Θ (see (Mairal, 2015)). Thus we define the optimality of f at θ Θ as

Of(θ ) := sup θ Θ, θ θ 1 f(θ , θ θ ). (5)

Note that Of(θ ) is non-negative (since we may take θ = θ ) and only positive if there exists some θ Θ with f(θ , θ θ ) < 0. Thus we say that θ Θ is a stationary point of f over Θ if Of(θ ) = 0. If f is differentiable and Θ is convex, this is equivalent to f(θ ) being in the normal cone of Θ at θ . If θ is in the interior of Θ then it implies that f(θ ) = 0.

For iterative algorithms, this stationary point condition may hardly be satisfied in a finite number of iterations. A practically important question is how the worst case number of iterations required to achieve an ε-approximate solution scales with the desired precision ε. We say that θ Θ is an ε-approximate stationary point of f over Θ if Of(θ ) ε. This notion of ε-approximate solution is consistent with the corresponding notion for unconstrained problems. In fact, if f is differentiable, and if θ is distance at least one away from the boundary Θ, then it reduces to f(θ ) ε. For each ε > 0, we then define the worst-case iteration complexity of an algorithm for solving (1) as

Nε(θ0) := inf{n 1 : Of(θn) ε}, (6)

where (θn)n 0 is a sequence of iterates produced by the algorithm with initial estimate θ0.

3.2. Assumptions

In this subsection, we state our assumptions for establishing the main results. Throughout this paper, we denote by Fn the σ-algebra generated by the samples v1, . . . , vn and the parameters θ0, . . . , θn produced by Algorithm 1 or 2. With this definition, (Fn)n 1 defines a filtration.

In what follows we will also define some important quantities in terms of the measure theoretic definition of the L norm for random variables:

X = inf{t > 0 : P(|X| > t) = 0}. (7)

This is due to the technical consideration that the conditional expectation E[τn,v|Fn] is random and hence so is supn 1 E[τn,v|Fn]. Our analysis requires this supremum to be bounded by a non-random constant, while in the fully general case supn 1 E[τn,v|Fn] may be an unbounded.

We first state our main assumption on the sampling scheme.

Assumption 3.1 (Recurrent data sampling). The sequence (vn)n 1 of data samples defines a stochastic process

Stochastic Optimization with Arbitrary Recurrent Data Sampling

which satisfies the following property: for each v V, supn 1 E[τn,v|Fn] < , i.e., the expected return time conditioned on Fn is uniformly bounded.

Assumption 3.1 states that the data (vn)n 1 are sampled in such a way that the expected time between visits to a particular data point is finite and uniformly bounded. Generalizing the notion of positive recurrence in Markov chain theory, we say a sampling algorithm is recurrent if Assumption 3.1 is satisfied. We emphasize that recurrence is the only requirement we make of the sampling process in order to prove the convergence rate guarantees in Theorem 3.8 and the asymptotic convergence in Theorem 3.9 (We do not assume independence or Markovian dependence, etc.). We include below a list of some commonly used recurrent sampling algorithms.

1. (i.i.d. sampling) Sampling data i.i.d from a fixed distribution is the most common assumption in the literature (Mairal, 2013; 2015; Bottou et al., 2018; Schmidt et al., 2017; Johnson & Zhang, 2013). Suppose we sample vn i.i.d from some distribution γ on V. Then the τn,v are independent geometric random variables taking values from {1, 2, . . . , } with success probability γ(v), so E[τn,v|Fn] = 1/γ(v). In particular, if γ is uniform E[τn,v|Fn] = |V| for all n and v.

2. (MCMC) Markov chain Monte Carlo methods (see e.g. Ch.3 of (Levin & Peres, 2017)) produce a Markov chain (vn) on V. If this chain is irreducible then maxv,w Ew[τ0,v] is finite (Levin & Peres, 2017). For any n, v, and initial distribution ν, the Markov property implies Eν[τn,v|Fn] = Evn[τ0,v]. So any irreducible Markov chain satisfies 3.1.

3. (Cyclic sampling) In cyclic sampling one samples data in order according to some enumeration until the dataset is exhausted. This process is then repeated until convergence. The authors of (Lu et al., 2022b) show that iteration complexity for SGD can be improved from O(ε 4) to O(ε 3) using such methods. To see that 3.1 holds in this setting, we simply notice that τn,v |V| for all n and v.

4. (Reshuffling) Reshuffling is similar to cycling sampling except that the dataset is randomly permuted at the beginning of each epoch (Lu et al., 2022b). It was observed empirically that random reshuffling performs better that i.i.d sampling in (Bottou, 2012) and further studied in (G urb uzbalaban et al., 2021; Lu et al., 2022b). The authors of (Lu et al., 2022b) show the same improvement in iteration complexity for SGD as for cyclic sampling. In this case, 3.1 is satisfied since τn,v 2|V|.

In Markov chain theory, the quantity maxv,w Ew[τ0,v] is commonly denoted thit. Adapting this notion, we define

thit := max v V sup n 1 E[τn,v|Fn] (8)

for each v when 3.1 holds, for general sampling schemes.

Continuing the connection with Markov chains, we also let

t := sup n 1

v V E[τn,v|Fn]π(v)

where π is as in (1). Note that if vn is an irreducible Markov chain then the Markov property implies

t = max w V

v V Ew[τ0,v]π(v). (10)

This is closely related to the target time define by

v V Ew[τv]π(v) (11)

with the difference being that here τv = inf{n 0 : vn = v} is the first hitting time of v rather than the first return time to v. The random target lemma (Lemma 10.1 in (Levin & Peres, 2017)) states that if π is the stationary distribution of the Markov chain, then the target time is independent of the starting state w. In this case, the quantities (10) and (11) only differ by one. This can be seen by first noting that Ew[τv] = Ew[τ0,v] if v = w. This leaves only the difference Ew[τ0,w]π(w) Ew[τw]π(w). The second term is equal to zero and the first equals one since if π is the unique stationary distribution for the chain, π(w) = 1 Ew[τ0,w] (Levin & Peres, 2017).

Figure 1. Lonely graph

For transitive irreducible Markov chains, thit and t are comparable (Levin & Peres, 2017). However, in other situations t may be much smaller that thit. For instance, consider the simple random walk on the graph in Figure 1, which we call the lonely graph, and let π be its stationary distribution. In the lonely graph, |V| 1 vertices form a clique and the remaining vertex has degree one. A random walk on this graph has worst case hitting time thit = O(|V|2): its value is tied to the lonely vertex with degree one which has low probability of being visited. On the other hand t is only

Stochastic Optimization with Arbitrary Recurrent Data Sampling

O(|V|) since it only depends on an average hitting time instead of the worst case. See Example 10.4 in (Levin & Peres, 2017) for more details.

We again emphasize that in our general setting, we do not require (vn) to be a Markov chain nor do we require π to be a stationary measure for the process. The above analysis serves to motivate the definition given in (9) and provide an example of where t may be much smaller than thit.

We next list some assumptions of the functions f v.

Assumption 3.2 (Lower-bounded objective and directional derivatives). For all v V and θ, θ Θ, the function f v

is bounded below, i.e. infθ Θ f v(θ) > . Moreover, the directional derivative f v(θ, θ θ) exists.

Assumption 3.2 implies that the objective f is bounded below. For the remainder of this paper we will denote 0 := g0(θ0) infθ Θ f(θ). It is important to note that if the initial surrogates gv 0 are in SL(f v, θ0) (as is the case in both Algorithms 1 and 2) then g0(θ0) = f(θ0) so 0 = f(θ0) infθ Θ f(θ). The regularity assumption in

3.2 was used in (Mairal, 2015) and is necessary in analyzing our algorithms using our definition of approximate stationarity. For Algorithm 2 we make the following stronger but common assumption which is crucial to our analysis:

Assumption 3.3. For each v V, the function f v is continuously differentiable and f v is L-Lipschitz continuous.

For simplicity, we assume that if Assumption 3.3 holds then the Lipschitz constant of f v agrees with that of the corresponding approximation error hv.

Finally, Assumption 3.4 states that the radii in Algorithm 2 decrease slowly, but not too slowly. This is analogous to square summability of step sizes in gradient descent.

Assumption 3.4 (Square-summable and non-summable radii). The sequence (rn)n 1 is non-increasing, P n=1 rn = , and P n=1 r2 n < .

3.3. Statement of main results

In this section we state the two main results of this work. We consider the following three cases corresponding to the three variants of our main algorithm:

Case 3.5. Assumptions 3.1-3.2 hold. Use Algorithm 1 with Regularization Schedule = Constant.

Case 3.6. Assumptions 3.1-3.2 hold. Use Algorithm 1 with Regularization Schedule = Dynamic.

Case 3.7. Assumptions 3.1-3.4 hold. Use Algorithm 2.

Notice that in Case 3.5, if one chooses ρ = 0 and the surrogates are gv n are in SL,µ(f v, θn 1) for some µ > 0, then Algorithm 1 reduces to the classical MISO algorithm in (Mairal, 2015).

Our first main result, Theorem 3.8, gives worst case upperbounds on the expected rate of convergence to optimality. For each of the cases 3.5-3.7 we give rates of convergence for the objective function f.

Theorem 3.8 (Rate of Convergence to Stationarity). Algorithms 1 and 2 satisfy the following for any N 1:

(i) Assume Case 3.5. Further assume ρ = 0 and Algorithm 1 is run with µ-strongly convex surrogates. Then

min 1 n N E[Of(θn)] Lt q

(ii) Assume Case 3.5. If ρ Lt ρ + µ then

min 1 n N E [Of(θn)] 2

(iii) Assume Case 3.6. If ρ Lt ρ + µ then

min 1 n N E [Of(θn)]

2 0(Lt + (2thit + 1) log2(4|V|))

(iv) Assume Case 3.7. Let CN = PN n=1 r2 n. Then

min 1 n N E [Of(θn)]

2L πmin CN 0 + 3 + t CNL PN n=1(1 rn+1) . (15)

To our best knowledge, the rates of convergence given in Theorem 3.8 are entirely new for first-order algorithms with general recurrent data sampling. In contrast to the convergence result for SAG in (Even, 2023) that depends on the hitting time thit, Algorithm 1 with constant proximal regularization (case 3.5) depends on the possibly much smaller target time t . See Table 1 for a comparison of our results with other works concerning non-convex optimization with non-i.i.d data.

We remark that items (i) and (ii) show the potential benefit of using proximal regularization even if the surrogates are already strongly convex. For non-convex f, the strong convexity parameter µ of any surrogates cannot be larger than the Lipschitz constant L. However, we are free to choose ρ. So an optimal choice of ρ results in dependence on p

Lt in (ii) instead of the linear dependence in (i). However, it is not necessary to chose ρ in the range given in items (ii) and (iii). We include a more general version of Theorem 3.8, Theorem A.1, in Appendix A which shows that these convergence rates hold for arbitrary ρ and µ so long as ρ + µ > 0. Overall, our theory suggests that one can

Stochastic Optimization with Arbitrary Recurrent Data Sampling

improve convergence by using sampling schemes that cover the dataset most efficiently, i.e. those that minimize t and thit. See the remarks in Appendix A for more on this topic.

We also remark that Theorem 1 in (Even, 2023) gives a lower bound in terms of thit. While our results depend t , they do not contradict this lower bound. See Appendix A for more discussion.

Our second result, Theorem 3.9, concerns the asymptotic behavior of Algorithms 1 and 2.

Theorem 3.9 (Global Convergence). Algorithms 1 and 2 have the following asymptotic convergence properties:

(i) For Case 3.6, we have limn E[Of(θn)] = 0 and limn E[Of(θn)2] = 0. (ii) For Case 3.7 almost surely every limit point of (θn)n 1 is stationary for f over Θ.

Theorem 3.9 shows that although RMISO with diminishing radius requires computing a projection at each step and has higher order dependence on t , it enjoys the strongest asymptotic guarantees. RMISO with dynamic proximal regularization is somewhere in the middle. It has lower order dependence on t than the diminishing radius version but also depends on thit. However, we are able to show that both the first and second moments of the optimality gap converge to zero. In particular, notice that in the familiar case that Θ = Rp and f is differentiable (i) implies that limn E[ f(θn) ] = 0.

Though Algorithm 1 with constant proximal regularization is the simplest of our proposed methods and has the best dependence on the constants L and t , it appears that stronger regularization schemes as in cases 3.6 and 3.7 are needed for obtaining asymptotic convergence guarantees. We refer the reader to Appendix A as well as remark E.3 in Appendix E for a more detailed discussion on the technical difficulties of proving asymptotic convergence for Case 3.5 as well as proving it in the almost sure sense for Case 3.6.

3.4. Sketch of proofs

In this section we provide a short sketch of our analysis in order to convey the main ideas. Let hn = P

v V hv nπ(v) be the average surrogate approximation error at step n. The key step in our analysis (Lemma D.1) is to prove

n=1 cn E[ hn(θn) ] = O

where cn is any non-increasing sequence. For simplicity, assume that the surrogate functions gv n and the objective functions f v are differentiable and we are in the unconstrained setting, i.e. Θ = Rp. If θn is a minimizer of gn

then gn(θn) = 0. Therefore hn(θn) = f(θn) and (16) implies

n=1 cn E[ f(θn) ] = O

If we take cn = 1, we can then conclude that min1 n N E[ f(θn) ] = O(N 1/2). The addition of regularization introduces an added complication because we are no longer directly minimizing gn on the entire feasible set and so do not have gn(θn) = 0. However, as we argue in Section D, the added regularization is not too strong asymptotically.

The main idea is to focus in the individual error gradients because each has the property hv n(θkv(n) 1) = 0. By Definition 2.1, we then have hv n(θn) L θn θkv(n) 1 , so to show (16) we only need to prove

n=1 cn E[ θn θkv(n) 1 ] = O

The triangle inequality and monotonicity of (cn) imply

cn θn θkv(n) 1

i=kv(n) ci θi θi 1 , (19)

so we can relate the error hv n(θn) to the sequence ( θn θn 1 2)n 1. The crucial role played by the regularization or strong convexity is that we can prove the following iterate stability: P n=1 θn θn 1 2 < a.s. This idea was also used in (Lyu & Li, 2023; Lyu et al., 2022; Lyu, 2023).

Under Assumption 3.1 one can expect E[n kv(n)] M for some M. One can then intuitively view the expectation of the right hand side of (19) similarly to Pn i=n M ci E[ θi θi 1 ]. Summing this from n = 1 to N we conclude that for a positive constant C,

n=1 E[ hv n(θn) ] C

n=1 cn E[ θn θn 1 ]. (20)

By Cauchy-Schwartz and the iterate stability obtained through regularization, the right hand side is

O PN n=1 c2 n 1/2 . Full details of our analysis are given

in Appendices C, D, and E.

4. Applications and Experiments

4.1. Applications

In this section we give some applications of our general framework. These include applications to matrix factorization as well as a double averaging version of RMISO derived by using prox-linear surrogates.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Table 1. Comparison of iteration complexity for non-convex optimization with non-i.i.d data. The notation O( ) omits logarithmic factors. Here tmix represents dependence on the mixing time of the Markov chain.

Iteration complexity Memory Sampling Sampling dependence

Ada Grad (Alacaoglu & Lyu, 2023) O(ε 4) O(1) Markovian O( tmix)

SGD (Sun et al., 2018; Even, 2023) O(ε 4) O(1) Markovian O( tmix)

SGD (Mishchenko et al., 2020; Lu et al., 2022b) O(ε 3) O(1) Reshuffling O( p

SAG (Even, 2023) O(ε 2) O(|V|) Markovian O( thit)

RMISO Case 3.5 O(ε 2) O(|V|) Recurrent O( t )

RMISO Case 3.6 O(ε 2) O(|V|) Recurrent O( t + thit)

RMISO Case 3.7 O(ε 2) O(|V|) Recurrent O(t )

4.1.1. DISTRIBUTED MATRIX FACTORIZATION

Before beginning this section we define some additional notation. For a collection of matrices {Av}v V Rn m we let [Av; v V] be their concatenation along the horizontal axis. For a set Θ Rn m we let ΘV = {[Av; v V] : Av Θ for all v}.

We consider the matrix factorization loss f(W, H) = 1 2 X WH 2 F + α H 1 where X Rp d is a given data matrix to be factored into the product of dictionary W ΘW Rp r and code H ΘH Rr d with α 0 being the L1-regularization parameter for H. Here ΘW and ΘH are convex constraint sets.

Suppose we have a connected graph G = (V, E) where each vertex stores a matrix Xv Rp d. For each v V define the loss function

f v(W) = inf H ΘH 1 2 Xv WH 2 F + α H 1, (21)

which is the minimum reconstruction error for factorizing Xv using the dictionary W. In this context, the empirical loss to be minimized is 1 |V| P

v V f v(W). Note that this problem is not convex. Indeed, letting X = [Xv; v V ] it is equivalent to finding (W , H ) ΘW ΘV H minimizing 1

2 X WH 2 F + α H 1, which is a constrained non-convex optimization problem with a bi-convex loss function.

In order to apply RMISO let Wn 1 ΘW be the previous dictionary and denote

Hv n arg min H ΘV H

1 2 Xv Wn 1H 2 F + α H 1 (22)

if vn = v and otherwise Hv n = Hv n 1. Then the function gv n(W) := 1

2 Xv Wn 1Hv n 2 F +α Hv n 1 is a majorizing surrogate of f v at Wn 1 and belongs to SL (f v, Wn 1) for some L > 0 (see Ex. G.5). Then Algorithms 1 and 2 can be used with these surrogates.

4.1.2. PROX-LINEAR SURROGATES

Suppose each f v is differentiable and has L-Lipschitz continuous gradients. Then the functions

gv(θ) = f v(θ ) + f v(θ ), θ θ + L

are in S2L,L(f v, θ ) (see Example G.2) . Further suppose that π is the uniform distribution. Using these surrogates, the update according to Algorithm 1 is

θn 1 1 |V| P

v V θkv(n) 1 n 1 1 |V| P

v V f v(θkv(n) 1) θn 1 ρn L+ρn θn 1 + L L+ρn θn 1 θn ProjΘ θn 1 1 L+ρn n 1 .

Compared with MISO (obtained by setting ρn = 0 in (24)) we see that the additional proximal regularization has the effect of further averaging the iterates, putting additional weight of ρn L+ρn on the most recent parameter θn 1.

4.2. Experiments

4.2.1. DISTRIBUTED NONNEGATIVE MATRIX FACTORIZATION

In this section we compare the performance of the distributed matrix factorization version of RMISO from Sec. 4.1.1 against other well known optimization algorithms. We consider a randomly drawn collection of 5000 images from the MNIST (Deng, 2012) dataset where each sample Xv represents a subset of images. In all experiments, we set α = 1 28 and r = 15. The dictionary W is constrained to be non-negative and rows with euclidean norm at most one.

The set of vertices V is arranged in a cycle graph with |V| = 55 with each vertex restricted to only contain samples with the same label. We consider two different sampling algorithms: the standard random walk where t and thit are both O(|V|2), and cyclic where both are O(|V|). Our theory suggests we should expect better performance for

Stochastic Optimization with Arbitrary Recurrent Data Sampling

cyclic sampling versus the random walk. We compare all three versions of RMISO: constant proximal regularization (RMISO-CPR), dynamic proximal regularization (RMISODPR), and diminishing radius (RMISO-DR), with MISO (Mairal, 2015), the online nonnegative matrix factorization (ONMF) algorithm of (Mairal et al., 2010), and Ada Grad (Duchi et al., 2011).

It is not guaranteed that the surrogates gv n(W) := 1

2 Xv Wn 1Hv n 2 F + α Hv n 1 are strongly convex. However, while running the experiments, we find that the Hessian of the averaged surrogate is positive definite after only a few iterations and thus the results for MISO are also supported by Theorem 3.8 (ii). This phenomenon is also discussed in Assumption B of (Mairal et al., 2010).

0 1000 2000 3000 4000 5000 6000 Iterations

Recons. Error

RMISO-CPR RMISO-DR RMISO-DPR

MISO ONMF ADAGRAD

(a) Random Walk

0 1000 2000 3000 4000 5000 6000 Iterations

Recons. Error

RMISO-CPR RMISO-DR RMISO-DPR

MISO ONMF ADAGRAD

Figure 2. Plot of reconstruction error against interation number for NMF using two sampling algorithms. Results show the performance of algorithms RMISO, MISO (Algorithm 1 with ρn = 0), ONMF, and Ada Grad in factorizing a collection of MNIST (Deng, 2012) data matrices.

We ran the experiment ten times with ten different random seeds and plot the average reconstruction error versus iteration number in Figure 2. We see that RMISO outperforms ONMF and shows competitive performance against Ada Grad for both sampling schemes. As expected, there is a dramatic performance improvement under cyclic sampling versus the random walk.

4.2.2. LOGISTIC REGRESSION WITH NONCONVEX REGULARIZATION

We consider logistic regression with the non-convex regularization term R(θ) = 0.01 Pp i=1 θ2 i 1+θ2 i where θ Rp is the parameter to be optimized. We use the a9a dataset (Becker & Kohavi, 1996). Here we consider the random walk on two separate graph topologies: the complete graph and the lonely graph as in Figure 1. Both graphs have |V| = 50 and each vertex only stores data with the same label.

We compared eight different optimization algorithms: (1) the prox-linear version of Algorithm 1 (24) with non-zero proximal regularization (RMISO-CPR); (2) Algorithm 1

0 1000 2000 3000 4000 5000 Iterations

RMISO-CPR RMISO-DPR RMISO-DR MISO

ADAGRAD MCSAG SGD SGD-HB

(a) Lonely graph

0 1000 2000 3000 4000 5000 Iterations

RMISO-CPR RMISO-DPR RMISO-DR MISO

ADAGRAD MCSAG SGD SGD-HB

(b) Complete graph

Figure 3. Plot of objective loss and standard deviation against the test dataset for a9a for two graph topologies and various optimization algorithms RMISO, MISO (Algorithm 1 with ρn = 0), Ada Grad, MCSAG, SGD, Adam, and SGD-HB

with dynamic proximal regularization; (3) Algorithm 2; (4) MISO (Algorithm 1 with ρn = 0); (5) Ada Grad (Duchi et al., 2011); (6) Markov Chain SAG (MCSAG) (Even, 2023); (7) SGD with decaying step size; (8) SGD-HB (SGD with momentum).

We ran each experiment with ten different seeds. The results plotted in Figure 3 show the average loss against the test dataset for both graph topologies over these ten runs as well as a shaded region with boundaries given by the standard deviation. We see that RMISO-DPR and MCSAG display poorer performance on the lonely graph as thit increases from O(|V|) to O(|V|2). The performance of RMISO-CPR, RMISO-DR, and MISO are unchanged since each only depends on t , with RMISO-DR and MISO performing the best and only narrowly outperforming RMISO-CPR.

Both SGD-HB and Ada Grad converge quickly in both settings but suffer from unstable trajectories compared to RMISO and MCSAG. A more stable algorithm may be advantageous in situations where the value of the objective function cannot easily be computed. See (Nesterov & Shikhman, 2015) for an example of such a situation.

5. Conclusion

In this paper we have established convergence and complexity results for our proposed extensions of MISO under the general assumption of recurrent data sampling. Our results show that convergence speed depends crucially on the average or supremized expected time to return to a given data point. In particular, the constant proximal regularization version of our algorithm depends only on the averaged target time, a potentially large improvement over the hitting time. Both our analysis and numerical experiments display the benefit of using possibly non-i.i.d or non-Markovian sampling schemes in order to accelerate convergence.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Acknowledgements

This work was supported in part by NSF Award DMS2023239 and DMS-2206296. The authors thank Qiaomin Xie for helpful comments.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Alacaoglu, A. and Lyu, H. Convergence of first-order methods for constrained nonconvex optimization with dependent data. In Proceedings of the 40th International Conference on Machine Learning, pp. 458 489. PMLR, 2023.

Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

Becker, B. and Kohavi, R. Adult. UCI Machine Learning Repository, 1996.

Bertsekas, D. P. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.

Beznosikov, A. and Tak aˇc, M. Random-reshuffled SARAH does not need full gradient computations. Optimization Letters, 2023.

Beznosikov, A., Samsonov, S., Sheshukova, M., Gasnikov, A., Naumov, A., and Moulines, E. First order methods with Markovian noise: from acceleration to variational inequalities. ar Xiv preprint ar Xiv:2305:.15938, 2023.

Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation. In Proceedings of the 31st Conference On Learning Theory, pp. 1691 1692. PMLR, 2018.

Bottou, L. Stochastic Gradient Descent Tricks, pp. 421 436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, 2018.

Davis, D. and Drusvyatskiy, D. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207 239, 2019.

Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141 142, 2012.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61): 2121 2159, 2011.

Even, M. Stochastic gradient descent under Markovian sampling schemes. In Proceedings of the 40th International Conference on Machine Learning, pp. 9412 9439. PMLR, 2023.

G urb uzbalaban, M., Ozdaglar, A., and Parrilo, P. A. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1):49 84, 2021.

Horst, R. and Thoai, N. V. Dc programming: Overview. Journal of Optimization Theory and Applications, 103(1): 1 43, 1999.

Horv ath, S., Sanjabi, M., Xiao, L., Richt arik, P., and Rabbat, M. Fedshuffle: Recipes for better use of local work in federated learning. Transactions on Machine Learning Research, 2022.

Huang, X., Yuan, K., Mao, X., and Yin, W. An improved analysis and rates for variance reduction under withoutreplacement sampling orders. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2021.

Huo, D. L., Chen, Y., and Xie, Q. Bias and extrapolation in Markovian linear stochastic approximation with constant stepsizes. ACM SIGMETRICS Performance Evaluation Review, 51(1):81 82, 2023.

Johansson, B., Rabi, M., and Johansson, M. A simple peerto-peer algorithm for distributed optimization in sensor networks. In 46th IEEE Conference on Decision and Control, pp. 4705 4710, 2007.

Johansson, B., Rabi, M., and Johansson, M. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157 1170, 2010.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2013.

Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. Non-asymptotic analysis of biased stochastic approximation scheme. In Proceedings of the Thirty-Second Conference on Learning Theory, pp. 1944 1974. PMLR, 2019.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Karimi, B., Wai, H.-T., Moulines, E., and Li, P. Minimization by incremental stochastic surrogate optimization for large scale nonconvex problems. In Dasgupta, S. and Haghtalab, N. (eds.), Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167, pp. 606 637. PMLR, 2022.

Levin, D. and Peres, Y. Markov Chains and Mixing Times. MBK. American Mathematical Society, 2017.

Lopes, C. G. and Sayed, A. H. Incremental adaptive strategies over distributed networks. IEEE Transactions on Signal Processing, 55(8):4064 4077, 2007.

Lu, Y., Guo, W., and De Sa, C. M. Grab: Finding provably better data permutations than random reshuffling. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2022a.

Lu, Y., Meng, S. Y., and Sa, C. D. A general analysis of example-selection for stochastic gradient descent. In International Conference on Learning Representations, 2022b.

Lyu, H. Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates. ar Xiv preprint ar Xiv:2201.01652, 2023.

Lyu, H. and Li, Y. Block majorization-minimization with diminishing radius for constrained nonconvex optimization. ar Xiv preprint ar Xiv:2012.03503, 2023.

Lyu, H., Needell, D., and Balzano, L. Online matrix factorization for markovian data and applications to network dictionary learning. Journal of Machine Learning Research, 21(251):1 49, 2020.

Lyu, H., Strohmeier, C., and Needell, D. Online nonnegative cp-dictionary learning for Markovian data. The Journal of Machine Learning Research, 23(1):6630 6679, 2022.

Mairal, J. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, pp. 2283 2291, 2013.

Mairal, J. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829 855, 2015.

Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(2):19 60, 2010.

Malinovsky, G., Sailanbayev, A., and Richt arik, P. Random reshuffling with variance reduction: New analysis and better rates. In Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence, pp. 1347 1357. PMLR, 2023.

Mao, X., Yuan, K., Hu, Y., Gu, Y., Sayed, A. H., and Yin, W. Walkman: A communication-efficient random-walk algorithm for decentralized optimization. IEEE Transactions on Signal Processing, 68:2513 2528, 2020.

Mishchenko, K., Khaled, A., and Richtarik, P. Random reshuffling: Simple analysis with vast improvements. In Advances in Neural Information Processing Systems, volume 33, pp. 17309 17320. Curran Associates, Inc., 2020.

Mishchenko, K., Khaled, A., and Richtarik, P. Proximal and federated random reshuffling. In Proceedings of the 39th International Conference on Machine Learning, pp. 15718 15749. PMLR, 2022.

Mohtashami, A., Stich, S. U., and Jaggi, M. Characterizing & finding good data orderings for fast convergence of sequential gradient methods. ar Xiv preprint ar Xiv:2202.01838, 2022.

Nagaraj, D., Wu, X., Bresler, G., Jain, P., and Netrapalli, P. Least squares regression with Markovian data: Fundamental limits and algorithms. Advances in Neural Information Processing Systems, 33:16666 16676, 2020.

Nesterov, Y. Introductory Lectures on Convex Optimization. Applied Optimization. Springer, 2003.

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013.

Nesterov, Y. and Shikhman, V. Quasi-monotone subgradient methods for nonsmooth convex minimization. Journal of Optimization Theory and Applications, 165(3):917 940, 2015.

Parikh, N. and Boyd, S. Proximal algorithms. Foundations and Trends in Optimimization, 1(3):127 239, 2014.

Qian, X., Sailanbayev, A., Mishchenko, K., and Richt arik, P. MISO is making a comeback with better proofs and rates. ar Xiv preprint ar Xiv:1906.01474, 2019.

Rajput, S., Lee, K., and Papailiopoulos, D. Permutationbased SGD: Is random optimal? In International Conference on Learning Representations, 2022.

Ram, S. S., Nedi c, A., and Veeravalli, V. V. Incremental stochastic subgradient algorithms for convex optimization. SIAM Journal on Optimization, 20(2):691 717, 2009.

Roy, A. and Balasubramanian, K. Online covariance estimation for stochastic gradient descent under Markovian sampling. ar Xiv preprint ar Xiv:2308.01481, 2023.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83 112, 2017.

Sow, D., Lin, S., Wang, Z., and Liang, Y. Doubly robust instance-reweighted adversarial training. In The Twelfth International Conference on Learning Representations, 2024.

Steininger, M., Kobs, K., Davidson, P., Krause, A., and Hotho, A. Density-based weighting for imbalanced regression. Machine Learning, 110(8):2187 2211, 2021.

Sun, T., Sun, Y., and Yin, W. On Markov chain gradient descent. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Sun, T., Li, D., and Wang, B. Adaptive random walk gradient descent for decentralized optimization. In Proceedings of the 39th International Conference on Machine Learning, pp. 20790 20809. PMLR, 2022.

Wang, P., Lei, Y., Ying, Y., and Zhou, D.-X. Stability and generalization for Markov chain stochastic gradient methods. In Advances in Neural Information Processing Systems, pp. 37735 37748, 2022a.

Wang, W., Xu, H., Liu, X., Li, Y., Thuraisingham, B., and Tang, J. Imbalanced adversarial training with reweighting. In 2022 IEEE International Conference on Data Mining (ICDM), pp. 1209 1214, Los Alamitos, CA, USA, dec 2022b. IEEE Computer Society.

Ying, B., Yuan, K., Vlaski, S., and Sayed, A. H. On the performance of random reshuffling in stochastic learning. In 2017 Information Theory and Applications Workshop, pp. 1 5, 2017.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

A. Further remarks on main results

We include in this section an extended version of Theorem 3.8 including convergence rates for arbitrary regularization parameters as well as two of its corollaries. The extended version of Theorem 3.8 is below.

Theorem A.1 (Extended Version of Theorem 3.8 in the main text). Algorithms 1 and 2 satisfy the following:

(i) Assume Case 3.5. Then

min 1 n N E

sup θ Θ, θ θn 1 f(θn, θ θn)

2 0 ρ ρ+µ + Lt ρ+µ

In particular, if ρ is chosen so that ρ Lt ρ + µ then

min 1 n N E

sup θ Θ, θ θn 1 f(θn, θ θn)

(ii) Assume Case 3.6. Then

min 1 n N E

sup θ Θ, θ θn 1 f(θn, θ θn)

ρ + (2thit + 1) log2(4|V|) + Lt ρ+µ

If ρ satisfies the condition ρ Lt ρ + µ then

min 1 n N E

sup θ Θ, θ θn 1 f(θn, θ θn)

Lt + (2thit + 1) log2(4|V|) + p

(iii) Assume Case 3.7. Then

min 1 n N E

sup θ Θ, θ θn 1 f(θn), θ θn

2L πmin CN 0 + 3 + t CNL PN n=1 1 rn+1 , (29)

where CN = PN n=1 r2 n.

Next, Corollary A.2 specializes these results to the setting of unconstrained nonconvex optimization.

Corollary A.2. Assume either Θ = Rp or that there exists c (0, 1] so that dist(θn, Θ) c for all n 1.

(i) Let (θn)n 0 be an output of Algorithm 1. Assume case 3.5 or 3.6. Then for any N 1

min 1 n N E [ f(θn) ] = O N 1/2 . (30)

(ii) Let (θn)n 0 be an output of Algorithm 2. Assume case 3.7. Then for any N 1

min 1 n N E [ f(θn) ] = O

Notice that we may take rn = 1 n log n in Algorithm 2. Then Corollary A.2 implies

min 1 n N E [ f(θn) ] = O log N

holds for case 3.7.

Finally, Corollary A.3 states the iteration complexity of Algorithms 1 and 2.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Corollary A.3 (Iteration Complexity). Algorithms 1 and 2 have the following worst case iteration complexity:

(i) Let (θn)n 0 be an output of Algorithm 1. Assume case 3.5 or 3.6. Then Nε(θ0) = O(ε 2).

(ii) Let (θn)n 0 be an output of Algorithm 2. Assume case 3.7. Then Nε(θ0) = O(ε 2(log ε 1)2).

Remark A.4 (Comparison with the lower bound of (Even, 2023) Theorem 1). Using our notation, the lower bound given in Theorem 1 of (Even, 2023) is

f(θN) 2 = Ω

Our convergence rates are given in terms of f(θn) rather than f(θn) 2, so in our setting this is

f(θN) = Ω p

Notice that the rate of convergence in the lower bound is O(N 1) while our upper bound gives a rate of convergence of O(N 1/2). Thus, despite the dependence in our results on t , they do not contradict this lower bound. Remark A.5 (Optimal sampling and estimating t and thit). The dependence of the convergence rates on t or thit in Theorem A.1 suggests one can accelerate convergence by choosing a sampling algorithm with the smallest values of these constants appropriate for the context. In general, an optimal sampling scheme is problem dependent. The best one can hope for, in terms of dependence on |V|, is that both constants are O(|V|) which is achieved by i.i.d sampling. However, this may not be feasible in settings like decentralized optimization where communication can only occur between neighboring vertices in a graph.

Depending on the graph topology, it is likely that for the standard random walk thit and t are much larger than |V|, especially for sparse graphs. If a cycle containing all nodes in the graph exists, our theory suggests using cyclic sampling by traversing such a spanning cycle. In this case, both thit and t have optimal order O(|V|). If no such cycle exists, a good way to minimized thit is to find the shortest path in the graph which contains all vertices and then sample the vertices deterministically in order by walking over this path. This idea holds in a more general setting beyond optimization on graphs: a good way to minimize thit is to sample data as efficiently as possible by covering the dataset with the fewest possible number of repeats.

For many specific instances, these quantities can be estimated analytically. For random walks on graphs, much about the hitting time and target time is known through classical Markov chain theory (Levin & Peres, 2017). For cyclic sampling and random reshuffling respectively, one has thit = |V| and thit 2|V| since each data point is visited exactly once every epoch and no re-shuffling occurs in the cyclic case. Under cyclic sampling, for fixed n and for each 1 k |V| there is a v with E[τn,v|Fn] = k. So t is the largest possible value of P

v V σvπ(v) where σ ranges over all permutations of 1, ..., |V|. In particular, if π is uniform, t = |V| 1

2 . For reshuffling with uniform π, t is at least this large by considering a time n at the beginning of an epoch. But it still holds t 2|V| since t thit. If these quantities cannot be easily estimated analytically, they can be approximated using Monte-Carlo. Remark A.6 (Comparison with i.i.d sampling). If the sequence (vn) is formed by sampling vertices uniformly at random from V then, as previously mentioned, the return times τn,v are i.i.d geometric random variables with parameter 1 |V|. Then E[τn,v|Fn] = |V| for each n and v so t = |V|. Substituting |V| for t in the optimal bound in Theorem 3.8 (ii) we recover the result given for MISO in the i.i.d setting in (Karimi et al., 2022) up to a factor of two. This is in-spite of the fact that our analytical approach is necessarily different to handle general recurrent sampling and shows that our results are tight. Remark A.7 (Iterate stability and regularization). Here we give some remarks on the use of diminishing radius and proximal regularization in Algorithms 1 and 2.

The diminishing radius restriction in Algorithm 2 is a hard regularization technique. It bakes necessary iterate stability directly into the problem by enforcing the stronger condition θn θn 1 rn. In comparison with proximal regularization, diminishing radius bounds the one step iterate difference by a deterministic quantity. Moreover, as is argued in the proof of Theorem 3.9 (ii) and the preceding propositions, sufficiently often the iterate θn obtained by minimizing the gn over the trust region in fact minimizes gn over the entire feasible set Θ. This allows us to prove that limit points of the iterates produces

Stochastic Optimization with Arbitrary Recurrent Data Sampling

by Algorithm 2 are stationary even for general recurrent sampling schemes. However, diminishing radius introduces the drawback of needing to compute a projection at each step of the optimization process.

Compared to diminishing radius, proximal regularization is a form of soft regularization. It is less restrictive than diminishing radius, but only allows us to derive a weaker form of iterate stability: we only have P n=1 θn θn 1 2 < instead of the stronger θn θn 1 rn which makes some aspects of the analysis slightly more challenging.

As mentioned in Section 3.2, the use of dynamic proximal regularization in Case 3.6 adapts to the sampling process and increasingly penalizes large values of θn θn 1 as the amount of time any vertex is left un-visited increases. The drawback is that we are unable to prove almost sure asymptotic convergence in Theorem 3.9 for Case 3.6 as we are for Case 3.7. If we could show ρn θn θn 1 0, then asymptotic convergence would follow. However ρn is not bounded and can take arbitrarily large values, albeit with low probability. But as we show in Lemma E.1, E[ρn] is uniformly bounded, which allows us to deduce E[ρn θn θn 1 ] 0 and leads to the L1-convergence result in Theorem 3.9.

For case 3.5 it is relatively straightforward to show ρn θn θn 1 0 since ρn is constant. In this case, the difficulty lies in showing hn(θn) 0. We are able to prove this for Case 3.6 however because the use of dynamic proximal regularization allows us to show that the sequence maxv V(n kv(n)) θ θn 1 2 is summable. More detail is given in the proofs and discussion in subsection E.1.

B. Convergence analysis using mixing times

In this section, we give an overview of the standard pipeline for analyzing stochastic optimization algorithms with Markovian data and discuss the difficulties of applying these techniques to the analysis of MISO.

The analysis of first order methods such as SGD generally rely on conditionally unbiased gradient estimates. In the context of solving problem (1), this is to require

E[ f vn(θn 1)|Fn 1] = f(θn 1) (35)

where Fn is the filtration of information up to time n. However, in the dependent data setting (35) does not hold which complicates the analysis significantly. Previous works (e.g (Sun et al., 2018; Bhandari et al., 2018; Nagaraj et al., 2020; Lyu et al., 2020; 2022; Lyu, 2023; Alacaoglu & Lyu, 2023)) use a conditioning on the distant past argument. Specifically, for SGD, one assumes that (vn) is a Markov chain on state space V which mixes exponentially fast to its stationary distribution π with parameter λ. One then considers the quantity

E[ f vn(θn an)|Fn an] (36)

where an is a slowly growing sequence satisfying P

n 1 λan < . By further assuming either uniformly bounded gradients as in (Sun et al., 2018) or that the conditional expectations E[ f vn+1(θ) |Fn] are uniformly bounded as in (Alacaoglu & Lyu, 2023), one can show that

f(θn an) E[ f vn(θn an)|Fn an] = O(λan). (37)

Using Lipschitz continuity of gradients, it is then established that

f(θn) E[ f vn(θn)|Fn an] = O(λan) + O( θn θn an ), (38)

allowing one to control the bias in the stochastic gradient estimate. Conventional analysis may then be used to prove

n=1 γn E [ f(θn), f vn(θn) ] < , (39)

where γn denotes the stepsize at iteration n. Combining this with (38), it can finally be deduced that

n=1 γn E[ f(θn) 2] < (40)

for an appropriate stepsize γn which gives the desired convergence rate.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

This technique was further developed to analyze convergence rates of stochastic majorization minimization (SMM) (Mairal, 2013) algorithms under Markovian sampling in (Lyu, 2023). In the context of problem (1), SMM proceeds by minimizing a recursively defined majorizing surrogate of the empirical loss function. The algorithm is stated concisely as follows:

Sample vn from the conditional distribution π( |Fn 1) gn Strongly convex majorizing surrogate of f vn

θn arg minθ Θ ( gn(θ) := (1 wn) gn 1(θ) + wngn(θ)) (41)

where (wn)n 1 is a non-increasing sequence of weights. A crucial step in the analysis of SMM is to show that

n=1 wn E[| gn(θn) fn(θn)|] < (42)

where fn is the empirical loss function satisfying the recursion fn(θ) := (1 wn) fn 1(θ) + wnf vn(θ). To do so, the problem is reduced to showing that

E E[f vn(θn an) fn(θn an)|Fn an]+ = O(wn an) + O(λan) (43)

(see Proposition 8.1 in (Lyu, 2023)) which is similar to (37). By additionally assuming Lipschitz continuity of the individual loss functions f v, it is then shown that

E E[f vn(θn) fn(θn)|Fn an] = O(wn an) + O(λan) + O( θn θn an ) (44)

which may be compared with (38). Finally, (42) is proved by showing

n=1 wn E[| gn(θn) fn(θn)|] g1(θ1) +

n=1 wn E[f vn(θn) fn(θn)] (45)

and using the bound (44) to conclude that the latter sum is finite.

MISO and its extensions proposed in this paper are similar to SMM in their use of the majorization-minimization principle, but contain a few key differences. Put shortly, the original implementation of MISO is

Sample vn from the conditional distribution π( |Fn 1) gvn n Convex surrogate of f vn at θn 1; gv n = gv n 1 for v = vn θn arg minθ Θ gn(θ) := P v V gv nπ(v) (46)

In contrast with SMM, each surrogate defining gn is given a constant weight. Consequently, the additional control provided by the decreasing weights (wn) in SMM is not present, which makes the adaptation of the techniques used in the analysis of SMM non-trivial.

A key step in the original analysis of MISO given in (Mairal, 2015) is to show

n=1 E[ hn(θn)] < . (47)

This is similar to (42) and shows that the averaged surrogate is an asymptotically accurate approximation of the true objective at all θns. To prove this, one needs to relate the averaged error hn(θn) to another quantity proven to be summable through other means. This is, in abstract, the role that mixing rate analysis plays for SGD and SMM. Using techniques from (Mairal, 2015) one can prove

n=1 E[hvn+1 n (θn)] < . (48)

If the (vn) are drawn i.i.d from π, or more generally if the probability of transitioning between any two vertices is uniformly bounded below by a positive quantity, then hvn+1 n (θn) is a conditionally unbiased estimate of hn(θn) up to a constant. Then (47) follows by conditioning on the most recent information Fn.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

To adapt the analysis to a more general setting, one may attempt to use Markov chain mixing to show

| hn(θn an) E[hvn+1 n (θn an)|Fn an]| = O(λan) (49)

similar to (37) and (43). However, an additional complication arises because the function hvn+1 n conditional on Fn an depends on both the history of data samples vk and the estimated parameters θk for n an k n. In contrast, f vn for SGD (in (37)) depends only on the last data sample vn and fn for SMM (in (43)) depends only on the histrory of data samples vn an, . . . , vn. So, one cannot use the Markov property to isolate the randomness in hvn+1 n (θn an) due to the Markov chain transition over the interval [n an, n+1]. To alleviate this problem, one may attempt to control the difference

|hvn+1 n (θn an) hvn+1 n an(θn an)| (50)

but doing so is not straightforward without access to something resembling the weights (wn) in SMM.

C. Preliminary Lemmas

In this section we state and prove some preliminary lemmas which will be used to prove in the proofs of both Theorem 3.8 and 3.9.

We first introduce some additional notation. Throughout this section as well as the remainder of the paper we let

tcov := sup n 1

E max v V τn,v Fn

We recall that the L norm here is taken for the conditional expectation viewed as a random variable. This quantity is a generalization of the worst case expected cover time from Markov chain theory (Levin & Peres, 2017) and will be important in the analysis of Algorithm 1 with dynamic proximal regularization.

Our first result, Proposition C.1, states that under Assumption 3.1 the return times have finite moments of all orders and gives an upper-bound on tcov in terms of thit. The first item will is used in Section E to prove asymptotic results while the second is used in the proof of iteration complexity for Case 3.6.

Proposition C.1 (Recurrence implies finite exponential moments of return time). Let Assumption 3.1 hold and let thit and tcov be as in (8) and (51) respectively.

(i) There exists s0 > 0 so that for all 0 < s < s0 there is a constant Cs > 0 with

max v V sup n 1 E h esτn,v Fn i Cs < . (52)

Consequently, for each p 1, maxv V supn 1 E[τ p n,v|Fn] C < for some C > 0.

(ii) We have the following bound on tcov:

tcov (2thit + 1) log2(4|V|). (53)

Proof. Let m be the smallest integer satisfying m 2thit. For any n 1 and v V notice that if τn,v km then we must have τn+jm,v m for each 0 j k 1. Then

P(τn,v km|Fn) P

j=0 {τn+jm,v m} Fn

j=0 1(τn+jm,v m) Fn

j=0 1(τn+jm,v m) Fn

E 1(τn+(k 1)m,v m)|Fn+(k 1)m k 2 Y

j=0 1(τn+jm,v m) Fn

Stochastic Optimization with Arbitrary Recurrent Data Sampling

where we have used that {τn+jm,v m} is measurable with respect to Fn+(k 1)m for each j k 2. By Markov s inequality and Assumption 3.1

E 1(τn+(k 1)m,v m)|Fn+(k 1)m = P τn+(k 1)m,v m|Fn+(k 1)m thit

j=0 1(τn+jm,v m) Fn

j=0 1(τn+jm,v m) Fn

Proceeding by induction it follows that

P(τn,v km|Fn) thit

with the second inequality using our choice of m. Now,

E [esτn,v|Fn] =

ℓ=1 esℓP(τn,v = ℓ|Fn)

k=0 es(k+1)m P(τn,v km|Fn)

k=0 es(k+1)m2 k. (59)

The latter sum is finite if s < log 2

2m and does not depend on n or v which shows (i).

With n still fixed let τcov = maxv V τn,v. We have

E[τcov|Fn] =

ℓ=1 P(τcov ℓ|Fn)

k=0 m P(τcov km|Fn) (2thit + 1)

k=0 P(τcov km|Fn) (60)

since P(τcov ℓ|Fn) is a decreasing function of ℓand m 2thit + 1. By a union bound

P(τcov km|Fn) 1 X

v V P(τn,v km|Fn) 1 |V|2 k. (61)

Summing a geometric series we get

k=0 P(τcov km|Fn) log2 |V| + |V| X

k>log2 |V| 2 k log2 |V| + 2. (62)

Combining (60) and (62) shows (ii).

The next proposition states some general properties of first order surrogate functions.

Proposition C.2 (Properties of Surrogates). Fix θ Θ and f : Θ R. Let g SL(f, θ) and let θ be a minimizer of g over Θ. Then for all θ Θ

2 θ θ 2 (63)

Proof. This follows from using the classical upperbound for L-smooth functions

h(θ) h( θ) + h( θ), θ θ + L

2 θ θ 2 (64)

(see Lemma F.1) and noting that h( θ) and h( θ) are both equal to zero according to Definition 2.1.

Next, we show that the surrogate objective value gn(θn) evaluated at θn is non-increasing.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Lemma C.3 (Surrogate Monotonicity). Let (θn)n 0 be an output of either Algorithm 1 or 2 . Then gn(θn 1) gn 1(θn 1) for all n 1. Moreover, the sequence ( gn(θn))n 0 is non-increasing. As a consequence, by Assumption 3.2 and Definition 2.1, ( gn(θn))n 0 is bounded below with probability one and therefore limn gn(θn) exists almost surely.

Proof. Since gvn n SL(f vn, θn 1), Definition 2.1 implies that gvn n (θn 1) = f vn(θn 1). Then

gn(θn 1) = gn 1(θn 1) + h gvn n (θn 1) gvn n 1(θn 1) i π(vn) (65)

= gn 1(θn 1) + h f vn(θn 1) gvn n 1(θn 1) i π(vn) (66)

gn 1(θn 1) (67)

where the last inequality used gvn n 1 is a majorizing surrogate of f vn.

Suppose now that (θn)n 0 is an output of Algorithm 1. Then by definition of θn,

gn(θn) gn(θn) + ρn

2 θn θn 1 2

gn(θn 1) + ρn

2 θn 1 θn 1 2

gn 1(θn 1).

If instead (θn)n 1 is an output of Algorithm 2, then we can directly conclude gn(θn) gn(θn 1) by definition of θn. The remainder of the proof is identical to the above.

The next lemma establishes the summability of the sequence hvn+1 n (θn). This was used in (Mairal, 2015) to prove asymptotic convergence of MISO under i.i.d. sampling. We use it primarily in the analysis of Algorithm 2.

Lemma C.4. Let (θn)n 0 be an output of either Algorithm 1 or 2. Then almost surely

n=1 hvn+1 n (θn) 1 πmin 0. (68)

Proof. By Definition 2.1, for each n the quantity hvn+1 n (θn) is non-negative. Therefore, it suffices to show that the sequence of partial sums, PN n=1 hvn+1 n (θn) is uniformly bounded.

Recall that

gn+1(θn+1) gn+1(θn) = gn(θn) + (gvn+1 n+1 (θn) gvn+1 n (θn))π(vn+1) (69)

= gn(θn) + (f vn+1(θn) gvn+1 n (θn))π(vn+1). (70)

We then have

n=1 hvn+1 n (θn) =

n=1 gvn+1 n (θn) f vn+1(θn)

1 π(vn+1) ( gn(θn) gn+1(θn+1))

n=1 gn(θn) gn+1(θn+1)

which is what we needed to show.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Proposition C.5. Suppose (θn)n 0 is an output of Algorithm 2. The for all n 1, θn θn 1 rn.

Proof. This follows directly from the definition of θn in Algorithm 2.

Lemma C.6 establishes the iterate stability. These results are crucially used to control the surrogate error gradient hn(θn) in Lemma D.1 as well as in the asymptotic analysis of RMISO in Section E.

Lemma C.6 (Finite variation of iterate differences). The following hold almost surely:

(i) For Case 3.7,

n=1 θn θn 1 2

n=1 r2 n < . (71)

(ii) In either of the Cases 3.5 or 3.6,

2 θn θn 1 2 0 < . (72)

Proof. The proof of (i) can be deduced from Proposition C.5 and Assumption 3.4.

Now assume either of the Cases 3.5 or 3.6. Define Gn(θ) = gn(θ) + ρn

2 θ θn 1 . Then Gn is ρn + µ strongly convex. Since θn is a minimizer of Gn over Θ we get

Gn(θn) + ρn + µ

2 θn θn 1 2 Gn(θn 1) = gn(θn 1) gn 1(θn 1) (73)

where the last inequality is due to Lemma C.3. So

2 θn θn 1 2 gn 1(θn 1) Gn(θn) gn 1(θn 1) gn(θn). (74)

2 θn θn 1 2

n=1 gn 1(θn 1) gn(θn) = g0(θ0) g N(θN) 0. (75)

Letting N shows that

2 θn θn 1 2 0 (76)

as desired. This shows (ii).

The remaining results in this section concern Algorithm 2 and are used both in the convergence rate analysis in Section D as well as the asymptotic analsyis of Section E. Recall that in this case we are assuming that f v is L-Lipschitz continuous for each v V . Proposition C.7 states that this assumption implies gn is differentiable and 2L Lipschitz for each n. Proposition C.7. Let {θv}v V be a collection of |V| points in Θ. Suppose that Assumption 3.3 holds and that gv n SL(f v, θv) for each v. Then

(i) The gradient of the objective function f = P

v V f vπ(v) is L-Lipschitz over Θ.

(ii) For each v, gv n is 2L-Lipchitz over Θ. In addition, gn is 2L-Lipchitz.

Proof. Since π is a probability distribution, (i) follows easily from the triangle inequality.

For (ii) note that (gv n f v) = hv n is L-Lipshitz by Definition 2.1. Then since gv n = hv n + f v it follows from the triangle inequality that gv n is 2L-Lipschitz. Then recalling that gn = P v V gv nπ(v) another application of the triangle inequality shows that gn is 2L-Lipschitz continuous.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Proposition C.8. Assume Case 3.7 and let (θn)n 0 be an output of Algorithm 2. Then

n=1 | gn(θn 1), θn θn 1 | 0 + L

n=1 r2 n (77)

almost surley.

Proof. Since gn is 2L-Lipschitz continuous by Proposition C.7, by Lemma F.1

| gn(θn) gn(θn 1) gn(θn 1), θn θn 1 | L θn θn 1 2. (78)

Under Algorithm 2 we have gn(θn) gn(θn 1) by the definition of θn and gn(θn 1) gn 1(θn 1) by Lemma C.3. These observations together with (78) imply

| gn(θn 1), θn θn 1 || | gn(θn 1) gn(θn)| + L θn θn 1 2 (79)

= gn(θn 1) gn(θn) + L θn θn 1 2 (80)

gn 1(θn 1) gn(θn) + L θn θn 1 2. (81)

n=1 gn 1(θn 1) gn(θn) 0 (82)

almost surely. Therefore

n=1 | gn(θn 1), θn θn 1 |

n=1 gn 1(θn 1) gn(θn) + L θn θn 1 2 (83)

n=1 θn θn 1 2 (84)

n=1 r2 n, (85)

where the last line uses θn θn 1 rn.

The next lemma is a key to establishing iteration complexity of Algorithm 2. A similar lemma was used to analyze block majorization-minimization with diminishing radius in (Lyu & Li, 2023).

Lemma C.9 (Approximate first order optimality). Let (θn)n 0 be an output of Algorithm 2 and let bn = min{1, rn}. Then

bn sup θ Θ, θ θn 1 1 gn 1(θn 1), θ θn 1 gn(θn 1), θn θn 1 + rn hvn n 1(θn 1) + 2Lr2 n. (86)

Proof. Fix θ Θ with θ θn 1 bn. By definition of θn we have gn(θn) gn(θ). Subtracting gn(θn 1) from both sides and using Proposition C.7 and Lemma F.1,

gn(θn 1), θn θn 1 L θn θn 1 2 gn(θn) gn(θn 1) (87)

gn(θ) gn(θn 1) (88)

gn(θn 1), θ θn 1 + L θ θn 1 2. (89)

Notice that

gn(θn 1) = gn 1(θn 1) + gvn n (θn 1) gvn n 1(θn 1) π(vn) (90)

= gn 1(θn 1) + f vn(θn 1) gvn n 1(θn 1 π(vn) (91)

= gn 1(θn 1) hvn n 1(θn 1)π(vn). (92)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

where the second line used gvn n SL(f vn, θn 1) and item (ii) of Definition 2.1, and the third line used the definition of hvn n 1. Therefore, adding and subtracting gn 1(θn 1), θ θn 1 from the right hand side of (89) we get

gn(θn 1), θn θn 1 gn 1(θn 1), θ θn 1 π(vn) hvn n 1(θn 1), θ θn 1 + L θ θn 1 2 (93)

+ L θn θn 1 2 (94)

gn 1(θn 1), θ θn 1 + hvn n 1(θn 1) θ θn 1 + L θ θn 1 2 (95)

+ L θn θn 1 2 (96)

gn 1(θn 1), θ θn 1 + rn hvn n 1(θn 1) + 2Lr2 n (97)

where the last line used θn θn 1 rn and θ θn 1 bn rn. Since the above holds for all θ Θ with θ θn 1 bn we obtain

gn(θn 1), θn θn 1 inf θ Θ, θ θn 1 bn gn 1(θn 1), θ θn 1 + rn hvn n 1(θn 1) + 2Lr2 n. (98)

Finally notice that since bn 1, the convexity of Θ implies that θn 1 + bn(θ θn 1) Θ for any θ Θ. Thus, if there exists θ Θ with θ θn 1 1 there is θ Θ with θ θn 1 bn such that the direction of θ θn 1 agrees with that of θ θn 1. Therefore

bn sup θ Θ, θ θn 1 1 gn 1(θn 1), θ θn 1 = sup θ Θ, θ θn 1 1 gn 1(θn 1), bn(θ θn 1) (99)

sup θ Θ, θ θn 1 bn gn 1(θn 1), θ θn 1 . (100)

combining this with (98) we complete the proof.

D. Convergence Rate Analysis

In this subsection we prove the convergence rate guarantees of Theorem 3.8.

D.1. The key lemma

First we state and prove Lemma D.1 which lies at the heart of our analysis. It allows us to relate the surrogate error gradient hn(θn) to the sequence of parameter differences ( θn θn 1 2) which is known to be summable by Lemma C.6. It is important to note that we only use the recurrence of data sampling Assumption 3.1 and the structure of the algorithm in the proof.

Lemma D.1 (Key lemma). Let (cn)n 1 be a non-increasing sequence of positive numbers. For any of the cases 3.5-3.7 and any v V ,

n=1 cn hn(θn)

n=1 θn θn 1 2 !1/2

Proof. Fix some v V . We recall that kv(n) is the last time before n that the sampling process visited v and therefore the last time the surrogate gv n was updated. We then have gv n SL(f v, θkv(n) 1) so by the definition of first-order surrogates (Definition 2.1) hv n(θkv(n) 1) = 0. Combining this with the Lipschitz continuity of hv n we get

cn hv n(θn) = cn hv n(θn) hv n(θkv(n) 1) Lcn θn θkv(n) 1 L

i=kv(n) cn θi θi 1 . (102)

Therefore by the triangle inequality,

cn hn(θn) cn X

v V hv n(θn) π(v) L X

i=kv(n) cn θi θi 1

π(v). (103)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

For an integer n, let pv(n) = inf{j > n : vj = v} be the first time strictly after n that the sampling algorithm visits v. Denote a b := min(a, b). We have

i=kv(n) cn θi θi 1

i=kv(n) cn θi θi 1

N pv(i) 1 X

n=i cn θi θi 1

i=1 ci θi θi 1 (pv(i) i)

i=1 ci θi θi 1 τi,v

π(v), (107)

where the third line used that (cn) is non-increasing. So we get

n=1 cn hn(θn)

n=1 cn θn θn 1 τn,v

n=1 cn θn θn 1

v V τn,vπ(v)

n=1 cn θn θn 1

v V E[τn,v|Fn]π(v)

n=1 cn θn θn 1

n=1 θn θn 1 2 !1/2

with the second to last line using Assumption 3.1 and the last using Cauchy-Schwartz.

D.2. The constant proximal regularization case 3.5

In this section we prove Theorem A.1 for Case 3.5.

Proof of Theorem A.1 for case 3.5. We first use the linearity of the limit and the differentiability of the average surrogate error hn to get

| gn(θn, θ θn) f(θn, θ θn)| = hn(θn), θ θn hn(θn) θ θn hn(θn) (113)

for all θ Θ with θ θn 1. It the follows from the triangle inequality, taking supremums, and then expectations that

sup θ Θ, θ θn 1 f(θn, θ θn)

sup θ Θ, θ θn 1 gn(θn, θ θn)

+ E hn(θn) . (114)

Our goal will be to control the sum of right hand side.

We first address the first term on the right hand side of (114). Recall that in this case we are using constant proximal regularization, so ρn ρ for some ρ 0. For any θ Θ

gn(θn, θ θn) + ρ θn θn 1, θ θn 0 (115)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

since θn is a minimizer of gn(θ) + ρ

2 θ θn 1 2 over Θ. Then

gn(θn, θ θn) ρ θn θn 1, θ θn ρ θn θn 1 θ θn ρ θn θn 1 (116)

for any θ Θ with θ θn 1. Therefore,

n=1 sup θ Θ, θ θn 1 gn(θn, θ θn)

n=1 θn θn 1

n=1 θn θn 1 2 !1/2

where we used the Cauchy-Schwartz inequality in the last line. By Lemma C.6

n=1 θn θn 1 2 !1/2

2 0 ρ + µ (119)

almost surely. Thus,

n=1 sup θ Θ, θ θn 1 gn(θn, θ θn)

ρ + µ . (120)

We now turn to the second term on the right hand side of (114). By Lemma D.1 with cn = 1 and Lemma C.6

n=1 θn θn 1 2 !1/2

ρ + µ Lt . (121)

Now, summing both sides of (114) and using (120) and (121),

sup θ Θ, θ θn 1 f(θn, θ θn)

n=1 sup θ Θ, θ θn 1 f(θn, θ θn)

ρ ρ + µ + Lt ρ + µ

min 1 n N E

sup θ Θ, θ θn 1 f(θn, θ θn)

2 0 ρ ρ+µ + Lt ρ+µ

D.3. The dynamic proximal regularization case 3.6

In this section we prove Theorem A.1 for Case 3.6. Recall the definition of tcov from (51). Before proving the theorem we introduce a Lemma adapted from (Even, 2023) Lemma A.5. This is used to bound the expected sum of the first N dynamic regularization parameters ρn in terms of tcov.

Lemma D.2. Let an = maxv V (n kv(n)). Then

n=1 E[an] Ntcov. (125)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Proof. Since an n 1 we have

n=1 E[an] =

i=1 P(an i). (126)

Let bn = maxv V τn,v. We note that if an i then there is v V with vj = v for all n i < j n and so τn i,v i. So we have the inclusion {an i} {bn i i}. Therefore

i=1 P(an i) =

n=i+1 P(an i) (127)

n=i+1 P(bn i i) (128)

t=1 P(bs t) (129)

t=1 P(bs t) (130)

N sup n 1 E[bn]. (131)

E[bn] = E E max v V τn,v Fn

so we are done.

Proof of Theorem A.1 for Case 3.6. The proof in this case follows the same strategy as Case 3.5, but is slightly more complicated due to the randomness of the dynamic proximal regularization parameter.

Define δn := gn 1(θn 1) gn(θn). By optimality of θn and Lemma C.3,

gn(θn) + ρn

2 θn θn 1 2 gn 1(θn 1) (133)

so θn θn 1 p

2ρ 1 n δn. Using similar reasoning as in the proof for case 3.5

gn(θn, θ θn) ρn θn θn 1 p

2ρnδn. (134)

We have using Cauchy-Schwartz twice,

n=1 (E[ρn])1/2(E[δn])1/2 (135)

N(ρ + tcov) 0. (137)

The last inequality here uses ρn = ρ + maxv V (n kv(n)) and Lemma D.2 as well as PN n=1 δn 0 a.s. It follows that

sup θ Θ, θ θn 1 1 gn(θn, θ θn)

2N(ρ + tcov) 0. (138)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

To handle the gradient error hn(θn) we first use Lemma C.6 and ρn ρ to conclude

2 θn θn 1 2

2 θn θn 1 2 0 (139)

almost surely. It then follows from Lemma D.1 that

n=1 E hn(θn)

n=1 θn θn 1

ρ Lt . (140)

Finally, combining (138) and (140) we get

sup θ Θ, θ θn 1 f(θn, θ θn)

ρ + tcov + Lt ρ + µ

and so we deduce

min 1 n N E

sup θ Θ, θ θn 1 f(θn, θ θn)

2 0 ρ + tcov + Lt ρ+µ

We complete the proof by substituting the bound for tcov in Proposition C.1.

D.4. The diminishing radius case 3.7

Here we prove Theorem A.1 for Case 3.7.

Proof of Theorem A.1 for Case 3.7. Similar to the proof in Case 3.5 we have

sup θ Θ, θ θn 1 f(θn), θ θn

sup θ Θ, θ θn 1 gn(θn), θ θn

+ E hn(θn) (143)

Let bn = rn 1. Then by Lemma C.9

sup θ Θ, θ θn 1 gn(θn), θ θn

n=1 E [ gn+1(θn), θn+1 θn ] +

n=1 rn+1E [ hvn+1 n (θn) ] +

n=1 2Lr2 n+1. (145)

Because hvn+1 n is non-negative and has L-Lipschitz continuous gradients hvn+1 n (θn) p

2Lhvn+1 n (θn) (see Lemma F.2). Then

n=1 rn+1E [ hvn+1 n (θn) ]

n=1 rn+1E q

2Lhvn+1 n (θn) (146)

n=1 E [2Lhvn+1 n (θn)]

v u u t2L 0

n=1 r2 n+1. (148)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Here the second line used Cauchy-Schwartz and then Jensen s inequality to move the square inside the expectation and the last line used Lemma C.4. From Proposition C.8,

n=1 E [ gn+1(θn), θn+1 θn ] 0 + L

n=1 r2 n+1 (149)

so from (145)

sup θ Θ, θ θn 1 gn(θn), θ θn

v u u t2L 0

n=1 r2 n+1 + 3L

n=1 r2 n+1. (150)

From Lemma D.1 with cn = rn+1

n=1 rn+1E hn(θn) X

n=1 rn+1 hv n(θn)

n=1 θn θn 1 2 !1/2

Since θn θn 1 rn and (rn) is non-increasing, this bound reduces to

n=1 rn+1E hn(θn) Lt

n=1 r2 n. (152)

Combining (150) and (152) with (143) we have

sup θ Θ, θ θn 1 f(θn), θ θn

v u u t2L 0

n=1 r2n + 3 + t L

n=1 r2 n (153)

min 1 n N E

sup θ Θ, θ θn 1 f(θn), θ θn

πmin PN n=1 r2n + 3 + t L PN n=1 r2 n PN n=1 bn . (154)

D.5. Proofs of Corollaries

In this section, we prove Corollaries A.2 and A.3.

Proof of corollary A.2. Fix N and let k be the integer recognizing the minimum in (25). If Θ = Rp then we may choose θ so that θ θk = f(θk)

f(θk) . Thus,

min 1 n N E[ f(θn) ] E[ f(θk) ] = E [ f(θk), θ θk ] (155)

sup θ Θ, θ θk 1 f(θk), θ θk

= O N 1/2 . (156)

If instead Θ = Rp but the second condition dist(θk, Θ) c holds, we can take θ θk = c f(θk)

f(θk) . In doing so we obtain

min 1 n N E[ f(θn) ] E[ f(θk) ] = 1

c E [ f(θk), θ θk ] (157)

sup θ Θ, θ θk 1 f(θk), θ θk

= O N 1/2 . (158)

This shows (30). The proof of (31) is similar.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Proof of corollary A.3. The convergence rates of Theorem 3.8 are of order O(N 1/2) for Algorithm 1. Then we can prove (i) by choosing N large enough so that N 1/2 ε.

If we take rn = 1 n log n in Algorithm 2 then its corresponding convergence rate in 3.8 is of order O(n 1/2 log n). Then (ii)

follows by using the fact that n ε 2(3 log ε 1)2 implies n 1/2 log n ε for sufficiently large ε. Indeed since log x x is decreasing for sufficiently large x we have

n 1/2 log n ε 3 log ε 1 2 log ε 1 + 2 log(3 log ε 1) ε (159)

for ε sufficiently small.

E. Asymptotic Analysis

We use this section to prove Theorem 3.9. Recall that by Proposition C.1, there are constants C1 and C2 with supn 1 E[τ 2 n,v|Fn] C1 and supn 1 E[τ 4 n,v|Fn] C2 for each v V. Accordingly, we let µ2 = maxv V supn 1 E[τ 2 n,v|Fn] and µ4 = maxv V supn 1 E[τ 4 n,v|Fn] .

The first Lemma of this section states that the first and second moments of the dynamic regularization parameter ρn in Algorithm 1 are uniformly bounded. While ρn only appears in Algorithm 1, the random variable maxv V (n kv(n)) is also present in the analysis of Algorithm 2. Therefore, this Lemma is used in the analysis of both algorithms in this section.

Lemma E.1. Assume 3.1. Then there is a constant C > 0 such that

sup n 1 E[ρn] + sup n 1 E[ρ2 n] C. (160)

Proof. Fix v V. For a positive integer j we have

{n kv(n) j} = {kv(n) n j} {τn j,v j}. (161)

Therefore, we get

E[(n kv(n))] =

j=1 P(n kv(n) j) (162)

j=1 P(τn j,v j) (163)

E[τ 2 n j,v]

j=1 j 2 (165)

since E[τ 2 n j,v] µ2. Finally,

E[ρn] = ρ + E max v V (n kv(n)) ρ + X

v V E[n kv(n)] ρ + |V|µ2

j=1 j 2. (166)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

To bound the second moment we follow the same approach:

E[(n kv(n))2] =

j=1 P (n kv(n))2 j (167)

j=1 P(τ 2 n j,v j) (168)

E[τ 4 n j,v]

j=1 j 2. (170)

The proof is completed by mimicking the last line of the proof bounding the first moment.

E.1. The dynamic proximal regularization case 3.6

Here we prove Theorem 3.9 (i). Our first lemma, Lemma E.2, is similar to D.1 and key to showing that hn(θn) 0. The difference is that we must deal with θn θkv(n) 1 2 instead of θn θkv(n) 1 . In order to relate this to the sequence of one step iterate differences ( θn θn 1 2) we need to use the Cauchy-Schwartz inequality which introduces a dependence on µ2 as well as thit.

Lemma E.2. Let (θn)n 0 be an output of Algorithm 1. Assume case 3.6. Then

n=1 E[ hn(θn)] < . (171)

Proof. Since hn(θn) = P

v V hv n(θn)π(v) and V is finite, it is sufficient to show P n=1 E[hv n(θn)] < for each v V . Before starting recall that by Lemma C.6, and ρn ρ > 0

2 θn θn 1 2 0 and

n=1 θn θn 1 2 2

almost surely. This implies

n=1 ρn θn θn 1 2 #

n=1 θn θn 1 2 #

Fix v V . For each n we have gv n SL(f v, θkv(n) 1). Then using Proposition C.2, the triangle inequality and Cauchy Schwartz

|hv n(θn)| L

2 θn θkv(n) 1 2 L

2 (n kv(n) + 1)

i=kv(n) θi θi 1 2. (174)

Let Bn = (n kv(n) + 1) Pn i=kv(n) θi θi 1 2. We claim that E [P n=1 Bn] < . As in Lemma D.1 let pv(n) = inf{j > n : vj = v} be the next time strictly after time n the sampling algorithm visits v. Exchanging the order of summation we have

n=1 (n kv(n) + 1)

i=kv(n) θi θi 1 2

i=1 θi θi 1 2 pv(i) 1 X

n=i (n kv(n) + 1)

i=1 θi θi 1 2 X

n=i (n kv(n) + 1)1(pv(i) > n)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

The equality {pv(i) > n} = {τi,v n i + 1} holds as both are equal to the event {vj = v : i < j n + 1}. Moreover, kv(n) = kv(i) on {pv(i) > n} since there is no visit to v between times i and n. Therefore,

i=1 θi θi 1 2 X

n=i (n kv(n) + 1)1(pv(i) > n)

i=1 θi θi 1 2 X

n=i (n kv(i) + 1)1(τi,v n i + 1)

i=1 θi θi 1 2 X

n=i (n kv(i) + 1)P(τi,v n i + 1|Fi)

where the last line used θi, θi 1 and kv(i) are all measurable with respect to Fi. We have

n=i (n kv(i) + 1)P(τi,v n i + 1|Fi) (180)

n=i (n i + 1)P(τi,v n i + 1|Fi) + (i kv(i))

n=i P(τi,v n i + 1|Fi) (181)

= E[τ 2 i,v|Fi] + (i kv(i))E[τi,v|Fi] (182)

µ2 + (i kv(i))thit. (183)

i=1 θi θi 1 2 X

n=i (n kv(i) + 1)P(τi,v n i + 1|Fi)

i=1 θi θi 1 2 #

i=1 (i kv(i)) θi θi 1 2 #

i=1 θi θi 1 2 #

i=1 ρi θi θi 1 2 #

This shows E [P n=1 Bn] < . The proof is completed by using Fubini s Theorem and (174) to conclude

n=1 E[hv n(θn)] = E

n=1 hv n(θn)

This completes the proof.

We now prove Theorem 3.9 (i).

Proof of Theorem 3.9 (i). Starting as in the proofs of Theorem 3.8

sup θ Θ, θ θn 1 f(θn, θ θn)

sup θ Θ, θ θn 1 gn(θn, θ θn)

+ E hn(θn) (188)

Using the same argument as in the proof of Theorem 3.8 for Case 3.6,

sup θ Θ, θ θn 1 gn(θn, θ θn)

E[ρn θn θn 1 ] E[ p

2ρnδn] (190)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

where δn = gn 1(θn 1) gn(θn). By Cauchy-Schwartz and Lemma E.1

2E[ρn]E[δn] C p

E[δn] (191)

for some C > 0 independent of n. By Jensen s inequality and Lemma F.2

E[ hn(θn) ] q

E[ hn(θn) 2] q

2LE[ hn(θn)] (192)

n=1 E[δn] =

n=1 E[ gn 1(θn 1)] E[ gn(θn)] 0 < (193)

so E[δn] 0 as n . Also, p

E[ hn(θn)] 0 by Lemma E.2. Therefore

lim sup n E

sup θ Θ, θ θn 1 f(θn, θ θn)

sup θ Θ, θ θn 1 gn(θn, θ θn)

+ E hn(θn) !

2LE[ hn(θn)] = 0. (195)

We follow a similar approach to show that

sup θ Θ, θ θn 1 f(θn, θ θn)

Notice that the sub-optimality measure supθ Θ, θ θn 1 f(θn, θ θn) is always non-negative, since we can take θ = θn. Then from the inequality

sup θ Θ, θ θn 1 f(θn, θ θn) sup θ Θ, θ θn 1 gn(θn, θ θn) + hn(θn) (197)

and Cauchy-Schwartz we get

sup θ Θ, θ θn 1 f(θn, θ θn)

sup θ Θ, θ θn 1 gn(θn, θ θn)

+ 2E hn(θn) 2 .

Mimicking the proof above and using Lemma E.1

sup θ Θ, θ θn 1 gn(θn, θ θn)

E[ρ2 n θn θn 1 2] 2E[ρnδn] (199)

E[ρ2n]E[δ2n]. (200)

E[δ2n]. (201)

Since P n=1 δn 0 we have δn 0 almost surely. Therefore, an application of the dominated convergence theorem shows E[δ2 n] 0. Using Lemmas E.2 and F.2 again to show E[ hn(θn) 2] 0 completes the proof.

Remark E.3. The proof of Lemma E.2 demonstrates one of the main difficulties in proving asymptotic convergence for constant proximal regularization. In particular, our techniques require us to show that

i=1 (i kv(i)) θi θi 1 2 #

Stochastic Optimization with Arbitrary Recurrent Data Sampling

The term i kv(i) appears as the residual when we swap n kv(i) + 1 for n i + 1 in order to show

n=i (n kv(i) + 1)P(τi,v n i + 1|Fi) µ2 + (i kv(i))thit

To avoid this, an idea is to notice that τi,v n i + 1 only if τkv(i),v n kv(i) + 1 and instead compute

n=i (n kv(i) + 1)P(τkv(i),v n kv(i) + 1|Fkv(i)).

The problem here is that kv(i) is not a stopping time so, among other things, the σ-algebra Fkv(i) may not be well defined. Intuitively, i kv(i) represents a gap in knowledge since we must wait until time i to know the last time v was visited.

The use of dynamic proximal regularization bakes (202) into the algorithm. Lemmas E.1 and C.6 suggests that it may be true with constant proximal regularization: if (i kv(n)) and θi θi 1 2 were independent then

i=1 (i kv(i)) θi θi 1 2 #

i=1 E[i kv(i)]E[ θi θi 1 2]

i=1 E[ θi θi 1 2] < .

However, kv(i), θi, and θi 1 are all determined by the behavior of the sampling process so we do not have this independence.

We will see in the next subsection that diminishing radius overcomes this issue by bounding the difference θi θi 1 2 by a deterministic quantity.

E.2. The diminishing radius case 3.7

Here we prove Theorem 3.9 (ii). Lemma E.4 is an analogue of Lemma E.2 for the diminishing radius case. The remaining argument is similar to that used in (Lyu & Li, 2023) and (Lyu, 2023) to analyze block majorization-minimization and SMM with diminishing radius respectively. Lemma E.4. Let (θn)n 0 be an output of Algorithm 2. Assume Case 3.7. Then almost surely

n=1 hn(θn) < . (203)

Proof. The strategy here is nearly the same as in Lemma E.2 except that we use θn θn 1 rn.

Again, it is sufficient to show P n=1 hv n(θn) < almost surely for each v V. Fixing v we have gv n SL(f v, θkv(n) 1). Then Proposition C.2, the triangle inequality, and Cauchy-Schwartz give us

|hv n(θn)| L

2 θn θkv(n) 1 2 L

2 (n kv(n) + 1)

i=kv(n) θi θi 1 2 (204)

2 (n kv(n) + 1)

i=1 r2 i . (205)

Let Bn = (n kv(n) + 1) Pn i=kv(n) r2 i . We mimic the proof of Lemma E.2 with r2 i in place of θi θi 1 2 to conclude

i=1 r2 i + thit E

The first term on the right hand side is finite by Assumption 3.4. Moreover, by Lemma E.1 and Fubini s Theorem

i=1 E[ρi]r2 i C

i=1 r2 i < . (207)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

n=1 hv n(θn)

It then follows that P n=1 hv n(θn) is finite almost surely.

Proposition E.5. Assume Case 3.7. Suppose there exists a sequence (nk)k 1 such that almost surely either

k=1 θnk+1 θnk = or lim inf k

gnk+1(θnk), θnk+1 θnk

Then there exists a further subsequence (mk)k 1 of (nk)k 1 such that θ := limk θmk exists almost surely and θ is a stationary point of f over Θ.

Proof. By Proposition C.8,

k=1 θnk+1 θnk

gnk+1(θnk), θnk+1 θnk

< a.s. (210)

Therefore, the former condition implies the latter almost surely. So, it suffices to show the the latter condition implies the assertion. Assume the latter condition in (209) and let (mk)k 1 be a subsequence of (nk)k 1, satisfying

gmk+1(θmk), θmk+1 θmk

Since θmk+1 θmk rmk, it follows that

lim k θmk+1 θmk

gmk+1(θmk), θmk+1 θmk

where bn = min{1, rn}. If θ is not a stationary point of f over Θ, then we may find θ Θ with θ θ 1 and ε > 0 so that

f(θ ), θ θ ε < 0. (213)

On the other hand by the triangle inequality and Cauchy-Schwartz

| gmk(θmk), θ θmk f(θ ), θ θ | (214)

= | gmk(θmk) f(θmk), θ θmk + f(θmk) f(θ ), θ θmk + f(θ ), θ θmk | (215)

hmk(θmk) θ θmk + f(θmk) f(θ ) θ θ + f(θ ) θ θmk (216)

Since (θmk)k 1 converges, supk θ θmk M for some M < . Furthermore,

n=1 hn(θn) 2 2L

n=1 hn(θn) < (217)

by Lemmas E.4 and F.2 so hmk(θmk) 0 almost surely as k . This, together with continuity of f and θmk θ , shows that right hand side above tends to zero as k . Then we can choose K sufficiently large so that

gmk(θmk), θ θmk ε

Stochastic Optimization with Arbitrary Recurrent Data Sampling

for k K. Recall that θn θn 1 rn and rn = o(1). Applying Lemma C.9 we get

gn(θn 1), θn θn 1

inf θ Θ, θ θn 1 1 gn 1(θn 1), θ θn 1 + hvn n 1(θn 1) + Lrn. (219)

It then follows that for sufficiently large k

gmk+1(θmk), θmk+1 θmk

inf θ Θ, θ θmk 1 gmk(θmk), θ θmk + h vmk+1 mk (θmk) + Lrmk (221)

gmk(θmk), θ θmk + h vmk+1 mk (θmk) + rmk (222)

2 + h vmk+1 mk (θmk + rmk (223)

Recall Lemma C.4 which shows

n=1 hvn+1 n (θn) < (224)

almost surely. Therefore, since hvn n is non-negative, p

hvn+1 n (θn) 0 almost surely as n . Moreover, by Lemma F.2, hvn+1 n (θn) p

2Lhvn+1 n (θn). So letting k shows

gmk(θmk+1), θmk+1 θmk

contradicting (212).

Recall that under Algorithm 2, the one step parameter difference θn θn 1 is at most rn. For each n 1 we say that θn is a long point if θn θn 1 < rn and a short point if θn θn 1 = rn. The next proposition shows that if θn is a long point, then θn is obtained by directly minimizing gn over the full parameter space Θ. It is here the we crucially use the convexity of gn from Definition 2.1.

Proposition E.6. For n 1, suppose that θn arg minθ Θ Brn(θn 1) gn(θ) and that θn θn 1 < rn. Then θn arg minθ Θ gn(θ).

Proof. By Definition 2.1, gn is convex. Thus, it suffices to verify the first order stationarity condition

inf θ Θ gn(θn), θ θn 0 (226)

to conclude θn arg minθ Θ gn(θ). To this end, assume the conclusion is false. Then there is θ Θ with gn(θn), θ θn < 0. Moreover, as θn is obtained by minimizing gn over Θ Brn(θn 1) we must have θ θn 1 > rn. As we are assuming θn θn 1 < rn, there is α (0, 1) so that θn θn 1 = αrn. Notice that

θ θn θ θn 1 θn θn 1 > (1 α)rn. (227)

Hence if we set a = (1 α)rn

θ θn then a (0, 1). So, the convexity of Θ implies that θ := a(θ θn)+θn Θ. Furthermore,

θ θn 1 a θ θn + θn θn 1 = (1 α)rn + αrn = rn (228)

gn(θn), θ θn = a gn(θn), θ θn < 0. (229)

This contradicts θn arg minθ Θ Brn(θn 1) gn(θ) and completes the proof.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Proposition E.7. Assume the Case 3.7. If (θnk)k 1 is a sequence consisting of long points such that θ := limk θnk exist almost surely, then θ is a stationary point of f over Θ.

Proof. By the assumption that θnk is a long point and Proposition E.6 we have θnk arg minθ Θ gnk(θ). Therefore, for any θ Θ,

gnk(θnk), θ θnk 0. (230)

We then notice that

f(θnk), θ θmk = gnk(θnk), θ θnk hnk(θnk), θ θnk . (231)

By Lemmas F.2 and E.4, hnk(θnk) 2 2L hnk(θnk) 0 almost surely as k . Therefore, by taking limits we get

f(θ ), θ θ 0. (232)

Since this holds for all θ Θ,

sup θ Θ, θ θ 1 f(θ ), θ θ 0 (233)

which means that θ is a stationary point of f over Θ.

Proposition E.8. Suppose there exists a sub-sequence (θnk)k 1 such that limk θnk = θ exists almost surely and that θ is not a stationary point of f over Θ. Then there is ε > 0 such that the ε-neighborhood Bε(θ ) has the following properties:

(a) Bε(θ ) does not contain any stationary points of f over Θ.

(b) There are infinitely many n for which θn is outside of Bε(θ ).

Proof. We first show that there exists ε > 0 so that Bε(θ ) does not contain any long points. Suppose for contradiction that for each ε > 0, there is a long point in Bε(θ ). Then one may construct a sequence of long points converging to θ . But then by Proposition E.7, θ is a stationary point for f over Θ, a contradiction.

Next we show that there exists ε so that Bε(θ ) satisfies (a). In fact, suppose not. Then we can find a sequence of stationary points (θ ,k)k 1 converging to θ . But then by continuity of f,

f(θ ), θ θ = lim k f(θk, ), θ θk, 0 (234)

for any θ Θ. Then θ is a stationary point of f over Θ, contradicting our assumptions.

Now let ε > 0 be such that Bε(θ ) does not contain any long points and satisfies (a). We will show that Bε/2(θ ) satisfies (b) and thus Bε/2(θ ) satisfies both (a) and (b) as desired. Aiming for a contradiction, suppose there are only finitely many n for which θn is outside Bε/2(θ ). Then there exists N so that θn Bε/2(θ ) for all n N. Then θn is a short point for each n N so θn θn 1 = rn for all n N. This, in turn, implies that P n=1 θn θn 1 = . By Proposition E.5, there exists a subsequence (θnk)k 1 such that θ = limk θnk exists and is stationary for f. But since θ Bε(θ ), this contradicts (a). The proof is complete.

We now prove Theorem 3.9 (ii).

Proof of Theorem 3.9 (ii). Suppose for contradiction that there exists a non-stationary limit point θ of (θn)n 0. By Proposition E.8, there is ε > 0 so that Bε(θ ) satisfies the conditions (a) and (b). Choose N large enough so that rn ε

4 for n N. We call an integer interval I := [ℓ, ℓ ) a crossing if θℓ Bε/3(θ ), θℓ / B2ε/3(θ ), and no proper subset of I satisfies both of these conditions. By definition, two distinct crossings have empty intersection. Fix a crossing I = [ℓ, ℓ ). It follows by the triangle inequality,

n=ℓ θn+1 θn θℓ θℓ ε/3. (235)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Note that since θ is a limit point of (θn)n 0, we have θn Bε/3(θ ) infinitely often. In addition, by condition (b) of Proposition E.8, θn also exits Bε(θ ) infinitely often. Therefore, there must be infinitely many crossings. Let nk be the k-th smallest integer that appears in some crossing, noting importantly that θnk B2ε/3 for k 1. Then nk as k and by (235),

k=1 θnk+1 θnk (# of crossings)ε

3 = . (236)

Then by Proposition E.5, there is a further subsequence (θmk)k 1 of (θnk)k 1 so that θ = limk θmk exists and is stationary. However, since θnk B2ε/3(θ ) the stationary point θ is in Bε(θ ). This contradicts property (a) of Proposition E.8 which shows the assertion.

E.3. Details for numerical experiments

E.3.1. DISTRIBUTED NONNEGATIVE MATRIX FACTORIZATION

The MNIST samples Xv at each node were formed by concatenating a collection of images {Xi}k i=1 R28 28 + along the horizontal axis so that Xv R28 28k + . We selected 5000 images from the full dataset at random and divided them into groups based on class label. New nodes were formed by adding batches of 100 images from each group until fewer than 100 images remained. Then a final node was added for the remaining images.

We include here a list of hyperparamters used for the NMF experiments.

For Ada Grad we used constant step size parameter η = 0.5. For both RMISO-DPR and RMISO-CPR we set ρ = 2500 for the random walk and ρ = 50 for cyclic sampling. For the diminishing radius version RMISO-DR we set rn = 1 n log(n+1).

Figure 4 displays the results of these experiments vs compute time.

0 2 4 6 8 10 Elapsed Time (s)

Recons. Error

RMISO-CPR RMISO-DR RMISO-DPR

MISO ONMF ADAGRAD

(a) Random Walk

0 2 4 6 8 10 Elapsed Time (s)

Recons. Error

RMISO-CPR RMISO-DR RMISO-DPR

MISO ONMF ADAGRAD

Figure 4. Plot of reconstruction error against compute time for NMF using two sampling algorithms. Results show the performance of algorithms RMISO, MISO (Algorithm 1 with ρn = 0), ONMF, and Ada Grad in factorizing a collection of MNIST (Deng, 2012) data matrices.

E.3.2. LOGISTIC REGRESSION WITH NONCONVEX REGULARIZATION

The hyperparameters for the logistic regression experiments were chosen as follows. For MCSAG and RMISO/MISO we took L = 2/5. The random walk on the complete graph has thit = O(|V|) while thit = O(|V|2) for the lonely graph but t = O(|V|) for both. Accordingly for MCSAG we set the hitting time parameter in the step size thit = 50 for the complete graph and thit = 2500 for the lonely graph. For RMISO we set ρ = 50 for both the constant proximal regularization version and the dynamic proximal regularization version. We ran SGD with a decaying step size of the form αn = α nγ where α = 0.1 and γ = 0.5. For SGD-HB and Ada Grad we used step sizes α = 0.05 and SGD-HB momentum parameter β = 0.9.

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Figure 5 shows the results of our experiments plotted vs compute time.

0 2 4 6 8 Elapsed Time (s)

RMISO-CPR RMISO-DPR RMISO-DR MISO

ADAGRAD MCSAG SGD SGD-HB

(a) Lonely graph

0 2 4 6 8 Elapsed Time (s)

RMISO-CPR RMISO-DPR RMISO-DR MISO

ADAGRAD MCSAG SGD SGD-HB

(b) Complete graph

Figure 5. Plot of objective loss and standard deviation vs compute time for a9a for two graph topologies and various optimization algorithms RMISO, MISO (Algorithm 1 with ρn = 0), Ada Grad, MCSAG, SGD, Adam, and SGD-HB

F. Auxiliary Lemmas

Lemma F.1. Let f : Rp R be a continuously differentiable function with L-Lipschitz continuous gradient. Then for all θ, θ Rp,

|f(θ ) f(θ) f(θ), θ θ | L

2 θ θ 2. (237)

Proof. This is a classical lemma. See (Nesterov, 2003) Lemma 1.2.3.

Lemma F.2. Let f : Rp [0, ) be a continuously differentiable function with L-Lipschitz continuous gradient. Then for all θ Rp, it holds f(θ) 2 2Lf(θ).

Proof. Fix θ Rp. By Lemma F.1 we have

inf θ Rp f(θ ) inf θ Rp

f(θ) + f(θ), θ θ + L

2 θ θ 2 . (238)

It is easy to compute that

f(θ) + f(θ), θ θ + L

2 θ θ 2 = f(θ) 1

2L f(θ) 2 2. (239)

f(θ) 2 2 2L(f(θ) inf θ Rp f(θ )) 2Lf(θ) (240)

since infθ Rp f(θ ) 0.

G. Examples of Surrogate Functions

Example G.1 (Proximal surrogates for L-smooth functions). Suppose f is continuously differentiable with L-Lipschitz continuous gradients. Then f is L-weakly convex, meaning θ 7 f(θ) + L

2 θ 2 is convex (see (Lyu, 2023) Lemma C.2). For each γ L, the following function belongs to SL+γ(f, θ ):

g : θ 7 f(θ) + γ

2 θ θ 2 (241)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

Indeed, g f, g(θ ) = f(θ ), h(θ ) = 0, and h is (L + ρ) Lipschitz. Minimizing the above function over Θ is equivalent to applying a proximal mapping of f where the resulting estimate is denoted proxf/ρ(θ ) (see (Parikh & Boyd, 2014; Davis & Drusvyatskiy, 2019)).

Example G.2 (Prox-linear surrogates). If f is L-smooth, then the following quadratic function g belongs to S2L(f, θ ):

g : θ 7 f(θ ) + f(θ ), θ θ + L

2 θ θ 2. (242)

Indeed, g(θ ) = f(θ ), g(θ ) = f(θ ). Moreover,

h(θ) h(θ ) = f(θ ) f(θ) + L(θ θ ) 2L θ θ (243)

since f is L-smooth.

Example G.3 (Prox-linear surrogates). Suppose f = f1 + f2 where f1 is differentiable with L-Lipschitz gradient and f2 is convex over Θ. Then the following function g belongs to S2L(f, θ ):

g : θ 7 f1(θ ) + f1(θ ), θ θ + L

2 θ θ 2 + f2(θ). (244)

Minimizing g over Θ amounts to performing a proximal gradient step (Beck & Teboulle, 2009; Nesterov, 2013).

Example G.4 (DC programming surrogates). Suppose f = f1 + f2 where f1 is convex and f2 is concave and differentiable with L2-Lipschitz gradient over Θ. One can also write f = f1 ( f2) which is the difference of convex (DC) functions f1 and f2. Then the following function g belongs to S2L(f, θ ):

g : θ 7 f1(θ) + f2(θ ) + f2(θ ), θ θ . (245)

Such surrogates are important in DC programming (Horst & Thoai, 1999).

Example G.5 (Variational Surrogates). Let f : Rp Rq R be a two-block multi-convex function and let Θ1 Rp and Θ2 Rq be two convex sets. Define a function f : inf H Θ2 f(θ, H). Then for each θ Θ, the following function

g : θ f(θ, H ), H arg min H Θ2 f(θ , H) (246)

is convex over Θ1 and satisfies g f and g(θ ) = f(θ ). Further, assume that

(i) θ 7 f(θ, H) is differentiable for all H Θ2 and θ θf(θ, H) is L -Lipschitz for all H Θ2;

(ii) H 7 θf(θ, H) is L-Lipschitz for all θ Θ1;

Then g belongs to SL(f , θ ) for some L > 0. When f is jointly convex, then f is also convex and we can choose L = L.

H. Matrix factorization algorithms

Here we formally state the non-negative matrix factorization algorithms derived in Section 4.1.1. They may be compared to the celebrated online nonnegative matrix factorization algorithm in (Mairal et al., 2010) which is a special case of SMM.

With surrogates gn(W) as defined in Section 4.1.1, one can show that minimizing the averaged surrogate gn(W) = 1 |V| P

v V gv n(W) is equivalent to minimizing

tr(WAn W T ) 2tr(WBn), (247)

with An and Bn, defined recursively as

An := An 1 + 1

h Hvn n (Hvn n )T Hvn n 1(Hvn n 1)T i (248)

Bn := Bn 1 + 1

h Hvn n XT v Hvn n 1XT v i . (249)

Stochastic Optimization with Arbitrary Recurrent Data Sampling

With this, we state the full algorithms below.

Algorithm 3 Distributed Matrix Factorization with Proximal Regularization

Input: (Xv)v V (Data matrices in Rp d); W0 ΘW (initial dictionary); N (number of iterations); ρ > 0 (regularization parameter) Option: Regularization {Dynamic, Constant} Compute initial codes Hv 0 arg min H ΘV H 1 2 Xv W0H 2 F + α H 1 for each v V for n = 1 to N do

sample an index vn update Hv n arg min H ΘV H 1 2 Xv Wn 1H 2 F + α H 1; Hv n = Hv n 1 for v = vn An An 1 + 1 |V| Hvn n (Hvn n )T Hvn n 1(Hvn n 1)T

Bn Bn 1 + 1 |V| Hvn n (Xvn)T Hvn n 1(Xvn)T

if Regularization = Dynamic then

ρn ρ + maxv V(n kv(n)) else

ρn ρ end if update dictionary Wn:

Wn arg min W ΘW

h tr(WAn W T ) 2tr(WBn) + ρn

2 W Wn 1 2 F i (250)

end for output: θN

Algorithm 4 Distributed Matrix Factorization with Diminishing Radius

Input: (Xv)v V (Data matrices in Rp d); W0 ΘW (initial dictionary); N (number of iterations); (rn)n 1 (diminishing radius search constraints) Compute initial codes Hv 0 arg min H ΘV H 1 2 Xv W0H 2 F + α H 1 for each v V for n = 1 to N do

sample an index vn update Hv n arg min H ΘV H 1 2 Xv Wn 1H 2 F + α H 1; Hv n = Hv n 1 for v = vn An An 1 + 1 |V| Hvn n (Hvn n )T Hvn n 1(Hvn n 1)T

Bn Bn 1 + 1 |V| Hvn n (Xvn)T Hvn n 1(Xvn)T

update dictionary Wn:

Wn arg min W ΘW Brn (Wn 1)

h tr(WAn W T ) 2tr(WBn) i (251)

end for output: θN