# stochastic_gradient_descent_under_markovian_sampling_schemes__15d3c8fb.pdf

Stochastic Gradient Descent under Markovian Sampling Schemes

Mathieu Even 1

We study a variation of vanilla stochastic gradient descent where the optimizer only has access to a Markovian sampling scheme . These schemes encompass applications that range from decentralized optimization with a random walker (token algorithms), to RL and online system identification problems. We focus on obtaining rates of convergence under the least restrictive assumptions possible on the underlying Markov chain and on the functions optimized. We first unveil the theoretical lower bound for methods that sample stochastic gradients along the path of a Markov chain, making appear a dependency in the hitting time of the underlying Markov chain. We then study Markov chain SGD (MC-SGD) under much milder regularity assumptions than prior works. We finally introduce MC-SAG, an alternative to MC-SGD with variance reduction, that only depends on the hitting time of the Markov chain, therefore obtaining a communication-efficient token algorithm.

1. Introduction

In this paper, we consider a stochastic optimization problem that takes root in decentralized optimization, estimation problems, and Reinforcement Learning. Consider a function f defined as:

f(x) = Ev π [fv(x)] , x Rd , (1)

where π is a probability distribution over a set V, and fv are smooth functions on Rd for all v in V. Classicaly, this represents the loss of a model parameterized by x on data parameterized by v. If i.i.d. samples (vt)t 0 of law π and their corresponding gradient estimates ( fvt) were accessible, one could directly apply SGD-like algorithms, that have proved to be efficient in large scale machine learning

*Equal contribution 1Inria - ENS Paris. Correspondence to: Mathieu Even <mathieu.even@inria.fr>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

problems (Bottou et al., 2018). We however consider in this paper a different setting: we assume the existence of a Markov chain (vt) of state space V and stationary distribution π. The optimizer may then use biased stochastic gradients along the path of this Markov chain to perform incremental updates. She may for instance use the Markov chain SGD (MC-SGD) algorithm, defined through the following recursion:

xt+1 = xt γ fvt(xt) . (2)

Being ergodically unbiased , such iterates should behave closely to those of vanilla SGD. The analysis is however notoriously difficult, since in (2), variable xt and the current state of the Markov chain vt are not independent, so that E [ fvt(xt)|xt] can be arbitrarily far from f(xt). This paper focuses on analyzing algorithms that incrementally sample stochastic gradients alongside the Markov chain (vt), motivated by the following applications.

1.1. Token algorithms

Traditional machine learning optimization algorithms require data centralization, raising scalability and privavy issues, hence the alternative of Federated Learning, where users data is held on device, and the training is orchestrated at a server level. Decentralized optimization goes further, by removing the dependency over a central entity, leading to increased scalability, privacy and robustness to node failures, broadening the range of applications to Io T (Internet of Things) networks. In decentralized optimization, users (or agents) are represented as nodes of a connected graph G = (V, E) over a finite set of users V (of cardinality n). The problem considered is then the minimization of

v V fv(x) , x Rd , (3)

where each fv is locally held by user v V, using only communications between neighboring agents in the graph. There are several known decentralized algorithmic approaches to minimize f under these constrains. The prominent one consists in alternating between communications using gossip matrices (Boyd et al., 2006; Dimakis et al., 2010) and local gradient computations, until a consensus is reached. These gossip approaches suffer from a high synchronization cost (nodes in the graph are required to perform simultaneous

SGD under Markovian Sampling Schemes

operations, or to be aware of operations at the other end of the communication graph) that can be prohibitive if we aim at removing the dependency on a centralized orchestrator. Further, a high number of communications are required to reach consensus, whether all nodes in the graph (as in synchronous gossip) or only two neighboring ones (as in randomized gossip) communicate at each iteration. To alleviate these communication burdens, based on the original works of Lopes & Sayed (2007); Johansson et al. (2007; 2010), we study algorithms based on Markov chain SGD: a variable x performs a random walk on graph G, and is incrementally updated at each step of the random walk, using the local function available at its location. This approach thus boils down to the one presented above with the function defined in (1), where V is the (finite) set of agents, π is the uniform distribution over V, and (vt) is the Markov chain consisting of the consecutive states of the random walk performed on graph G. The random walk guarantees that every communications are spent on updating the global model, as opposed to gossip-based algorithms, where communications are used to reach a running consensus while locally performing gradient steps.

These algorithms are referred to as token algorithms: a token (that represents the model estimate) randomly walks the graph and performs updates during its walk. There are two directions to design and analyze token algorithms. Johansson et al. (2007) designed and analyzed its algorithm using, based on SGD with subdifferentials and a Markov chain sampling (consisting of the random walk). Following works (Duchi et al., 2011; Sun et al., 2018) tried to improve convergence guarantees of such stochastic gradient algorithms with Markov chain sampling, under various scenarii (mirror SGD e.g.). However, all these analyses rely on overly strong assumption: bounded gradients and/or bounded domains are assumed, and the rates obtained are of the form τmix/T + p

τmix/T for a number T of steps, where τmix is the mixing time of the underlying Markov chain. More recently, Dorfman & Levy (2022) obtained similar rates under similar assumptions (bounded losses and gradients), but without requiring any prior knowledge of τmix, using adaptive stepsizes.

A more recent approach consists in deriving token algorithms from Lagrangian duality and from variants of coordinate gradient methods or ADMM algorithms with Markov chain sampling. Mao et al. (2020) introduce the Walkman algorithm, whose analysis works on any graph, and obtain rates of τ 2 mixn

T to reach approximate-stationary points, while Hendrikx (2022) introduced a more general framework, but whose analysis only works on the complete graph (and is thus equivalent to an i.i.d. sampling). Yet, Hendrikx (2022) extend their analysis to arbitrary graph, by performing gradient updates every τmix steps of the random walk, obtaining a a dependency on τmixn, making their algorithm state of

the art for these problems. Altenatively, Wang et al. (2022) studies the algorithm stability of MC-SGD in order to derive generalization upper-bounds for this algorithm, and Sun et al. (2022) provides and studies adaptive token algorithms. Recently, and concurrently to this work, Doan (2023) also studies MC-SGD without smoothness; however, their dependency on the mixing time of the random walk (in their Theorem 1) scales as exp(cτmix): this is prohibitive as soon as the mixing time becomes larger than O(1).

In summary, current token algorithms and their analyses either rely on strong noise and regularity assumptions (e.g. bounded gradients), or suffer from an overly strong dependency on Markov chain-related quantities (as in (Mao et al., 2020; Hendrikx, 2022)).

The token algorithms we consider are to be put in contrast with consensus-based decentralized algorithms, or gossip algorithms (with fixed gossip matrices (Dimakis et al., 2010) or with randomized pairwise communications (Boyd et al., 2006)). They originally were introduced to compute the global average of local vectors through peer-to-peer communication. Among the classical decentralized optimization algorithms, some alternate between gossip communications and local steps (Nedic & Ozdaglar, 2009; Koloskova et al., 2019; 2020), others use dual formulations and formulate the consensus constraint using gossip matrices to obtain decentralized dual or primal-dual algorithms (Scaman et al., 2017; Hendrikx et al., 2019; Even et al., 2021a; Kovalev et al., 2021; Alghunaim & Sayed, 2019), and benefit from natural privacy amplification mechanisms (Cyffers et al., 2022). Other approaches include non-symetric communication matrices (Assran & Rabbat, 2021) that are more scalable. We refer the reader to Nedic et al. (2018) for a broader survey on decentralized optimization. The works we relate to in this line of research are Koloskova et al. (2020), where a unified analysis of decentralized SGD is performed (the gossip equivalent of our algorithm MC-SGD), and in particular contains rates for convex-non-smooth functions, and Yu et al. (2019), that performs an analysis of decentralized SGD with momentum in the smooth-non-convex case, which is the gossip equivalent of our algorithm MC-SAG.

1.2. Reinforcement Learning problems and online system identification

In several applications (RL, time-series analysis e.g.), a statistician may have access to values (Xt)t 0 generated sequentially along the path of a Markov chain, observations from which she wishes to estimate a parameter For instance, Kowshik et al. (2021) consider a sequence of observations Xt+1 = A Xt + ξt for ξt i.i.d. centered noise, and A to estimate, and aim at finding ˆA minimizing the MSE E h P

t<T ( ˆA A )X i where X π is the stationary distribution. Studying this problem under the lens

SGD under Markovian Sampling Schemes

of stochastic optimization, this boils down to building efficient strategis for SGD under Markov chain sampling, beyond the case of linear mean-squared regressions studied in (Kowshik et al., 2021). While optimal offline policies have extensively been studied in this setting (Jedra & Proutiere, 2019; Simchowitz et al., 2018), online algorithms that take the form of SGD-like algorithms have received little attention, and only focus on the case of quadratic losses with Markov chain as described above. However, these analyses only focus on least squares and Markov chains of the form Xt+1 = A Xt + ηt. Under these specific assumptions, Nagaraj et al. (2020) prove that a dependency on τmix for MC-SGD is inevitable, while using reverse-experince replay, Kowshik et al. (2021) breaks this and obtain sample-optimal online algorithms. Their algorithm however require to store a number of iterates that grow linearly with τmix.

The convergence guarantees we prove in the sequel for SGD under Markov chain sampling fit in this online framework, and refine previous analyses by removing strong regularity assumptions such as bounded iterates or bounded gradients (Sun et al., 2018), or strong assumptions on the Markovian structure data and least-squares problems (Nagaraj et al., 2020; Kowshik et al., 2021).

Finally, note that in our setting, the iterates of the algorithms considered (denoted as (xk)k 0) and the Markov chain (vk)k 0 are dependent of each other. More precisely, (vk)k 0 is a Markov chain whose states do not depend on the iterate sequence, while xk is (vℓ)ℓ k 1-measurable. This setting is sometimes referred to as exogenous Markov noise (Rust, 1986). Another line of works, pioneered by (Benveniste et al., 1990), considers Markov transitions for (vk) where vk+1|vk is sampled using a Markov transition kernel Px0,...,xk 1 that is directly linked to the iterates. This orthogonal line of work of stochastic approximation with Markovian noise is related to sampling (through the MCMC algorithm), adaptive filtering, and other related problems that involve exploration (Brown & Rutan, 1985; Andrieu et al., 2005; Andrieu & Moulines, 2006; Fort et al., 2016; Blanke & Lelarge, 2023). Our work aims at finding precise rates of convergence as in the convex or non-convex optimization literature (Bubeck, 2015; Carmon et al., 2021), under the mildest assumptions on the exogenous Markovchain (vk)k 0.

Before unrolling our results, we start by reminding definitions and results related to Markov chain theory in Section 2, before presenting our contributions in Section 3

2. Markov chains preliminaries

We refer the interested reader to Levin et al. (2006) for a thorough introduction to Markov chain theory. In this section, we define mixing, hitting and cover times for a Markov

chain on a finite space set V of cardinality n. However, note that these definitions can be extended to the more general setting where V is infinite (either countable or not). We focus on Markov chain on finite state spaces, but note that the mixing time of a Markov chain can similarly be defined on inifinite state spaces (countable and continuous state spaces). In this paper, all results that involve only the mixing time of the Markov chain (the results from Section 5) easily generalize to infinite state spaces. Definition 2.1. Let P RV V be a stochastic matrix (i.e. Pv,w 0 for all v, w V and P

w V Pv,w = 1 for all v V). A time-homogeneous Markov chain on V of transition matrix P is a stochastic process (Xt)t 0 with values in V such that, for any t 0 and w, v0, . . . , vt 1, v V,

P (Xt+1= w|Xt =v, Xt 1 =vt 1, . . . , X0 =v0) = Pv,w .

A Markov chain of transition matrix P is irreducible if, for any v, w V, there exists t 0 such that (P t)vw > 0. A Markov chain of transition matrix P is aperiodic if there exists t0 > 0 such that for all t t0 and v, w V, (P t)v,w > 0. Any irreducible and aperiodic Markov chain on V admits a stationary distribution π, that verifies πP = π. It finally holds that, if P is reversible (πv Pv,w = πw Pw,v for all v, w V), denoting as λP = 1 maxλ Sp(P )\{1} |λ| > 0 the absolute spectral gap of P, where Sp(P) is the spectrum of P, for any stochastic vector π0 RV: 1

π0P t π π (1 λP )t π0 π π .

If the chain is not reversible, there is still a linear decay, but in terms of total variation distance rather than in the norm π (Chapter 4.3 of Levin et al. (2006)). In the sequel, (vt)t 0 is any irreducible aperiodic Markov chain of transition matrix P on V of stationary distribution π (not necessarily the uniform distribution on V).

Furthermore, we define the graph G = (V, E) over the state space V through the relation {v, w} E Pv,w > 0 for v and w two distinct states. Consequently, the Markov chain (vt)t can also be seen as a random walk on graph G, with transition probability P. In the random walk decentralized optimization case, this graph coincides with the communication graph. In the sequel, for t 0 and v V, E [ |vt = v] and P ( |vt = v) respectively denote the expectation and probability conditioned on the event vt = v. Similarly for πt a probability distribution on V, E [ |vt πt] and P ( |vt πt) refers to conditioning on the law of vt. Definition 2.2 (Mixing, hitting and cover times). For w V, let τw = inf {t 1 | vt = w} be the time the chain reaches w (or returns to w, in the case v0 = w). We define the following quantities.

1we write z 2 π = P

v V πvz2 v for z RV, and always stands for the Euclidean norm

SGD under Markovian Sampling Schemes

1. Mixing time. For ε > 0, the mixing time τmix(ε) of (vt) is defined as, where d TV is the total-variation distance:

τmix(ε) = inf t 1 | π0 , d TV(P tπ0, π) ε ,

and we define the mixing time τmix as τmix = τmix(πmin/2) 2 where πmin = minv V πv.

2. Hitting and cover times. The hitting time τhit and cover time τcov of (vt) are defined as:

τhit = max (v,w) V2 E [τw|v0 = v] ,

τcov = max v V E max w V τw|v0 = v .

The mixing time is the number of steps of the Markov chain required for the distribution of the current state to be close to the stationary probability π. Starting from any arbitrary v0, the hitting time bounds the time it takes to reach any fixed w, while the cover time bounds the number of steps required to visit all the nodes in the graph.

Note that if the chain is reversible, τmix(ε) is closely related to λP through τmix(ε) λ 1 P ln(π 1 minε 1) . Under reversibility assumptions, we defined the relaxation time of the Markov chain as τrel = 1/λP . More generally without reversibility, τmix(ε) log2(1/ε) τmix(1/4). Then, as we prove in Appendix A, τhit always satisfies τhit 2π 1 minτmix. Finally, using Matthews (1988) method (detailed in Chapter 11.4 of Levin et al. (2006)), we have τcov ln(n)τhit.

3. Contributions

In our paper, we analyze theoretically stochastic gradient methods with Markov chain sampling (such as MC-SGD in Equation (2)), and aim at deriving complexity bounds under the mildest assumptions possible. We first derive in Section 4 complexity lower bounds for such methods, making appear τhit as the Markov chain quantity that slows down such algorithms.

We then study MC-SGD under various regularity assumptions in Section 5: we remove the bounded gradient assumption of all previous analyses, obtain rates under a µ-PL assumption, and prove a linear convergence in the interpolation regime, where noise and function dissimilarities only need to be bounded at the optimum.

In the data-heterogeneous setting (functions fv that can be arbitrarily dissimilar) and in the case where V (the state

2this definition of mixing time is not standard: Levin et al. (2006) define it as τmix(1/4), Mao et al. (2020) define it as we do; however, as explained in Chapter 4.5 of Levin et al. (2006), these definitions are equivalent up to a factor ln(1/πmin)

space of the Markov chain) is finite, we introduce MCSAG in Section 6, a variance-reduced alternative to MCSGD, that is perfectly suited to decentralized optimization. Using time adaptive stepsizes, this algorithm has a rate of convergence of τhit/T and thus matches that of our lower bound, up to acceleration.

We discuss in Section 7 the implications of our results. In particular, we prove that random-walk based decentralization is more communication efficient than consensus-based approaches; prior to our analysis, this was only shown empirically (Mao et al., 2020; Johansson et al., 2010). Further, our results formally prove that using all gradients along the Markov chain trajectory leads to faster rates; as in the previous case, this was only empirically observed before (Sun et al., 2018). These two consequences are derived from the fact that MC-SAG depends only on τhit rather than the traditionally used quantity nτmix, that can be arbitrarily bigger (Table 2).

4. Oracle complexity lower bounds under Markov chain sampling

In this section, we provide oracle complexity lower bounds for finding stationary points of the function f defined in (3), for a class of algorithms that satisfy a Markov sampling scheme . For a given Markov chain (vt) on V, we consider algorithms verifying the following procedural constraints, for some fixed initialization M0 = {x0} an then for t 0,

1. A iteration t, the algorithm has access to function fvt and may extend its memory:

Mt+1 = Span({x , fvt(x) , x Mt}) .

2. Output: the algorithm specifies an output value xt Mt.

We call algorithms verifying such constraints black box procedures with Markov sampling (vt) . Such procedures as well as the result below are inspired by the distributed black-box procedures defined in Scaman et al. (2017). We use the notation a( ) = Ω(b( )) for C > 0 such that a( ) Cb( ) in the theorem below, and classically consider the limiting situation d , by assuming we are working in ℓ2 = (θk)k N RN : P

Theorem 4.1. Assume that τv (see Definition 2.2) has finite second moment for any v V. Let , B > 0, L > 0 and µ > 0, denote κ = L/µ. Let x0 be fixed.

1. Non-convex lower bound: there exist functions (fv)v V such that f = P

v V πvfv is L-smooth, and f(x0) minx f(x) and such that for any T and any Markov black-box algorithm that outputs x T after

SGD under Markovian Sampling Schemes

T steps, we have:

f(x T ) 2 = Ω L τhit

2. Convex lower bound: there exist functions (fv)v V such that f = P

v V πvfv is convex and L-smooth and

minimized at some x that verifies x0 x 2 B2, and such that for any T and any Markov black-box algorithm that outputs x T after T steps, we have:

f(x T ) f(x ) = Ω LB2 τhit

3. Strongly convex lower bound: there exist functions (fv)v V such that f = P

v V πvfv is µ-strongly convex and L-smooth and minimized at some x that verifies x0 x 2 B2, and such that for any T and any Markov black-box algorithm that outputs x T after T steps, we have:

f(x T ) f(x ) = Ω LB2 exp T κτhit

A complete proof can be found in Appendix B. The hitting time of the Markov chain bounds, starting from any point in V, the mean time it takes to reach any other state in the graph. Making no other assumptions than smoothness, having rates that depend on this hiting time is thus quite intuitive.

5. Analysis of Markov-Chain SGD

We have shown in last subsection that, in order to reach an ε-stationary point with Markov sampling, the optimizer is slowed down by the hitting time of the Markov chain; this lower bound being worst-case on the functions (fv), we here add additional similarity assumptions, that are still milder than classical ones in this setting (Sun et al., 2018).Studying the iterates generated by (2), we obtain in this section a dependency on τmix, provided bounded gradient dissimilarities (Assumptions 5.1 and 5.4).

We here assume that (vt)t 0 is a Markov chain on V of invariant probability π (not necessarily the uniform measure on V). In this section, the function f studied is defined as

f( ) = Ev π[fv( )] ,

as in (1). Consequently, for the MC-SGD algorithm for decentralized optimization over a given graph G to minimize the averaged function over all nodes (as in (3)), π needs to be the uniform probability over V.

We first derive convergence rates under smoothness assumptions with or without a µ-PL inequality that holds, before improving our results under strong convexity assumptions,

under which we prove a linear convergence rate in the interpolation regime. We finally add local noise (due to sampling, or additive gaussian noise to enforce privacy) in the final paragraph of this Section.

5.1. Analysis under bounded gradient dissimilarities

Assumption 5.1. There exists (σ2 v)v V such that for all v V and all x Rd, we have3:

fv(x) f(x) 2 σ2 v ,

and we denote σ2 = Ev π σ2 v and σ2 max = maxv V σ2 v. Assumption 5.2. Each fv is L-smooth, f is lower bounded, its minimum is attained at some x Rd. Theorem 5.3 (MC-SGD). Assume that Assumptions 5.1 and 5.2 hold, and let f(x0) f(x ) + σ2 max/L.

1. For a constant time-horizon dependent step size γ (i.e., γ is a functiion of T), the iterates generated by Equation (2) satisfy, for T 2τmix ln(τmix): 4

E f(ˆx T ) 2 = O Lτmix

T + L σ2τmix + σ2

where ˆx T is drawn uniformly at random amongst x0, . . . , x T 1.

2. If f additionally verifies a µ-PL inequality (for any x Rd, f(x) 2) 2µ(f(x) f(x ))), for a constant time-horizon dependent step size γ, the iterates generated by Equation (2) satisfy, for T 2τmix ln(τmix) a numerical constant c > 0, and κ = L/µ, with FT = E [f(x T ) f(x )]:

FT e c T κτmix ln(T ) + O τmix σ2

Theorem 5.3 is proved in Appendix C, by enforcing a delay of order τmix and relying on recent analyses of delayed SGD and SGD with biased gradients. As explained in the introduction, removing the bounded gradient assumption present in previous works (Johansson et al., 2010; Sun et al., 2018; Duchi et al., 2011) that study Markov chain SGD (in the mirror setting, or with subdifferentials), and replacing it by a much milder and classical assumption of bounded gradient dissimilarities (Karimireddy et al., 2020), we thus still managed to obtain similar rates. Further, if f verifies a µ-PL inequality (if for any x Rd, f(x) 2) 2µ(f(x) f(x ))), we have an almostlinear rate of convergence: this is the first rate under µ-PL or strong convexity assumptions for MC-SGD-like algorithms, that we even refine further in next subsection.

3this assumption could be replaced by a more relaxed noise assumption of the form fv(x) f(x) 2 M f(x) 2 + σ2 v 4 O hides logarithmic factors

SGD under Markovian Sampling Schemes

5.2. Tight rates and linear convergence in the interpolation regime

We now study MC-SGD under the following assumptions, to derive faster rates, that only depend on the sampling noise at the optimum. The interpolation regime often related to overparameterization refers to the case where there exists a model x Rd minimizing all fv for v V, leading to σ2 = 0 in Assumption 5.5, and to a linear convergence rate below. Assumption 5.4. Functions fv are L-smooth and µ-strongly convex. We denote κ = L/µ. Assumption 5.5 (Noise at the optimum). Let x be a minimizer of f. We assume that for some σ 0, we have for all v V: fv(x ) 2 σ2 .

Theorem 5.6 (Unified analysis). Under Assumptions 5.4 and 5.5, the sequence generated by (2) satisfies, if γL < 1:

E h x T x 2i 2(1 γµ)T x0 x 2

0 s T (1 γµ)T s E

s t<T fvt(x )

In the interpolation regime, fv(x ) = 0 for all v V, so that:

E h x T x 2i 2(1 γµ)T x0 x 2 .

The result in Theorem 5.6 is in fact true irrespectively of the sequence (vt) chosen: it does not require (vt) to specifically be a Markov chain. This property is used in the next Corollary, that also highlights the fact that by studying distance to the optimum, a condition number is lost in the process. This is the case in many previous analyses of other different algorithms (e.g., Bregman/Mirror-SGD (Dragomir et al., 2021) or SGD with random-resfhuffling (Mishchenko et al., 2020), which is in fact a particular instance of MC-SGD, that our analysis recovers), that study distances to the optimum (with respect to some mirror map, in the case of Mirror SGD), and therefore obtain an extra κ factor in the noise term. Theorem 5.6 is proved by generalizing the proof technique of (Mishchenko et al., 2020) to arbitrary orderings and for unbounded time horizons. Remark 5.7 (Random resfhuffling). A special case of Theorem 5.6 is SGD with random reshuffling. By analyzing SGD with random-reshuffling as SGD with a Markovian ordering (on an extended state space), Theorem 1,2 also recover rates for SGD with random reshuffling for which we have τmix = n. Moreover, since Theorem 5.6 generalizes Theorem 1 of (Mishchenko et al., 2020), we also recover their rate as a special case by bounding each term

s t<T fvt(x ) 2 .

We specify Theorem 5.6 under a Markovian sampling scheme in next corollary: the noise term at the optimum takes the form τmix/T.

Corollary 5.8 (MC-SGD, interpolation). Under Assumptions 5.4 and 5.5, for T 1 and for a well chosen stepsize γ > 0, the iterates generated by (2) satisfy:

E h x T x 2i 2e T

This result is stronger than Theorem 5.3.2, for (i) noise amplitude and gradient dissimilarities only need to be bounded at the optimum; (ii) the optimization term (the first one) is not slowed down by the mixing time. This comes at the cost of strong convexity assumptions, stronger than a µ-PL inequality for f. The term τmixσ

T cannot be removed in the general case, as next proposition shows. Hence, since the two other terms have optimal dependency in terms of Markov-chain and noise related quantities, our analysis ends up being sharp.

Corollary 5.6 together with the following proposition are an extension of Nagaraj et al. (2020), who proved similar results for MC-SGD with constant stepsize on least square problems on Markovian data of a certain form (for linear online system identification).

Proposition 5.9. For any V (such that |V| 2) and τ > 1, there exists a Markov chain on V of relaxation time τ, functions (fv)v V and x0 Rd such that given any stepsize γ, the iterates of Equation (2) output x T for any T > 0 verifying x T x0 2 = Ω(τσ2 /T), and the assumptions of Theorem 5.6 hold.

5.3. MC-SGD with local noise

In the two previous subsections, we analyzed SGD with Markovian sampling schemes, where the stochasticity only came from the Markov chain (vk)k 0. We now generalize the analysis and results to SGD with both Markovian sampling, and local noise, by studying the sequence:

xt+1 = xt γtgt . (4)

We now formulate the form stochastic gradients gt can take.

Assumption 5.10. For all v V, the function fv satisfies fv(x) = E [Fv(x, ξv)] for all x Rd, where ξv Dv. Furthermore, there exists a Markov-chain (vt)t 0 such that for all t 0, gt = x Fvt(xt, ξt) ,

where ξt Dvt|vt is independent from v0, . . . , vt 1 and ξ0, . . . , ξt 1.

SGD under Markovian Sampling Schemes

A direct consequence of Assumption 5.4 is that E [gt|xt, vt] = fvt(xt). Two main applications of Assumption 5.4 are:

1. Local sampling. If fv(x) = 1

m Pm i=1 fv,i(x) (agent v has m local samples), agent m may use only a batch B [m] of its samples, leading to stochastic gradients gt in (4) of the form:

gt = 1 |Bt|

i Bt fvt,i(xt) ,

for random batches (Bt)t 0.

2. Differential privacy. Adding local noise (e.g., additive Gaussian random noise) enforces differential privacy under suitable assumptions. A private decentralized token algorithm is then Differentially Private MC-SGD (DP-MC-SGD), with iterates (4) where gt satisfies

gt = fvt(xt) + ηt , (5)

where (vt) is the Markov chain (random walk performed by the token on the communication graph), and ηt N(0, σ2 t Id) is sampled independently from the past, to enforce differential privacy.

Under Assumption 5.4, a direct generalization of Theorem 5.6 and Corollary 5.8 is the following.

Theorem 5.11 (MC-SGD with local noise). Assume that Assumptions 5.10,5.5 holds, each Fv( , ξ) is µ-strongly convex L-smooth, and there exists σ2 local such that:

E h gt fvt(xt) 2|xt, vt i σ2 local .

Then, for a well chosen γ > 0, the iterates generated by (4) satisfy:

E h x T x 2i 2e T

µ3T σ2 local + τmix 1

Importantly, and as one would have expected, local noise is not impacted by the mixing time of the underlying random walk. While we did not pursue in this direction, this observation could easily be made under other regularity assumptions, and such a result would hold for instance under the assumptions of Theorem 1 or 2. While Differentially Private MC-SGD sounds appealing for performing decentralized and differentially private optimization, we here only provided a utility analysis, the privacy analysis being left for future works.

Algorithm 1 Markov Chain SAG (MC-SAG)

1: Input: x0 Rd, hv Rd for v V and h0 Rd, stepsizes γt > 0, v0 V 2: for t = 0, 1, . . . do 3: Compute fvt(xt) 4: ht+1 = ht + 1

n fvt(xt) hvt

5: xt+1 = xt γt ht+1 6: hvt fvt(xt) 7: Sample vt+1 Pvt, 8: end for

6. Analysis of Markov-Chain SAG

After providing convergence guarantees for the most natural algorithm (MC-SGD) under a Markov chain sampling on the set V, we prove that one can achieve a rate of order 1/T (rather than the 1/

T previously obtained) in the smooth setting, by applying the variance reduction techniques present in Schmidt et al. (2017), that first introduced the Stochastic Averaged Gradient algorithm, together with a time-adaptive stepsize policy described below. Our faster rate with variance reduction leads of a dependency on τhit instead of τmix; since we do not make any other assumption other than smoothness, this is unavoidable in light of our lower bound (Theorem 4.1).

MC-SAG The MC-SAG algorithm is described in Algorithm 1. The recursion leading to the iterate xt can then be summarized as, for stepsizes (γt)t 0, under the initialization hv = fv(x0) and h = f(x0):

xt+1 = xt γt

v V fv(xdv(t)) , (6)

where for v V, we define dv(t) = sup {s t | vs = v} as the last previous iterate at which v was the current state of the Markov chain. By convention, if the set over which the supremum is taken is empty, we set dv(t) = 0. We handle both the initialization described just above for hv, h and arbitrary initialization in our analysis below.

In the same way that MC-SGD reduces to vanilla SGD if (vt) is an i.i.d. uniform sampling over V, MC-SAG boils down to the SAG algorithm (Schmidt et al., 2017) in that case and under the initialization hv = fv(x0) and h = f(x0). In a decentralized setting, nodes keep in mind their last gradient computed (variable hv at node v). At all times, ht is an average of these hv over the graph, and is, in the same way as xt, updated along the random walk. The MCSAG algorithm is thus perfectly adapted to decentralized optimization.

Time-adaptive stepsize policy To obtain our convergence guarantees, a time-adaptive stepsize policy (γt) is used, as

SGD under Markovian Sampling Schemes

in Asynchronous SGD (Mishchenko et al., 2022) to obtain delay-independent guarantees. For t 0, let the stepsize γt be defined as:

γt = 1 2L τhit + maxv V(t dv(t)) . (7)

Denoting τt = maxv V(t dv(t)), this quantity can be tracked down during the optimization process. Indeed, if agent vt receives τt 1 together with (xt, ht), she may compute τt as:

τt = max τt 1 + 1 , t dvt(t) ,

where t dvt(t) is the number of iterations that took place since the last time the Markov chain state was vt. Hence, if agents keep track of the number of iterations, the adaptive stepsize policy (7) can be used in Algorithm 1, as long as agent vt sends (τt, t) to vt+1, yielding the following result.

We now present the convergence results for MC-SAG. (vt) is in this section assumed to be a Markov chain on V of finite hitting time τhit. Importantly, the next Theorem does not require any additional assumption on (vt) such as reversibility, or even that it has a stationary probability that is the uniform distribution: the non-symmetric but easily implementable transition probabilities Pv,w = 1/(dv + 1) for w = v or w v can be used here, as well as non-reversible random walks than can have much smaller mixing and hitting times. The function f studied is here independent of the Markov chain, and is defined as in (3), the uniformly averaged function over all states v V (or over all agents in the network).

Theorem 6.1 (MC-SAG). Assume that Assumption 5.2 holds and that the Markov chain has a finite hitting time (for an arbitrary invariant probability).

1. Under the initialization:

hv = fv(x0) , h = f(x0) ,

using the adaptive stepsize policy defined in Equation (7), the sequence generated by Algorithm 1 satisfies, for any T > 0:

E min t<T f(xt) 2 8L f(x0) f(x ) τhit ln(n)

2. Under any arbitrary initialization that satisfies h = 1 n P

v V hv, using the adaptive stepsize policy defined in Equation (7), the sequence generated by Algorithm 1 satisfies, for any T > 0:

E min t<T f(xt) 2 16L τhit ln(n)

= f(x0) f(x ) + 1

v V fv(x0) hv 2 .

Theorem 6.1 is proved in Appendix 6.1. Up to the logarithmic factor in n, the rates in Theorem 6.1 are the nonaccelerated versions of the lower-bound in Theorem 4.1.

7. Discussion of our results

7.1. Communication efficiency: comparison of our results with consensus-based approaches

We summarize the communication efficiencies in Table 1 (in terms of total number of communications required to reach an ε-stationary point), of classical gossip-based decentralized gradient methods (non-accelerated, since no accelerated method is known under our regularity assumptions). We consider the algorithm of (Yu et al., 2019) (decentralized SGD with momemtum, state of the art decentralized gossipbased algorithm for this problem) with fixed communication matrix W on the graph G together with the Walkman algorithm (Mao et al., 2020) and our algorithms, for a Markov chain with transition matrix P. For the sake of comparison, we take as gossip matrix W = P.Consequently as shown in Table 1, our algorithm (MC-SAG) always outperforms nonaccelerated gossip-based decentralized gradient descent algorithms in terms of number of communications required to reach ε-stationary points. Note that we do not claim the overall superiority of our approach over classical decentralized optimization algorithms (the latter benefit from parallelization while ours do not), but a superiority in terms of communication efficiency.

Table 1: Number of communications required (# comm. below) to obtain an ε-stationary point x (verifying f(x) 2 ε). Logarithmic/constant factors hidden.

A B Our work

# comm. ε 1|E|τmix ε 1nτ 2 mix ε 2τmix c

A: (Yu et al., 2019). B: (Mao et al., 2020). c MC-SGD under Assumption 5.1.

The dependency on the quantity τhit we obtain (under no other assumptions than smoothness) is always better than the dependency on nτmix of previous works (using gossip communications or a random walker), since τhit 2nτmix always holds. As illustrated in Table 2 on some known graphs, this inequality is rather loose when the connectivity decreases (i.e. the mixing time increases), so that the speedup our results lead to is even more effective on illconnected graphs; the difference between the two can scale

SGD under Markovian Sampling Schemes

Table 2: Hitting and mixing times of some known graphs, for the simple random walk.

Cycle d-dim. torus Complete graph

τhit O(n2) O(n1+ 1

nτmix O(n3) O(n1+ 2

up to a factor n. In fact, we prove in Appendix A that for d-regular and symetric graphs, we have:

τhit 2|E|Diam(G)

where Diam is the diameter of G. The dependency nτ 2 mix obtained in (Mao et al., 2020) (the only work that does not make bounded gradient assumptions) is prohibitive when graph connectivity decreases (n3 on the grid, n5 on the cycle). Our analysis does not rely on a reversibility assumption of the Markov chain, so that non symetric random walks can be used, therefore accelerating mixing; on the cycle for a non-symmetric random walk for instance, the hitting time decreases to O(n).

7.2. Using all gradient along the trajectory of (vt) is provably more efficient

Sun et al. (2018) empirically motivated through empirical evidence the use of all gradients fvt sampled along the trajectory of the Markov chain rather than waiting for the chain to mix before every stochastic gradient step in order to mimic the behavior of vanilla SGD. However, their rates (as well as those of (Johansson et al., 2010; Duchi et al., 2011) and ours for MC-SGD) are functions of S = T/τmix, and of order 1/S + 1/

S. These are exactly what one would obtain, by waiting for τmix steps of the chain in order to have an approximate uniform sampling before each update! Consequently, there are no theoretical ground or evidence for using all the gradients along the trajectory of the Markov chain with these results, other than by doing so, one does not do worse than waiting for the chain to mix to mimic vanilla SGD. This is exactly the approach taken by Hendrikx (2022): a gradient step is performed every τmix random walk steps. This is where MC-SAG and its guarantees that depend on T/τhit come in place. Under our assumptions, the rate of SAG for finding approximate stationary points when waiting for the chain to mix before using a stochastic gradient is of order n/S = nτmix/T where S = T/τmix is the number of stochastic gradients used. We obtain τhit/T instead: hence, in cases where τhit = o(nτmix), using all stochastic gradients along the trajectory of the Markov chain - instead of waiting for mixing before performing a stochastic gradient step - provably helps. Hence, we here provided a realistic scenario where

using all stochastic gradients proves to accelerate the rate; this was previously noticed in another setting with RERSGD (SGD with reverse-experience replay, (Kowshik et al., 2021)).

7.3. Running-time complexity and robustness to stragglers

The total time it takes to run random walk-based decentralized algorithms depends on Tv w, the time it takes to compute a gradient at v, and then the communication time to send it to w. Using ergodicity of the Markov chain, the time time MC(T) it takes to run MC-SAG or MC-SGD for T iterations verifies:

(v,w) V2 πv Pv,w Tv w ,

where the limit is a weighted sum of the local computation/communication times, with weights summing to 1. Random-walk based decentralized algorithms are therefore robust to slow edges or nodes ( stragglers ), a property that synchronous gossip algorithms do not verify (their time complexity depends on maxv,w Tv w), while studying asynchronous gossip is notoriously difficult (Even et al., 2021b).

Numerical illustration of our theory We present in Appendix G two experiments on synthetic problems, comparing MC-SAG and MC-SGD to gossip-based and token baselines. We consider two settings (a well-connected graph with homogeneous functions, an ill-connected graph with heterogeneous functions) in an effort to illustrate how these two difficulties (graph connectivity and data-heterogeneity) are both bypassed by the MC-SAG algorithm.

Conclusion Without variance reduction and under bounded data-heterogeneity assumptions, SGD under MC sampling is slowed down by a factor τmix, due to increased sampling variance. Using variance-reduction techniques, we obtain faster rates, that depend on τhit rather than nτmix, which one would have expected by directly extending known results in the i.i.d. setting to our MC sampling schemes. Leveraging such a dependency yields a fast token algorithm (MC-SAG), robust to both ill-connectivity of the graph and data-heterogeneity.

Aknowledgments I deeply thank Hadrien Hendrikx for many interesting discussions on the subject and for all his help in writing this paper, as well as Anastasiia Koloskova and Edwige Cyffers for helpful discussions.

SGD under Markovian Sampling Schemes

Alghunaim, S. A. and Sayed, A. H. Linear convergence of primal-dual gradient methods and their performance in distributed optimization, 2019.

Andrieu, C. and Moulines, É. On the ergodicity properties of some adaptive MCMC algorithms. The Annals of Applied Probability, 16(3), August 2006.

Andrieu, C., Moulines, É., and Priouret, P. Stability of stochastic approximation under verifiable conditions. SIAM Journal on Control and Optimization, 44(1):283 312, January 2005.

Assran, M. S. and Rabbat, M. G. Asynchronous gradient push. IEEE Transactions on Automatic Control, 66(1): 168 183, 2021.

Benveniste, A., Métivier, M., and Priouret, P. Adaptive Algorithms and Stochastic Approximations. Springer Berlin Heidelberg, 1990.

Blanke, M. and Lelarge, M. Flex: an adaptive exploration algorithm for nonlinear systems. In International Conference on Machine Learning, 2023.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, January 2018.

Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508 2530, 2006.

Brown, S. and Rutan, S. Adaptive kalman filtering. Journal of Research of the National Bureau of Standards, 90(6): 403, November 1985.

Bubeck, S. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8(3 4):231 357, November 2015.

Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower bounds for finding stationary points ii: First-order methods. Math. Program., 185(1 2):315 355, jan 2021. ISSN 0025-5610.

Cyffers, E., Even, M., Bellet, A., and Massoulié, L. Muffliato: Peer-to-peer privacy amplification for decentralized optimization and averaging. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.

Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.

Dimakis, A. G., Kar, S., Moura, J. M. F., Rabbat, M. G., and Scaglione, A. Gossip algorithms for distributed signal processing. Proceedings of the IEEE, 98(11):1847 1864, 2010.

Doan, T. T. Finite-time analysis of markov gradient descent. IEEE Transactions on Automatic Control, 68(4):2140 2153, 2023.

Dorfman, R. and Levy, K. Y. Adapting to mixing time in stochastic optimization with Markovian data. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5429 5446. PMLR, 17 23 Jul 2022.

Dragomir, R. A., Even, M., and Hendrikx, H. Fast stochastic bregman gradient methods: Sharp analysis and variance reduction. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 2815 2825. PMLR, 18 24 Jul 2021.

Duchi, J. C., Agarwal, A., Johansson, M., and Jordan, M. I. Ergodic mirror descent. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 701 706, 2011. doi: 10.1109/Allerton. 2011.6120236.

Even, M., Berthier, R., Bach, F., Flammarion, N., Hendrikx, H., Gaillard, P., Massoulié, L., and Taylor, A. Continuized accelerations of deterministic and stochastic gradient descents, and of gossip algorithms. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 28054 28066. Curran Associates, Inc., 2021a.

Even, M., Hendrikx, H., and Massoulié, L. Decentralized optimization with heterogeneous delays: a continuous-time approach. Technical report, ar Xiv:2106.03585, 2021b.

Even, M., Massoulié, L., and Scaman, K. On sample optimality in personalized collaborative and federated learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.

Fort, G., Moulines, E., Schreck, A., and Vihola, M. Convergence of markovian stochastic approximation with discontinuous dynamics. SIAM Journal on Control and Optimization, 54(2):866 893, January 2016.

Hendrikx, H. A principled framework for the design and analysis of token algorithms, 2022.

SGD under Markovian Sampling Schemes

Hendrikx, H., Bach, F., and Massoulié, L. An accelerated decentralized stochastic proximal algorithm for finite sums. In Advances in Neural Information Processing Systems, 2019.

Jedra, Y. and Proutiere, A. Sample complexity lower bounds for linear system identification. In IEEE Conference on decision and control (CDC) 2019, pp. 2676 2681, 12 2019.

Johansson, B., Rabi, M., and Johansson, M. A simple peerto-peer algorithm for distributed optimization in sensor networks. In 2007 46th IEEE Conference on Decision and Control, pp. 4705 4710, 2007. doi: 10.1109/CDC. 2007.4434888.

Johansson, B., Rabi, M., and Johansson, M. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157 1170, 2010.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., and Suresh, A. T. SCAFFOLD: Stochastic controlled averaging for federated learning. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5132 5143. PMLR, 13 18 Jul 2020.

Koloskova, A., Stich, S., and Jaggi, M. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, volume 97, pp. 3478 3487. PMLR, 2019.

Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, S. A unified theory of decentralized SGD with changing topology and local updates. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5381 5393. PMLR, 13 18 Jul 2020.

Kovalev, D., Gasanov, E., Gasnikov, A., and Richtarik, P. Lower bounds and optimal algorithms for smooth and strongly convex decentralized optimization over timevarying networks. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 22325 22335. Curran Associates, Inc., 2021.

Kowshik, S., Nagaraj, D., Jain, P., and Netrapalli, P. Streaming linear system identification with reverse experience replay. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 30140 30152. Curran Associates, Inc., 2021.

Levin, D. A., Peres, Y., and Wilmer, E. L. Markov chains and mixing times. American Mathematical Society, 2006.

Lopes, C. G. and Sayed, A. H. Incremental adaptive strategies over distributed networks. IEEE Transactions on Signal Processing, 55(8):4064 4077, 2007. doi: 10.1109/TSP.2007.896034.

Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchandran, K., and Jordan, M. I. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202 2229, January 2017.

Mao, X., Yuan, K., Hu, Y., Gu, Y., Sayed, A. H., and Yin, W. Walkman: A communication-efficient random-walk algorithm for decentralized optimization. IEEE Transactions on Signal Processing, 68:2513 2528, 2020. doi: 10.1109/TSP.2020.2983167.

Matthews, P. Covering Problems for Markov Chains. The Annals of Probability, 16(3):1215 1228, 1988.

Mei, S., Bai, Y., and Montanari, A. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747 2774, 2018.

Mishchenko, K., Khaled, A., and Richtarik, P. Random reshuffling: Simple analysis with vast improvements. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 17309 17320. Curran Associates, Inc., 2020.

Mishchenko, K., Bach, F., Even, M., and Woodworth, B. Asynchronous SGD beats minibatch SGD under arbitrary delays. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.

Nagaraj, D., Wu, X., Bresler, G., Jain, P., and Netrapalli, P. Least squares regression with markovian data: Fundamental limits and algorithms. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 16666 16676. Curran Associates, Inc., 2020.

Nedic, A. and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. doi: 10.1109/TAC.2008.2009515.

Nedic, A., Olshevsky, A., and Rabbat, M. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5): 953 976, May 2018.

Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014. ISBN 1461346916.

SGD under Markovian Sampling Schemes

Rao, S. Finding hitting times in various graphs, 2012.

Rust, J. Structural estimation of markov decision processes. In Engle, R. F. and Mc Fadden, D. (eds.), Handbook of Econometrics, volume 4, chapter 51, pp. 3081 3143. Elsevier, 1 edition, 1986.

Scaman, K., Bach, F., Bubeck, S., Lee, Y. T., and Massoulié, L. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3027 3036. PMLR, 06 11 Aug 2017.

Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(1 2):83 112, mar 2017. ISSN 0025-5610.

Simchowitz, M., Mania, H., Tu, S., Jordan, M. I., and Recht, B. Learning without mixing: Towards a sharp analysis of linear system identification. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 439 473. PMLR, 06 09 Jul 2018.

Stich, S. U. and Karimireddy, S. P. The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication, 2021.

Sun, T., Sun, Y., and Yin, W. On markov chain gradient descent. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Sun, T., Li, D., and Wang, B. Adaptive random walk gradient descent for decentralized optimization. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 20790 20809. PMLR, 17 23 Jul 2022.

Wang, P., Lei, Y., Ying, Y., and Zhou, D.-X. Stability and generalization for markov chain stochastic gradient methods. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.

Yu, H., Jin, R., and Yang, S. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7184 7193. PMLR, 09 15 Jun 2019.

SGD under Markovian Sampling Schemes

A. Preliminary results

A.1. Mixing time and relaxation time, mixing time and hitting time

We first begin by the two following lemmas. The first one is very classical, and bounds the mixing time in terms of 1/λP , in the case where the chain is reversible; we provide a proof for completeness. Note that if the chain is reversible, we still have a linear decay (Levin et al., 2006). The second lemma we provide bounds the hitting time of the Markov chain with the mixing time. This result is somewhat less classical, and is not present in the classical Markov chain literature surveys. Lemma A.1 (τmix and λP ). For any ε > 0, if the chain is reversible:

λP ln(ε 1π 1 min) ,

so that τmix l 1 λP ln(π 2 min/2) m .

Proof. We have:

d TV(P tπ0, π) = 1

w V |(P tπ0)w πw|

w V πw|(P tπ0)w πw|

P tπ0 π 2 π

2πmin π0 π π

so that |(P t)v,w πw| ε for t λ 1 P ln(π 1 minε 1/2).

Lemma A.2 (Mixing times and hitting times).

τhit 2π 1 minτmix ,

so that if π is the uniform distribution over V, τhit 2nτmix.

Proof. for any v, w V,

E [τw|v0 = v] = X

k 1 P (τw k|v0 = v)

ℓ 0 P (τw > ℓτmix|v0 = v) .

Then, for ℓ 0,

P (τw > (ℓ+ 1)τmix|v0 = v) = P (τw > (ℓ+ 1)τmix|τw > ℓτmix, v0 = v) P (τw > ℓτmix|v0 = v) ,

and, conditioning on vℓτmix, P (τw > (ℓ+ 1)τmix|τw > ℓτmix, vℓτmix) P v(ℓ+1)τmix = w|vℓτmix . By definition of τmix, we have that P v(ℓ+1)τmix = w|vℓτmix (1 πw/2), so that:

P (τw > (ℓ+ 1)τmix|v0 = v) (1 πw/2)P (τw > ℓτmix|v0 = v) ,

and P (τw > ℓτmix|v0 = v) (1 πw/2)ℓby recursion. Finally,

E [τw|v0 = v] τmix X

ℓ 0 (1 πw/2)ℓ

concluding the proof by taking the maximum over w.

SGD under Markovian Sampling Schemes

A.2. Matthews bound for cover times

The following result bounds the cover time of the Markov chain: it is in fact closely related to its hitting time, and the two differ with a most a factor ln(n). This surprising result is proved in a very elegant way in the survey Levin et al. (2006), using the famous Matthews method (Matthews, 1988).

Theorem A.3 (Matthews bound for cover times). The hitting and cover times of the Markov chain verify:

A.3. A bound on the hitting time of regular and symetric graphs

Using results from Rao (2012), we relate the hitting time of symmetric regular graphs (in a sense that we define below) to well-known graph-related quantities: number of edges |E|, diameter δ and degree d.

Lemma A.4 (Bounding hitting times of regular graphs). Let (vt) be the simple random walk on a d-regular graph G of diameter δ, that satisfies the following symetry property: for any {u, v}, {v, w} E, there exists a graph automorphism that maps v to w. Then, we have:

Proof. Using Theorem 2.1 of Rao (2012), for {v, w} E, we have

E [τw|v0 = v] = 2|E|

where |E| is the number of edges in the graph. Let v and w in V, at distance δ δ. There exists nodes v = v(0), v(1) . . . , v(δ 1), v(δ ) = w such that for all 0 s < δ , {v(s), v(s + 1)} E, and by using the Markov property:

E [τw|v0 = v] X

s<δ E τv(s+1)v0 = v(s) δ 2|E|

A.4. Two miscellaneous lemmas

We finally end this preliminary results section with the two following lemmas, that we help us conclude the proof of

Theorem 6.1. The first lemma will lead to a bound on P

t<T γt 1 where γt is the adaptive stepsize policy defined in Equation (7), while the second one is used to conclude the proof of Theorem 6.1 to show that a remaining term is non-positive.

Lemma A.5. For t 0 and v V, let pv(t) = inf {s > t|vs = v} and dv(t) = sup {s < t|vs = v} be the next and the last previous iterates for which vt = v (dv(t) = 0 by convention, if v has not yet been visited). Assume that (vt) has stationary distribution π. For t 0, let At = supv V t dv(t) and Bt = supv V pv(t) t . We have:

E [Bt|vt = v] τcov , v V ,

and for T 1: X

t<T E [At] Tτcov .

Proof. The first bound on Bt is obtained using the Markov property of the chain, and by definition of τcov. We have:

k=0 P (At k) .

SGD under Markovian Sampling Schemes

For t 0 fixed, we denote d = infv dv(t), so that At = t d. We then have the equality between the following events:

{At k} = {t d k} = {d t k} = {Bt k k} ,

that all coincide with the event there exists some v V such that for all t k s t, vs = v . Summing over t < T:

t<T E [At] = X

k<t P (Bt k k)

s<T ℓ P (Bℓ s)

s 0 P (Bℓ s)

Lemma A.6. Let (at)t 0, (bt)t 0 be two sequences of real-valued random variables. Let (Ft)t 0 be a filtration. Assume

that bt is positive and Ft-measurable for all t, and that E [at|Ft] 0. Then, denoting HT =

PT t=0 at PT t=0 bt , the sequence

(E [HT ])T 0 is non-increasing, so that E [HT ] 0 for all T.

Proof. For fixed T 1, we have, using the fact tha bt is FT measurable for t T

E [HT |FT ] = E [a T |FT ] + P

t<T E [at|FT ] P

t<T E [at|FT ] P

t<T E [at|FT ] P

using E [a T |FT ] 0 and b T > 0. Consequently, taking the mean, we obtain E [HT ] E [HT 1].

B. Lower bound

We prove the smooth non-convex version of Theorem 4.1; the convex cases are proved in a similar way using exactly the same arguments, and the most difficult function in the world , as defined by Nesterov (2014), rather than the one used by Carmon et al. (2021), albeit the two are closely related.

Proof of Theorem 4.1. For x ℓ2 and k N, denote by x(k) its kth coordinate. We split the function defined in Section 3.2 of Carmon et al. (2021) (inspired by the most difficult function in the world of Nesterov (2014)) between two nodes v, w V maximizing E [τw|v0 = v], by setting πvfv(x) = 1

k 1 2x(2k)2 2x(2k 1)x(2k) + 1

2αx(0)2 bx(0) + α

2 and πwfw(x) = 1

k 0 2x(2k + 1)2 2x(2k + 1)x(2k) for some b, α > 0. Then, we define T0 = τv and for t 0,

T2k+1 = inf {t T2k | vt = w} and

T2k+2 = inf {t T2k+1 | vt = w} .

The second step of the proof is somewhat classical, and consists in observing that the black-box constraints of the algorithm

SGD under Markovian Sampling Schemes

together with the construction of the functions fv and fw defined in the proof sketch of Section 4 imply that:

if vt = v and

( Mt Span(ei, i 2k 1) then Mt+1 Span(ei, i 2k) ,

Mt Span(ei, i 2k) then Mt+1 Span(ei, i 2k) ,

if vt = w and

( Mt Span(ei, i 2k) then Mt+1 Span(ei, i 2k + 1) ,

Mt Span(ei, i 2k + 1) then Mt+1 Span(ei, i 2k + 1) ,

if vt / {v, w}, then Mt = Mt+1 .

In other words, even dimensions are discovered by node v, while odd ones are discovered by node w. The dimension Re0 is discovered by node v thanks to the term be0. Using Theorem 1 of Carmon et al. (2021), for a right choice of parameters α, b > 0, f is L-smooth and satisfies f(x0) infx f(x) , together with, any k and any x Mt Span(ei, i 2k),

This lower bound proof technique is explained in a detailed and enlightening fashion in Chapter 3.5 of Bubeck (2015).

Then, the final and more technical step of the proof consists in upper bounding Ek(t). If (Tk+1 Tk)k 0 were independent from k(t), using E [Tk+1 Tk] = τhit for k even, we would directly obtain t E Tk(t) E [k(t)] 1)τhit/2. However, these random variables are not independent: since tail effects can happen, we need a finite second moment for hitting times, and the proof is a bit trickier. First, note that:

E [k(t)] = X

0 k t P (k(t) k) = X

0 k t P (Tk t) .

Let (Xℓ)ℓ 0 be i.i.d. random variables of same law as τw conditioned on v0 = v. We have E [Xℓ] = E [τw|v0 = v] = τhit, and var (Xℓ) < (by assumption). Let Sk = Pk 1 ℓ=0 Xℓ(Sk has the same law as Pk 1 ℓ=0 T2k+1 T2k ), so that, using the Markov property of (vt), Tk stochastically dominates S k/2 . Hence, P (Tk t) P S k/2 t . Then, using Chebychev inequality, for any ℓ 0 and for t such that ℓτhit t, we have:

P (Sℓ t) = P (Sℓ ℓτhit t ℓτhit)

= P (Sℓ ℓτhit)2 (t ℓτhit)2

(t ℓτhit)2 .

We then have:

E [k(t)] 2 X

0 ℓ t/2 P (Sℓ t)

0 ℓ 2t/τhit P (Sℓ t) + 2 X

2t/τhit ℓ t/2 P (Sℓ t)

2t/τhit ℓ t/2

ℓvar (X0) (t ℓτhit)2 .

We finally show that the second term stays bounded:

2t/τhit ℓ t/2

ℓ (t ℓτhit)2 = 1 τ 2 hit

0 ℓ t/2 2t/τhit

ℓ+ 2t/τhit (ℓ+ t/τhit)2

= 1 τ 2 hit

0 ℓ t/2 2t/τhit

ℓ (ℓ+ t/τhit)2 + 2 τ 2 hit

0 ℓ t/2 2t/τhit

t/τhit (ℓ+ t/τhit)2 .

First, using a comparison with a continuous sum, we have:

ℓ (ℓ+ t/τhit)2 X

1 (ℓ+ t/τhit) ln( t t/τhit ) = ln(τhit) ,

SGD under Markovian Sampling Schemes

since for a, x > 0, R ax 0 ydy (y+a)2 R ax 0 dy (y+a) = ln(x). Finally, using P

ℓ 1 1 (a+ℓ)2 R 0 dy (y+a)2 = 1

a, we bound the second sum as: X

0 ℓ t/2 2t/τhit

t/τhit (ℓ+ t/τhit)2 τhit

Wrapping our arguments together, we end up with:

E [k(t)] 4t

τhit + 2var (τv)

τ 2 hit (ln(τhit) + 1 + τhit

For t big enough, we end up with E [k(t)] 5t/τhit, so that since E f(xt) 2 L /(16E [k(t)]2) as explained in the main text, we have:

f(xt) 2 = Ω L τ 2 hit t2

C. Markov chain stochastic gradient descent: proof of Theorem 5.3

We start by proving the following bound on E h fvt(xt) 2i . Note that this bound can be used for any t τmix.

Lemma C.1. For t 0 and if vt πt for d TV(πt, π) πmin/2, we have:

E h fvt(xt) 2i 3 σ2 + 2E h f(xt) 2i .

Proof of the Lemma. We have for any v V that P (vt = v) πv + πv/2 = 3πv/2, so that

E h fvt(xt) 2i 2E h fvt(xt) f(xt) 2i + 2E h f(xt) 2i

v V P (vt = v) σ2 v + 2E h f(xt) 2i

= 3 σ2 + 2E h f(xt) 2i .

The proof borrows ideas from both the analyses of delayed SGD (Mania et al., 2017) and SGD with biased gradients (Even et al., 2022), thus refining MC-SGD initial analysis (Johansson et al., 2010). While a biased gradient analysis would not yield convergence to an ε-stationary point for arbitrary ε (at every iterations, biases are non-negligible and can be arbitrary high), by enforcing a delay τ (of order τmix) in the analysis, we manage to take advantage of the ergodicity of the biases.

C.1. Smooth non-convex case of Theorem 5.3

Proof of Theorem 5.3.1. Denoting Ft = Ef(xt) f(x ), we have using smoothness:

Ft+1 Ft γE [ fvt(xt), f(xt) ] + γ2L

2 E h fvt(xt) 2i .

For the first term on the righthandside of the inequality, assuming that t τ for some τ > 0 we explicit later in the proof:

E [ γ fvt(xt), f(xt) ] = E [ γ fvt(xt τ), f(xt τ) ] + E [ γ fvt(xt), f(xt) f(xt τ) ]

+ E [ γ fvt(xt) fvt(xt τ), f(xt τ) ] .

First, we condition the first term on the filtration up to time t τ:

E [ γ fvt(xt τ), f(xt τ) ] = E [ γ Et τ fvt(xt τ), f(xt τ) ]

2 E h Et τ fvt τ (xt) 2i + γ

2 E [ f(xt τ) Et τ fvt(xt τ) ]

2 E h f(xt τ) 2i .

SGD under Markovian Sampling Schemes

Then, for τ τmix(πminε), using the following lemma, we have, for ε < 1/2:

E [ γ Et τ fvt(xt τ), f(xt τ) ] γ

4 E h f(xt τ) 2i + γε2 σ2 .

Lemma C.2. For τ τmix(επmin) and t τ,

E h Et τ fvt(xt τ) f(xt τ) 2i 2ε2E h f(xt τ) 2i + 2ε2 σ2 .

Proof of the Lemma. We have:

E h Et τ fvt(xt τ) f(xt τ) 2i = E

v V (P (vt = v|xt τ) πv) fv(xt τ)

v V πv E h fv(xt τ) 2i ,

where we used |P (vt = v|xt τ) πv| επv and convexity of the squared Euclidean norm. For that last term, X

v V πv E h fv(xt τ) 2i X

v V 2πv E h f(xt τ) 2i + σ2 v

= 2E h f(xt τ) 2i + 2 σ2 ,

concluding the proof of the Lemma.

Using gradient Lipschitzness and writing xt xt τ = γ Pt 1 s=max(t τ,0) fvs(xs), we have:

E [ γ fvt(xt), f(xt) f(xt τ) ] γ2LE

s=max(t τ,0) fvs(xs)

2 (τE h fvt(xt) 2i +

s=max(t τ,0) E h fvs(xs) 2i .

E [ γ fvt(xt) fvt(xt τ), f(xt τ) ] γ2L

2 (τE h f(xt τ) 2i +

s=max(t τ,0) E h fvs(xs) 2i .

Wrapping things up, we obtain, for t τ and τ τmix:

4 E h f(xt τ) 2i + γε2 σ2

2 (τ + 1)E h fvt(xt) 2i + τE h f(xt τ) 2i + 2

s=max(t τ,0) E h fvs(xs) 2i

4 E h f(xt τ) 2i + γε2 σ2 + (3τ + 1)γ2L

2 (τ + 1)E h f(xt) 2i + τE h f(xt τ) 2i + 2

s=max(t τ,0) E h f(xs) 2i .

Summing for τ t < T:

τ t<T E h f(xt τ) 2i 4Fτ

τ t<T 6γLτ E h f(xt) 2i + E h f(xt τ) 2i + 2 2ε2 + γ(3τ + 1) σ2 ,

SGD under Markovian Sampling Schemes

leading to, for γ 1 12Lτ :

t<T τ E h f(xt) 2i 4Fτ

T τ t<T E h f(xt) 2i + 2 2ε2 + γ(3τ + 1) σ2 . (8)

We now prove that for any t 0, we have supt s t+τ E h f(xs) 2i 4E h f(xt) 2i + 8γ2L2τ 2σ2. Let t s < t + τ.

E h f(xs) 2i 2E h f(xt) 2i + 2E h f(xs) f(xt) 2i

2E h f(xt) 2i + 2L2γ2E

r=t fvr(xr) 2 #

2E h f(xt) 2i + 4L2γ2τ

r=t E h f(xr) 2i + σ2

2E h f(xt) 2i + 4L2γ2τ 2( sup t s t+τ E h f(xs) 2i + σ2) ,

leading to the desired result for γ 1/(8Lτ). Plugging this in (8):

t<T τ E h f(xt) 2i 4Fτ

T τ t<T E h f(xt τ) 2i

T 4L2γ2τ 2 σ2 + 2 2ε2 + γ(3τ + 1) L σ2 .

Now, for γ = min(1/(48Lτ), p

F0/(TLτ σ2)), ε = 1/

T, and so τ = τmix ln(T), we have:

t<T τ E h f(xt) 2i 196τLFτ

We now upper bound Fτ. For any t and γ < 1/(2L),

2 E h f(xt) fvt(xt) 2i γσ2 max/2 ,

where the first inequality is a simplified version of the descent lemma with biased gradient at the beggining of this proof, and the second inequality uses the initialization properties of v0. Thus, we obtain Fτ F0 + γτσ2 max/2 F0 + σ2 max/L for our choice of γ. We thus conclude by plugging this in (9) applied for T + τ instead of T, yielding the desired result.

The condition for the upper-bound we proved above to be true, namely T τ = τmix ln(T), is always satisfied for T 2τmix ln(τmix). Indeed, if T τ 2 mix, then τmix ln(T) 2τmix ln(τmix) T, and otherwise we have τmix ln(T)

T ln(T) T. This concludes the proof, and x0 in the Theorem corresponds to xτ.

C.2. Under a µ-PL inequality

Proof of Theorem 2.2. We start from:

4 E h f(xt τ) 2i + γε2 σ2 + (3τ + 1)γ2L

2 (τ + 1)E h f(xt) 2i + τE h f(xt τ) 2i + 2

s=max(t τ,0) E h f(xs) 2i .

If f satisfies a µ-PL inequality, then E h f(xt τ) 2i 2µFt τ, so that, for some α (0, 1):

Ft+1 Ft αγµ

4 Ft τ (1 α)γ

8 E h f(xt τ) 2i + γε2 σ2 + (3τ + 1)γ2L

2 (τ + 1)E h f(xt) 2i + τE h f(xt τ) 2i + 2

s=max(t τ,0) E h f(xs) 2i .

SGD under Markovian Sampling Schemes

For t 0, let Pt = (1 αγµ/4) t. We multiply the above expression by Pt+1 and sum for t < T, hoping for cancellations. For T τ:

τ t<T Pt+1 Ft Ft+1 αγµ

τ t<T Pt+1 (1 αγµ

4 )Ft Ft+1 + αγµ

4 (Ft Ft τ)

τ t<T Pt Ft X

τ+1 t T Pt Ft

τ t<T Pt+1Ft αγµ

τ t<T Pt+1Ft τ

PτFτ PT FT + αγµ

τ t<T Pt+1Ft Pταγµ

0 t<T τ Pt+1Ft

PτFτ PT FT + αγµ

T τ t<T Pt+1Ft

PτFτ PT FT + αγ

T τ t<T Pt+1E h f(xt) 2i ,

using the µ-PL inequality. For t 0, we denote Rt = E h f(xt) 2i . We now handle the Rt terms.

8 Pt+1Rt τ + X

2 (τ + 1)Pt+1Rt + τPt+1Rt τ + 2

s=t τ Pt+1Rs

τ t<T Pt+1Rt + τ X

0 t<T τ Pt+1Rt + 2τ X

t<T Rt Pt+τ

0 t<T τ Pt+1Rtγ (1 α)

2 (2τ + 1 + 2τPτ 1)

(τ + 1 + 2τPτ 1) Pt+1Rt

16 Pt+τ+1Rt

T τ t<T Pt+1Rt ,

if γ satisfies γ 1 α 8βL(5τ+1) and Pτ 2, for some β 1. Since for γµ 1, Pτ eτµγ, Pτ 2 can be ensured with γ 1 2τL. All in one, we have:

0 PτFτ PT FT + γ α

T τ t<T Pt+1Rt

16 Pt+τ+1Rt

+ γε2 σ2 + (3τ + 1)γ2L

τ t<T Pt+1 .

SGD under Markovian Sampling Schemes

Using what we proved in the previous proof, we have Rt 4Rt τ + 8γ2L2τ 2σ2 for T τ t < T, so that:

T τ t<T Pt+1Rt 4γ α

T 2τ t<T τ Pt+τ+1Rt

+ 8γ2L2τ 2σ2γ α

T 2τ t<T τ Pt+τ+1 .

Consequently, for 4γ α

16 γ, which can be ensured with α = 1/16 and β = 8, we have:

0 PτFτ PT FT + 1

8γ3L2τ 2σ2 X

T 2τ t<T τ Pt+τ+1

+ γε2 + (3τ + 1)γ2L

τ t<T Pt+1 ,

FT Fτ/PT τ + γ2 σ2L ε2

Lγ + 3τ + 1

2Fτ/PT + 2γ σ2

Lγ + 3τ + 1

2Fτ/PT + 2γ σ2

Finally, using Fτ F0 + σ2 max/L, if γ 1 64(5τ+1) where τ = τmix(ε):

FT 2(F0 + σ2 max/L)(1 γµ

8 )T + 2γ σ2

We thus choose ε = p

1/T so that τ τmix ln(T), and stepsize γ = min( 8 ln(T (F0+ σ2)/ σ2)

µT , 1 64(5τ+1)) , leading to the desired result for c = 64 6 = 384.

The same discussion than in the smooth non-convex proof regarding the condition T τ applies here.

D. Markov chain SGD: results in the interpolation regime

D.1. Proof of Theorem 5.6

We begin by proving the following lemma. Lemma D.1. For any T 1, we have:

t<T fvt(x )

t<T d TV(P t v0, , π ) + 2σ2 X

s<t<T d TV(t s) ,

where d TV(r) = sup {d TV((P r)v, , π ) , v V} for r N, so that:

t<T fvt(x )

σ2 4τmix(1/4) + T(1 + 8τmix(1/4)) .

Proof. We have:

t<T fvt(x )

t<T fvt(x )

t<T fvt(x )

t<T E h fvt(x ) 2i + 2 X

s<t<T E [ fvs(x ), fvt(x ) ] .

SGD under Markovian Sampling Schemes

Denote G = ( fv(x ))v V RV d. For the first term above,

t<T E h fvt(x ) 2i = X

t<T E h G ,vs 2i

P (vt = v) πv G ,v 2 !

t<T d TV(P t v0, , π ) .

s<t<T E [ fvs(x ), fvt(x ) ] = X

s<t<T E [ G ,vs, G ,vt ]

v,w V (P s)v0,v(P t s)v,w G ,v G ,w

v,w V (P s)v0,v (P t s)v,w 1

n G ,v G ,w

v,w V (P s)v0,v (P t s)v,w 1

v V (P s)v0,v X

(P t s)v,w 1

s<t<T d TV(t s) ,

where d TV(r) = sup {d TV((P r)v, , π ) , v V} for r N.

We finally bound P

t<T d TV(t). Using Chapter 4.5 of Levin et al. (2006), we have τmix(ε) (log2(ε 1) + 1)τmix(1/4), so that for any t 0, d TV(t) 2 t/τmix(1/4)+1. Hence,

t<T d TV(t) 2 1 2 1/τmix(1/4) 4τmix(1/4) ,

s<t<T d TV(t s) 4Tτmix(1/4) ,

concluding the proof.

Lemma D.2. For any yt Rd and t 0, denoting yt+1 = yt γ fvt(x ), we have:

xt+1 yt+1 2 (1 γµ) xt yt 2 + γL yt x 2 .

Proof. Denote ft = fvt( ). We expand:

xt+1 yt+1 2 = xt yt 2 2γ ft(xt) ft(x ), xt yt + γ2 ft(xt) ft(x ) 2 .

Using the three points equality as Mishchenko et al. (2020), we have:

2γ ft(xt) ft(x ), xt yt = 2γDft(yt, xt) 2γDft(xt, x ) + 2γDft(yt, x ) .

First, 2γDft(yt, xt) γµ xt yt 2 using strong convexity. Then, 2γDft(xt, x ) cancels the term γ2 ft(xt) ft(x ) 2 2γ2LDft(xt, x ) for γ 1/L, using smoothness of ft. And finally, using smoothness again, 2γDft(yt, x ) γL yt x 2, concluding the proof.

SGD under Markovian Sampling Schemes

Proof of Theorem 5.6. Fix some y0 Rd and let (yt) be defined with the recursion

yt+1 = yt γ fvt(x , ξt) .

Unrolling the previous Lemma, we have, for a fixed time horizon T:

x T y T 2 (1 γµ)T x0 y0 2 + γL X

t<T (1 γµ)T t yt x 2 .

This is possible, since the descent lemma is deterministic, in the sense that no expectations are taken so far. Since we want control over the distance to the optimum, we wish to have y T = x , leading to:

y0 = x + γ X

t<T fvt(x , ξt) , ys = x + γ X

s t<T fvt(x , ξt) , s < T .

We thus have:

E x T x 2 2(1 γµ)T

E x0 x 2 + γ2E

s<T fvs(x )

t<T (1 γµ)T t E

t s<T fvs(x )

Using Lemma D.1, we have for any t T:

t s<T fvs(x )

for C = 13. Hence,

E x T x 2 2(1 γµ)T E x0 x 2 + γ2TCτmixσ2

t<T (1 γµ)T t(T t)Cτmixσ2

2(1 γµ)T x0 x 2 + 3γL

µ2 Cτmixσ2 ,

t<T (1 x)tt 1/x2 and (1 z)xx 1 eu 1/(2u). Finally, for a stepsize choice of

1 L , 1 Tµ ln

E [ x T x ] 2e µT

L x0 x 2 + O Lτmixσ2 µ3T

D.2. Proof of Proposition 5.9

Proof. Consider the graph G on the set of nodes V = {0, 1}, with probability transitions p01 = p10 = p and p00 = p11 = 1 p, for some small p (0, 1). The relaxation time τmix(1/4) of this graph scales as 1/p.

Consider now f0(x) = 1

2(x 1)2 and f1(x) = 1

2(x + 1)2 for x R, so that x = 0. For (vt) a Markov chain with the given transition probabilities, started at v0 following the uniform (stationary) distribution on V, let xt be generated with MC-SGD: xt+1 = xt γ fvt(xt) and x0 = 1, i.e.,

x T = (1 γ)T γ X

t<T (1 γ)T t 1ζt ,

SGD under Markovian Sampling Schemes

where ζt { 1, 1} takes value 1 if vt = 0 and value 1 if vt = 0. We have:

E (x T x )2 = (1 γ)2T + γ2 X

s<t<T (1 γ)2T t s 2E [ζsζt] .

We compute this second term, and show that it is non-negative for p 1/2 and of order γ 4p, so that to reach a given precision ε > 0, is required γ 4pε and thus to make the first term small, T must verify T = Ω(1/(2pε)), concluding our reasonning.

For s < t, we have E [ζsζt] = 2P (vt s = v0|v0 π ) 1/ Denoting zk = P (vk = v0|v0 π ), we have zk+1 = pzk + (1 p)(1 zk) and z0 = 1, so that zk = 1

2(1 + (1 2p)k) for k 0. This leads to:

s<t<T (1 γ)2T t s 2E [ζsζt] = γ2 (1 γ)(1 2p) 1 (1 γ)(1 2p)

1 (1 γ)2 (1 2p)(1 γ)T (1 2p)T

2p γ (1 γ)T

For ε 0, in order to have E (x T x )2 ε, is required (1 γ)T ε so that γT . Under γT , we have

s<t<T (1 γ)2T t s 2E [ζsζt] γ

Finally, to reach precision ε, this quantity needs to be upper-bounded by ε(1 + o(1)), so that γ 4pε 1(1 + o(1)) is necessary. Plugging this in (1 γ)2T ε yields T = Ω(pε 1), the desired result.

E. With local noise: proof of Theorem 5.11

The proof follows the exact same steps as the proof of Theorem 5.6.

Proof. First, note that, using Lemma 15 in Stich and Karimireddy (2021), we have:

t<T x Fvt(x , ξt)

t<T fvt(x )

+ 2Tσ2 local .

We then have the following lemma, proved exactly as in the previous section.

Lemma E.1. For any yt Rd and t 0, denoting yt+1 = yt γ x Fvt(x , ξt), we have:

xt+1 yt+1 2 (1 γµ) xt yt 2 + γL yt x 2 .

This leads to: x T y T 2 (1 γµ)T x0 y0 2 + γL X

t<T (1 γµ)T t yt x 2 ,

for y0 = x + γ X

t<T x Fvt(x , ξt) , ys = x + γ X

s t<T x Fvt(x , ξt) , s < T .

We thus have:

E x T x 2 2(1 γµ)T

E x0 x 2 + γ2E

s<T x Fvs(x , ξs)

t<T (1 γµ)T t E

t s<T x Fvs(x , ξs)

To conclude, we use the first inequality of this proof, and Lemma D.1, and proceed as in the proof of Theorem 5.6 and Corollary 5.8.

SGD under Markovian Sampling Schemes

F. MC-SAG: proof of Theorem 6.1

F.1. With perfect initialization: Theorem 6.1.1

We begin classically by proving a descent lemma. This lemma is deterministic, in the sense that not means E are present, and it therefore does not use the Markovian properties of the Markov chain. MC-SAG uses biased gradients, even in the case where vt are i.i.d., since the algorithm SAG (Schmidt et al., 2017) is inherently biased (making it unbiased leads to the SAGA iterations (Defazio et al., 2014)).

Let Gt = ht+1 for t 0, so that xt+1 = xt γt Gt. We recall that for v V and t 0, pv(t) is the next time (strictly) the chain hits node v, while dv(t) is either the last time the chain was at the state v (if that happened), or 0 in v has not yet been visited.

Lemma F.1. Assume that f is L-smooth. Then, for any t 0, we have:

f(xt+1) f(xt) γt

2 f(xt) 2 γt

4 Gt 2 + γt L2

s=dv(t) γs Gs

Proof. For t 0 and v V, let pv(t) = inf {s > t|vs = v} and dv(t) = sup {s t|vs = v} be the next and the last previous iterates for which vt = v (dv(t) = 0 by convention, if v has not yet been visited). Denote Ft = E [f(xt) f(x )]. We have, using smoothness:

f(xt+1) f(xt) γt f(xt), Gt + γ2 t L

Together with f(xt), Gt = 1

2( f(xt) 2 + Gt 2 f(xt) Gt 2), we obtain:

f(xt+1) f(xt) γt

2 f(xt) 2 + Gt 2 f(xt) Gt 2 + γ2 t L

2 f(xt) 2 γt

4 Gt 2 + γt

2 f(xt) Gt 2 ,

as long as γt 1/(2L). We thus need to upperbound the bias f(xt) Gt 2. We have:

f(xt) Gt 2 =

v V fv(xdv(t)) fv(xt)

fv(xdv(t)) fv(xt) 2 .

Fix some v in V. We have fv(xt) fv(xdv(t)) 2 L2 Pt 1 s=dv(t) γs Gs 2 , leading to:

f(xt+1) f(xt) γt

2 f(xt) 2 γt

4 Gt 2 + γt L2

s=dv(t) γs Gs

We now proceed with the proof of Theorem 6.1.1.

Proof of Theorem 6.1.1. We begin with

f(xt+1) f(xt) γt

2 f(xt) 2 γt

4 Gt 2 + γt L2

s=dv(t) γs Gs

SGD under Markovian Sampling Schemes

as a starting point. For v V,

s=dv(t) L2γs Gs

s=dv(t) (t dv(t))L2γtγ2 s Gs 2

s=dv(t) Lγ2 s Gs 2 ,

since γt 1/(L(t dv(t))). Summing for t < T, we obtain:

2 f(xt) 2 F0 X

s=dv(t) Lγ2 s Gs 2 .

s=dv(t) Lγ2 s Gs 2 = X

s<T Gs 2Lγ2 s(pv(s) s) .

For γs 1/(2Lτhit), we have E h Gs 2γ2 s(pv(s) s) i 1

2E h Gs 2γs i , so that:

2 f(xt) 2 F0 + X

2 Gt 2 + Kt ,

where Kt = 1 4n P

v V Gt 2Lγ2 t (pv(t) t) verifies E [Kt|Ft] Gt 2γt/4 since γt 1/(2τhit), where Ft is the filtration up to time t:

E [Kt|Ft] = E

v V Gt 2Lγ2 t (pv(t) t)|Ft

v V Gt 2Lγ2 t E [(pv(t) t)|Ft] Gt , γt are Ft-measurable

v V Gt 2Lγ2 t τhit since E [(pv(t) t)|Ft] = E [(pv(t) t)|vt] τhit

8γt Gt 2 since γt 1/(2τhit) .

E min t<T f(xt) 2 E 2F0 P

2 Gt 2 + Kt

Using Lemma A.6 and the above bound on E [Kt|Ft], that last term is non-positive. Using Jensen inequality, we have:

E min t<T f(xt) 2 2F0

t<T E γ 1 t .

Since γ 1 t = 2L(τhit + supv V(t dv(t))), we have P

t<T E γ 1 t 2LT(τhit + τcov) using Lemma A.5, concluding the proof.

SGD under Markovian Sampling Schemes

F.2. With arbitrary initialization: proof of Theorem 6.1.2

We no longer assume that G0 = f(x0) or that hv = fv(x0): hv are arbitrary and G0 = 1

v V hv. In that case, we have the following descent lemma.

Lemma F.2. Assume that f is L-smooth. Let

τcov = inf {t 1 such that {v0, . . . , vt 1} = V} .

Then, for any t 0, we have:

f(xt+1) f(xt) γt

2 f(xt) 2 γt

4 Gt 2 + γt L2

s=dv(t) γs Gs

+ γt1t< τcov

v V fv(x0) hv 2 .

Proof. As in Lemma F.1, we have

f(xt+1) f(xt) γt

2 f(xt) 2 γt

4 Gt 2 + γt

2 f(xt) Gt 2 ,

as long as γt 1/(2L). We thus again need to upperbound the bias f(xt) Gt 2. For t τcov, we have Gt = 1 n P

v V fv(xdv(t)), and so:

f(xt) Gt 2 1

fv(xdv(t)) fv(xt) 2 ,

hence the result of Lemma F.1 holds for t τcov (and so does that of the Lemma we are proving). Then, for t τcov, f(xt) Gt = 1

v V fv(xt) ht v, where ht v = hv if v hasn t been visited yet, or ht v = fv(xdv(t)) otherwise. Hence,

f(xt) Gt 2 1

ht v fv(xt) 2

fv(xdv(t)) fv(xt) 2 + 2

v V fv(x0) hv 2

Since fv(xt) fv(xdv(t)) 2 L2 Pt 1 s=dv(t) γs Gs 2 , we have:

f(xt+1) f(xt) γt

2 f(xt) 2 γt

4 Gt 2 + γt L2

s=dv(t) γs Gs

v V fv(x0) hv 2 ,

leading to the desired result.

Proof of Theorem 6.1.2. Starting from

f(xt+1) f(xt) γt

2 f(xt) 2 γt

4 Gt 2 + γt L2

s=dv(t) γs Gs

+ γt1t< τcov

v V fv(x0) hv 2 ,

and mimicking the proof with initialization, we obtain that for γ 1 t = 4L(τhit + supv V(t dv(t))), we have

E min t<T f(xt) 2 2F0

t<T E γ 1 t + 2E [ τcov]

v V fv(x0) hv 2 ,

and we conclude the proof by noticing that E [ τcov] τcov.

SGD under Markovian Sampling Schemes

G. Numerical illustration of our theory

Setting We place ourselves in the decentralized optimization setting on a graph G with local functions fv, and build two toy problems. We compare our algorithms MC-SGD and MC-SAGA with Walkman (Mao et al., 2020) and decentralized SGD (D-SGD in Figure 1) (Koloskova et al., 2020; Yu et al., 2019) with both randomized gossip communications and fixed gossip matrix. We consider the non-convex loss function ℓ(x, a, b) = (σ(x a) b)2/2 where σ(t) = 1/(1 + exp( t)) as in Mei et al. (2018). For v V, we take fv(x) = ℓ(x, av, bv) for av and bv random variables. In Figure 1(a), we take a connected random geometric graph of n = 50 nodes in [0, 1]2 with radius parameter ρ = 0.3 (nodes are connected if their distance is less than ρ). We consider homogeneous data: av, bv are taken i.i.d., with av N(0, 1) and bv uniform in [0, 1]. In Figure 1(b), we take the cycle of n = 50 nodes and consider heterogeneous data: for two opposite nodes in the graph (v = 0 and v = 25 e.g.), av are N(0, 1) and bv uniform in [0, 1] (and the functions are re-normalized), while they are taken equal to 0 in the rest of the graph. In Figure 1(a), the graph is well connected and Assumption 5.1 is verified for a small enough σ2, so that all algorithms perform comparatively well; MC-SGD because of the homogeneity, the gossip-based ones and Walkman thanks to the connectivity. However, decreasing the connectivity and increasing data heterogeneity in a pathological way, we obtain Figure 1(b): MC-SGD fails due to function-heterogeneity ( σ2 too big), while the three others are slowed down by their communication inefficiency, illustrating how depending only on τhit, MC-SAGA outperforms the other algorithms, that rather depend on nτmix or nτ 2 mix.

(a) Geometric graph, homogeneous functions

(b) Cycle graph, heterogeneous functions

Figure 1: Comparison of MC-SAGA and MC-SGD with existing algorithms. In abscissa are the total number of communications, and in ordinate f(x) f(x ).