# stochastic_gradient_descent_under_markovian_sampling_schemes__15d3c8fb.pdf Stochastic Gradient Descent under Markovian Sampling Schemes Mathieu Even 1 We study a variation of vanilla stochastic gradient descent where the optimizer only has access to a Markovian sampling scheme . These schemes encompass applications that range from decentralized optimization with a random walker (token algorithms), to RL and online system identification problems. We focus on obtaining rates of convergence under the least restrictive assumptions possible on the underlying Markov chain and on the functions optimized. We first unveil the theoretical lower bound for methods that sample stochastic gradients along the path of a Markov chain, making appear a dependency in the hitting time of the underlying Markov chain. We then study Markov chain SGD (MC-SGD) under much milder regularity assumptions than prior works. We finally introduce MC-SAG, an alternative to MC-SGD with variance reduction, that only depends on the hitting time of the Markov chain, therefore obtaining a communication-efficient token algorithm. 1. Introduction In this paper, we consider a stochastic optimization problem that takes root in decentralized optimization, estimation problems, and Reinforcement Learning. Consider a function f defined as: f(x) = Ev π [fv(x)] , x Rd , (1) where π is a probability distribution over a set V, and fv are smooth functions on Rd for all v in V. Classicaly, this represents the loss of a model parameterized by x on data parameterized by v. If i.i.d. samples (vt)t 0 of law π and their corresponding gradient estimates ( fvt) were accessible, one could directly apply SGD-like algorithms, that have proved to be efficient in large scale machine learning *Equal contribution 1Inria - ENS Paris. Correspondence to: Mathieu Even . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). problems (Bottou et al., 2018). We however consider in this paper a different setting: we assume the existence of a Markov chain (vt) of state space V and stationary distribution π. The optimizer may then use biased stochastic gradients along the path of this Markov chain to perform incremental updates. She may for instance use the Markov chain SGD (MC-SGD) algorithm, defined through the following recursion: xt+1 = xt γ fvt(xt) . (2) Being ergodically unbiased , such iterates should behave closely to those of vanilla SGD. The analysis is however notoriously difficult, since in (2), variable xt and the current state of the Markov chain vt are not independent, so that E [ fvt(xt)|xt] can be arbitrarily far from f(xt). This paper focuses on analyzing algorithms that incrementally sample stochastic gradients alongside the Markov chain (vt), motivated by the following applications. 1.1. Token algorithms Traditional machine learning optimization algorithms require data centralization, raising scalability and privavy issues, hence the alternative of Federated Learning, where users data is held on device, and the training is orchestrated at a server level. Decentralized optimization goes further, by removing the dependency over a central entity, leading to increased scalability, privacy and robustness to node failures, broadening the range of applications to Io T (Internet of Things) networks. In decentralized optimization, users (or agents) are represented as nodes of a connected graph G = (V, E) over a finite set of users V (of cardinality n). The problem considered is then the minimization of v V fv(x) , x Rd , (3) where each fv is locally held by user v V, using only communications between neighboring agents in the graph. There are several known decentralized algorithmic approaches to minimize f under these constrains. The prominent one consists in alternating between communications using gossip matrices (Boyd et al., 2006; Dimakis et al., 2010) and local gradient computations, until a consensus is reached. These gossip approaches suffer from a high synchronization cost (nodes in the graph are required to perform simultaneous SGD under Markovian Sampling Schemes operations, or to be aware of operations at the other end of the communication graph) that can be prohibitive if we aim at removing the dependency on a centralized orchestrator. Further, a high number of communications are required to reach consensus, whether all nodes in the graph (as in synchronous gossip) or only two neighboring ones (as in randomized gossip) communicate at each iteration. To alleviate these communication burdens, based on the original works of Lopes & Sayed (2007); Johansson et al. (2007; 2010), we study algorithms based on Markov chain SGD: a variable x performs a random walk on graph G, and is incrementally updated at each step of the random walk, using the local function available at its location. This approach thus boils down to the one presented above with the function defined in (1), where V is the (finite) set of agents, π is the uniform distribution over V, and (vt) is the Markov chain consisting of the consecutive states of the random walk performed on graph G. The random walk guarantees that every communications are spent on updating the global model, as opposed to gossip-based algorithms, where communications are used to reach a running consensus while locally performing gradient steps. These algorithms are referred to as token algorithms: a token (that represents the model estimate) randomly walks the graph and performs updates during its walk. There are two directions to design and analyze token algorithms. Johansson et al. (2007) designed and analyzed its algorithm using, based on SGD with subdifferentials and a Markov chain sampling (consisting of the random walk). Following works (Duchi et al., 2011; Sun et al., 2018) tried to improve convergence guarantees of such stochastic gradient algorithms with Markov chain sampling, under various scenarii (mirror SGD e.g.). However, all these analyses rely on overly strong assumption: bounded gradients and/or bounded domains are assumed, and the rates obtained are of the form τmix/T + p τmix/T for a number T of steps, where τmix is the mixing time of the underlying Markov chain. More recently, Dorfman & Levy (2022) obtained similar rates under similar assumptions (bounded losses and gradients), but without requiring any prior knowledge of τmix, using adaptive stepsizes. A more recent approach consists in deriving token algorithms from Lagrangian duality and from variants of coordinate gradient methods or ADMM algorithms with Markov chain sampling. Mao et al. (2020) introduce the Walkman algorithm, whose analysis works on any graph, and obtain rates of τ 2 mixn T to reach approximate-stationary points, while Hendrikx (2022) introduced a more general framework, but whose analysis only works on the complete graph (and is thus equivalent to an i.i.d. sampling). Yet, Hendrikx (2022) extend their analysis to arbitrary graph, by performing gradient updates every τmix steps of the random walk, obtaining a a dependency on τmixn, making their algorithm state of the art for these problems. Altenatively, Wang et al. (2022) studies the algorithm stability of MC-SGD in order to derive generalization upper-bounds for this algorithm, and Sun et al. (2022) provides and studies adaptive token algorithms. Recently, and concurrently to this work, Doan (2023) also studies MC-SGD without smoothness; however, their dependency on the mixing time of the random walk (in their Theorem 1) scales as exp(cτmix): this is prohibitive as soon as the mixing time becomes larger than O(1). In summary, current token algorithms and their analyses either rely on strong noise and regularity assumptions (e.g. bounded gradients), or suffer from an overly strong dependency on Markov chain-related quantities (as in (Mao et al., 2020; Hendrikx, 2022)). The token algorithms we consider are to be put in contrast with consensus-based decentralized algorithms, or gossip algorithms (with fixed gossip matrices (Dimakis et al., 2010) or with randomized pairwise communications (Boyd et al., 2006)). They originally were introduced to compute the global average of local vectors through peer-to-peer communication. Among the classical decentralized optimization algorithms, some alternate between gossip communications and local steps (Nedic & Ozdaglar, 2009; Koloskova et al., 2019; 2020), others use dual formulations and formulate the consensus constraint using gossip matrices to obtain decentralized dual or primal-dual algorithms (Scaman et al., 2017; Hendrikx et al., 2019; Even et al., 2021a; Kovalev et al., 2021; Alghunaim & Sayed, 2019), and benefit from natural privacy amplification mechanisms (Cyffers et al., 2022). Other approaches include non-symetric communication matrices (Assran & Rabbat, 2021) that are more scalable. We refer the reader to Nedic et al. (2018) for a broader survey on decentralized optimization. The works we relate to in this line of research are Koloskova et al. (2020), where a unified analysis of decentralized SGD is performed (the gossip equivalent of our algorithm MC-SGD), and in particular contains rates for convex-non-smooth functions, and Yu et al. (2019), that performs an analysis of decentralized SGD with momentum in the smooth-non-convex case, which is the gossip equivalent of our algorithm MC-SAG. 1.2. Reinforcement Learning problems and online system identification In several applications (RL, time-series analysis e.g.), a statistician may have access to values (Xt)t 0 generated sequentially along the path of a Markov chain, observations from which she wishes to estimate a parameter For instance, Kowshik et al. (2021) consider a sequence of observations Xt+1 = A Xt + ξt for ξt i.i.d. centered noise, and A to estimate, and aim at finding ˆA minimizing the MSE E h P t 0. A Markov chain of transition matrix P is aperiodic if there exists t0 > 0 such that for all t t0 and v, w V, (P t)v,w > 0. Any irreducible and aperiodic Markov chain on V admits a stationary distribution π, that verifies πP = π. It finally holds that, if P is reversible (πv Pv,w = πw Pw,v for all v, w V), denoting as λP = 1 maxλ Sp(P )\{1} |λ| > 0 the absolute spectral gap of P, where Sp(P) is the spectrum of P, for any stochastic vector π0 RV: 1 π0P t π π (1 λP )t π0 π π . If the chain is not reversible, there is still a linear decay, but in terms of total variation distance rather than in the norm π (Chapter 4.3 of Levin et al. (2006)). In the sequel, (vt)t 0 is any irreducible aperiodic Markov chain of transition matrix P on V of stationary distribution π (not necessarily the uniform distribution on V). Furthermore, we define the graph G = (V, E) over the state space V through the relation {v, w} E Pv,w > 0 for v and w two distinct states. Consequently, the Markov chain (vt)t can also be seen as a random walk on graph G, with transition probability P. In the random walk decentralized optimization case, this graph coincides with the communication graph. In the sequel, for t 0 and v V, E [ |vt = v] and P ( |vt = v) respectively denote the expectation and probability conditioned on the event vt = v. Similarly for πt a probability distribution on V, E [ |vt πt] and P ( |vt πt) refers to conditioning on the law of vt. Definition 2.2 (Mixing, hitting and cover times). For w V, let τw = inf {t 1 | vt = w} be the time the chain reaches w (or returns to w, in the case v0 = w). We define the following quantities. 1we write z 2 π = P v V πvz2 v for z RV, and always stands for the Euclidean norm SGD under Markovian Sampling Schemes 1. Mixing time. For ε > 0, the mixing time τmix(ε) of (vt) is defined as, where d TV is the total-variation distance: τmix(ε) = inf t 1 | π0 , d TV(P tπ0, π) ε , and we define the mixing time τmix as τmix = τmix(πmin/2) 2 where πmin = minv V πv. 2. Hitting and cover times. The hitting time τhit and cover time τcov of (vt) are defined as: τhit = max (v,w) V2 E [τw|v0 = v] , τcov = max v V E max w V τw|v0 = v . The mixing time is the number of steps of the Markov chain required for the distribution of the current state to be close to the stationary probability π. Starting from any arbitrary v0, the hitting time bounds the time it takes to reach any fixed w, while the cover time bounds the number of steps required to visit all the nodes in the graph. Note that if the chain is reversible, τmix(ε) is closely related to λP through τmix(ε) λ 1 P ln(π 1 minε 1) . Under reversibility assumptions, we defined the relaxation time of the Markov chain as τrel = 1/λP . More generally without reversibility, τmix(ε) log2(1/ε) τmix(1/4). Then, as we prove in Appendix A, τhit always satisfies τhit 2π 1 minτmix. Finally, using Matthews (1988) method (detailed in Chapter 11.4 of Levin et al. (2006)), we have τcov ln(n)τhit. 3. Contributions In our paper, we analyze theoretically stochastic gradient methods with Markov chain sampling (such as MC-SGD in Equation (2)), and aim at deriving complexity bounds under the mildest assumptions possible. We first derive in Section 4 complexity lower bounds for such methods, making appear τhit as the Markov chain quantity that slows down such algorithms. We then study MC-SGD under various regularity assumptions in Section 5: we remove the bounded gradient assumption of all previous analyses, obtain rates under a µ-PL assumption, and prove a linear convergence in the interpolation regime, where noise and function dissimilarities only need to be bounded at the optimum. In the data-heterogeneous setting (functions fv that can be arbitrarily dissimilar) and in the case where V (the state 2this definition of mixing time is not standard: Levin et al. (2006) define it as τmix(1/4), Mao et al. (2020) define it as we do; however, as explained in Chapter 4.5 of Levin et al. (2006), these definitions are equivalent up to a factor ln(1/πmin) space of the Markov chain) is finite, we introduce MCSAG in Section 6, a variance-reduced alternative to MCSGD, that is perfectly suited to decentralized optimization. Using time adaptive stepsizes, this algorithm has a rate of convergence of τhit/T and thus matches that of our lower bound, up to acceleration. We discuss in Section 7 the implications of our results. In particular, we prove that random-walk based decentralization is more communication efficient than consensus-based approaches; prior to our analysis, this was only shown empirically (Mao et al., 2020; Johansson et al., 2010). Further, our results formally prove that using all gradients along the Markov chain trajectory leads to faster rates; as in the previous case, this was only empirically observed before (Sun et al., 2018). These two consequences are derived from the fact that MC-SAG depends only on τhit rather than the traditionally used quantity nτmix, that can be arbitrarily bigger (Table 2). 4. Oracle complexity lower bounds under Markov chain sampling In this section, we provide oracle complexity lower bounds for finding stationary points of the function f defined in (3), for a class of algorithms that satisfy a Markov sampling scheme . For a given Markov chain (vt) on V, we consider algorithms verifying the following procedural constraints, for some fixed initialization M0 = {x0} an then for t 0, 1. A iteration t, the algorithm has access to function fvt and may extend its memory: Mt+1 = Span({x , fvt(x) , x Mt}) . 2. Output: the algorithm specifies an output value xt Mt. We call algorithms verifying such constraints black box procedures with Markov sampling (vt) . Such procedures as well as the result below are inspired by the distributed black-box procedures defined in Scaman et al. (2017). We use the notation a( ) = Ω(b( )) for C > 0 such that a( ) Cb( ) in the theorem below, and classically consider the limiting situation d , by assuming we are working in ℓ2 = (θk)k N RN : P Theorem 4.1. Assume that τv (see Definition 2.2) has finite second moment for any v V. Let , B > 0, L > 0 and µ > 0, denote κ = L/µ. Let x0 be fixed. 1. Non-convex lower bound: there exist functions (fv)v V such that f = P v V πvfv is L-smooth, and f(x0) minx f(x) and such that for any T and any Markov black-box algorithm that outputs x T after SGD under Markovian Sampling Schemes T steps, we have: f(x T ) 2 = Ω L τhit 2. Convex lower bound: there exist functions (fv)v V such that f = P v V πvfv is convex and L-smooth and minimized at some x that verifies x0 x 2 B2, and such that for any T and any Markov black-box algorithm that outputs x T after T steps, we have: f(x T ) f(x ) = Ω LB2 τhit 3. Strongly convex lower bound: there exist functions (fv)v V such that f = P v V πvfv is µ-strongly convex and L-smooth and minimized at some x that verifies x0 x 2 B2, and such that for any T and any Markov black-box algorithm that outputs x T after T steps, we have: f(x T ) f(x ) = Ω LB2 exp T κτhit A complete proof can be found in Appendix B. The hitting time of the Markov chain bounds, starting from any point in V, the mean time it takes to reach any other state in the graph. Making no other assumptions than smoothness, having rates that depend on this hiting time is thus quite intuitive. 5. Analysis of Markov-Chain SGD We have shown in last subsection that, in order to reach an ε-stationary point with Markov sampling, the optimizer is slowed down by the hitting time of the Markov chain; this lower bound being worst-case on the functions (fv), we here add additional similarity assumptions, that are still milder than classical ones in this setting (Sun et al., 2018).Studying the iterates generated by (2), we obtain in this section a dependency on τmix, provided bounded gradient dissimilarities (Assumptions 5.1 and 5.4). We here assume that (vt)t 0 is a Markov chain on V of invariant probability π (not necessarily the uniform measure on V). In this section, the function f studied is defined as f( ) = Ev π[fv( )] , as in (1). Consequently, for the MC-SGD algorithm for decentralized optimization over a given graph G to minimize the averaged function over all nodes (as in (3)), π needs to be the uniform probability over V. We first derive convergence rates under smoothness assumptions with or without a µ-PL inequality that holds, before improving our results under strong convexity assumptions, under which we prove a linear convergence rate in the interpolation regime. We finally add local noise (due to sampling, or additive gaussian noise to enforce privacy) in the final paragraph of this Section. 5.1. Analysis under bounded gradient dissimilarities Assumption 5.1. There exists (σ2 v)v V such that for all v V and all x Rd, we have3: fv(x) f(x) 2 σ2 v , and we denote σ2 = Ev π σ2 v and σ2 max = maxv V σ2 v. Assumption 5.2. Each fv is L-smooth, f is lower bounded, its minimum is attained at some x Rd. Theorem 5.3 (MC-SGD). Assume that Assumptions 5.1 and 5.2 hold, and let f(x0) f(x ) + σ2 max/L. 1. For a constant time-horizon dependent step size γ (i.e., γ is a functiion of T), the iterates generated by Equation (2) satisfy, for T 2τmix ln(τmix): 4 E f(ˆx T ) 2 = O Lτmix T + L σ2τmix + σ2 where ˆx T is drawn uniformly at random amongst x0, . . . , x T 1. 2. If f additionally verifies a µ-PL inequality (for any x Rd, f(x) 2) 2µ(f(x) f(x ))), for a constant time-horizon dependent step size γ, the iterates generated by Equation (2) satisfy, for T 2τmix ln(τmix) a numerical constant c > 0, and κ = L/µ, with FT = E [f(x T ) f(x )]: FT e c T κτmix ln(T ) + O τmix σ2 Theorem 5.3 is proved in Appendix C, by enforcing a delay of order τmix and relying on recent analyses of delayed SGD and SGD with biased gradients. As explained in the introduction, removing the bounded gradient assumption present in previous works (Johansson et al., 2010; Sun et al., 2018; Duchi et al., 2011) that study Markov chain SGD (in the mirror setting, or with subdifferentials), and replacing it by a much milder and classical assumption of bounded gradient dissimilarities (Karimireddy et al., 2020), we thus still managed to obtain similar rates. Further, if f verifies a µ-PL inequality (if for any x Rd, f(x) 2) 2µ(f(x) f(x ))), we have an almostlinear rate of convergence: this is the first rate under µ-PL or strong convexity assumptions for MC-SGD-like algorithms, that we even refine further in next subsection. 3this assumption could be replaced by a more relaxed noise assumption of the form fv(x) f(x) 2 M f(x) 2 + σ2 v 4 O hides logarithmic factors SGD under Markovian Sampling Schemes 5.2. Tight rates and linear convergence in the interpolation regime We now study MC-SGD under the following assumptions, to derive faster rates, that only depend on the sampling noise at the optimum. The interpolation regime often related to overparameterization refers to the case where there exists a model x Rd minimizing all fv for v V, leading to σ2 = 0 in Assumption 5.5, and to a linear convergence rate below. Assumption 5.4. Functions fv are L-smooth and µ-strongly convex. We denote κ = L/µ. Assumption 5.5 (Noise at the optimum). Let x be a minimizer of f. We assume that for some σ 0, we have for all v V: fv(x ) 2 σ2 . Theorem 5.6 (Unified analysis). Under Assumptions 5.4 and 5.5, the sequence generated by (2) satisfies, if γL < 1: E h x T x 2i 2(1 γµ)T x0 x 2 0 s T (1 γµ)T s E s t 0, the iterates generated by (2) satisfy: E h x T x 2i 2e T This result is stronger than Theorem 5.3.2, for (i) noise amplitude and gradient dissimilarities only need to be bounded at the optimum; (ii) the optimization term (the first one) is not slowed down by the mixing time. This comes at the cost of strong convexity assumptions, stronger than a µ-PL inequality for f. The term τmixσ T cannot be removed in the general case, as next proposition shows. Hence, since the two other terms have optimal dependency in terms of Markov-chain and noise related quantities, our analysis ends up being sharp. Corollary 5.6 together with the following proposition are an extension of Nagaraj et al. (2020), who proved similar results for MC-SGD with constant stepsize on least square problems on Markovian data of a certain form (for linear online system identification). Proposition 5.9. For any V (such that |V| 2) and τ > 1, there exists a Markov chain on V of relaxation time τ, functions (fv)v V and x0 Rd such that given any stepsize γ, the iterates of Equation (2) output x T for any T > 0 verifying x T x0 2 = Ω(τσ2 /T), and the assumptions of Theorem 5.6 hold. 5.3. MC-SGD with local noise In the two previous subsections, we analyzed SGD with Markovian sampling schemes, where the stochasticity only came from the Markov chain (vk)k 0. We now generalize the analysis and results to SGD with both Markovian sampling, and local noise, by studying the sequence: xt+1 = xt γtgt . (4) We now formulate the form stochastic gradients gt can take. Assumption 5.10. For all v V, the function fv satisfies fv(x) = E [Fv(x, ξv)] for all x Rd, where ξv Dv. Furthermore, there exists a Markov-chain (vt)t 0 such that for all t 0, gt = x Fvt(xt, ξt) , where ξt Dvt|vt is independent from v0, . . . , vt 1 and ξ0, . . . , ξt 1. SGD under Markovian Sampling Schemes A direct consequence of Assumption 5.4 is that E [gt|xt, vt] = fvt(xt). Two main applications of Assumption 5.4 are: 1. Local sampling. If fv(x) = 1 m Pm i=1 fv,i(x) (agent v has m local samples), agent m may use only a batch B [m] of its samples, leading to stochastic gradients gt in (4) of the form: gt = 1 |Bt| i Bt fvt,i(xt) , for random batches (Bt)t 0. 2. Differential privacy. Adding local noise (e.g., additive Gaussian random noise) enforces differential privacy under suitable assumptions. A private decentralized token algorithm is then Differentially Private MC-SGD (DP-MC-SGD), with iterates (4) where gt satisfies gt = fvt(xt) + ηt , (5) where (vt) is the Markov chain (random walk performed by the token on the communication graph), and ηt N(0, σ2 t Id) is sampled independently from the past, to enforce differential privacy. Under Assumption 5.4, a direct generalization of Theorem 5.6 and Corollary 5.8 is the following. Theorem 5.11 (MC-SGD with local noise). Assume that Assumptions 5.10,5.5 holds, each Fv( , ξ) is µ-strongly convex L-smooth, and there exists σ2 local such that: E h gt fvt(xt) 2|xt, vt i σ2 local . Then, for a well chosen γ > 0, the iterates generated by (4) satisfy: E h x T x 2i 2e T µ3T σ2 local + τmix 1 Importantly, and as one would have expected, local noise is not impacted by the mixing time of the underlying random walk. While we did not pursue in this direction, this observation could easily be made under other regularity assumptions, and such a result would hold for instance under the assumptions of Theorem 1 or 2. While Differentially Private MC-SGD sounds appealing for performing decentralized and differentially private optimization, we here only provided a utility analysis, the privacy analysis being left for future works. Algorithm 1 Markov Chain SAG (MC-SAG) 1: Input: x0 Rd, hv Rd for v V and h0 Rd, stepsizes γt > 0, v0 V 2: for t = 0, 1, . . . do 3: Compute fvt(xt) 4: ht+1 = ht + 1 n fvt(xt) hvt 5: xt+1 = xt γt ht+1 6: hvt fvt(xt) 7: Sample vt+1 Pvt, 8: end for 6. Analysis of Markov-Chain SAG After providing convergence guarantees for the most natural algorithm (MC-SGD) under a Markov chain sampling on the set V, we prove that one can achieve a rate of order 1/T (rather than the 1/ T previously obtained) in the smooth setting, by applying the variance reduction techniques present in Schmidt et al. (2017), that first introduced the Stochastic Averaged Gradient algorithm, together with a time-adaptive stepsize policy described below. Our faster rate with variance reduction leads of a dependency on τhit instead of τmix; since we do not make any other assumption other than smoothness, this is unavoidable in light of our lower bound (Theorem 4.1). MC-SAG The MC-SAG algorithm is described in Algorithm 1. The recursion leading to the iterate xt can then be summarized as, for stepsizes (γt)t 0, under the initialization hv = fv(x0) and h = f(x0): xt+1 = xt γt v V fv(xdv(t)) , (6) where for v V, we define dv(t) = sup {s t | vs = v} as the last previous iterate at which v was the current state of the Markov chain. By convention, if the set over which the supremum is taken is empty, we set dv(t) = 0. We handle both the initialization described just above for hv, h and arbitrary initialization in our analysis below. In the same way that MC-SGD reduces to vanilla SGD if (vt) is an i.i.d. uniform sampling over V, MC-SAG boils down to the SAG algorithm (Schmidt et al., 2017) in that case and under the initialization hv = fv(x0) and h = f(x0). In a decentralized setting, nodes keep in mind their last gradient computed (variable hv at node v). At all times, ht is an average of these hv over the graph, and is, in the same way as xt, updated along the random walk. The MCSAG algorithm is thus perfectly adapted to decentralized optimization. Time-adaptive stepsize policy To obtain our convergence guarantees, a time-adaptive stepsize policy (γt) is used, as SGD under Markovian Sampling Schemes in Asynchronous SGD (Mishchenko et al., 2022) to obtain delay-independent guarantees. For t 0, let the stepsize γt be defined as: γt = 1 2L τhit + maxv V(t dv(t)) . (7) Denoting τt = maxv V(t dv(t)), this quantity can be tracked down during the optimization process. Indeed, if agent vt receives τt 1 together with (xt, ht), she may compute τt as: τt = max τt 1 + 1 , t dvt(t) , where t dvt(t) is the number of iterations that took place since the last time the Markov chain state was vt. Hence, if agents keep track of the number of iterations, the adaptive stepsize policy (7) can be used in Algorithm 1, as long as agent vt sends (τt, t) to vt+1, yielding the following result. We now present the convergence results for MC-SAG. (vt) is in this section assumed to be a Markov chain on V of finite hitting time τhit. Importantly, the next Theorem does not require any additional assumption on (vt) such as reversibility, or even that it has a stationary probability that is the uniform distribution: the non-symmetric but easily implementable transition probabilities Pv,w = 1/(dv + 1) for w = v or w v can be used here, as well as non-reversible random walks than can have much smaller mixing and hitting times. The function f studied is here independent of the Markov chain, and is defined as in (3), the uniformly averaged function over all states v V (or over all agents in the network). Theorem 6.1 (MC-SAG). Assume that Assumption 5.2 holds and that the Markov chain has a finite hitting time (for an arbitrary invariant probability). 1. Under the initialization: hv = fv(x0) , h = f(x0) , using the adaptive stepsize policy defined in Equation (7), the sequence generated by Algorithm 1 satisfies, for any T > 0: E min t 0: E min t 0, if the chain is reversible: λP ln(ε 1π 1 min) , so that τmix l 1 λP ln(π 2 min/2) m . Proof. We have: d TV(P tπ0, π) = 1 w V |(P tπ0)w πw| w V πw|(P tπ0)w πw| P tπ0 π 2 π 2πmin π0 π π so that |(P t)v,w πw| ε for t λ 1 P ln(π 1 minε 1/2). Lemma A.2 (Mixing times and hitting times). τhit 2π 1 minτmix , so that if π is the uniform distribution over V, τhit 2nτmix. Proof. for any v, w V, E [τw|v0 = v] = X k 1 P (τw k|v0 = v) ℓ 0 P (τw > ℓτmix|v0 = v) . Then, for ℓ 0, P (τw > (ℓ+ 1)τmix|v0 = v) = P (τw > (ℓ+ 1)τmix|τw > ℓτmix, v0 = v) P (τw > ℓτmix|v0 = v) , and, conditioning on vℓτmix, P (τw > (ℓ+ 1)τmix|τw > ℓτmix, vℓτmix) P v(ℓ+1)τmix = w|vℓτmix . By definition of τmix, we have that P v(ℓ+1)τmix = w|vℓτmix (1 πw/2), so that: P (τw > (ℓ+ 1)τmix|v0 = v) (1 πw/2)P (τw > ℓτmix|v0 = v) , and P (τw > ℓτmix|v0 = v) (1 πw/2)ℓby recursion. Finally, E [τw|v0 = v] τmix X ℓ 0 (1 πw/2)ℓ concluding the proof by taking the maximum over w. SGD under Markovian Sampling Schemes A.2. Matthews bound for cover times The following result bounds the cover time of the Markov chain: it is in fact closely related to its hitting time, and the two differ with a most a factor ln(n). This surprising result is proved in a very elegant way in the survey Levin et al. (2006), using the famous Matthews method (Matthews, 1988). Theorem A.3 (Matthews bound for cover times). The hitting and cover times of the Markov chain verify: A.3. A bound on the hitting time of regular and symetric graphs Using results from Rao (2012), we relate the hitting time of symmetric regular graphs (in a sense that we define below) to well-known graph-related quantities: number of edges |E|, diameter δ and degree d. Lemma A.4 (Bounding hitting times of regular graphs). Let (vt) be the simple random walk on a d-regular graph G of diameter δ, that satisfies the following symetry property: for any {u, v}, {v, w} E, there exists a graph automorphism that maps v to w. Then, we have: Proof. Using Theorem 2.1 of Rao (2012), for {v, w} E, we have E [τw|v0 = v] = 2|E| where |E| is the number of edges in the graph. Let v and w in V, at distance δ δ. There exists nodes v = v(0), v(1) . . . , v(δ 1), v(δ ) = w such that for all 0 s < δ , {v(s), v(s + 1)} E, and by using the Markov property: E [τw|v0 = v] X s<δ E τv(s+1)v0 = v(s) δ 2|E| A.4. Two miscellaneous lemmas We finally end this preliminary results section with the two following lemmas, that we help us conclude the proof of Theorem 6.1. The first lemma will lead to a bound on P t t|vs = v} and dv(t) = sup {s < t|vs = v} be the next and the last previous iterates for which vt = v (dv(t) = 0 by convention, if v has not yet been visited). Assume that (vt) has stationary distribution π. For t 0, let At = supv V t dv(t) and Bt = supv V pv(t) t . We have: E [Bt|vt = v] τcov , v V , and for T 1: X t 0. Consequently, taking the mean, we obtain E [HT ] E [HT 1]. B. Lower bound We prove the smooth non-convex version of Theorem 4.1; the convex cases are proved in a similar way using exactly the same arguments, and the most difficult function in the world , as defined by Nesterov (2014), rather than the one used by Carmon et al. (2021), albeit the two are closely related. Proof of Theorem 4.1. For x ℓ2 and k N, denote by x(k) its kth coordinate. We split the function defined in Section 3.2 of Carmon et al. (2021) (inspired by the most difficult function in the world of Nesterov (2014)) between two nodes v, w V maximizing E [τw|v0 = v], by setting πvfv(x) = 1 k 1 2x(2k)2 2x(2k 1)x(2k) + 1 2αx(0)2 bx(0) + α 2 and πwfw(x) = 1 k 0 2x(2k + 1)2 2x(2k + 1)x(2k) for some b, α > 0. Then, we define T0 = τv and for t 0, T2k+1 = inf {t T2k | vt = w} and T2k+2 = inf {t T2k+1 | vt = w} . The second step of the proof is somewhat classical, and consists in observing that the black-box constraints of the algorithm SGD under Markovian Sampling Schemes together with the construction of the functions fv and fw defined in the proof sketch of Section 4 imply that: if vt = v and ( Mt Span(ei, i 2k 1) then Mt+1 Span(ei, i 2k) , Mt Span(ei, i 2k) then Mt+1 Span(ei, i 2k) , if vt = w and ( Mt Span(ei, i 2k) then Mt+1 Span(ei, i 2k + 1) , Mt Span(ei, i 2k + 1) then Mt+1 Span(ei, i 2k + 1) , if vt / {v, w}, then Mt = Mt+1 . In other words, even dimensions are discovered by node v, while odd ones are discovered by node w. The dimension Re0 is discovered by node v thanks to the term be0. Using Theorem 1 of Carmon et al. (2021), for a right choice of parameters α, b > 0, f is L-smooth and satisfies f(x0) infx f(x) , together with, any k and any x Mt Span(ei, i 2k), This lower bound proof technique is explained in a detailed and enlightening fashion in Chapter 3.5 of Bubeck (2015). Then, the final and more technical step of the proof consists in upper bounding Ek(t). If (Tk+1 Tk)k 0 were independent from k(t), using E [Tk+1 Tk] = τhit for k even, we would directly obtain t E Tk(t) E [k(t)] 1)τhit/2. However, these random variables are not independent: since tail effects can happen, we need a finite second moment for hitting times, and the proof is a bit trickier. First, note that: E [k(t)] = X 0 k t P (k(t) k) = X 0 k t P (Tk t) . Let (Xℓ)ℓ 0 be i.i.d. random variables of same law as τw conditioned on v0 = v. We have E [Xℓ] = E [τw|v0 = v] = τhit, and var (Xℓ) < (by assumption). Let Sk = Pk 1 ℓ=0 Xℓ(Sk has the same law as Pk 1 ℓ=0 T2k+1 T2k ), so that, using the Markov property of (vt), Tk stochastically dominates S k/2 . Hence, P (Tk t) P S k/2 t . Then, using Chebychev inequality, for any ℓ 0 and for t such that ℓτhit t, we have: P (Sℓ t) = P (Sℓ ℓτhit t ℓτhit) = P (Sℓ ℓτhit)2 (t ℓτhit)2 (t ℓτhit)2 . We then have: E [k(t)] 2 X 0 ℓ t/2 P (Sℓ t) 0 ℓ 2t/τhit P (Sℓ t) + 2 X 2t/τhit ℓ t/2 P (Sℓ t) 2t/τhit ℓ t/2 ℓvar (X0) (t ℓτhit)2 . We finally show that the second term stays bounded: 2t/τhit ℓ t/2 ℓ (t ℓτhit)2 = 1 τ 2 hit 0 ℓ t/2 2t/τhit ℓ+ 2t/τhit (ℓ+ t/τhit)2 = 1 τ 2 hit 0 ℓ t/2 2t/τhit ℓ (ℓ+ t/τhit)2 + 2 τ 2 hit 0 ℓ t/2 2t/τhit t/τhit (ℓ+ t/τhit)2 . First, using a comparison with a continuous sum, we have: ℓ (ℓ+ t/τhit)2 X 1 (ℓ+ t/τhit) ln( t t/τhit ) = ln(τhit) , SGD under Markovian Sampling Schemes since for a, x > 0, R ax 0 ydy (y+a)2 R ax 0 dy (y+a) = ln(x). Finally, using P ℓ 1 1 (a+ℓ)2 R 0 dy (y+a)2 = 1 a, we bound the second sum as: X 0 ℓ t/2 2t/τhit t/τhit (ℓ+ t/τhit)2 τhit Wrapping our arguments together, we end up with: E [k(t)] 4t τhit + 2var (τv) τ 2 hit (ln(τhit) + 1 + τhit For t big enough, we end up with E [k(t)] 5t/τhit, so that since E f(xt) 2 L /(16E [k(t)]2) as explained in the main text, we have: f(xt) 2 = Ω L τ 2 hit t2 C. Markov chain stochastic gradient descent: proof of Theorem 5.3 We start by proving the following bound on E h fvt(xt) 2i . Note that this bound can be used for any t τmix. Lemma C.1. For t 0 and if vt πt for d TV(πt, π) πmin/2, we have: E h fvt(xt) 2i 3 σ2 + 2E h f(xt) 2i . Proof of the Lemma. We have for any v V that P (vt = v) πv + πv/2 = 3πv/2, so that E h fvt(xt) 2i 2E h fvt(xt) f(xt) 2i + 2E h f(xt) 2i v V P (vt = v) σ2 v + 2E h f(xt) 2i = 3 σ2 + 2E h f(xt) 2i . The proof borrows ideas from both the analyses of delayed SGD (Mania et al., 2017) and SGD with biased gradients (Even et al., 2022), thus refining MC-SGD initial analysis (Johansson et al., 2010). While a biased gradient analysis would not yield convergence to an ε-stationary point for arbitrary ε (at every iterations, biases are non-negligible and can be arbitrary high), by enforcing a delay τ (of order τmix) in the analysis, we manage to take advantage of the ergodicity of the biases. C.1. Smooth non-convex case of Theorem 5.3 Proof of Theorem 5.3.1. Denoting Ft = Ef(xt) f(x ), we have using smoothness: Ft+1 Ft γE [ fvt(xt), f(xt) ] + γ2L 2 E h fvt(xt) 2i . For the first term on the righthandside of the inequality, assuming that t τ for some τ > 0 we explicit later in the proof: E [ γ fvt(xt), f(xt) ] = E [ γ fvt(xt τ), f(xt τ) ] + E [ γ fvt(xt), f(xt) f(xt τ) ] + E [ γ fvt(xt) fvt(xt τ), f(xt τ) ] . First, we condition the first term on the filtration up to time t τ: E [ γ fvt(xt τ), f(xt τ) ] = E [ γ Et τ fvt(xt τ), f(xt τ) ] 2 E h Et τ fvt τ (xt) 2i + γ 2 E [ f(xt τ) Et τ fvt(xt τ) ] 2 E h f(xt τ) 2i . SGD under Markovian Sampling Schemes Then, for τ τmix(πminε), using the following lemma, we have, for ε < 1/2: E [ γ Et τ fvt(xt τ), f(xt τ) ] γ 4 E h f(xt τ) 2i + γε2 σ2 . Lemma C.2. For τ τmix(επmin) and t τ, E h Et τ fvt(xt τ) f(xt τ) 2i 2ε2E h f(xt τ) 2i + 2ε2 σ2 . Proof of the Lemma. We have: E h Et τ fvt(xt τ) f(xt τ) 2i = E v V (P (vt = v|xt τ) πv) fv(xt τ) v V πv E h fv(xt τ) 2i , where we used |P (vt = v|xt τ) πv| επv and convexity of the squared Euclidean norm. For that last term, X v V πv E h fv(xt τ) 2i X v V 2πv E h f(xt τ) 2i + σ2 v = 2E h f(xt τ) 2i + 2 σ2 , concluding the proof of the Lemma. Using gradient Lipschitzness and writing xt xt τ = γ Pt 1 s=max(t τ,0) fvs(xs), we have: E [ γ fvt(xt), f(xt) f(xt τ) ] γ2LE s=max(t τ,0) fvs(xs) 2 (τE h fvt(xt) 2i + s=max(t τ,0) E h fvs(xs) 2i . E [ γ fvt(xt) fvt(xt τ), f(xt τ) ] γ2L 2 (τE h f(xt τ) 2i + s=max(t τ,0) E h fvs(xs) 2i . Wrapping things up, we obtain, for t τ and τ τmix: 4 E h f(xt τ) 2i + γε2 σ2 2 (τ + 1)E h fvt(xt) 2i + τE h f(xt τ) 2i + 2 s=max(t τ,0) E h fvs(xs) 2i 4 E h f(xt τ) 2i + γε2 σ2 + (3τ + 1)γ2L 2 (τ + 1)E h f(xt) 2i + τE h f(xt τ) 2i + 2 s=max(t τ,0) E h f(xs) 2i . Summing for τ t < T: τ t 0, is required γ 4pε and thus to make the first term small, T must verify T = Ω(1/(2pε)), concluding our reasonning. For s < t, we have E [ζsζt] = 2P (vt s = v0|v0 π ) 1/ Denoting zk = P (vk = v0|v0 π ), we have zk+1 = pzk + (1 p)(1 zk) and z0 = 1, so that zk = 1 2(1 + (1 2p)k) for k 0. This leads to: s t|vs = v} and dv(t) = sup {s t|vs = v} be the next and the last previous iterates for which vt = v (dv(t) = 0 by convention, if v has not yet been visited). Denote Ft = E [f(xt) f(x )]. We have, using smoothness: f(xt+1) f(xt) γt f(xt), Gt + γ2 t L Together with f(xt), Gt = 1 2( f(xt) 2 + Gt 2 f(xt) Gt 2), we obtain: f(xt+1) f(xt) γt 2 f(xt) 2 + Gt 2 f(xt) Gt 2 + γ2 t L 2 f(xt) 2 γt 4 Gt 2 + γt 2 f(xt) Gt 2 , as long as γt 1/(2L). We thus need to upperbound the bias f(xt) Gt 2. We have: f(xt) Gt 2 = v V fv(xdv(t)) fv(xt) fv(xdv(t)) fv(xt) 2 . Fix some v in V. We have fv(xt) fv(xdv(t)) 2 L2 Pt 1 s=dv(t) γs Gs 2 , leading to: f(xt+1) f(xt) γt 2 f(xt) 2 γt 4 Gt 2 + γt L2 s=dv(t) γs Gs We now proceed with the proof of Theorem 6.1.1. Proof of Theorem 6.1.1. We begin with f(xt+1) f(xt) γt 2 f(xt) 2 γt 4 Gt 2 + γt L2 s=dv(t) γs Gs SGD under Markovian Sampling Schemes as a starting point. For v V, s=dv(t) L2γs Gs s=dv(t) (t dv(t))L2γtγ2 s Gs 2 s=dv(t) Lγ2 s Gs 2 , since γt 1/(L(t dv(t))). Summing for t < T, we obtain: 2 f(xt) 2 F0 X s=dv(t) Lγ2 s Gs 2 . s=dv(t) Lγ2 s Gs 2 = X s