# demystifying_sgd_with_doubly_stochastic_gradients__3d8f1376.pdf

Demystifying SGD with Doubly Stochastic Gradients

Kyurae Kim 1 Joohwan Ko 2 Yi-An Ma 3 Jacob R. Gardner 1

Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g., diffusion models, variational autoencoders, and many more), a setting also known as finite sum with infinite data. For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of 𝑏 𝑚, where 𝑏is the minibatch size and 𝑚is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.

1. Introduction

Stochastic gradient descent (SGD; Robbins & Monro, 1951; Bottou, 1999; Nemirovski et al., 2009; Shalev Shwartz et al., 2011) is the de facto standard for solving large scale optimization problems of the form of finite sums

1Department of Computer and Information Sciences, University of Pennsylvania, Philadelphia, PA, U.S.A. 2KAIST, Daejeon, South Korea, Republic of 3Halıcıo glu Data Science Institute, University of California San Diego, San Diego, CA, U.S.A.. Correspondence to: Kyurae Kim <kyrkim@seas.upenn.edu>.

Proceedings of the 41𝑠𝑡International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

minimize 𝒙 𝒳 ℝ𝑑

𝑛 𝑛 𝑖=1𝑓𝑖(𝒙) } . (1)

When 𝑛is large, SGD quickly converges to low-accuracy solutions by subsampling over components 𝑓1, , 𝑓𝑛. The properties of SGD on the finite sum class have received an immense amount of interest (Bottou et al., 2018) as it includes empirical risk minimization (ERM; Vapnik, 1991).

Unfortunately, for an emerging large set of problems in machine learning, we may not have direct access to the components 𝑓1, , 𝑓𝑛. That is, each 𝑓𝑖may be defined as an intractable expectation, or an infinite sum

𝑓𝑖(𝒙) = 𝔼𝞰 𝜑𝑓𝑖(𝒙; 𝞰) , (2)

where we only have access to the noise distribution 𝜑 and the integrand 𝑓𝑖(𝒙; 𝞰), and 𝞰is a potentially continuous and unbounded source of stochasticity; a setting Zheng & Kwok (2018); Bietti & Mairal (2017) have previously called finite sum with infinite data. Such problems include the training of diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song & Ermon, 2019), variational autoencoders (Kingma & Welling, 2014; Rezende et al., 2014), solving ERM under differential privacy (Bassily et al., 2014; Song et al., 2013), and also classical problems such as variational inference (Ranganath et al., 2014; Titsias & L azaro-Gredilla, 2014; Kucukelbir et al., 2017), and variants of empirical risk minimization (Dai et al., 2014; Bietti & Mairal, 2017; Shi et al., 2021; Orvieto et al., 2023; Liu et al., 2021). In contrast to the finite sum setting where SGD has traditionally been applied, our problem takes the form of

minimize 𝒙 𝒳 ℝ𝑑

𝑛 𝑛 𝑖=1𝔼𝞰 𝜑𝑓𝑖(𝒙; 𝞰) } .

These optimization problems are typically solved using SGD with doubly stochastic gradients (doubly SGD; coined by Dai et al. 2014; Titsias & L azaro-Gredilla 2014), so-called because, in addition to subsampling over 𝑓𝑖, stochastic estimates of each component 𝑓𝑖are used.

Previous studies have relied on strong assumptions to analyze doubly stochastic gradients. For instance, Kulunchakov & Mairal (2020); Bietti & Mairal (2017); Zheng & Kwok (2018) have (i) assumed that the variance of each component estimator is bounded by a constant, which contradicts componentwise strong convexity (Nguyen et al.,

Demystifying SGD with Doubly Stochastic Gradients

2018) when 𝒳= ℝ𝑑, (ii) or that the integrand 𝑓𝑖(𝒙; 𝜼), is 𝐿-Lipschitz smooth uniformly over 𝜼. That is, for any fixed 𝜼and 𝑖,

𝑓𝑖(𝒙; 𝜼) 𝑓𝑖(𝒚; 𝜼) 𝐿 𝒙 𝒚 2 2 holds for all (𝒙, 𝒚) 𝒳2. Unfortunately, this only holds for additive noise and is otherwise unrealizable when 𝞰 has an unbounded support. Therefore, analyses relying on uniform smoothness obscure a lot of interesting behavior. Meanwhile, weaker assumptions such as expected smoothness (ES; Moulines & Bach, 2011; Gower et al., 2021b) have shown to be realizable even for complex gradient estimators (Domke, 2019; Kim et al., 2023). Therefore, a key question is how these ES-type assumptions propagate to doubly stochastic estimators. Among these, we focus on the expected residual (ER; Gower et al., 2019) condition.

Furthermore, in practice, certain applications of doubly SGD share the randomness 𝞰across the batch 𝘉. (See Section 2.2 for examples.) This introduces dependence between the gradient estimate for each component such that 𝑓𝑖(𝒙; 𝞰) 𝑓𝑗(𝒙; 𝞰) for 𝑖, 𝑗 𝘉. Little is known about the effect of this practice apart from some empirical results (Kingma et al., 2015). For instance, when 𝑚Monte Carlo samples of 𝞰and a minibatch of size 𝑏are used, what is the trade-off between 𝑚and 𝑏? To answer this question, we provide a theoretical analysis of doubly SGD that encompasses dependent gradient estimators.

Technical Contributions

Theorem 1: For doubly stochastic estimators, we establish a general variance bound of the form of

𝑛 𝑛 𝑖=1 𝜎2 𝑖 𝑚𝑏 + 𝜌

𝑛 𝑛 𝑖=1 𝜎𝑖 )2

where 𝜎2 𝑖is the variance of the estimator of 𝑓𝑖, 𝜌 [0, 1] is the correlation between the estimators, and 𝜏2 is the variance of subsampling.

Theorems 2 and 3: Using the general variance bound, we show that a doubly stochastic estimator subsampling over correlated estimators satisfying the ER condition and the bounded variance (BV; Definition 2; bounded only on the solution set) condition equally satisfies the ER and BV conditions as well. This is sufficient to guarantee the convergence of doubly SGD on convex, quasar convex, and nonconvex smooth objectives.

Theorem 5: Under similar assumptions, we also prove the convergence of doubly SGD with random reshuffling (doubly SGD-RR), instead of independent subsampling, on a strongly convex objective with strongly convex components.

Practical Insights

Should I invest in (increase) 𝑚or 𝑏? When dependent gradient estimators are used, increasing 𝑚or 𝑏 does not have the same impact on the gradient variance as the subsampling strategy also affects the resulting correlation between the estimators. Through Lemma 9, our analysis provides insight into this effect. In particular, we reveal that reducing subsampling variance also reduces Monte Carlo variances. Therefore, for a fixed budget 𝑚 𝑏, increasing 𝑏 should always be preferred over increasing 𝑚.

Random Reshuffling Improves Complexity. Our analysis of doubly SGD-RR reveals that, for strongly convex objectives, random reshuffling improves the iteration complexity of doubly SGD from

𝜖𝜎2 sub ) to 𝒪( 1

𝜖𝜎sub). Fur-

thermore, for dependent gradient estimators, doubly SGD-RR is super-efficient : for a batch taking Θ(𝑚𝑏) samples to compute, it achieves a 𝑛 𝑏 tighter asymptotic sample complexity compared to full-batch SGD.

2. Preliminaries Notation We denote random variables (RVs) in serif (e.g., 𝘹, 𝙭, 𝙓, 𝘉), vectors and matrices in bold (e.g., 𝒙, 𝙭, 𝑨, 𝘼). For a vector 𝒙, we denote the 𝓁2-norm as 𝒙 2

𝒙 𝒙, where 𝒙, 𝒙 = 𝒙 𝒙is the inner product. Lastly, 𝘟 𝘠denotes independence of 𝘟and 𝘠.

Table 1. Nomenclature Symb. Description Ref.

𝐹(𝒙) Objective function Eq. (1) 𝑓𝑖(𝒙) 𝑖th component of 𝐹 Eq. (1) 𝑓𝘉(𝒙) Minibatch subsampling estimator of 𝐹Eq. (4)

𝘉 Minibatch of component indices Eq. (3) 𝜋 Minibatch subsampling strategy Eq. (3) 𝑏eff Effective sample size of 𝜋 Eq. (5) 𝙜𝑖(𝒙) Unbiased stochastic estimator of 𝑓𝑖 Eq. (7) 𝒈𝑖(𝒙; 𝞰) Integrand of estimator 𝙜𝑖(𝒙) Eq. (7) 𝙜𝘉(𝒙) Doubly stochastic estimator of 𝐹 Eq. (8) ℒsub ER constant (Definition 1) of 𝜋 Assu. 5 ℒ𝑖 ER constant (Definition 1) of 𝙜𝑖 Assu. 6 𝜏2 BV constant (Definition 2) of 𝜋 Assu. 7 𝜎2 𝑖 BV constant (Definition 2) of 𝙜𝑖 Assu. 7

2.1. Stochastic Gradient Descent on Finite-Sums

Stochastic gradient descent (SGD) is an optimization algorithm that repeats the steps

𝒙𝑡+1 = Π𝒳(𝒙𝑡 𝛾𝑡𝙜(𝒙𝑡)) ,

where, Π𝒳is a projection operator onto 𝒳, (𝛾𝑡)𝑇 1 𝑖=0 is some stepsize schedule, 𝙜(𝒙) is an unbiased estimate of 𝐹(𝒙).

Demystifying SGD with Doubly Stochastic Gradients

Finite-Sum Problems. When the objective can be represented as a finite sum it is typical to approximate the gradients of the objective as

𝑛 𝑛 𝑖=1 𝑓𝑖(𝒙) = 𝔼𝘉 𝜋[1

𝑏 𝑖 𝘉 𝑓𝑖(𝒙)] , (3)

where 𝘉 𝜋is an index set of cardinality |𝘉| = 𝑏, or minibatch, formed by subsampling over the datapoint indices {1, , 𝑛}. More formally, we are approximating 𝐹 using the (minibatch) subsampling estimator

𝑖 𝘉 𝑓𝑖(𝒙) , (4)

where the performance of this estimator, or equivalently, of the subsampling strategy 𝜋, can be quantified by its variance tr𝕍[ 𝑓𝘉(𝒙)] = 1 𝑏eff

𝑛 𝑛 𝑖=1 𝑓𝑖(𝒙) 𝐹(𝒙) 2 2, (unit) subsampling variance

where we say 𝑏eff is the effective sample size of 𝜋. For instance, independent subsampling achieves 𝑏eff = 𝑏, and sampling without replacement, also known as 𝑏-nice sampling (Gower et al., 2019; Richt arik & Tak aˇc, 2016; Csiba & Richt arik, 2018), achieves 𝑏eff = (𝑛 1)𝑏 𝑛 𝑏(Lemma 2).

2.2. Doubly Stochastic Gradients

For problems where the components are defined as intractable expectations as in Eq. (2), we have to rely on an additional Monte Carlo approximation step such as

𝑖=1 𝑓𝑖(𝒙) = 𝔼𝘉 𝜋[1

𝑖 𝘉 𝔼𝞰 𝜑[ 𝑓𝑖(𝒙; 𝞰)]]

= 𝔼𝘉 𝜋, 𝞰𝑗 𝜑

𝑗=1 𝑓𝑖 (𝒙; 𝞰𝑗 ) , (6)

where 𝞰𝑗 𝜑are 𝑚independently and identically distributed (i.i.d.) Monte Carlo samples from 𝜑.

Doubly Stochastic Gradient Consider an unbiased estimator of the component gradient 𝑓𝑖such that

𝔼𝙜𝑖(𝒙) = 𝔼𝞰 𝜑𝒈𝑖(𝒙; 𝞰) = 𝑓𝑖(𝒙) , (7)

where 𝒈𝑖(𝒙; 𝞰) is the measurable integrand. Using these, we can estimate 𝐹through the doubly stochastic gradient estimator

𝑖 𝘉 𝙜𝑖(𝒙) , (8)

We separately define the integrand 𝒈(𝒙; 𝜼) since, in practice, a variety of unbiased estimators of 𝑓𝑖can be obtained by appropriately defining the integrand 𝒈𝑖. For example, one can form the 𝑚-sample naive Monte Carlo estimator by setting

𝒈𝑖(𝒙; 𝞰) = 1

𝑚 𝑚 𝑗=1 𝑓𝑖 (𝒙; 𝞰𝑗 ) ,

where 𝞰= [𝞰1, , 𝞰𝑚] 𝜑 𝑚.

Dependent Component Gradient Estimators. Notice that, in Eq. (6), the subcomponents in the batch share the Monte Carlo samples, which may occur in practice. This means 𝙜𝑖and 𝙜𝑗in the same batch are dependent and, in the worst case, positively correlated, which complicates the analysis. While it is possible to make the estimators independent by sampling 𝑚unique Monte Carlo samples for each component (𝑚𝑏Monte Carlo samples in total) as highlighted by Kingma et al. (2015), it is common to use dependent estimators for various practical reasons:

1. ERM with Randomized Smoothing: In the ERM context, recent works have studied the generalization benefits of randomly perturbing the model weights before computing the gradient (Orvieto et al., 2023; Liu et al., 2021). When subsampling is used, perturbing the weights independently for each datapoint is computationally inefficient. Therefore, the perturbation is shared across the batch, creating dependence.

2. Black-Box Variational inference (Titsias & L azaro Gredilla, 2014; Kucukelbir et al., 2017): Here, each component can be decomposed as

𝑓𝑖(𝒙; 𝞰) = 𝓁𝑖(𝒙; 𝞰) + 𝑟(𝒙; 𝞰) ,

where 𝓁𝑖is the log likelihood and 𝑟is the log-density of the prior. By sharing (𝞰𝑗)𝑚 𝑗=1, 𝑟only needs to be evaluated 𝑚times. To create independent estimators, it needs to be evaluated 𝑚𝑏times instead, but 𝑟can be expensive to compute.

3. Random feature kernel regression with doubly SGD (Dai et al., 2014): The features are shared across the batch 1. This reduces the peak memory requirement from 𝑏𝑚𝑑𝜼, where 𝑑𝜼is the size of the random features, to 𝑚𝑑𝜼.

One of the analysis goals of this work is to characterize the effect of dependence in the context of SGD.

2.3. Technical Assumptions on Gradient Estimators

To establish convergence of SGD, contemporary analyses use the variance transfer strategy (Moulines & Bach, 2011; Johnson & Zhang, 2013; Nguyen et al., 2018; Gower et al., 2019; 2021b). That is, by assuming the gradient noise satisfies some condition resembling smoothness, it is possible to bound the gradient noise on some arbitrary point 𝒙by the gradient variance on the solution set 𝒙 arg min𝒙 𝒳𝐹(𝒙).

ER Condition. In this work, we will use the expected residual (ER) condition by Gower et al. (2021a):

1See the implementation at https://github.com/ zixu1986/Doubly_Stochastic_Gradients

Demystifying SGD with Doubly Stochastic Gradients

Definition 1 (Expected Residual; ER). A gradient estimator 𝙜of 𝐹 𝒳 ℝis said to satisfy ER (ℒ) if

tr𝕍[𝙜(𝒙) 𝙜(𝒙 )] 2ℒ(𝐹(𝒙) 𝐹(𝒙 )) ,

for some 0 < ℒ< and all 𝒙 𝒳and all 𝒙 arg min𝒙 𝒳𝐹(𝒙).

When 𝑓is convex, a weaker form can be used: We will also consider the convex variant of the ER condition that uses the Bregman divergence defined as

D𝜙(𝒚, 𝒙) 𝜙(𝒚) 𝜙(𝒙) 𝜙(𝒙) , 𝒚 𝒙 ,

(𝒙, 𝒚) 𝒳2, where 𝜙 𝒳 ℝis a convex function.

Why the ER condition? A way to think about the ER condition is that it corresponds to the variance form equivalent of the expected smoothness (ES) condition by Gower et al. (2021b) defined as

𝔼 𝙜(𝒙) 𝙜(𝒙 ) 2 2 2ℒ(𝐹(𝒙) 𝐹(𝒙 )) , (ES)

but is slightly weaker, as shown by Gower et al. (2021a). The main advantage of the ER condition is that, due to the properties of the variance, it composes more easily:

Proposition 1. Let 𝙜satisfy ER (ℒ). Then, the 𝑚-sample i.i.d. average of 𝙜satisfy ER (ℒ 𝑚).

BV Condition. From the ER property, the gradient variance on any point 𝒙 𝒳can be bounded by the variance on the solution set as long as the following holds:

Definition 2 (Bounded Gradient Variance). A gradient estimator 𝙜of 𝐹 𝒳 ℝsatisfies BV (𝜎2) if

tr𝕍[𝙜(𝒙 )] 𝜎2

for some 𝜎2 < and all 𝒙 arg max𝒙 𝒳𝐹(𝒙).

2.4. Convergence Guarantees for SGD

Sufficiency of ER and BV. From the ER and BV conditions, other popular conditions such as ES (Gower et al., 2021b) and ABC (Khaled & Richt arik, 2023) can be established with minimal additional assumptions. As a result, we retrieve the previous convergence results on SGD established for various objective function classes:

strongly convex (Gower et al., 2019),

quasar convex (+PL) (Gower et al., 2021a),

smooth (+PL) (Khaled & Richt arik, 2023).

(Note: quasar convexity is strictly weaker than convexity Guminov et al., 2023; PL: Polyak-Łojasiewicz.) (See also the comprehensive treatment by Garrigos & Gower, 2023.) Therefore, ER and BV are sufficient conditions for SGD to converge on problem classes typically considered in SGD convergence analysis.

In this work, we will specifically focus on smooth and strongly convex objectives:

Assumption 1. There exists some 𝜇, 𝐿satisfying 0 < 𝜇 𝐿< suc that the objective function 𝐹 𝒳 ℝ is 𝜇-strongly convex and 𝐿-smooth as

𝐹(𝒚) 𝐹(𝒙) 𝐹(𝒙) , 𝒚 𝒙 + 𝜇

𝐹(𝒚) 𝐹(𝒙) 𝐹(𝒙) , 𝒚 𝒙 + 𝐿

2 𝒙 𝒚 2 2 hold for all (𝒙, 𝒚) 𝒳2.

Also, we will occasionally assume that 𝐹is comprised of a finite sum of convex and smooth components:

Assumption 2. The objective function 𝐹 𝒳 ℝis a finite sum as 𝐹= 1

𝑛(𝑓1 + + 𝑓𝑛), where each component is 𝐿𝑖-smooth and convex such that

𝑓𝑖(𝒙) 𝑓𝑖(𝒚) 2 2 2𝐿𝑖D𝑓𝑖(𝒙, 𝒚)

holds for all (𝒙, 𝒚) 𝒳2.

Note that Assumption 2 alone already implies that 𝐹is convex and 𝐿max-smooth with 𝐿max = max {𝐿1, , 𝐿𝑛}.

Why focus on strongly convex functions? We focus on strongly convex objectives as the effect of stochasticity is the most detrimental: in the deterministic setting, one only needs 𝒪(log (1 𝜖)) iterations to achieve an 𝜖-accurate solution. But with SGD, one actually needs 𝒪(1 𝜖) iterations due to noise. As such, we can observe a clear contrast between the effect of optimization and noise in this setting.

With that said, for completeness, we provide full proof of convergence on strongly convex-smooth objectives:

Lemma 1. Let the objective 𝐹satisfy Assumption 1 and the gradient estimator 𝙜satisfy ER (ℒ) and BV (𝜎2). Then, the last iterate of SGD is 𝜖-close to the global optimum 𝒙 = arg min𝒙 𝒳𝐹(𝒙) such that 𝔼 𝒙𝑇 𝒙 2 2 𝜖after a number of iterations at least

𝑇 2 max (𝜎2

𝜇2 1 𝜖, ℒ+ 𝐿

𝜇 ) log (2 𝒙0 𝒙 2 2 1 𝜖)

and the fixed stepsize

𝛾= min ( 𝜖𝜇

2𝜎2 , 1 2 (ℒ+ 𝐿)) .

See the full proof in page 22.

Note that our complexity guarantee is only 𝒪(1 𝜖log (1 𝜖)) due to the use of a fixed stepsize. It is also possible to establish a 𝒪(1 𝜖) guarantee using decreasing stepsize schedules proposed by Gower et al. (2019); Stich (2019). In practice, these schedules are rarely used, and the resulting complexity guarantees are less clear than with fixed stepsizes. Therefore, we will stay on fixed stepsizes.

Demystifying SGD with Doubly Stochastic Gradients

3. Main Results

3.1. Doubly Stochastic Gradients

Table 2. Rosetta Stone 3.1.1 3.1.2

𝙭𝑖 𝙜𝑖 𝙭𝘉 𝙜𝘉 𝒙𝑖 𝑓𝑖 𝒙 𝐹

First, while taming notational complexity, we will prove a general result that holds for combining unbiased but potentially correlated estimators through subsampling. All of the later results on SGD will fall out as special cases following the correspondence in Table 2.

3.1.1. GENERAL VARIANCE BOUND

Theoretical Setup. Consider the problem of estimating the population mean 𝒙= 1

𝑛 𝑛 𝑖=1 𝒙𝑖with a collection of RVs 𝙭1, , 𝙭𝑛, each an unbiased estimator of the component 𝒙𝑖= 𝔼𝙭𝑖. Then, any subsampled ensemble

𝑖 𝘉 𝙭𝑖 with 𝘉 𝜋,

where 𝜋is an unbiased subsampling strategy with an effective sample size of 𝑏eff, is also an unbiased estimator of 𝒙. The goal is to analyze how the variance of the component estimators tr𝕍𝙭𝑖for 𝑖= 1, , 𝑛and the variance of 𝜋 affect the variance of 𝙭𝘉. The following condition characterizes the correlation between the component estimators:

Assumption 3. The component estimators 𝙭1, , 𝙭𝑛 have finite variance tr𝕍𝙭𝑖< for all 𝑖= 1, , 𝑛and, there exists some 𝜌 [0, 1] for all 𝑖 𝑗such that

tr Cov (𝙭𝑖, 𝙭𝑗 ) 𝜌

Remark 1. Assumption 3 always holds with 𝜌= 1 as a basic consequence of the Cauchy-Schwarz inequality.

Remark 2. For a collection of mutually independent estimators 𝙭1, , 𝙭𝑛such that 𝙭𝑖 𝙭𝑗for all 𝑖 𝑗, Assumption 3 holds with 𝜌= 0.

Remark 3. The equality in Assumption 3 holds with 𝜌= 0 for independent estimators, while it holds with 𝜌= 1 when they are perfectly positively correlated such that, for all 𝑖 𝑗, there exists some constant 𝛼𝑖𝑗 0 such that 𝙭𝑖= 𝛼𝑖,𝑗𝙭𝑗

Theorem 1. Let the component estimators 𝙭1, , 𝙭𝑛satisfy Assumption 3. Then, the variance of the doubly stochastic estimator 𝙭𝘉is bounded as

tr𝕍[𝙭𝘉] 𝑉com + 𝑉cor + 𝑉sub, where

𝑛 𝑛 𝑖=1 tr𝕍[𝙭𝑖] ) ,

𝑉cor = 𝜌(1 1

tr𝕍[𝙭𝑖] )2 , and

𝑛 𝑛 𝑖=1 𝒙𝑖 𝒙 2 2.

Equality holds when the equality in Assumption 3 holds.

Proof. We start from the law of total (co)variance,

tr𝕍[𝙭𝘉] = 𝔼𝜋[tr𝕍[𝙭𝘉 𝘉]] Variance of ensemble

+ tr𝕍𝜋[𝔼[𝙭𝘉 𝘉]] . Variance of subsampling

This splits the variance into the variance of the specific ensemble of 𝘉and subsampling variance. The main challenge is to relate the variance of the ensemble of 𝘉with the variance of the individual estimators in the sum

𝔼𝜋[tr𝕍[𝙭𝘉 𝘉]] = 𝔼𝜋 [ tr𝕍 [ 1

𝑏 𝑖 𝘉𝙭𝑖 ]] . (9)

Since the individual estimators may not be independent, analyzing the variance of the sum can be tricky. However, the following lemma holds generally:

Lemma 9. Let 𝙭1, , 𝙭𝑏be a collection of vector-variate RVs dependent on some random variable 𝘉satisfying Assumption 3. Then, the expected variance of the sum of 𝙭1, , 𝙭𝑏conditioned on 𝘉is bounded as

𝔼 [ tr𝕍 [ 𝑏 𝑖=1 𝙭𝑖 𝘉 ]] 𝜌𝕍[𝘚] + 𝜌(𝔼𝘚)2 + (1 𝜌) 𝔼[𝘝] ,

tr𝕍[𝙭𝑖 𝘉] and 𝘝= 𝑏 𝑖=1 tr𝕍[𝙭𝑖 𝘉] .

Equality holds when the equality in Assumption 3 holds.

Here, 𝘚is the sum of conditional standard deviations, while 𝘝is the sum of conditional variances. Notice that the variance of the variances is playing a role: if we reduce the subsampling variance, then the variance of the ensemble, 𝑉com, also decreases.

The rest of the proof, along with the proof of Lemma 9, can be found in Appendix B.3 page 23.

In Theorem 1, 𝑉com is the contribution of the variance of the component estimators, while 𝑉cor is the contribution of the correlation between component estimators , and 𝑉sub is the subsampling variance.

Monte Carlo with Subampling Without Replacement. Theorem 1 is very general: it encompasses both the correlated and uncorrelated cases and matches the constants of all of the important special cases. We will demonstrate this in the following corollary along with variance reduction by Monte Carlo averaging of 𝑚i.i.d. samples. That is, we subsample over 𝙭𝑚 1 , , 𝙭𝑚 𝑛, where each estimator is an 𝑚-sample Monte Carlo estimator:

𝑚 𝑚 𝑗=1𝙭(𝑗) 𝑖 ,

where 𝙭(1) 𝑖 , , 𝙭(𝑚) 𝑖 are i.i.d replications with mean 𝒙𝑖=

𝔼𝙭(𝑗) 𝑖 . Then, the variance of the doubly stochastic estimator

𝙭𝘉of the mean 𝒙= 1

𝑛 𝑛 𝑖=1 𝒙𝑖defined as

𝑏 𝑖 𝘉𝙭𝑚 𝑖 with 𝘉 𝜋,

can be bounded as follows:

Demystifying SGD with Doubly Stochastic Gradients

Corollary 1. For each 𝑗= 1, , 𝑚, let 𝙭(𝑗) 1 , , 𝙭(𝑗) 𝑛 satisfy Assumption 3. Then, the variance of the doubly stochastic estimator 𝙭𝑚 𝘉 of the mean 𝒙= 1

𝑛 𝑛 𝑖=1 𝒙𝑖, where 𝜋is 𝑏-minibatch sampling without replacement, satisfy the following corollaries:

(i) 𝜌= 1 and 1 < 𝑏< 𝑛:

tr𝕍[𝙭𝑚 𝘉 ] 𝑛 𝑏 (𝑛 1)𝑚𝑏

𝑛 𝑛 𝑖=1 𝜎2 𝑖 )

𝑛 𝑛 𝑖=1 𝜎𝑖 )2 + 𝑛 𝑏 (𝑛 1)𝑏𝜏2

(ii) 𝜌= 1 and 𝑏= 1:

tr𝕍[𝙭𝑚 𝘉 ] 1

𝑛 𝑛 𝑖=1 𝜎2 𝑖 ) + 𝜏2

(iii) 𝜌= 1 and 𝑏= 𝑛:

tr𝕍[𝙭𝑚 𝘉 ] 1

𝑛 𝑛 𝑖=1 𝜎𝑖 )2

(iv) 𝜎𝑖= 0 for all 𝑖= 1, , 𝑛:

tr𝕍[𝙭𝑚 𝘉 ] 𝑛 𝑏 (𝑛 1)𝑏𝜏2,

tr𝕍[𝙭𝑚 𝘉 ] 1 𝑚𝑏

𝑛 𝑛 𝑖=1 𝜎2 𝑖 ) + 𝑛 𝑏 (𝑛 1)𝑏𝜏2

where, for all 𝑖= 1, , 𝑛and any 𝑗= 1, , 𝑚,

𝜎2 𝑖= tr𝕍𝙭(𝑗) 𝑖 is invidual variance and

𝑛 𝑛 𝑖=1 𝒙𝑖 𝒙 2 2 is the subsampling variance.

Remark 4 (For dependent estimators, increasing 𝑏also reduces component variance.). Notice that, for case of 𝜌= 1, Corollary 1 (i), the term with 1

𝑛 𝑛 𝑖=1 𝜎2 𝑖is reduced in a rate of 𝒪(1 𝑚𝑏). This means reducing the subsampling noise by increasing 𝑏also reduces the noise of estimating each component. Furthermore, the first term dominates the second term as ( 1

𝑛 𝑛 𝑖=1 𝜎𝑖 )2 1

𝑛 𝑛 𝑖=1 𝜎2 𝑖,

as stated by Jensen s inequality. Therefore, despite correlations, increasing 𝑏will have a more significant effect since it reduces both dominant terms 1

𝑛 𝑛 𝑖=1 𝜎2 𝑖and 𝜏2.

Remark 5. When independent estimators are used, Corollary 1 (v) shows that increasing 𝑏reduces the full variance in a 𝒪(1 𝑏) rate, but increasing 𝑚does not.

Remark 6. Corollary 1 achieves all known endpoints in the context of SGD: For 𝑏= 𝑛(full batch), doubly SGD reduces to SGD with a Monte Carlo estimator, where there is no subsampling noise (no 𝜏2). When the Monte Carlo noise is 0, then doubly SGD reduces to SGD with a subsampling estimator (no 𝜎𝑖), retrieving the result of Gower et al. (2019).

3.1.2. GRADIENT VARIANCE CONDITIONS FOR SGD

From Theorem 1, we can establish the ER and BV conditions (Section 2.3) of the doubly stochastic gradient estimators. Following the notation in Section 2.2, we will denote the doubly stochastic gradient estimator as 𝙜𝘉, which combines the estimators 𝙜1, , 𝙜𝑛according to the subsampling strategy 𝘉 𝜋, which achieves an effective sample size of 𝑏eff. We will also use the corresponding minibatch subsampling estimator 𝑓𝘉for the analysis.

Assumption 4. For all 𝒙 𝒳, the component gradient estimators 𝙜1 (𝒙) , , 𝙜𝑛(𝒙) satisfy Assumption 3 with some 𝜌 [0, 1].

Again, this assumption is always met with 𝜌= 1 and holds with 𝜌= 0 if the estimators are independent.

Assumption 5. The subsampling estimator 𝑓𝘉satisfies the ER (ℒsub) condition in Definition 1.

This is a classical assumption used to analyze SGD on finite sums and is automatically satisfied by Assumption 2. (See Lemma 10 in Appendix B.4.3 for a proof.)

Assumption 6. For all 𝑖= 1, , 𝑛and 𝒙 𝒳and global minimizers 𝒙 arg min𝒙 𝒳𝐹(𝒙 ), the component gradient estimator 𝙜𝑖satisfies at least one of the following variants of the ER condition:

(ACVX) tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒚)] 2ℒ𝑖D𝑓𝑖(𝒙, 𝒚) , where 𝑓𝑖is convex.

(AITP) tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒚)] 2ℒ𝑖(𝑓𝑖(𝒙) 𝑓𝑖(𝒙 )) where 𝑓𝑖(𝒙) 𝑓𝑖(𝒙 ).

(B) tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒚)] 2ℒ𝑖(𝐹(𝒙) 𝐹(𝒙 )).

Each of these assumptions holds under different assumptions and problem setups. For instance, ACVX holds only under componentwise convexity, while AITP requires majorization 𝑓𝑖(𝒙) 𝑓𝑖(𝒙 ), which is essentially assuming interpolation (Vaswani et al., 2019; Ma et al., 2018; Gower et al., 2021a) in the ERM context. Among these, (B) is the strongest since it directly relates the individual components 𝑓1, , 𝑓𝑛with the full objective 𝐹.

We now state our result establishing the ER condition:

Theorem 2. Let Assumption 4 to 6 hold. Then, we have:

(i) If (ACVX) or (AITP) hold, 𝙜𝘉satisfies ER (ℒA).

(ii) If (B) holds, 𝙜𝘉satisfies ER (ℒB).

where ℒmax = max {ℒ1, , ℒ𝑛 },

𝑏) ℒmax + 𝜌(1 1

𝑛 𝑛 𝑖=1 ℒ𝑖 ) + ℒsub

Demystifying SGD with Doubly Stochastic Gradients

𝑛 𝑛 𝑖=1 ℒ𝑖 )

ℒ𝑖 )2 + ℒsub

See the full proof in page 25.

Remark 7. Assuming the conditions in Assumption 6 hold with the same value of ℒ𝑖, the inequality ( 1

𝑛 𝑛 𝑖=1 ℒ𝑖 ℒmax,

implies ℒB ℒA.

Meanwhile, The BV condition follows by assuming equivalent conditions on each component estimator:

Assumption 7. Variance is bounded for all 𝒙 arg min𝒙 𝒳𝐹(𝒙) such that the following hold:

𝑛 𝑛 𝑖=1 𝑓𝑖(𝒙 ) 2 2 𝜏2 for some 𝜏2 < and,

2. tr𝕍[𝙜𝑖(𝒙 )] 𝜎2 𝑖for some 𝜎2 𝑖< , for all 𝑖= 1, , 𝑛.

Based on these, Theorem 1 immediately yields the result:

Theorem 3. Let Assumption 4 and 7 hold. Then, 𝙜𝘉satisfies BV (𝜎2), where

𝑛 𝑛 𝑖=1 𝜎2 𝑖 )

+ 𝜌(1 1 𝑏eff ) ( 1

𝑛 𝑛 𝑖=1 𝜎𝑖 )2 + 𝜏2

Equality in Definition 2 holds if equality in Assumption 4 holds. See the full proof in page 27.

As discussed in Section 2.4, Theorems 2 and 3 are sufficient to guarantee convergence of doubly SGD. For completeness, let us state a specific result for 𝜌= 1:

Corollary 2. Let the objective 𝐹satisfy Assumption 1 and 2, the global optimum 𝒙 = arg min𝒙 𝒳𝐹(𝒙) be a stationary point of 𝐹, the component gradient estimators 𝙜1, , 𝙜𝑛satisfy Assumption 6 (B) and 7, and 𝜋be 𝑏minibatch sampling without replacement. Then the last iterate of SGD with 𝙜𝘉is 𝜖-close to 𝒙 as 𝔼 𝒙𝑇 𝒙 2 2 𝜖after a number of iterations of at least

𝑇 2 max (𝐶var 1 𝜖, 𝐶bias) log (2 𝒙0 𝒙 2 2 1 𝜖)

for some fixed stepsize where

𝜎2 𝑖 𝜇2 ) + 2 ( 1

See the full proof in page 28.

3.2. Random Reshuffling of Stochastic Gradients

We now move to our analysis of SGD with random reshuffling (SGD-RR). In the doubly stochastic setting, this corresponds to reshuffling over stochastic estimators instead of gradients, which we will denote as doubly SGD-RR. In practice, doubly SGD-RR is often observed to converge faster than doubly SGD, even when dependent estimators are used.

3.2.1. ALGORITHM

Doubly SGD-RR The algorithm is stated as follows:

❶Reshuffle and partition the gradient estimators into minibatches of size 𝑏as 𝘗 = {𝘗1, , 𝘗𝑝}, where 𝑝= 𝑛 𝑏is the number of partitions or minibatches. ❷Perform gradient descent for 𝑖= 1, , 𝑝steps as

𝒙𝑖+1 𝑘 = Π𝒳 ( 𝒙𝑖 𝑘 𝛾𝙜𝘗𝑖(𝒙𝑖 𝑘) )

❸𝑘 𝑘+ 1 and go back to step ❶.

(We assume 𝑛is an integer multiple of 𝑏for clarity.) Here, 𝑖= 1, , 𝑝denotes the step within the epoch, 𝑘= 1, , 𝐾 denotes the epoch number.

3.2.2. PROOF SKETCH

Why SGD-RR is Faster A key aspect of random reshuffling in the finite sum setting (SGD-RR) is that it uses conditionally biased gradient estimates. Because of this, on strongly convex finite sums, Mishchenko et al. (2020) show that the Lyapunov function for random reshuffling is not

the usual 𝒙𝑖 𝑘 𝒙 2

2, but some biased Lyapunov function

2, where the reference point is

𝒙𝑖 Π𝒳 ( 𝒙 𝛾 𝑖 1 𝑗=0 𝑓𝘗𝑖(𝒙 ) ) . (10)

Under this definition, the convergence rate of SGD is not determined by the gradient variance anymore; it is determined by the squared error of the Lyapunov reference

point, 𝒙𝑖 𝒙 2 2. There are two key properties of this quantity:

The peak mean-squared error decreases at a rate of 𝛾2

with respect to the stepsize 𝛾.

The squared error is 0 at the following two endpoints: beginning of the epoch and at the end of the epoch.

For some stepsize achieving a 𝒪(1 𝑇) rate on SGD, these two properties combined result in SGD-RR attaining a 𝒪(1 𝑇2) rate at exactly the end of each epoch.

Is doubly SGD-RR as Fast as SGD-RR? Unfortunately, doubly SGD-RR does not achieve the same rate as SGDRR. Since stochastic gradients are used in addition to reshuffling, doubly SGD-RR deviates from the path that minimizes the biased Lyapunov function. Still, doubly SGD-RR does have provable benefits.

Demystifying SGD with Doubly Stochastic Gradients

20 22 24 26 28 210

gradient variance

𝑚𝑏= 1024 𝑚𝑏= 128 𝑚𝑏= 16

(a) Low Hetero. (𝑠= 1)

20 22 24 26 28 210

𝑏 (b) Mid. Hetero. (𝑠= 2)

20 22 24 26 28 210

𝑏 (c) High Hetero. (𝑠= 4)

Figure 1. Trade-off between 𝑏and 𝑚on the gradient variance tr𝕍𝙜(𝒙 ) under varying budgets 𝑚 𝑏. The problem is a finite sum of 𝑑= 10, 𝑛= 1024 isotropic quadratics with smoothness constants sampled as 𝐿𝑖 Inv-Gamma(1 2, 1 2) and stationary points sampled as 𝒙 𝑖 𝒩(𝟎𝑑, 𝑠2𝐈𝑑 ), where the gradient has additive noise of 𝞰 𝒩(𝟎𝑑, 𝐈𝑑). Larger 𝑠means more heterogeneous data.

3.2.3. COMPLEXITY ANALYSIS

We provide the general complexity guarantee for doubly SGD-RR on strongly convex objectives with 𝜇-strongly convex components and fully correlated component estimators (𝜌= 1):

Theorem 4. Let the objective 𝐹satisfy Assumption 1 and 2, where each component 𝑓𝑖is additionally 𝜇strongly convex, and Assumption 6 (ACVX), 7 hold. Then, the last iterate 𝒙𝑇of doubly SGD-RR is 𝜖-close to the global optimum 𝒙 = arg max𝒙 𝒳𝐹(𝒙) such

that 𝔼 𝒙𝑇 𝒙 2 2 𝜖after a number of iterations of at least

𝑇 max (4𝐶com var 1

𝜖+ 𝐶sub var 1

𝜖, 𝐶bias) log (2 𝒙0 1 𝒙 2

for some fixed stepsize, where 𝑇= 𝐾𝑝= 𝐾𝑛 𝑏,

𝐶bias = (ℒmax + 𝐿) 𝜇

𝐶com var = 2

𝜎2 𝑖 𝜇2 ) + 2 ( 1

See the full proof in page 34.

Remark 8. When 𝜎𝑖= 0 for all 𝑖= 1, , 𝑛, the anytime convergence bound Theorem 5 in the Appendix reduces exactly to Theorem 1 of Mishchenko et al. (2020). Therefore, Theorem 5 is a strict generalization of SGD-RR to the doubly stochastic setting.

Using 𝑚-sample Monte Carlo improves the constants as follows:

Corollary 3. Let the assumptions of Theorem 4 hold. Then, for 1 < 𝑏< 𝑛and 𝑚-sample Monte Carlo, the same guarantees hold with the constant

𝐶com var = 2 𝑚𝑏( 1

𝜎2 𝑖 𝜇2 ) + 2

Remark 9. Compared to doubly SGD, doubly SGD-RR improves the dependence on the subsampling noise 𝜏2 from 𝒪(1 𝜖) to 𝒪(1

𝜖). Therefore, random reshuffling does improve the complexity of doubly SGD. Unfortunately, it also means that it does not achieve a better asymptotic complexity as in the finite sum setting. However, nonasymptotically, if the subsampling noise dominates component estimation noise, doubly SGD-RR will behave closely to an 𝒪(1

𝜖) (or equivalently, 𝒪(1 𝑇)) algorithm.

Remark 10. As was the case with independent subsampling, increasing 𝑏also reduces component estimation noise for RR-SGD. However, the impact on the complexity is more subtle. Consider that the iteration complexity is

where 𝜅𝜎 = max𝑖=1, ,𝑛𝜎𝑖 𝜇, 𝜅𝜏 = 𝜏 𝜇and 𝜅 = max𝑖=1, ,𝑛𝐿𝑖 𝜇. The 1 𝜖term decreases the fastest with 𝑚. Therefore, it might seem that increasing 𝑚is advantageous. However, the 1

𝜖term has a 𝒪 (

𝑛 ) dependence on the dataset size, which would be non-negligible for large datasets. As a result, in the large 𝑛, large 𝜖regime, increasing 𝑏over 𝑚should be more effective.

Remark 11. Eq. (11) also implies that, for dependent estimators, doubly SGD-RR achieves an asymptotic speedup of 𝑛 𝑏compared to full-batch SGD with only component estimation noise. Assume that the sample complexity of a single estimate is Θ(𝑚𝑏) (Θ(𝑚𝑛) for full-batch). Then, the sample complexity of doubly SGD-RR is 𝒪(𝑏1 𝜖) and 𝒪(𝑛1 𝜖) for full-batch SGD. However, the 𝑛 𝑏seed-up comes from correlations. Therefore, for independent estimators, the asymptotic complexity of the two is equal.

4. Simulation

Setup We evaluate the insight on the tradeoff between 𝑏 and 𝑚for correlated estimators on a synthetic problem. In particular, we set

𝑓𝑖(𝒙; 𝞰) = 𝐿𝑖

2 𝒙 𝒙 𝑖+ 𝞰 2

Demystifying SGD with Doubly Stochastic Gradients

where the smoothness constants 𝐿𝑖 Inv-Gamma(1 2, 1 2) and the stationary points 𝒙 𝑖 𝒩(𝟎𝑑, 𝑠2𝐈𝑑) are sampled randomly, where 𝟎𝑑is a vector of 𝑑zeros and 𝐈𝑑is a 𝑑 𝑑 identity matrix. Then, we compute the gradient variance on the global optimum, corresponding to computing the BV (Definition 2) constant. Note that 𝑠2 here corresponds to the heterogeneity of the data. We make the estimators dependent by sharing 𝞰1, , 𝞰𝑚across the batch.

Results The results are shown in Fig. 1. At low heterogeneity, there exists a sweet spot between 𝑚and 𝑏. However, this sweet spot moves towards large values of 𝑏, where, at high heterogeneity levels, the largest values of 𝑏 are more favorable. Especially in the low budget regime where 𝑚𝑏 𝑛, the largest 𝑏values appear to achieve the lowest variance. This confirms our theoretical results that a large 𝑏should be preferred on challenging (large number of datapoints, high heterogeneity) problems.

5. Discussions

5.1. Applications

In Appendix C, we establish Assumption 6 and 7 on the following applications:

ERM with Randomized Smoothing: In this problem, we consider ERM, where the model weights are perturbed by noise. This variant of ERM has recently gathered interest as it is believed to improve generalization performance (Orvieto et al., 2023; Liu et al., 2021). In Appendix C.1, we establish Assumption 6 (AITP) under the interpolation assumption.

Reparameterization Gradient: In certain applications, e.g., variational inference, generative modeling, and reinforcement learning (see Mohamed et al., 2020, 5), the optimization problem is over the parameters of some distribution, which is taken expectation over. Among gradient estimators for this problem, the reparameterization gradient is widely used due to lower variance (Xu et al., 2019). For this, in Appendix C.2, we establish Assumption 6 (ACVX) and (B) by assuming a convexity and smooth integrand.

5.2. Related Works

Unlike SGD in the finite sum setting, doubly SGD has received little interest. Previously, Bietti & Mairal (2017); Zheng & Kwok (2018); Kulunchakov & Mairal (2020) have studied the convergence of variance-reduced gradients (Gower et al., 2020) specific to the doubly stochastic setting under the uniform Lipchitz integrand assumption (𝒈𝑖( ; 𝜼) is 𝐿-Lipschitz for all 𝜼). Although this assumption has often been used in the stochastic optimization literature (Nemirovski et al., 2009; Moulines & Bach, 2011; Shalev-Shwartz et al., 2009; Nguyen et al., 2018), it is easily shown to be restrictive: for some 𝐿-smooth ˆ𝑓𝑖(𝒙),

𝑓𝑖(𝒙; 𝞰) = ˆ𝑓𝑖(𝒙) + 𝑥1𝞰is not 𝐿-Lipschitz unless the support of 𝜼is compact. In contrast, we established results under weaker conditions. We also provide a discussion on the relationships of different conditions in Appendix A.

Furthermore, we extended doubly SGD to the case where random reshuffling is used in place of sampling independent batches. In the finite-sum setting, the fact that SGDRR converges faster than independent subsampling (SGD) has been empirically known for a long time (Bottou, 2009). While G urb uzbalaban et al. (2021) first demonstrated that SGD-RR can be fast for quadratics, a proof under general conditions was demonstrated recently (Haochen & Sra, 2019): In the strongly convex setting, Mishchenko et al. (2020) Ahn et al. (2020); Nguyen et al. (2021) establish a 𝒪 ( 1

𝜖 ) complexity to be 𝜖-accurate, which is tight in terms of asymptotic complexity (Safran & Shamir, 2020; Cha et al., 2023; Safran & Shamir, 2021).

Lastly, Dai et al. (2014); Xie et al. (2015); Shi et al. (2021) provided convergence guarantees for doubly SGD for ERM of random feature kernel machines. However, these analyses are based on concentration arguments that doubly SGD does not deviate too much from the optimization path of finite-sum SGD. Unfortunately, concentration arguments require stronger assumptions on the noise, and their analysis is application-specific. In contrast, we provide a general analysis under the general ER assumption.

5.3. Conclusions

In this work, we analyzed the convergence of SGD with doubly stochastic and dependent gradient estimators. In particular, we showed that if the gradient estimator of each component satisfies the ER and BV conditions, the doubly stochastic estimator also satisfies both conditions; this implies convergence of doubly SGD.

Practical Recommendations An unusual conclusion of our analysis is that when Monte Carlo is used with minibatch subsampling, it is generally more beneficial to increase the minibatch size 𝑏instead of the number of Monte Carlo samples 𝑚. That is, for both SGD and SGD-RR, increasing 𝑏decreases the variance in a rate close to 1 𝑏when (i) the gradient variance of the component gradient estimators varies greatly such that ( 1

𝑛 𝑛 𝑖=1 𝜎𝑖 )2 1

𝑛 𝑛 𝑖=1 𝜎2 𝑖or when (ii) the estimators are independent as 𝜌= 0. Surprisingly, such a benefit persists even in the interpolation regime 𝜏2 = 0. On the contrary, when the estimators are both dependent and have similar variance, it is necessary to increase both 𝑚and 𝑏, where a sweet spot between the two exists. However, such a regime is unlikely to occur in practice; in statistics and machine learning applications, the variance of the gradient estimators tends to vary greatly due to the heterogeneity of data.

Demystifying SGD with Doubly Stochastic Gradients

Acknowledgements

The authors would like to thank the anonymous reviewers for their comments and Jason Altschuler (UPenn) for numerous suggestions that strengthened the work.

K. Kim was supported by a gift from AWS AI to Penn Engineering s ASSET Center for Trustworthy AI; Y.-A. Ma was funded by the NSF Grants [NSF-SCALE Mo DL-2134209] and [NSF-CCF-2112665 (TILOS)], the U.S. Department Of Energy, Office of Science, as well as the DARPA AIE program; J. R. Gardner was supported by NSF award [IIS2145644].

Impact Statement

This paper presents a theoretical analysis of stochastic gradient descent under doubly stochastic noise to broaden our understanding of the algorithm. The work itself is theoretical, and we do not expect direct societal consequences, but SGD with doubly stochastic gradients is widely used in various aspects of machine learning and statistics. Therefore, we inherit the societal impact of the downstream applications of SGD.

Ahn, K., Yun, C., and Sra, S. SGD with shuffling: Optimal rates without component convexity and large epoch requirements. In Advances in Neural Information Processing Systems, volume 33, pp. 17526 17535. Curran Associates, Inc., 2020. (page 9)

Bassily, R., Smith, A., and Thakurta, A. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the IEEE Annual Symposium on Foundations of Computer Science, FOCS 14, pp. 464 473, USA, October 2014. IEEE Computer Society. (page 1)

Bietti, A. and Mairal, J. Stochastic optimization with variance reduction for infinite datasets with finite sum structure. In Advances in Neural Information Processing Systems, volume 30, pp. 1623 1633. Curran Associates, Inc., 2017. (pages 1, 9)

Bottou, L. On-line learning and stochastic approximations. In On-Line Learning in Neural Networks, pp. 9 42. Cambridge University Press, 1 edition, January 1999. (page 1)

Bottou, L. Curiously fast convergence of some stochastic gradient descent algorithms. Unpublished open problem at the International Symposium on Statistical Learning and Data Sciences (SLDS), 2009. (page 9)

Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223 311, January 2018. (page 1)

Cha, J., Lee, J., and Yun, C. Tighter lower bounds for shuffling SGD: Random permutations and beyond. In Proceedings of the International Conference on Machine Learning, volume 202 of PMLR, pp. 3855 3912. JMLR, July 2023. (page 9)

Csiba, D. and Richt arik, P. Importance sampling for minibatches. Journal of Machine Learning Research, 19(27): 1 21, 2018. (pages 3, 37)

Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M.- F. F., and Song, L. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, volume 27, pp. 3041 3049. Curran Associates, Inc., 2014. (pages 1, 3, 9)

Domke, J. Provable gradient variance guarantees for blackbox variational inference. In Advances in Neural Information Processing Systems, volume 32, pp. 329 338. Curran Associates, Inc., 2019. (pages 2, 37)

Domke, J. Provable smoothness guarantees for black-box variational inference. In Proceedings of the International Conference on Machine Learning, volume 119 of PMLR, pp. 2587 2596. JMLR, July 2020. (page 38)

Domke, J., Gower, R., and Garrigos, G. Provable convergence guarantees for black-box variational inference. In Advances in Neural Information Processing Systems, volume 36, pp. 66289 66327. Curran Associates, Inc., 2023. (page 16)

Duchi, J. C., Bartlett, P. L., and Wainwright, M. J. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674 701, January 2012. (page 35)

Garrigos, G. and Gower, R. M. Handbook of convergence theorems for (stochastic) gradient methods. Preprint ar Xiv:2301.11235, ar Xiv, February 2023. (pages 4, 19)

Gorbunov, E., Hanzely, F., and Richtarik, P. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 108 of PMLR, pp. 680 690. JMLR, June 2020. (page 37)

Gower, R., Sebbouh, O., and Loizou, N. SGD for structured nonconvex functions: Learning rates, minibatching and interpolation. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 130 of PMLR, pp. 1315 1323. JMLR, March 2021a. (pages 3, 4, 6, 15, 16, 21, 35)

Demystifying SGD with Doubly Stochastic Gradients

Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., and Richt arik, P. SGD: General analysis and improved rates. In Proceedings of the International Conference on Machine Learning, volume 97 of PMLR, pp. 5200 5209. JMLR, June 2019. (pages 2, 3, 4, 6, 16, 21, 37)

Gower, R. M., Schmidt, M., Bach, F., and Richtarik, P. Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968 1983, November 2020. (page 9)

Gower, R. M., Richt arik, P., and Bach, F. Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching. Mathematical Programming, 188(1): 135 192, July 2021b. (pages 2, 3, 4)

Guminov, S., Gasnikov, A., and Kuruzov, I. Accelerated methods for weakly-quasi-convex optimization problems. Computational Management Science, 20(1): 36, December 2023. (pages 4, 16)

G urb uzbalaban, M., Ozdaglar, A., and Parrilo, P. A. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1):49 84, March 2021. (page 9)

Haochen, J. and Sra, S. Random shuffling beats SGD after finite epochs. In Proceedings of the International Conference on Machine Learning, volume 97 of PMLR, pp. 2624 2633. JMLR, May 2019. (page 9)

Hinder, O., Sidford, A., and Sohoni, N. Near-optimal methods for minimizing star-convex functions and beyond. In Proceedings of Conference on Learning Theory, volume 125 of PMLR, pp. 1894 1938. JMLR, July 2020. (page 16)

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840 6851. Curran Associates, Inc., 2020. (page 1)

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, volume 26, pp. 315 323. Curran Associates, Inc., 2013. (page 3)

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, pp. 795 811, Cham, 2016. Springer International Publishing. (pages 15, 38)

Khaled, A. and Richt arik, P. Better theory for SGD in the nonconvex world. Transactions of Machine Learning Research, 2023. (pages 4, 16)

Kim, K., Oh, J., Wu, K., Ma, Y., and Gardner, J. R. On the convergence of black-box variational inference. In Advances in Neural Information Processing Systems, volume 36, pp. 44615 44657, New Orleans, LA, USA, December 2023. Curran Associates Inc. (pages 2, 38)

Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, April 2014. (pages 1, 37)

Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, volume 28, pp. 2575 2583. Curran Associates, Inc., 2015. (pages 2, 3)

Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14): 1 45, 2017. (pages 1, 3, 37)

Kulunchakov, A. and Mairal, J. Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise. Journal of Machine Learning Research, 21(155):1 52, 2020. (pages 1, 9)

Liu, T., Li, Y., Wei, S., Zhou, E., and Zhao, T. Noisy gradient descent converges to flat minima for nonconvex matrix factorization. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 130 of PMLR, pp. 1891 1899. JMLR, March 2021. (pages 1, 3, 9, 35)

Ma, S., Bassily, R., and Belkin, M. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the International Conference on Machine Learning, volume 80 of PMLR, pp. 3325 3334. JMLR, July 2018. (pages 6, 35)

Mishchenko, K., Khaled, A., and Richtarik, P. Random reshuffling: Simple analysis with vast improvements. In Advances in Neural Information Processing Systems, volume 33, pp. 17309 17320. Curran Associates, Inc., 2020. (pages 7, 8, 9, 30, 33)

Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. Monte Carlo gradient estimation in machine learning. Journal of Machine Learning Research, 21(132):1 62, 2020. (pages 9, 37)

Moulines, E. and Bach, F. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, volume 24, pp. 451 459. Curran Associates, Inc., 2011. (pages 2, 3, 9, 16)

Demystifying SGD with Doubly Stochastic Gradients

Needell, D. and Ward, R. Batched stochastic gradient descent with weighted sampling. In Approximation Theory XV: San Antonio 2016, Springer Proceedings in Mathematics & Statistics, pp. 279 306, Cham, 2017. Springer International Publishing. (page 37)

Needell, D., Srebro, N., and Ward, R. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming, 155(1): 549 573, January 2016. (page 37)

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4): 1574 1609, January 2009. (pages 1, 9)

Nesterov, Y. and Polyak, B. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, August 2006. (page 16)

Nesterov, Yu. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152, May 2005. (page 35)

Nguyen, L., Nguyen, P. H., van Dijk, M., Richtarik, P., Scheinberg, K., and Takac, M. SGD and Hogwild! Convergence without the bounded gradients assumption. In Proceedings of the International Conference on Machine Learning, volume 80 of PMLR, pp. 3750 3758. JMLR, July 2018. (pages 1, 3, 9, 15)

Nguyen, L. M., Tran-Dinh, Q., Phan, D. T., Nguyen, P. H., and Van Dijk, M. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):207:9397 207:9440, January 2021. (page 9)

Orvieto, A., Raj, A., Kersting, H., and Bach, F. Explicit regularization in overparametrized models via noise injection. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 206 of PMLR, pp. 7265 7287. JMLR, April 2023. (pages 1, 3, 9, 35)

Polyak, B. T. and d Aleksandr Borisovich, T. Optimal order of accuracy of search algorithms in stochastic optimization. Problemy Peredachi Informatsii, 26(2):45 53, 1990. (page 35)

Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 33 of PMLR, pp. 814 822. JMLR, April 2014. (page 1)

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International

Conference on Machine Learning, volume 32 of PMLR, pp. 1278 1286. JMLR, June 2014. (pages 1, 37)

Richt arik, P. and Tak aˇc, M. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433 484, March 2016. (page 3)

Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3): 400 407, September 1951. (page 1)

Safran, I. and Shamir, O. How good is SGD with random shuffling? In Proceedings of Conference on Learning Theory, volume 125 of PMLR, pp. 3250 3284. JMLR, July 2020. (page 9)

Safran, I. and Shamir, O. Random shuffling beats SGD only after many epochs on ill-conditioned problems. In Advances in Neural Information Processing Systems, volume 34, pp. 15151 15161. Curran Associates, Inc., 2021. (page 9)

Schmidt, M. and Roux, N. L. Fast convergence of stochastic gradient descent under a strong growth condition. ar Xiv Preprint ar Xiv:1308.6370, ar Xiv, August 2013. (page 16)

Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Stochastic convex optimization. In Proceedings of the Conference on Computational Learning Theory, June 2009. (page 9)

Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. Pegasos: Primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1):3 30, March 2011. (page 1)

Shi, W., Gu, B., Li, X., Deng, C., and Huang, H. Triply stochastic gradient method for large-scale nonlinear similar unlabeled classification. Machine Learning, 110(8): 2005 2033, August 2021. (pages 1, 9)

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, volume 37 of PMLR, pp. 2256 2265. JMLR, June 2015. (page 1)

Song, S., Chaudhuri, K., and Sarwate, A. D. Stochastic gradient descent with differentially private updates. In Proceedings of the IEEE Global Conference on Signal and Information Processing, pp. 245 248, Austin, TX, USA, December 2013. IEEE. (page 1)

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32, pp. 11918 11930. Curran Associates, Inc., 2019. (page 1)

Demystifying SGD with Doubly Stochastic Gradients

Stich, S. U. Unified optimal analysis of the (stochastic) gradient method. Preprint ar Xiv:1907.04232, ar Xiv, December 2019. (page 4)

Titsias, M. and L azaro-Gredilla, M. Doubly stochastic variational Bayes for non-conjugate inference. In Proceedings of the International Conference on Machine Learning, volume 32 of PMLR, pp. 1971 1979. JMLR, June 2014. (pages 1, 3, 37)

Vapnik, V. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, volume 4, pp. 831 838. Morgan-Kaufmann, 1991. (page 1)

Vaswani, S., Bach, F., and Schmidt, M. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 89 of PMLR, pp. 1195 1204. JMLR, April 2019. (pages 6, 16, 35)

Wright, S. J. and Recht, B. Optimization for Data Analysis. Cambridge University Press, New York, 2021. (page 16)

Xie, B., Liang, Y., and Song, L. Scale up nonlinear component analysis with doubly stochastic gradients. In Advances in Neural Information Processing Systems, volume 28, pp. 2341 2349. Curran Associates, Inc., 2015. (page 9)

Xu, M., Quiroz, M., Kohn, R., and Sisson, S. A. Variance reduction properties of the reparameterization trick. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 89 of PMLR, pp. 2711 2720. JMLR, April 2019. (pages 9, 37)

Zheng, S. and Kwok, J. T.-Y. Lightweight stochastic optimization for minimizing finite sums with infinite data. In Proceedings of the International Conference on Machine Learning, volume 80 of PMLR, pp. 5932 5940. JMLR, July 2018. (pages 1, 9)

Demystifying SGD with Doubly Stochastic Gradients

TABLE OF CONTENTS

1 Introduction 1

2 Preliminaries 2 2.1 Stochastic Gradient Descent on Finite-Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Doubly Stochastic Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Technical Assumptions on Gradient Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.4 Convergence Guarantees for SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Main Results 5 3.1 Doubly Stochastic Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1.1 General Variance Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1.2 Gradient Variance Conditions for SGD . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Random Reshuffling of Stochastic Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2.2 Proof Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Simulation 8

5 Discussions 9 5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

A Gradient Variance Conditions 15 A.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Additional Gradient Variance Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.3 Establishing the ER Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.4 Proofs of Implications in Fig. 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

B Proofs 18 B.1 Auxiliary Lemmas (Lemmas 2 to 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Convergence of SGD (Lemmas 1, 7 and 8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B.3 General Variance Bound (Theorem 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B.4 Doubly Stochastic Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 B.4.1 Expected Residual Condition (Theorem 2) . . . . . . . . . . . . . . . . . . . . . . . . . 25 B.4.2 Bounded Variance Condition (Theorem 3) . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.4.3 Complexity Analysis (Corollary 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 B.5 Random Reshuffling of Stochastic Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 B.5.1 Gradient Variance Conditions (Lemma 11, Lemma 12) . . . . . . . . . . . . . . . . . . 29 B.5.2 Convergence Analysis (Theorem 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 B.5.3 Complexity Analysis (Theorem 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

C Applications 35 C.1 ERM with Randomized Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 C.1.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 C.1.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 C.1.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 C.2 Reparameterization Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 C.2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 C.2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 C.2.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Demystifying SGD with Doubly Stochastic Gradients

SG + 𝑓is smooth

QV with 𝛽= 0

𝑓(𝒙; 𝜼) is uniformly smooth

(5) (6) (7)

(8) (9) (10)

Figure 2. Implications between general gradient variance conditions for some unbiased estimator 𝙜(𝒙) = 𝑓(𝒙; 𝞰) of 𝑓(𝒙) = 𝔼𝙜(𝒙). The dashed arrows ( ) hold if 𝑓is further assumed to be QFG; the dotted arrow ( ) holds if the integrand 𝑓(𝒙; 𝜼) is uniformly convex such that it is convex with respect to 𝒙for any fixed 𝜼. (1), (5), (9), (13) are established by Gower et al. (2021a, Theorem 3.4); (2) is proven in Proposition 3; (3) is proven in Proposition 7; (4) is proven in Proposition 6; (7) is proven in Proposition 4; (8) is proven in Proposition 5; (6) is proven by Nguyen et al. (2018, Lemma 2) but we restate the proof in Proposition 8; (11) is proven in Proposition 2; (10), (12) hold trivially if 𝒙 arg min𝒙 𝒳𝑓(𝒙) are all stationary points.

A. Gradient Variance Conditions

In this section, we will discuss some additional aspects of the ER and ES conditions introduced in Section 2.3. We will also look into alternative gradient variance conditions that have been proposed in the literature and their relationship with the ER condition.

A.1. Definitions

For this section, we will use the following additional definitions:

Definition 3 (Quadratic Functional Growth; QFG). We say 𝑓 𝒳 ℝsatisfies 𝜇-quadratic functional growth if there exists some 𝜇> 0 such that

2 𝒙 𝒙 2 2 𝑓(𝒙) 𝑓(𝒙 )

holds for all 𝒙 𝒳, where 𝒙 arg min𝒙 𝒳𝑓(𝒙).

This condition implies that 𝑓grows at least as fast as some quadratic and is weaker than the Polyak-Łojasiewicz. However, for any convex function 𝑓that satisfies this condition also means that 𝑓is 𝜇-strongly convex (Karimi et al.,

Definition 4 (Uniform Smoothness). For the unbiased estimator 𝙜(𝒙) = 𝑓(𝒙; 𝞰) of 𝐹(𝒙) = 𝔼 𝑓(𝒙; 𝞰) = 𝔼 𝑓(𝒙; 𝞰), we say the integrand 𝑓𝑖(𝒙; 𝜼) satisfies uniform 𝐿-smoothness if there exist some 𝐿< such that, for any fixed 𝜼,

𝑓(𝒙; 𝜼) 𝑓(𝒙 ; 𝜼) 2 𝐿 𝒙 𝒙 2

holds for all (𝒙, 𝒙 ) 𝒳2 simultaneously.

As discussed in Sections 1 and 5.2, this condition is rather strong: it does not hold for multiplicative noise unless the support is bounded.

Definition 5 (Uniform Convexity). For the unbiased estimator 𝙜(𝒙) = 𝑓(𝒙; 𝞰) of 𝐹(𝒙) = 𝔼 𝑓(𝒙; 𝞰), we say the integrand 𝑓(𝒙; 𝜼) is uniformly convex if it is convex for any 𝜼such that, for any fixed 𝜼,

𝑓(𝒙; 𝜼) 𝑓(𝒙 ; 𝜼) 𝑓(𝒙; 𝜼) , 𝒙 𝒙

holds for all (𝒙, 𝒙 ) 𝒳2 simultaneously.

A.2. Additional Gradient Variance Conditions

For some estimator 𝙜of 𝑓, the following conditions have been considered in the literature:

Strong growth condition (SG):

𝔼 𝙜(𝒙) 2 2 𝜌 𝑓(𝒙) 2 2

Weak growth condition (WG):

𝔼 𝙜(𝒙) 2 2 𝜌(𝑓(𝒙) 𝑓(𝒙 ))

Quadratic variance (QV):

𝔼 𝙜(𝒙) 2 2 𝛼 𝒙 𝒙 2 2 + 𝛽

Convex expected smoothness (CES):

𝔼 𝙜(𝒙) 𝙜(𝒚) 2 2 2ℒD𝑓(𝒙, 𝒚)

Convex expected residual (CER):

tr𝕍[𝙜(𝒙) 𝙜(𝒚)] 2ℒD𝑓(𝒙, 𝒚)

Quadratic expected smoothness (QES):

𝔼 𝙜(𝒙) 𝙜(𝒚) 2 2 ℒ2 𝒙 𝒚 2 2

𝔼 𝙜(𝒙) 2 2 𝐴(𝑓(𝒙) 𝑓(𝒙 )) + 𝐵 𝑓(𝒙) 2 2 + 𝐶

Here, 𝒙 arg min𝒙 𝒳𝑓(𝒙) is any stationary point of 𝑓 and the stated conditions should hold for all (𝒙, 𝒚) 𝒳2.

Demystifying SGD with Doubly Stochastic Gradients

𝑓𝑖is smooth + 𝒙 -convex

𝑓𝑖is smooth + Interp.

𝑓𝑖is smooth + convex

Figure 3. Implications of assumptions on the components 𝑓1, , 𝑓𝑛to the minibatch subsampling gradient estimator 𝑓𝘉of 𝐹= 1

𝑛(𝑓1 + + 𝑓𝑛). (1), (4) are established by Gower et al. (2021a, Theorem 3.4), while (3) trivially follows from the fact that 𝒙 -convexity is strictly weaker than (global) convexity, and (2) was established by Gower et al. (2019, Proposition 3.10).

SG was used by Schmidt & Roux (2013) to establish the linear convergence of SGD for strongly convex objectives, and 𝒪(1 𝑇) convergence for convex objectives; WG was proposed by Vaswani et al. (2019) to establish similar guarantees to SG under a verifiable condition; QV was used to establish the non-asymptotic convergence on strongly convex functions by Wright & Recht (2021), while convergence on general convex functions was established by Domke et al. (2023), including stochastic proximal gradient descent; QES was used by (Moulines & Bach, 2011) to establish one of the earliest general non-asymptotic convergence results for SGD on strongly convex objectives; ABC was used by Khaled & Richt arik (2023) to establish convergence of SGD for non-convex smooth functions. (See also Khaled & Richt arik (2023) for a comprehensive overview of these conditions.) The relationship of these conditions with the ER condition are summarized in Fig. 2.

As demonstrated in Fig. 2 and discussed by Khaled & Richt arik (2023), the ABC condition is the weakest of all. However, the convergence guarantees for problems that exclusively satisfy the ABC condition are weaker than others. (For instance, the number of iterations 𝑇has to be fixed a priori.) On the other hand, the ER condition retrieves most of the strongest known guarantees for SGD; some of which were listed in Section 2.4.

A.3. Establishing the ER Condition

For subsampling estimators, it is possible to establish some of the gradient variance conditions through general assumptions on the components. See some examples in Fig. 3. Here, we use the following definitions:

Definition 6. For the finite sum objective 𝐹 = 1

𝑛(𝑓1 + + 𝑓𝑛), we say interpolation holds if, for all 𝑖= 1, , 𝑛, 𝑓𝑖(𝒙 ) 𝑓𝑖(𝒙) ,

holds for all 𝒙 𝒳, where 𝒙 arg min𝒙 𝒳𝐹(𝒙).

Definition 7. For the finite sum objective 𝐹 = 1

𝑛(𝑓1 + + 𝑓𝑛), we say the components are 𝒙 -convex if, for all 𝑖= 1, , 𝑛,

𝑓𝑖(𝒙 ) 𝑓𝑖(𝒙) 𝑓𝑖(𝒙 ) , 𝒙 𝒙

holds for all 𝒙 𝒳, where 𝒙 arg min𝒙 𝒳𝐹(𝒙).

This assumption is a weaker version of convexity; convexity needs to hold with respect to 𝒙 only. It is closely related to star (Nesterov & Polyak, 2006) and quasar convexity (Hinder et al., 2020; Guminov et al., 2023).

A.4. Proofs of Implications in Fig. 2

We prove new implication results between some of the gradient variance conditions discussed in Appendix A.2. In particular, the relationship between the QES and QV against other conditions has not been considered before.

Proposition 2. Let 𝙜be an unbiased estimator of 𝑓. Then,

𝙜is CES 𝙜is CER

Proof. The result immediately follows from the fact that

tr𝕍[𝙜(𝒙) 𝙜(𝒙 )] 𝔼 𝙜(𝒙) 𝙜(𝒙 ) 2

holds for all 𝒙, 𝒙 𝒳.

Proposition 3. Let 𝙜be an unbiased estimator of 𝑓. Then,

𝙜is SG + 𝑓is 𝐿-smooth

𝙜is QV with 𝛽= 0

Proof. Notice that, by definition, 𝑓(𝒙 ) = 0. Then,

𝔼 𝙜(𝒙) 2 2 𝜌 𝑓(𝒙) 2 2 = 𝜌 𝑓(𝒙) 𝑓(𝒙 ) 2 2,

applying 𝐿-smoothness of 𝑓,

𝐿2𝜌 𝒙 𝒙 2 2.

Demystifying SGD with Doubly Stochastic Gradients

Proposition 4. Let 𝙜(𝒙) = 𝑓(𝒙; 𝞰) be an unbiased estimator of 𝑓(𝒙) = 𝔼 𝑓(𝒙; 𝞰). Then,

Integrand is uniformly 𝐿-smooth QES

Proof. The result immediately follows from the fact that the integrand 𝑓(𝒙; 𝜼) is 𝐿-smooth with respect to 𝒙uniformly over 𝜼as

𝔼 𝙜(𝒙) 𝙜(𝒙 ) 2

2 = 𝔼 𝑓(𝒙; 𝞰) 𝑓(𝒙 ; 𝞰) 2

𝐿2 𝒙 𝒙 2 2.

Proposition 5. Let 𝙜(𝒙) = 𝑓(𝒙; 𝞰) be an unbiased estimator of 𝑓(𝒙) = 𝔼 𝑓(𝒙; 𝞰). Then,

Integrand is uniformly 𝐿-smooth + 𝑓is uniformly convex

Proof. Since the integrand 𝑓(𝒙; 𝜼) is both uniformly smooth and convex with respect to 𝒙for a any fixed 𝜼, we have 𝑓(𝒙; 𝜼) 𝑓(𝒙 ; 𝜼) 2 2𝐿(𝑓(𝒙; 𝜼) 𝑓(𝒙 ; 𝜼) 𝑓(𝒙 ; 𝜼) , 𝒙 𝒙 ) .

𝔼 𝙜(𝒙) 𝙜(𝒙 ) 2 2 = 𝔼 𝑓(𝒙; 𝞰) 𝑓(𝒙 ; 𝞰) 2 2 2𝐿𝔼(𝑓(𝒙; 𝜼) 𝑓(𝒙 ; 𝜼) 𝑓(𝒙 ; 𝜼) , 𝒙 𝒙 )

= 2𝐿(𝑓(𝒙) 𝑓(𝒙 ) 𝑓(𝒙 ) , 𝒙 𝒙 )

= 2𝐿(𝑓(𝒙) 𝑓(𝒙 ))

holds for all 𝒙 𝒳.

Proposition 6. Let 𝙜be an unbiased estimator of 𝐹. Then,

𝙜is QV with 𝛽= 0 QES

Proof. From the classic inequality (𝑎+ 𝑏)2 2𝑎2 + 2𝑏2, we have

𝔼 𝙜(𝒙) 𝙜(𝒙 ) 2 2 2 𝔼 𝙜(𝒙) 2 2 + 2 𝔼 𝙜(𝒙 ) 2 2.

Now, since QV holds with 𝛽= 0, we have 𝔼 𝙜(𝒙 ) 2 2 = 0. Therefore,

𝔼 𝙜(𝒙) 𝙜(𝒙 ) 2 2 2 𝔼 𝙜(𝒙) 2 2 2 𝛼 𝒙 𝒙 2 2,

where we have applied QV at the last inequality.

Proposition 7. Let 𝙜be an unbiased estimator of 𝑓. Then,

𝙜is QV with 𝛽= 0 + 𝑓is 𝜇-QFG

Proof. The result immediately follows from QV as

𝔼 𝙜(𝒙) 2 2 𝛼 𝒙 𝒙 2 2,

applying 𝜇-quadratic functional growth,

𝜇(𝑓(𝒙) 𝑓(𝒙 )) .

Proposition 8. Let 𝙜be an unbiased estimator of 𝑓. Then,

𝙜is QES + 𝑓is 𝜇-QFG

Proof. From QV, we have

𝔼 𝙜(𝒙) 𝙜(𝒙 ) 2 2 ℒ2 𝒙 𝒙 2 2 and 𝜇-quadratic functional growth yields

𝜇 (𝑓(𝒙) 𝑓(𝒙 )) .

The strategy applying QFG when proving Propositions 7 and 8 establishes the stronger variant of the ER condition: Assumption 6 (B). However, the price for this is that one has to pay for an excess 𝜅= ℒ 𝜇factor, and this strategy works only works for quadratically growing objectives.

Demystifying SGD with Doubly Stochastic Gradients

B.1. Auxiliary Lemmas (Lemmas 2 to 6)

Lemma 2. Consider a finite population of 𝑛vectorvariate random variables 𝒙1, , 𝒙𝑛. Then, the variance of the average of 𝑏samples chosen without replacement is

𝑖=1 𝙭𝘉𝑖 = 𝑛 𝑏 𝑏(𝑛 1)𝜎2,

where 𝘉= {𝘉1, , 𝘉𝑏} is the collection of random indices of the samples and 𝜎2 is the variance of independently choosing a single sample.

Proof. From the variance of the sum of random variables, we have

𝑖=1 tr𝕍[𝙭𝘉𝑖 ] +

𝑖 𝑗 Cov ( 𝙭𝘉𝑖, 𝙭𝘉𝑗 ) ,

and noticing that the covariance is independent of the index in the batch,

= 𝑏tr𝕍[𝙭𝘉𝑖 ] + 𝑏(𝑏 1)𝐶, (12)

where 𝐶= Cov ( 𝙭𝘉𝑖, 𝙭𝘉𝑗 ) . Using the fact that the variance is 0 for 𝑏= 𝑛, we can solve for 𝐶such that

𝐶= 1 𝑛 1 tr𝕍[𝙭𝘉𝑖 ] ,

which is negative, and a negative correlation is always great. Plugging this expression to Eq. (12), we have

𝑖=1 𝙭𝘉𝑖 = 𝑏tr𝕍[𝙭𝘉𝑖 ] 𝑏(𝑏 1) 1 𝑛 1 tr𝕍[𝙭𝘉𝑖 ] ,

𝑛 1) tr𝕍[𝙭𝘉𝑖 ]

𝑛 1) tr𝕍[𝙭𝘉𝑖 ] .

Dividing both sides by 𝑏2 yields the result.

Lemma 3. Let 𝙭1, , 𝙭𝑛be vector-variate random variables. Then, the variance of the sum is upper-bounded as

𝑖=1 𝙭𝑖] ( 𝑛 𝑖=1

tr𝕍[𝙭𝑖] )2 (13)

𝑛 𝑛 𝑖=1 tr𝕍[𝙭𝑖]. (14)

The equality in Eq. (13) holds if and only if 𝙭𝑖and 𝙭𝑗are constant multiples such that there exists some 𝛼𝑖𝑗 0 such that

𝙭𝑖= 𝛼𝑖𝑗𝙭𝑗 for all 𝑖, 𝑗.

Proof. The variance of a sum is

𝑗=1 tr Cov (𝙭𝑖, 𝙭𝑗 ) .

From the Cauchy-Schwarz inequality for expectations,

tr Cov (𝙭𝑖, 𝙭𝑗 ) = 𝔼(𝙭𝑖 𝔼𝙭𝑖) (𝙭𝑗 𝔼𝙭𝑗 )

𝔼 𝙭𝑖 𝔼𝙭𝑖 2𝔼 𝙭𝑗 𝔼𝙭𝑗 2

This implies

𝑗=1 tr Cov (𝙭𝑖, 𝙭𝑗 )

The equality statement comes from the property of the Cauchy-Schwarz inequality. Lastly, Eq. (14) follows from additionally applying Jensen s inequality as

𝑖=1 tr𝕍[𝙭𝑖] .

An equivalent proof strategy is to expand the quadratic in Eq. (15) and apply the arithmetic mean-geometric mean inequality to the cross terms.

Demystifying SGD with Doubly Stochastic Gradients

Lemma 4 (Lemma A.2; Garrigos & Gower, 2023). For a recurrence relation given as

𝑟𝑇 (1 𝛾𝜇)𝑇𝑟0 + 𝐵𝛾,

for some constant 0 < 𝛾< 1 𝐶,

can be guaranteed by setting

𝜖, 𝐶) log (2𝑟0

where 𝜇, 𝐵> 0 and 0 < 𝐶< 𝜇are some finite constants.

Proof. First, notice that the recurrence

𝑟𝑇 (1 𝛾𝜇)𝑇𝑟0 bias

+ 𝐵𝛾 variance

is a sum of monotonically increasing (variance) and decreasing (bias) terms with respect to 𝛾. Therefore, the bound is minimized when both terms are equal. This implies that 𝑟𝑡 𝜖can be achieved by solving for

(1 𝛾𝜇)𝑇𝑟0 𝜖

First, for the variance term,

For the bias term, as long as 𝛾< 1

(1 𝛾𝜇)𝑇𝑟0 𝜖

𝑇log (1 𝛾𝜇) log 𝜖

2𝑟0 log (1 𝛾𝜇)

𝜖 log (1 (1 𝛾𝜇))

Furthermore, using the bound log 1 𝑥 1 𝑥for 0 < 𝑥< 1, we can achieve the guarantee with

Therefore, 1 𝛾determines the iteration complexity. Plugging in the minimum over the constraints on 𝛾yields the iteration complexity.

Lemma 5. For a recurrence relation given as

𝑟𝑇 (1 𝛾𝜇)𝑇𝑟0 + 𝐴𝛾2 + 𝐵𝛾,

for some constant 0 < 𝛾< 1 𝐶,

can be guaranteed by setting

𝛾= min ( 𝐵+

𝐵2 + 2𝐴𝜖 2𝐴 , 1

𝜖 , 𝐶) log (2𝑟0

where 𝜇, 𝐴, 𝐵> 0 and 0 < 𝐶< 𝜇are some finite constants.

Proof. This theorem is a generalization of Lemma A.2 by Garrigos & Gower (2023). First, notice that the recurrence

𝑟𝑇 (1 𝛾𝜇)𝑇𝑟0 bias

+ 𝐴𝛾2 + 𝐵𝛾 variance

is a sum of monotonically increasing (variance) and decreasing (bias) terms with respect to 𝛾. Therefore, the bound is minimized when both terms are equal. This implies that 𝑟𝑡 𝜖can be achieved by solving for

(1 𝛾𝜇)𝑇𝑟0 𝜖

2 and 𝐴𝛾2 + 𝐵𝛾 𝜖

First, for the variance term,

The solution to this equation is given by the positive solution of the quadratic equation as

𝐵2 + 2𝐴𝜖 2𝐴 .

For the bias term, as long as 𝛾< 1

𝜇, the solution is identical

to Lemma 4. Therefore,

can guarantee the bias term to be smaller than 𝜖 2, while 1 𝛾determines the iteration complexity. Plugging in the minimum over the constraints on 𝛾,

𝛾= min ( 𝐵+

𝐵2 + 2𝐴𝜖 2𝐴 , 1

Demystifying SGD with Doubly Stochastic Gradients

yields the iteration complexity.

Now, since the quadratic formula is not very interpretable, let us simplify the expression for 1 𝛾using the bound

which holds for any 𝑎, 𝑏> 0 and is tight for 𝜖 0. With our constants, this reads

𝐵2 + 2𝐴𝜖 𝐵+

and therefore

Therefore, for the stepsize choice of Eq. (17),

1 𝛾 min (2𝐵1

Plugging this into Eq. (16) yields the statement.

Lemma 6. Let 𝐹 𝒳 ℝbe a finite sum of convex functions as 𝐹= 1

𝑛(𝑓1 + + 𝑓𝑛), where 𝑓𝑖 𝒳 ℝ. Then, 1 𝑛

𝑖=1 D𝑓𝑖 (𝒙, 𝒙 ) = D𝐹 (𝒙, 𝒙 ) ,

for any 𝒙, 𝒙 𝒳.

Proof. The result immediately follows from the definition of Bregman divergences as

𝑖=1 D𝑓𝑖 (𝒙, 𝒙 )

(𝑓𝑖(𝒙) 𝑓𝑖 (𝒙 ) 𝑓𝑖 (𝒙 ) , 𝒙 𝒙 )

𝑖=1 𝑓𝑖(𝒙)) ( 1

𝑖=1 𝑓𝑖 (𝒙 ) )

𝑖=1 𝑓𝑖 (𝒙 ) , 𝒙 𝒙

= 𝐹(𝒙) 𝐹(𝒙 ) 𝐹(𝒙 ) , 𝒙 𝒙

= D𝐹 (𝒙, 𝒙 ) .

Demystifying SGD with Doubly Stochastic Gradients

B.2. Convergence of SGD (Lemmas 1, 7 and 8)

Lemma 7. Let 𝐹 𝒳 ℝbe 𝐿-smooth function. Then, the expected squared norm of a gradient estimator 𝙜satisfying both ER (ℒ) and BV (𝜎2) is bounded as

𝔼 𝙜(𝒙) 2 2 4 (ℒ+ 𝐿) (𝐹(𝒙) 𝐹(𝒙 )) + 2𝜎2,

for any 𝒙 𝒳and 𝒙 arg max𝒙 𝒳𝐹(𝒙).

Proof. The proof is a minor modification of Lemma 2.4 by Gower et al. (2019) and Lemma 3.2 by Gower et al. (2021a).

By applying the bound (𝑎+ 𝑏)2 2𝑎2 + 2𝑏2, we can transfer the variance on 𝒙to the variance of 𝒙 . That is,

𝔼 𝙜(𝒙) 2 2 = 𝔼 𝙜(𝒙) 𝙜(𝒙 ) + 𝙜(𝒙 ) 2 2 2 𝔼 𝙜(𝒙) 𝙜(𝒙 ) 2 2 𝑉1

+2 𝔼 𝙜(𝒙 ) 2 2 𝑉2

The key is to bound 𝑉1. It is typical to do this using expected-smoothness-type assumptions such as the ER assumption. That is,

𝑉1 = 𝔼 𝙜(𝒙) 𝙜(𝒙 ) 2 2 = tr𝕍[𝙜(𝒙) 𝙜(𝒙 )] + ( 𝐹(𝒙) 𝐹(𝒙 )) ,

from the 𝐿-smoothness of 𝐹,

tr𝕍[𝙜(𝒙) 𝙜(𝒙 )] + 2𝐿(𝐹(𝒙) 𝐹(𝒙 )) ,

and the ER condition,

2ℒ(𝐹(𝒙) 𝐹(𝒙 )) + 2𝐿(𝐹(𝒙) 𝐹(𝒙 ))

= 2 (𝐿+ ℒ) (𝐹(𝒙) 𝐹(𝒙 ) ).

Finally, 𝑉2 immediately follows from the BV condition as

𝑉2 = 𝔼 𝙜(𝒙 ) 2 2 𝜎2.

Lemma 8. Let the objective function 𝐹satisfy Assumption 1 and the gradient estimator 𝙜be unbiased and satisfy both ER (ℒ) and BV (𝜎2). Then, the last iterate of SGD guarantees

𝔼 [ 𝒙𝑇 𝒙 2 2 ] (1 𝜇𝛾)𝑇 𝒙0 𝒙 2 2 + 2𝜎2

where 𝒙 = arg min𝒙 𝒳𝐹(𝒙) is the global optimum.

Proof. Firstly, we have

𝒙𝑡+1 𝒙 2 2 = Π𝒳(𝒙𝑡 𝛾𝙜(𝒙𝑡)) Π (𝒙 ) 2 2,

and since the projection onto a convex set under a Euclidean metric is non-expansive,

𝒙𝑡 𝛾𝙜(𝒙𝑡) 𝒙 2 2 = 𝒙𝑡 𝒙 2 2 2𝛾 𝙜(𝒙𝑡) , 𝒙𝑡 𝒙 + 𝛾2 𝙜(𝒙𝑡) 2 2.

Denoting the 𝜎-algebra formed by the randomness and the iterates up to the 𝑡th iteration as ℱ𝑡such that (ℱ𝑡)𝑡 1 forms a filtration, the conditional expectation is

𝔼 [ 𝒙𝑡+1 𝒙 2 2 ℱ𝑡 ]

= 𝒙𝑡 𝒙 2 2 2𝛾 𝔼[𝙜(𝒙𝑡) ℱ𝑡] , 𝒙𝑡 𝒙

+ 𝛾2𝔼 [ 𝙜(𝒙𝑡) 2 2 ℱ𝑡 ] .

= 𝒙𝑡 𝒙 2 2 2𝛾 𝐹(𝒙𝑡) , 𝒙𝑡 𝒙

+ 𝛾2𝔼 [ 𝙜(𝒙𝑡) 2 2 ℱ𝑡 ] ,

applying the 𝜇-strong convexity of 𝐹,

𝒙𝑡 𝒙 2 2 2𝛾(𝐹(𝒙𝑡) 𝐹(𝒙 ) + 𝜇

2 𝒙𝑡 𝒙 2 2)

+ 𝛾2𝔼 [ 𝙜(𝒙𝑡) 2 2 ℱ𝑡 ]

= (1 𝛾𝜇) 𝒙𝑡 𝒙 2 2 2𝛾(𝐹(𝒙𝑡) 𝐹(𝒙 ))

+ 𝛾2𝔼 [ 𝙜(𝒙𝑡) 2 2 ℱ𝑡 ]

From Lemma 7, we have

𝔼 [ 𝙜(𝒙𝑡) 2 2 ℱ𝑡 ] (4 (ℒ+ 𝐿) (𝐹(𝒙𝑡) 𝐹(𝒙 )) + 2𝜎2) .

𝔼 [ 𝒙𝑡+1 𝒙 2 2 ℱ𝑡 ]

(1 𝛾𝜇) 𝒙𝑡 𝒙 2 2 2𝛾(𝐹(𝒙𝑡) 𝐹(𝒙 ))

+ 𝛾2 (4 (ℒ+ 𝐿) (𝐹(𝒙𝑡) 𝐹(𝒙 )) + 2𝜎2)

= (1 𝛾𝜇) 𝒙𝑡 𝒙 2 2 2𝛾(1 2𝛾(ℒ+ 𝐿)) (𝐹(𝒙𝑡) 𝐹(𝒙 )) + 2𝛾2𝜎2,

Demystifying SGD with Doubly Stochastic Gradients

and with a small-enough stepsize satisfying 𝛾< 1

can guarantee a partial contraction as

(1 𝛾𝜇) 𝒙𝑡 𝒙 2 2 + 2𝛾2𝜎2.

Note that the coefficient 1 𝛾𝜇is guaranteed to be strictly smaller than 1 since 𝜇 𝐿, which means that we indeed have a partial contraction.

Now, taking full expectation, we have

𝔼 𝒙𝑡+1 𝒙 2 2 (1 𝛾𝜇) 𝔼 𝒙𝑡 𝒙 2 2 + 2𝛾2𝜎2.

Unrolling the recursion from 0 to 𝑇 1, we have

𝔼 𝒙𝑇 𝒙 2 2 (1 𝛾𝜇)𝑇𝔼 𝒙0 𝒙 2 2

+ 2𝛾2𝜎2 𝑇 1

𝑡=0 (1 𝛾𝜇)𝑡.

(1 𝛾𝜇)𝑇𝔼 𝒙0 𝒙 2 2 + 2𝜎2

where the last inequality follows from the asymptotic bound on geometric sums.

Lemma 1. Let the objective 𝐹satisfy Assumption 1 and the gradient estimator 𝙜satisfy ER (ℒ) and BV (𝜎2). Then, the last iterate of SGD is 𝜖-close to the global optimum 𝒙 = arg min𝒙 𝒳𝐹(𝒙) such that 𝔼 𝒙𝑇 𝒙 2 2 𝜖after a number of iterations at least

𝑇 2 max (𝜎2

𝜇2 1 𝜖, ℒ+ 𝐿

𝜇 ) log (2 𝒙0 𝒙 2 2 1 𝜖)

and the fixed stepsize

𝛾= min ( 𝜖𝜇

2𝜎2 , 1 2 (ℒ+ 𝐿)) .

Proof. We can apply Lemma 4 to the result of Lemma 8 with the constants

𝑟0 = 𝒙0 𝒙 2 2, 𝐵= 2𝜎2

𝜇, and 𝐶= 2 (ℒ+ 𝐿) .

Then, we can guarantee an 𝜖-accurate solution with the stepsize

𝛾= min ( 𝜖𝜇

2𝜎2 , 1 2 (ℒ+ 𝐿))

and a number of iterations of at least

𝜇, 2 (ℒ+ 𝐿)) log (2 𝒙0 𝒙 2 2 1 𝜖)

= 2 max (𝜎2

𝜇 ) log (2 𝒙0 𝒙 2 2 1 𝜖) .

Demystifying SGD with Doubly Stochastic Gradients

B.3. General Variance Bound (Theorem 1)

Lemma 9. Let 𝙭1, , 𝙭𝑏be a collection of vector-variate RVs dependent on some random variable 𝘉satisfying Assumption 3. Then, the expected variance of the sum of 𝙭1, , 𝙭𝑏conditioned on 𝘉is bounded as

𝔼 [ tr𝕍 [ 𝑏 𝑖=1 𝙭𝑖 𝘉 ]] 𝜌𝕍[𝘚] + 𝜌(𝔼𝘚)2 + (1 𝜌) 𝔼[𝘝] ,

tr𝕍[𝙭𝑖 𝘉] and 𝘝= 𝑏 𝑖=1 tr𝕍[𝙭𝑖 𝘉] .

Equality holds when the equality in Assumption 3 holds.

Proof. From the formula for the variance of sums,

|||||||||| 𝘉

𝑖=1 tr 𝕍[𝙭𝑖 𝘉] +

𝑗 𝑖 tr Cov (𝙭𝑖, 𝙭𝑗 𝘉) .

𝑖=1 tr 𝕍[𝙭𝑖 𝘉] +

𝑖=1 tr 𝕍[𝙭𝑖 𝘉]

𝑖=1 tr 𝕍[𝙭𝑖 𝘉] + 𝜌 ( 𝑏 𝑖=1

tr𝕍[𝙭𝑖 𝘉] )2

= (1 𝜌) 𝘝+ 𝜌𝘚2.

Then, it follows that

|||||||||| 𝘉

𝜌𝔼[𝘚2] + (1 𝜌) 𝔼[𝘝]

= 𝜌𝕍[𝘚] + 𝜌(𝔼𝘚)2 + (1 𝜌) 𝔼[𝘝] ,

from the basic property of the variance:

𝕍[𝘚] = 𝔼[𝘚2] (𝔼𝘚)2.

Since Assumption 3 is the only inequality we use, the equality in the statement holds whenever the equality in Assumption 3 holds.

Theorem 1. Let the component estimators 𝙭1, , 𝙭𝑛satisfy Assumption 3. Then, the variance of the doubly stochastic estimator 𝙭𝘉is bounded as

tr𝕍[𝙭𝘉] 𝑉com + 𝑉cor + 𝑉sub, where

𝑛 𝑛 𝑖=1 tr𝕍[𝙭𝑖] ) ,

𝑉cor = 𝜌(1 1

tr𝕍[𝙭𝑖] )2 , and

𝑛 𝑛 𝑖=1 𝒙𝑖 𝒙 2 2.

Equality holds when the equality in Assumption 3 holds.

Proof. Starting from the law of total covariance, we have

𝕍[𝙭𝘉] = 𝔼𝘉 𝜋[tr𝕍[𝙭𝘉 𝘉]] Ensemble Variance

+ tr𝕍𝘉 𝜋[𝔼[𝙭𝘉 𝘉]] . Subsampling Variance

Ensemble Variance Bounding the variance of each ensemble is key. From Lemma 9, we have

𝔼[tr𝕍[𝙭𝘉 𝘉]] = 𝔼[tr𝕍[1

||||||||| 𝘉]]

𝑏𝙭𝑖) ||||||||| 𝘉]]

𝜌𝕍𝘚+ 𝜌(𝔼𝘚)2 + (1 𝜌) 𝔼𝘝, (19)

𝑖 𝘉 tr𝕍[𝙭𝑖] .

In our context, 𝘚is the batch average of the standard deviations, and 𝘝is the batch average of the variance (scaled with a factor of 1 𝑏).

Notice that 𝘚is an 𝑏-sample average of the standard deviations. Therefore, if 𝜋is an unbiased subsampling strategy, we retrieve the population average standard deviation as

𝔼𝘉 𝜋[𝘚] = 𝔼𝘉 𝜋[1

tr𝕍[𝙭𝑖]] = 1

Under a similar reasoning, the variance of the standard deviations follows as

Demystifying SGD with Doubly Stochastic Gradients

= 1 𝑏eff 𝕍𝑖 Uniform{1, ,𝑛} [

𝑖=1 tr𝕍[𝙭𝑖] ( 1

where the last identity is the well-known formula for the variance: 𝕍𝘟= 𝔼𝘟2 (𝔼𝘟)2. Likewise, the average variance follows as

𝑖 𝘉 tr𝕍[𝙭𝑖]]

𝑖 𝘉 tr𝕍[𝙭𝑖]]

𝑖=1 tr𝕍[𝙭𝑖]) (22)

Plugging Eqs. (20) to (22) into Eq. (19), we have

𝔼𝘉 𝜋[tr𝕍[𝙭𝘉 𝘉]]

𝜌𝕍𝘚+ 𝜌(𝔼𝘚)2 + (1 𝜌) 𝔼𝘝

𝑖=1 tr𝕍[𝙭𝑖] ( 1

𝑖=1 tr𝕍[𝙭𝑖])

𝑖=1 tr𝕍[𝙭𝑖])

+ 𝜌(1 1 𝑏eff ) ( 1

Subsampling Variance The subsampling noise is straightforward. For this, we will denote the minibatch subsampling estimator of the component means as

Since each component estimator 𝙭𝑖is unbiased, the expectation conditional on the minibatch 𝘉is

𝔼[ 𝒙𝘉 𝘉] = 1

tr𝕍𝘉 𝜋[𝔼[𝙭𝘉 𝘉]] = tr𝕍𝘉 𝜋[ 𝒙𝘉]

= 1 𝑏eff ( 1

𝑖=1 𝒙𝑖 𝒙 2 2) . (24)

Combining Eqs. (23) and (24) into Eq. (18) yields the result. Notice that the only inequality we used is Eq. (19), Lemma 9, in which equality holds if the equality in Assumption 3 holds.

Demystifying SGD with Doubly Stochastic Gradients

B.4. Doubly Stochastic Gradients

B.4.1. EXPECTED RESIDUAL CONDITION (THEOREM 2)

Theorem 2. Let Assumption 4 to 6 hold. Then, we have:

(i) If (ACVX) or (AITP) hold, 𝙜𝘉satisfies ER (ℒA).

(ii) If (B) holds, 𝙜𝘉satisfies ER (ℒB).

where ℒmax = max {ℒ1, , ℒ𝑛 },

𝑏) ℒmax + 𝜌(1 1

𝑛 𝑛 𝑖=1 ℒ𝑖 ) + ℒsub

𝑛 𝑛 𝑖=1 ℒ𝑖 )

ℒ𝑖 )2 + ℒsub

Proof. From Theorem 1, we have

tr𝕍[𝙜𝘉(𝒙) 𝙜𝘉(𝒙 )]

𝑖=1 tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )])

+ 𝜌(1 1 𝑏eff ) ( 1

tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )])

+ 1 𝑏eff tr𝕍[ 𝑓𝘉(𝒙) 𝐹(𝒙)] ,

where Assumption 5 yields

𝑖=1 tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )])

+ 𝜌(1 1 𝑏eff ) ( 1

tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )])

𝑏eff (𝐹(𝒙) 𝐹(𝒙 ))

𝑏 ) 𝑇var + 𝜌(1 1 𝑏eff ) 𝑇cov

𝑏eff (𝐹(𝒙) 𝐹(𝒙 )) . (25)

Proof of (i) with (ACVX) Since Assumption 6 (ACVX) requires 𝑓1, , 𝑓𝑛to be convex, 𝐹is also convex. Therefore, we can use the identity in Lemma 6 and

D𝐹(𝒙, 𝒙 ) = 𝐹(𝒙) 𝐹(𝒙 ) .

With that said, under (ACVX), we have

𝑖=1 tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )]

𝑖=1 2ℒ𝑖D𝑓𝑖(𝒙, 𝒙 ) ,

applying ℒmax ℒ𝑖for all 𝑖= 1, , 𝑛,

𝑖=1 D𝑓𝑖(𝒙, 𝒙 )

and Lemma 6,

= 2ℒmax D𝐹(𝒙, 𝒙 ) . (26)

For 𝑇cov, since

tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )])

is monotonic w.r.t. the variance, we can apply (ACVX) as

ℒ𝑖D𝑓𝑖(𝒙, 𝒙 ))

Now, applying the Cauchy-Schwarz inequality yields

𝑖=1 D𝑓𝑖(𝒙, 𝒙 ))

𝑖=1 ℒ𝑖) ( 1

𝑖=1 D𝑓𝑖(𝒙, 𝒙 ))

and by Lemma 6,

𝑖=1 ℒ𝑖) D𝐹(𝒙, 𝒙 ) . (27)

Plugging Eqs. (26) and (27) into Eq. (25), we have

tr𝕍[𝙜(𝒙) 𝙜(𝒙 )]

𝑏 ) 𝑇var + 𝜌(1 1 𝑏eff ) 𝑇cov

𝑏eff (𝐹(𝒙) 𝐹(𝒙 ))

𝑏 ) 2ℒmax D𝐹(𝒙, 𝒙 )

+ 𝜌(1 1 𝑏eff ) 2 ( 1

𝑖=1 ℒ𝑖) D𝐹(𝒙, 𝒙 )

+ 1 𝑏eff 2ℒsub D𝐹(𝒙, 𝒙 ) .

𝑏 ) ℒmax + 𝜌(1 1 𝑏eff ) ( 1

+ 1 𝑏eff ℒsub ) D𝐹(𝒙, 𝒙 )

𝑏 ) ℒmax + 𝜌(1 1 𝑏eff ) ( 1

Demystifying SGD with Doubly Stochastic Gradients

+ 1 𝑏eff ℒsub ) (𝐹(𝒙) 𝐹(𝒙 )) .

Proof of (i) with (AITP) From Assumption 6 (AITP), we have

𝑖=1 tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )]

𝑖=1 2ℒ𝑖(𝑓𝑖(𝒙) 𝑓𝑖(𝒙 )) ,

applying ℒmax ℒ𝑖for all 𝑖= 1, , 𝑛,

𝑖=1 (𝑓𝑖(𝒙) 𝑓𝑖(𝒙 ))

= 2ℒmax (𝐹(𝒙) 𝐹(𝒙 )) . (28)

tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )])

applying (AITP),

ℒ𝑖(𝑓𝑖(𝒙) 𝑓𝑖(𝒙 )))

and applying the Cauchy-Schwarz inequality,

𝑖=1 𝑓𝑖(𝒙) 𝑓𝑖(𝒙 ))

𝑖=1 ℒ𝑖) ( 1

𝑖=1 𝑓𝑖(𝒙) 𝑓𝑖(𝒙 ))

𝑖=1 ℒ𝑖) (𝐹(𝒙) 𝐹(𝒙 )) . (29)

Plugging Eqs. (28) and (29) into Eq. (25), we have

tr𝕍[𝙜(𝒙) 𝙜(𝒙 )]

𝑏 ) 𝑇var + 𝜌(1 1 𝑏eff ) 𝑇cov

𝑏eff (𝐹(𝒙) 𝐹(𝒙 ))

𝑏 ) 2ℒmax (𝐹(𝒙) 𝐹(𝒙 ))

+ 𝜌(1 1 𝑏eff ) 2 ( 1

𝑖=1 ℒ𝑖) (𝐹(𝒙) 𝐹(𝒙 ))

+ 1 𝑏eff 2ℒsub (𝐹(𝒙) 𝐹(𝒙 )) .

𝑏 ) ℒmax + 𝜌(1 1 𝑏eff ) ( 1

+ 1 𝑏eff ℒsub ) (𝐹(𝒙) 𝐹(𝒙 )) .

Proof of (ii) From Assumption 6 (B), we have

𝑖=1 tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )]

𝑖=1 2ℒ𝑖(𝐹(𝒙) 𝐹(𝒙 ))

𝑖=1 ℒ𝑖) (𝐹(𝒙) 𝐹(𝒙 )) . (30)

tr𝕍[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )])

ℒ𝑖(𝐹(𝒙) 𝐹(𝒙 )))

(𝐹(𝒙) 𝐹(𝒙 )) . (31)

Plugging Eqs. (30) and (31) into Eq. (25), we have

tr𝕍[𝙜(𝒙) 𝙜(𝒙 )]

𝑖=1 ℒ𝑖) (𝐹(𝒙) 𝐹(𝒙 ))

+ 𝜌(1 1 𝑏eff ) 2 ( 1

(𝐹(𝒙) 𝐹(𝒙 ))

+ 1 𝑏eff 2ℒsub (𝐹(𝒙) 𝐹(𝒙 )) ,

+ 𝜌(1 1 𝑏eff ) ( 1

+ 1 𝑏eff ℒsub) (𝐹(𝒙) 𝐹(𝒙 )) .

Demystifying SGD with Doubly Stochastic Gradients

B.4.2. BOUNDED VARIANCE CONDITION (THEOREM 3)

Theorem 3. Let Assumption 4 and 7 hold. Then, 𝙜𝘉satisfies BV (𝜎2), where

𝑛 𝑛 𝑖=1 𝜎2 𝑖 )

+ 𝜌(1 1 𝑏eff ) ( 1

𝑛 𝑛 𝑖=1 𝜎𝑖 )2 + 𝜏2

Equality in Definition 2 holds if equality in Assumption 4 holds.

Proof. For any element of the solution set 𝒙 = arg min𝒙 𝒳𝐹(𝒙), by Theorem 1, we have

tr𝕍[𝙜𝘉(𝒙 )] ( 𝜌

𝑖=1 tr𝕍[𝙜𝑖(𝒙 )])

+ 𝜌(1 1 𝑏eff ) ( 1

tr𝕍[𝙜𝑖(𝒙 )])

+ 1 𝑏eff tr𝕍[ 𝑓𝘉(𝒙 )] .

Applying Assumption 7, we have

+ (1 1 𝑏eff ) ( 1

+ 1 𝑏eff 𝜏2

= (1 1 𝑏eff ) ( 1

+ 𝜌(1 1 𝑏eff ) ( 1

+ 1 𝑏eff 𝜏2.

B.4.3. COMPLEXITY ANALYSIS (COROLLARY 2)

Lemma 10. Let the objective function 𝐹satisfy Assumption 2, 𝜋be sampling 𝑏samples without replacement, and all elements of the solution set 𝒙 arg min𝒙 𝒳𝐹(𝒙) be stationary points of 𝐹. Then, the subsampling estimator 𝑓𝘉satisfies the ER condition as

tr𝕍𝘉 𝜋[ 𝑓𝘉(𝒙) 𝑓𝘉(𝒙 )]

2 𝑛 𝑏 𝑏(𝑛 1)𝐿max (𝐹(𝒙) 𝐹(𝒙 )) ,

where 𝐿max = max {𝐿1, , 𝐿𝑛}.

Proof. Consider that, for any random vector 𝙭,

tr𝕍[𝙭2] 𝔼 𝙭 2 2

holds. Also, sampling without replacement achieves 𝑏eff = (𝑛 1)𝑏

𝑛 𝑏. Therefore, we have

tr𝕍𝘉 𝜋[ 𝑓𝘉(𝒙) 𝑓𝘉(𝒙 )]

= 𝑛 𝑏 𝑏(𝑛 1)tr𝕍[ 𝑓𝑖(𝒙) 𝑓𝑖(𝒙 )]

𝑛 𝑏 𝑏(𝑛 1) ( 1

𝑖=1 𝑓𝑖(𝒙) 𝑓𝑖(𝒙 ) 2 2) ,

and from Assumption 2,

= 𝑛 𝑏 𝑏(𝑛 1) ( 1

𝑖=1 2𝐿𝑖D𝑓𝑖(𝒙, 𝒙 )) .

Using the bound 𝐿max 𝐿𝑖for all 𝑖= 1, , 𝑛,

2𝐿max 𝑛 𝑏 𝑏(𝑛 1) ( 1

𝑖=1 D𝑓𝑖(𝒙, 𝒙 )) ,

applying Lemma 6,

= 2𝐿max 𝑛 𝑏 𝑏(𝑛 1)D𝐹(𝒙, 𝒙 ) ,

and since 𝒙 is a stationary point of 𝐹,

= 2 𝑛 𝑏 𝑏(𝑛 1)𝐿max (𝐹(𝒙) 𝐹(𝒙 )) .

Demystifying SGD with Doubly Stochastic Gradients

Corollary 2. Let the objective 𝐹satisfy Assumption 1 and 2, the global optimum 𝒙 = arg min𝒙 𝒳𝐹(𝒙) be a stationary point of 𝐹, the component gradient estimators 𝙜1, , 𝙜𝑛satisfy Assumption 6 (B) and 7, and 𝜋be 𝑏minibatch sampling without replacement. Then the last iterate of SGD with 𝙜𝘉is 𝜖-close to 𝒙 as 𝔼 𝒙𝑇 𝒙 2 2 𝜖after a number of iterations of at least

𝑇 2 max (𝐶var 1 𝜖, 𝐶bias) log (2 𝒙0 𝒙 2 2 1 𝜖)

for some fixed stepsize where

𝜎2 𝑖 𝜇2 ) + 2 ( 1

Proof. From Assumption 2 and the assumption that 𝒙 is a stationary point, Lemma 10 establishes that 𝑓𝘉satisfies the ER (ℒsub) holds with

ℒsub = 𝑛 𝑏 (𝑛 1)𝑏𝐿max.

Therefore, Assumption 5 holds. Furthermore, since the component gradient estimators satisfy Assumption 6 (B) and Assumption 3 always hold with 𝜌= 1, we can apply Theorem 2 which estblishes that 𝙜𝘉satisfies ER (ℒ) with

ℒ= 𝑛 𝑏 (𝑛 1)𝑏( 1

𝑖=1 ℒ𝑖) + 𝑛(𝑏 1)

+ 𝑛 𝑏 (𝑛 1)𝑏𝐿max.

Furthermore, under Assumption 7, Theorem 3 shows that BV (𝜎2) holds with

𝜎2 = 𝑛 𝑏 (𝑛 1)𝑏( 1

𝑖=1 𝜎2 𝑖) + 𝑛(𝑏 1)

+ 𝑛 𝑏 (𝑛 1)𝑏𝜏2.

Since both ER (ℒ) and BV (𝜎2) hold and 𝐹satisfies Assumption 1, we can now invoke Lemma 1, which guarantees that we can obtain an 𝜖-accurate solution after

𝑇 2 max ( 𝜎2

) log (2 𝒙0 𝒙 2 2 1 𝜖)

iterations and fixed stepsize of

𝛾= min ( 𝜖𝜇

2𝜎2 , 1 2 (ℒ+ 𝐿)) .

The constants in the lower bound on the number of required iterations can be made more precise as

𝐶var = 𝑛 𝑏 (𝑛 1)𝑏( 1

+ 𝑛 𝑏 (𝑛 1)𝑏 𝜏2

𝐶bias = 𝑛 𝑏 (𝑛 1)𝑏( 1

+ 𝑛 𝑏 (𝑛 1)𝑏 𝐿 𝜇.

Using the fact that (𝑛 𝑏) 𝑛 (𝑛 1) 𝑛 2 for all 𝑛 2 yields the simplified constants in the statement.

Demystifying SGD with Doubly Stochastic Gradients

B.5. Random Reshuffling of Stochastic Gradients

B.5.1. GRADIENT VARIANCE CONDITIONS (LEMMA 11, LEMMA 12)

Lemma 11. Let the objective function satisfy Assumption 2, 𝐵be any 𝑏-minibatch of indices such that 𝐵 {1, , 𝑛} and the component gradient estimators 𝙜1, , 𝙜𝑛satisfy Assumption 6 (ACVX). Then, 𝙜𝐵is convex-smooth in expectation such that

𝔼𝜑 𝙜𝐵(𝒙) 𝙜𝐵(𝒙 ) 2 2 2 (ℒmax + 𝐿max) D𝑓𝐵(𝒙, 𝒙 ) ,

for any 𝒙 𝒳, where

𝒙 = arg min 𝒙 𝒳 𝐹(𝒙) ,

ℒmax = max {ℒ1, , ℒ𝑛} ,

𝐿max = max {𝐿1, , 𝐿𝑛} .

Proof. Notice that, for this Lemma, we do not assume that the minibatch 𝐵is a random variable. Therefore, the only randomness is the stochasticity of the component gradient estimators 𝙜1, , 𝙜𝑛.

Now, from the property of the variance, we can decompose the expected squared norm as

𝔼 𝙜𝐵(𝒙) 𝙜𝐵(𝒙 ) 2 2 = tr𝕍𝜑[𝙜𝐵(𝒙) 𝙜𝐵(𝒙 )] 𝑉com

+ 𝑓𝐵(𝒙) 𝑓𝐵(𝒙 ) 2 2 𝑉sub

First, the contribution of the variances of the component gradient estimators follows as

𝑉com = tr𝕍𝜑[𝙜(𝒙) 𝙜(𝒙 )]

𝑖 𝐵 𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )] ,

applying Eq. (14) of Lemma 3,

𝑖 𝐵 tr𝕍𝜑[𝙜𝑖(𝒙) 𝙜𝑖(𝒙 )] , (32)

and then Assumption 6 (ACVX),

𝑖 𝐵 2ℒ𝑖D𝑓𝑖(𝒙, 𝒙 ) .

Now, since ℒmax ℒ𝑖for all 𝑖= 1, , 𝑛,

𝑖 𝐵 D𝑓𝑖(𝒙, 𝒙 )

= 2ℒmax D𝑓𝐵(𝒙, 𝒙 ) .

On the other hand, the squared error of subsampling (it is not the variance since we do not take expectation over the

batches) follows as

𝑉sub = 𝑓𝘉(𝒙) 𝑓𝘉(𝒙 ) 2 2

𝑖 𝐵 𝑓𝑖(𝒙) 𝑓𝑖(𝒙 )

by Jensen s inequality,

𝑖 𝐵 𝑓𝑖(𝒙) 𝑓𝑖(𝒙 ) 2 2,

from Assumption 2,

𝑖 𝐵 2𝐿𝑖D𝑓𝑖(𝒙, 𝒙 )

and since 𝐿max 𝐿𝑖for all 𝑖= 1, , 𝑛,

𝑖 𝐵 D𝑓𝑖(𝒙, 𝒙 )

= 2𝐿max D𝑓𝐵(𝒙, 𝒙 ) .

Combining the bound on 𝑉com and 𝑉sub immediately yields the result.

Demystifying SGD with Doubly Stochastic Gradients

Lemma 12. For any 𝑏-minibatch reshuffling strategy, the squared error of the reference point of the Lyapunov function (Eq. (10)) under reshuffling is bounded as

𝔼 𝒙𝑖 𝒙 2 2 𝛾2𝑛

for all 𝑖= 1, , 𝑝, where 𝒙 arg min𝒙 𝒳𝐹(𝒙).

Proof. The proof is a generalization of Mishchenko et al. (2020, Proposition 1), where we sample 𝑏-minibatches instead of single datapoints. Recall that 𝘗denotes the (possibly random) partitioning of the 𝑛datapoints into 𝑏minibatches 𝘗1, , 𝘗𝑝. From the definition of the squared error of the Lyapunov function in Eq. (10), we have

𝔼[ 𝒙𝑖 𝒙 2 2]

𝑘=0 𝛾 𝑓𝘗𝑖(𝒙 )

and since the projection onto a convex set under a Euclidean metric is non-expansive,

𝑘=0 𝛾 𝑓𝘗𝑖(𝒙 ) 𝒙

𝑘=0 𝛾 𝑓𝘗𝑖(𝒙 )

introducing a factor of 𝑖in and out of the squared norm,

𝑘=0 𝛾 𝑓𝘗𝑖(𝒙 )

𝑘=0 𝑓𝘗𝑖(𝒙 )

Now notice that 1

𝑖 𝑖 1 𝑗=0 𝑓𝘗𝑖(𝒙 ) is a sample average of 𝑖𝑏samples drawn without replacement. Therefore, it is an unbiased estimate of 𝐹(𝒙 ). This implies

𝔼[ 𝒙𝑖 𝒙 2 2] = 𝛾2𝑖2

𝑘=0 𝑓𝘗𝑖(𝒙 )

𝑘=0 𝑓𝘗𝑖(𝒙 ) ,

and from Lemma 2 with a sample size of 𝑖𝑏,

2 𝑛 𝑖𝑏 (𝑛 1) 𝑖𝑏 1 𝑛

𝑖=1 𝑓𝑖(𝒙 ) 2 2

2 (𝑛 1) 𝜏2.

Notice that this is a quadratic with respect to 𝑖, where the maximum is obtained by 𝑖= 𝑛 2𝑏. Then,

𝔼[ 𝒙𝑖 𝒙 2 2] 𝛾2( 𝑛

8𝑏2 (𝑛 1)𝜏2,

and using the bound 𝑛 (𝑛 1) 2 for all 𝑛 2,

Demystifying SGD with Doubly Stochastic Gradients

B.5.2. CONVERGENCE ANALYSIS (THEOREM 5)

Theorem 5. Let the objective 𝐹satisfy Assumption 1 and 2, where, each component 𝑓𝑖is additionally 𝜇strongly convex and Assumption 6 (ACVX), 7 hold. Then, the last iterate 𝒙𝑇of doubly SGD-RR with a stepsize satisfying 𝛾< 1 (ℒmax + 𝐿max) guarantees

𝔼 𝒙0 𝐾+1 𝒙 2

2 𝑟𝐾𝑝 𝒙0 1 𝒙 2

2 + 𝐶sub var 𝛾2 + 𝐶com var 𝛾

where 𝑝 = 𝑛 𝑏is the number of epochs, 𝒙 = arg min𝒙 𝒳𝐹(𝒙), 𝑟= 1 𝛾𝜇is the contraction coefficient,

𝐶com var = 4

𝑖=1 𝜎2 𝑖) + 4

𝐶sub var = 1

𝑖=1 𝑓𝑖(𝒙 ) 2 2) .

Proof. The key element of the analysis of random reshuffling is that the Lyapunov function that achieves a fast con-

vergence is 𝒙𝑖+1 𝑘 𝒙𝑖+1 2

2 not 𝒙𝑖+1 𝑘 𝒙 2

2. This stems from the well-known fact that random reshuffling results in a conditionally biased gradient estimator.

Recall that 𝘗denotes the partitioning of the 𝑛datapoints into 𝑏-minibatches 𝘗1, , 𝘗𝑝. As usual, we first expand the Lyapunov function as

𝒙𝑖+1 𝑘 𝒙𝑖+1 2

= Π𝒳(𝒙𝑖 𝑘 𝛾𝙜𝘗𝑖(𝒙𝑖 𝑘)) Π𝒳(𝒙𝑖 𝛾 𝑓𝘗𝑖(𝒙 )) 2

2 and since the projection onto a convex set under a Euclidean metric is non-expansive,

(𝒙𝑖 𝑘 𝛾𝙜𝘗𝑖(𝒙𝑖 𝑘)) (𝒙𝑖 𝛾 𝑓𝘗𝑖(𝒙 )) 2

2 2𝛾 𝒙𝑖 𝑘 𝒙𝑖 , 𝙜𝘗𝑖(𝒙𝑖 𝑘) 𝑓𝘗𝑖(𝒙 )

+ 𝛾2 𝙜𝘗𝑖(𝒙𝑖 𝑘) 𝑓𝘗𝑖(𝒙 ) 2

Taking expectation over the Monte Carlo noise conditional on the partitioning 𝘗,

𝔼𝜑 𝒙𝑖+1 𝑘 𝒙𝑖+1 2

= 𝒙𝑖 𝑘 𝒙𝑖 2

2 2𝛾 𝒙𝑖 𝑘 𝒙𝑖 , 𝔼𝜑[𝙜𝘗𝑖(𝒙𝑖 𝑘)] 𝑓𝘗𝑖(𝒙 )

+ 𝛾2𝔼𝜑 𝙜𝘗𝑖(𝒙𝑖 𝑘) 𝑓𝘗𝑖(𝒙 )

= 𝒙𝑖 𝑘 𝒙𝑖 2

2 2𝛾 𝒙𝑖 𝑘 𝒙𝑖 , 𝑓𝘗𝑖(𝒙𝑖 𝑘) 𝑓𝘗𝑖(𝒙 )

+ 𝛾2 𝔼𝜑 𝙜𝘗𝑖(𝒙𝑖 𝑘) 𝑓𝘗𝑖(𝒙 ) 2

From the three-point identity, we can more precisely characterize the effect of the conditional bias such that 𝒙𝑖 𝑘 𝒙𝑖 , 𝑓𝘗𝑖(𝒙𝑖 𝑘) 𝑓𝘗𝑖(𝒙 )

= D𝑓𝘗𝑖(𝒙𝑖 , 𝒙𝑖 𝑘) + D𝑓𝘗𝑖(𝒙𝑖 𝑘, 𝒙 ) D𝑓𝘗𝑖(𝒙𝑖 , 𝒙 ).

For the gradient noise,

𝔼𝜑 𝙜𝘗𝑖(𝒙𝑖 𝑘) 𝑓𝘗𝑖(𝒙 ) 2

= 𝔼𝜑 𝙜𝘗𝑖(𝒙𝑖 𝑘) 𝙜𝘗𝑖(𝒙 ) + 𝙜𝘗𝑖(𝒙 ) 𝑓𝘗𝑖(𝒙 ) 2

2𝔼𝜑 𝙜𝘗𝑖(𝒙𝑖 𝑘) 𝙜𝘗𝑖(𝒙 ) 2

2 + 2𝔼𝜑 𝙜𝘗𝑖(𝒙 ) 𝑓𝘗𝑖(𝒙 ) 2 2

= 2 𝔼𝜑 𝙜𝘗𝑖(𝒙𝑖 𝑘) 𝙜𝘗𝑖(𝒙 ) 2

2 + 2 tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ] ,

and from Lemma 11,

4 (ℒmax + 𝐿max) D𝑓𝘗𝑖(𝒙𝑖 𝑘, 𝒙 ) + 2tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ]

Notice the variance term tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ]. This quantifies the amount of deviation from the trajectory of singly stochastic random reshuffling. As such, it quantifies how slower we will be compared to its fast rate.

Now, we will denote the 𝜎-algebra formed by the randomness and the iterates up to the 𝑖th step of the 𝑘th epoch as ℱ𝑖 𝑘such that (ℱ𝑖 𝑘)𝑘 1,𝑖 1 is a filtration. Then,

𝔼𝞰𝑖 𝑘 𝜑[ 𝒙𝑖+1 𝑘 𝒙𝑖+1 2

2 |||||| ℱ𝑖 𝑘]

2 2𝛾 ( D𝑓𝘗𝑖(𝒙𝑖 , 𝒙𝑖 𝑘) + D𝑓𝘗𝑖(𝒙𝑖 𝑘, 𝒙 ) D𝑓𝘗𝑖(𝒙𝑖 , 𝒙 ) )

+ 4𝛾2 (ℒmax + 𝐿max) D𝑓𝘗𝑖(𝒙𝑖 𝑘, 𝒙 )

+ 2𝛾2tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ] .

Now, the 𝜇-strong convexity of the component functions

( 𝒙𝑖 , 𝒙𝑖 𝑘 ) 𝜇

2 𝒙𝑖 𝑘 𝒙𝑖 2

2. Therefore,

2 𝒙𝑖 𝑘 𝒙𝑖 2

2 + D𝑓𝘗𝑖(𝒙𝑖 𝑘, 𝒙 ) D𝑓𝘗𝑖(𝒙𝑖 , 𝒙 ))

+ 4𝛾2 (ℒmax + 𝐿max) D𝑓𝘗𝑖(𝒙𝑖 𝑘, 𝒙 )

+ 2𝛾2tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ] ,

and reorganizing the terms,

= (1 𝛾𝜇) 𝒙𝑖 𝑘 𝒙𝑖 2

2 2𝛾(1 2𝛾(ℒmax + 𝐿max)) D𝑓𝘗𝑖(𝒙𝑖 𝑘, 𝒙 )

+ 2𝛾D𝑓𝘗𝑖(𝒙𝑖 , 𝒙 )

+ 𝛾22tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ] .

Demystifying SGD with Doubly Stochastic Gradients

Taking full expectation,

𝔼 𝒙𝑖+1 𝑘 𝒙𝑖+1 2

(1 𝛾𝜇) 𝔼 𝒙𝑖 𝑘 𝒙𝑖 2

2 2𝛾(1 2𝛾(ℒmax + 𝐿max)) 𝔼 [ D𝑓𝘗𝑖

( 𝒙𝑛 𝑘, 𝒙 )]

+ 2𝛾𝔼 [ D𝑓𝘗𝑖 (𝒙𝑖 , 𝒙 )]

+ 2𝛾2𝔼[tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ]] ,

and as long as 𝛾< 1 (2 (ℒmax + 𝐿max))

(1 𝛾𝜇) 𝔼 𝒙𝑖 𝑘 𝒙𝑖 2

2 + 2𝛾𝔼 [ D𝑓𝘗𝑖 (𝒙𝑖 , 𝒙 )]

𝑇err + 2𝛾2 𝔼[tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ]]

Bounding 𝑇err From the definition of the Bregman divergence and 𝐿-smoothness, for all 𝑗= 1, , 𝑛, notice that we have

D𝑓𝑗(𝒚, 𝒙) = 𝑓𝑗(𝒚) 𝑓𝑗(𝒙) 𝑓𝑗(𝒙) , 𝒚 𝒙

2 𝒚 𝒙 2 2. (34)

for all (𝒙, 𝒙 ) 𝒳2. Given this, the Lyapunov error term

( 𝒙𝑖 𝑘, 𝒙 )] = 𝔼

𝑗 𝘗𝑖 D𝑓𝑗 ( 𝒙𝑖 𝑘, 𝒙 ) can be bounded using 𝐿-smoothness by Eq. (34),

2 and 𝐿max 𝐿𝑖for all 𝑖= 1, , 𝑛,

𝑗 𝘗𝑖 𝒙𝑖 𝑘 𝒙 2

2 𝔼 𝒙𝑖 𝑘 𝒙 2

The squared error 𝒙𝑖 𝑘 𝒙 2

2 is bounded in Lemma 12 as

2 𝜖2 sfl 𝛾2𝑛

4𝑏2 𝜏2 < . (36)

Bounding 𝑇var Now, let s take a look at the variance term. First, notice that, by the Law of Total Expectation,

𝔼[tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ]] = 𝔼[𝔼[tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ] 𝘗]] .

Here, 𝔼[tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ] 𝘗]

is the variance from selecting 𝑏samples without replacement. We can thus apply Lemma 9 with 𝑏eff = (𝑛 1)𝑏

𝑛 𝑏 such that

𝔼[tr𝕍𝜑 [𝙜𝘗𝑖(𝒙 ) ] 𝘗]

𝑛 𝑏 (𝑛 1) 𝑏

which we will denote as

for clarity. Also, notice that 𝜎2 no longer depends on the partitioning.

Per-step Recurrence Equation Applying Eqs. (35) and (37) to Eq. (33), we now have the recurrence equation

𝔼 𝒙𝑖+1 𝑘 𝒙𝑖+1 2

2 (1 𝛾𝜇) 𝔼 𝒙𝑖 𝑘 𝒙𝑖 2

2 + 𝐿max𝜖2 sfl 𝛾+ 2𝜎2 𝛾2.

Now that we have a contraction of the Lyapunov function

𝔼 𝒙𝑖+1 𝑘 𝒙𝑖+1 2

2, it remains to convert this that the Lya-

punov function bounds our objective 𝔼 𝒙𝑖+1 𝑘 𝒙 2

2. This can be achieved by noticing that, at the end of each epoch, we have 𝒙𝑘+1 𝒙 = 𝒙𝑝 𝑘 𝒙𝑝 , and equivalently, we have 𝒙𝑘 𝒙 = 𝒙0 𝑘 𝒙0 at the beginning of the epoch. The fact that the relationship with the original objective is only guaranteed at the endpoints (beginning and end of the epoch) is related to the fact that the bias of random reshuffling starts increasing at the beginning of the epoch and starts decreasing near the end.

Per-Epoch Recurrence Equation Nevertheless, this implies that by simply unrolling the recursion as in the analysis of regular SGD, we obtain a per-epoch contraction of

𝔼 𝒙0 𝑘+1 𝒙 2

2 (1 𝛾𝜇)𝑝𝔼 𝒙0 𝑘 𝒙 2

+ ( 𝐿max𝜖2 sfl𝛾+ 2𝜎2𝛾2)

𝑖=0 (1 𝜇𝛾)𝑖

And after 𝐾epochs,

𝔼 𝒙0 𝐾+1 𝒙 2

2 (1 𝛾𝜇)𝑝𝐾𝔼 𝒙0 0 𝒙 2

+ ( 𝐿max𝜖2 sfl𝛾+ 2𝜎2𝛾2)

𝑖=0 (1 𝜇𝛾)𝑖

𝑗=0 (1 𝜇𝛾)𝑝𝑗

Note that 𝑇= 𝑝𝐾.

Demystifying SGD with Doubly Stochastic Gradients

As done by Mishchenko et al. (2020), the product of sums can be bounded as

𝑖=0 (1 𝜇𝛾)𝑖

𝑗=0 (1 𝜇𝛾)𝑝𝑗

𝑗=0 (1 𝜇𝛾)𝑖(1 𝜇𝛾)𝑝𝑗

𝑗=0 (1 𝜇𝛾)𝑖+𝑝𝑗

𝑖=0 (1 𝜇𝛾)𝑖

𝑖=0 (1 𝜇𝛾)𝑖

𝔼 𝒙0 𝐾+1 𝒙 2

(1 𝛾𝜇)𝑝𝐾𝔼 𝒙0 0 𝒙 2

( 𝐿max𝜖2 sfl𝛾+ 2𝜎2𝛾2)

= (1 𝛾𝜇)𝑝𝐾𝔼 𝒙0 0 𝒙 2

2 + 𝜖2 sfl 𝜇+ 2𝜎2

Plugging in the value of 𝜖2 sfl from Eq. (36), we have

𝔼 𝒙0 𝐾+1 𝒙 2

2 (1 𝛾𝜇)𝑝𝐾𝔼 𝒙0 0 𝒙 2

+ 𝐿max𝑛𝜎2 sub 4𝑏2𝜇 𝛾2 + 2𝜎2

This implies

𝔼 𝒙0 𝐾+1 𝒙 2

2 𝑟𝐾𝑛 𝑏 𝒙0 1 𝒙 2

2 + 𝐶sub var 𝛾2 + 𝐶com var 𝛾,

where 𝑟= 1 𝛾𝜇,

𝐶sub var = 1

𝑖=1 𝑓𝑖(𝒙 ) 2 2) , and

𝐶com var = 2

𝜇 𝑛 𝑏 (𝑛 1)𝑏( 1

𝑖=1 𝜎2 𝑖) + 2

𝜇 𝑛(𝑏 1) (𝑛 1) 𝑏( 1

Applying the fact that (𝑛 𝑏) 𝑛 (𝑛 1) 𝑛 2 for all 𝑛 2 yields the simplified constants in the statement.

Demystifying SGD with Doubly Stochastic Gradients

B.5.3. COMPLEXITY ANALYSIS (THEOREM 4)

Theorem 4. Let the objective 𝐹satisfy Assumption 1 and 2, where each component 𝑓𝑖is additionally 𝜇strongly convex, and Assumption 6 (ACVX), 7 hold. Then, the last iterate 𝒙𝑇of doubly SGD-RR is 𝜖-close to the global optimum 𝒙 = arg max𝒙 𝒳𝐹(𝒙) such

that 𝔼 𝒙𝑇 𝒙 2 2 𝜖after a number of iterations of at least

𝑇 max (4𝐶com var 1

𝜖+ 𝐶sub var 1

𝜖, 𝐶bias) log (2 𝒙0 1 𝒙 2

for some fixed stepsize, where 𝑇= 𝐾𝑝= 𝐾𝑛 𝑏,

𝐶bias = (ℒmax + 𝐿) 𝜇

𝐶com var = 2

𝜎2 𝑖 𝜇2 ) + 2 ( 1

Proof. From the result of Theorem 5, we can invoke Lemma 5 with

𝑛 𝑏 (𝑛 1)𝑏( 1

𝑖=1 𝜎2 𝑖) + 𝑛(𝑏 1)

𝐶= ℒmax + 𝐿max.

Then, an 𝜖accurate solution in expectation can be obtained after

𝜖 , ℒmax + 𝐿max

𝜇 ) log (2𝑟2 0 1 𝜖)

iterations with a stepsize of

𝛾= min ( 𝐵+

𝐵2 + 2𝐴𝜖 2𝐴 , 1

To make the iteration complexity more precise, the terms 𝐶1, 𝐶2 can be organized as

𝜇{ 𝑛 𝑏 (𝑛 1)𝑏( 1

𝑖=1 𝜎2 𝑖) + 𝑛(𝑏 1)

Applying the fact that (𝑛 𝑏) 𝑛 (𝑛 1) 𝑛 2 for all 𝑛 2 yields the simplified constants in the statement.

Demystifying SGD with Doubly Stochastic Gradients

C. Applications

C.1. ERM with Randomized Smoothing

C.1.1. DESCRIPTION

Randomized smoothing was originally considered by Polyak & d Aleksandr Borisovich (1990); Nesterov (2005); Duchi et al. (2012) in the nonsmooth convex optimization context, where the function is smoothed through random perturbation. This scheme has recently renewed interest in the non-convex ERM context as it has been found to improve generalization performance (Orvieto et al., 2023; Liu et al., 2021). Here, we will focus on the computational aspect of this scheme. In particular, we will see if we can obtain similar computational guarantees already established in the finite-sum ERM setting, such as those by Gower et al. (2021a, Lemma 5.2).

Consider the canonical ERM problem, where we are given a dataset 𝒟= {(𝒙𝑖, 𝑦𝑖)}𝑛 𝑖=1 (𝒳 𝒴)𝑛and solve

minimize 𝒘 𝒲 𝐿(𝒘) = 1

𝑖=1 𝓁(𝑓𝒘(𝒙𝑖) , 𝑦𝑖) + ℎ(𝒘) ,

where (𝒙𝑖, 𝑦𝑖) 𝒳 𝒴are the feature and label of the 𝑖th instance, 𝑓𝒘 𝒳 𝒴is the model, 𝓁 𝒴 𝒴 ℝ 0 is a non-negative loss function, and ℎ 𝒲 ℝis a regularizer.

For randomized smoothing, we instead minimize

𝑖=1 𝑅𝑖(𝒘) ,

where the instance risk is defined as

𝑟𝑖(𝒘) = 𝔼𝟄 𝜑𝓁(𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖)

for some noise distribution 𝝐 𝜑. The goal is to obtain a solution 𝒘 = arg min𝒘 𝒲𝐿(𝒘) that is robust to such perturbation.

The integrand of the gradient estimator of the instance risk is defined as

𝒈𝑖(𝒘; 𝜼) = 𝒘𝓁(𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖)

𝒘 𝓁 (𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖) ,

where it is an unbiased estimate of the instance risk such that 𝔼𝙜𝑖(𝒘) = 𝑅𝑖(𝒘) .

The key challenge in analyzing the convergence of SGD in the ERM setting is dealing with the Jacobian 𝑓𝒘+𝟄(𝒙𝑖)

𝒘 . Even for simple toy models, analyzing the Jacobian without relying on strong assumptions is hard. In this work, we will assume that it is bounded by an instance-dependent constant.

C.1.2. PRELIMINARIES

We use the following assumptions:

Assumption 8.

(a) Let the mapping 𝑦 𝓁( 𝑦, 𝑦) is convex and 𝐿smooth for any 𝑦𝑖 𝑖= 1, , 𝑛.

(b) The Jacobian of the model with respect to its parameters for all 𝑖= 1, , 𝑛is bounded almost surely as

for all 𝒘 𝒲.

(c) Interpolation holds on the solution set such that, for all 𝒘 arg min𝒘 𝒲𝐿(𝒘), the loss minimized as

𝓁(𝑓𝒘 +𝟄(𝒙𝑖) , 𝑦𝑖 ) = 𝓁 (𝑓𝒘 +𝟄(𝒙𝑖) , 𝑦𝑖 ) = 0

for all (𝒙𝑖, 𝑦𝑖) 𝒟.

(a) holds for the squared loss, (c) basically assumes that the model is overparameterized and there exists a set of optimal weights that are robust with respect to perturbation. The has recently gained popularity as it qualitatively explains some of the empirical phenomenons of non-convex SGD (Vaswani et al., 2019; Gower et al., 2021a; Ma et al., 2018). (b) is a strong assumption but is commonly used to establish convergence guarantees of ERM (Gower et al., 2021a).

Remark 12. Under Assumption 8 (c), Assumption 7 holds with arbitrarily small 𝜎2 𝑖, 𝜏2 .

Demystifying SGD with Doubly Stochastic Gradients

C.1.3. THEORETICAL ANALYSIS

Proposition 9. Let Assumption 8 hold. Then, Assumption 6 (AITP) holds.

𝔼 𝙜𝑖(𝒘) 𝙜𝑖(𝒘 ) 2 2

𝒘 𝓁 (𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖)

𝒘 𝓁 (𝑓𝒘 +𝟄(𝒙𝑖) , 𝑦𝑖 )

from the interpolation assumption (Assumption 8 (c)),

𝒘 𝓁 (𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖)

2 ||||𝓁 (𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖)|||| 2,

applying Assumption 8 (b),

𝐺2 𝑖𝔼||||𝓁 (𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖)|||| 2.

and then the interpolation assumption (Assumption 8 (c)),

= 𝐺2 𝑖𝔼||||𝓁 (𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖) 𝓁 (𝑓𝒘 +𝟄(𝒙𝑖) , 𝑦𝑖 )||||.

From Assumption 8 (a),

2𝐿𝐺2 𝑖𝔼 ( 𝓁(𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖) 𝓁(𝑓𝒘 +𝟄(𝒙𝑖) , 𝑦𝑖 )

𝓁 (𝑓𝒘 +𝟄(𝒙𝑖) , 𝑦𝑖 ) , 𝑓𝒘+𝟄(𝒙𝑖) 𝑓𝒘 +𝟄(𝒙𝑖) )

and interpolation (Assumption 8 (c)),

= 2𝐿𝐺2 𝑖 (𝔼𝓁(𝑓𝒘+𝟄(𝒙𝑖) , 𝑦𝑖) 𝔼𝓁(𝑓𝒘 +𝟄(𝒙𝑖) , 𝑦𝑖 ))

= 2𝐿𝐺2 𝑖(𝑅𝑖(𝒘) 𝑅𝑖(𝒘 )) .

Proposition 10. Let Assumption 8 hold. Then, Assumption 5 holds.

𝑛 𝑛 𝑖=1 𝑅𝑖(𝒘) 𝑅𝑖(𝒘 ) 2 2

𝑛 𝑛 𝑖=1 𝔼𝙜𝑖(𝒘) 𝔼𝙜𝑖(𝒘 ) 2 2,

and from Jensen s inequality,

𝑛 𝑛 𝑖=1𝔼 𝙜𝑖(𝒘) 𝙜𝑖(𝒘 ) 2 2.

We can now reuse Proposition 9 as

𝑖=1 2𝐿𝐺2 𝑖(𝑅𝑖(𝒘) 𝑅𝑖(𝒘 ))

and taking 𝐺max > 𝐺𝑖for all 𝑖= 1, , 𝑛as

2𝐿𝐺2 max 1 𝑛

𝑖=1 (𝑅𝑖(𝒘) 𝑅𝑖(𝒘 ))

= 2𝐿𝐺2 max (𝐿(𝒘) 𝐿(𝒘 )) .

Demystifying SGD with Doubly Stochastic Gradients

C.2. Reparameterization Gradient

C.2.1. DESCRIPTION

The reparameterization gradient estimator (Kingma & Welling, 2014; Rezende et al., 2014; Titsias & L azaro Gredilla, 2014) is a gradient estimator for problems of the form of 𝑓𝑖(𝒘) = 𝔼𝙯 𝑞𝒘𝓁𝑖(𝙯) ,

where 𝓁 ℝ𝑑𝙯 ℝis some integrand, such that the derivative is taken with respect to the parameters of the distribution𝑞𝒘we are integrating over. It was independently proposed by Kingma & Welling (2014); Rezende et al. (2014) in the context of variational expectation maximization of deep latent variable models (a setup commonly known as variational autoencoders) and by Titsias & L azaro-Gredilla (2014) for variational inference of Bayesian models.

Consider the case where the generative process of 𝑞𝒘can be represented as

𝙯 𝑞𝒘 𝙯 𝑑= 𝒯𝒘(𝙪) ; 𝙪 𝜑,

where 𝑑= is equivalence in distribution, 𝜑is some base distribution independent of 𝒘, and 𝒯𝒘is a reparameterization function measurable with respect to 𝜑and differentiable with respect to all 𝒘 𝒲. Then, the reparameterization gradient is given by the integrand

𝒈𝑖(𝒘; 𝒖) = 𝒘𝓁𝑖(𝒯𝒘(𝒖)) ,

which is unbiased, and often results in lower variance (Kucukelbir et al., 2017; Xu et al., 2019) compared to alternatives such as the score gradient estimator. (See Mohamed et al. (2020) for an overview of such estimators.)

The reparameterization gradient is primarily used to solve problems in the form of

minimize 𝒘 𝒲 𝐹(𝒘) =

𝑖=1 𝑓𝑖(𝒘) + ℎ(𝒘)

= 𝔼𝙯 𝑞𝒘𝓁𝑖(𝙯) + ℎ(𝒘) ,

where ℎis some convex regularization term.

Previously, Domke (2019, Theorem 6) established a bound on the gradient variance of the reparameterization gradient (Kingma & Welling, 2014; Rezende et al., 2014; Titsias & L azaro-Gredilla, 2014) under the doubly stochastic setting. This bound also incorporates more advanced subsampling strategies such as importance sampling Gower et al. (2019); Gorbunov et al. (2020); Csiba & Richt arik (2018); Needell et al. (2016); Needell & Ward (2017). However, he did not extend the analysis to a complexity analysis of SGD and left out the effect of correlation between components.

C.2.2. PRELIMINARIES

The properties of the reparameterization gradient for when 𝑞𝒘is in the location-scale family were studied by Domke (2019).

Assumption 9. We assume the variational family

satisfies the following:

(a) 𝒬is part of the location-scale family such that 𝒯𝒘(𝒖) = 𝑪𝒖+ 𝒎.

(b) The scale matrix is positive definite such that 𝑪 0.

(c) 𝙪= (𝘶1, , 𝘶𝑑𝒛) constitute of i.i.d. components, where each component is standardized, symmetric, and finite kurtosis such that 𝔼𝘶𝑖= 0, 𝔼𝘶2 𝑖= 1, 𝔼𝘶3 𝑖= 0, and 𝔼𝘶4 𝑖= 𝑘𝜑, where 𝑘𝜑is the kurtosis.

Under these conditions, Domke (2019) proves the following:

Lemma 13 (Domke, 2019; Theorem 3). Let Assumption 9 hold and 𝓁𝑖be 𝐿𝑖-smooth. Then, the squared norm of the reparameterization gradient is bounded:

𝔼 𝙜𝑖(𝒘) 2 2 (𝑑+ 1) 𝒎 𝒛𝑖 2 2 + (𝑑+ 𝑘𝜑 ) 𝑪 2 F

for all 𝒘= (𝒎, 𝑪) 𝒲and all stationary points of 𝓁𝑖 denoted with 𝒛𝑖.

Demystifying SGD with Doubly Stochastic Gradients

Similarly, Kim et al. (2023) establish the QES condition as part of Lemma 3 (Kim et al., 2023). We refine this into statement we need:

Lemma 14. Let Assumption 9 hold and 𝓁𝑖be 𝐿𝑖-smooth. Then, the squared norm of the reparameterization gradient is bounded:

𝔼 𝙜𝑖(𝒘) 𝙜𝑖 (𝒘 ) 2

2 𝐿2 𝑖 (𝑑+ 𝑘𝜑 ) 𝒘 𝒘𝑖 2 2

for all 𝒘, 𝒘 𝒲.

𝔼 𝙜𝑖(𝒘) 𝙜𝑖 (𝒘 ) = 𝔼 𝒘𝓁𝑖(𝒯𝒘(𝙪)) 𝒘𝓁𝑖(𝒯𝒘 (𝙪)) 2 2

𝒘 𝓁𝑖(𝒯𝒘(𝙪)) 𝒯𝒘 (𝙪)

𝒘 𝓁𝑖(𝒯𝒘 (𝙯))

= 𝔼( 𝓁𝑖(𝒯𝒘(𝙪)) 𝓁𝑖(𝒯𝒘 (𝙪))) ( 𝒯𝒘(𝙪)

( 𝓁𝑖(𝒯𝒘(𝙪)) 𝓁𝑖(𝒯𝒘 (𝙪))) . As shown by Kim et al. (2023, Lemma 6), the squared Jacobian is an identity matrix scaled with a scalar-valued function independent of 𝒘, 𝐽𝒯(𝒖) = 𝒖 2 2 + 1, such that

= 𝔼𝐽𝒯(𝙪) 𝓁𝑖(𝒯𝒘(𝙪)) 𝓁𝑖(𝒯𝒘 (𝙪)) 2 2, applying the 𝐿𝑖-smoothness of 𝓁𝑖,

= 𝐿2 𝑖𝔼𝐽𝒯(𝙪) 𝒯𝒘(𝙪) 𝒯𝒘 (𝙪) 2 2,

and Kim et al. (2023, Corollary 2) show that,

𝐿2 𝑖 (𝑑+ 𝑘𝜑 ) 𝒘 𝒘 2 2.

Lastly, the properties of 𝓁𝑖are known to transfer to the expectation 𝑓𝑖as follows:

Lemma 15. Let Assumption 9 hold. Then we have the following:

(i) Let 𝓁𝑖be 𝐿𝑖smooth. Then, 𝑓𝑖is also 𝐿𝑖-smooth

(ii) Let 𝓁𝑖be convex. Then, 𝑓𝑖and 𝐹are also convex.

(iii) Let 𝓁𝑖be 𝜇-strongly convex. Then, 𝑓𝑖and 𝐹are also 𝜇-strongly convex.

Proof. (i) is proven by Domke (2020, Theorem 1), while a more general result is provided by Kim et al. (2023, Theorem 1); (ii) and (iii) are proven by Domke (2020, Theorem 9) and follow from the fact that ℎis convex.

C.2.3. THEORETICAL ANALYSIS

We now conclude that the reparameterization gradient fits the framework of this work:

Proposition 11. Let Assumption 9 hold and 𝓁𝑖be convex and 𝐿𝑖-smooth. Then, Assumption 5 holds.

Proof. The result follows from combining Lemma 15 and Lemma 10.

From Lemma 13, we satisfy 7.

Proposition 12. Let Assumption 9 hold, 𝓁𝑖be 𝐿𝑖-smooth, the solutions 𝒘 arg min𝒘 𝒲𝐹(𝒘) and the stationary points of 𝓁𝑖, 𝒛, be bounded such that 𝒘 2 < and 𝒛 2 < . Then, Assumption 7 holds.

Proof. Lemma 13 implies that, as long as the 𝒘 and 𝒛are bounded, we satisfy the component gradient estimator part of Assumption 7, where the constant is given as

𝜎2 𝑖= 𝐿2 𝑖(𝑑+ 1) 𝒎 𝒛𝑖 2 2 + 𝐿2 𝑖 (𝑑+ 𝑘𝜑 ) 𝑪 2 F,

where 𝒘 = (𝒎 , 𝑪 ).

From Lemma 14, we can conclude that the reparameterization gradient satisfies Assumption 6:

Proposition 13. Let Assumption 9 hold and 𝓁𝑖be 𝐿𝑖-smooth and 𝜇-strongly convex. Then, Assumption 6 (ACVX) and Assumption 6 (B) hold.

Proof. Notice the following:

1. Assumption 4 always holds for 𝜌= 1.

2. From the stated conditions, Lemma 15 establishes that both 𝑓𝑖and 𝐹are 𝜇-strongly convex.

3. 𝜇-strong convexity of 𝑓and 𝐹implies that both are 𝜇-QFG (Karimi et al., 2016, Appendix A).

4. The reparameterization gradient satisfies the QES condition by Lemma 14.

Item 1, 2 and 3 combined imply the ES condition by Proposition 8, which immediately implies the ER condition with the same constant. Therefore, we satisfy both Assumption 6 (ACVX), Assumption 6 (B) where the ER constant ℒ𝑖is given as

ℒ𝑖= 𝐿2 𝑖 𝜇 (𝑑+ 𝑘𝜑 ) .