# nearoptimality_of_contrastive_divergence_algorithms__c11c2a6f.pdf

Near-Optimality of Contrastive Divergence Algorithms

Pierre Glaser Kevin Han Huang Arthur Gretton Gatsby Computational Neuroscience Unit, University College London pierreglaser@gmail.com, han.huang.20@ucl.ac.uk, arthur.gretton@gmail.com

We perform a non-asymptotic analysis of the contrastive divergence (CD) algorithm, a training method for unnormalized models. While prior work has established that (for exponential family distributions) the CD iterates asymptotically converge at an O(n 1/3) rate to the true parameter of the data distribution, we show, under some regularity assumptions, that CD can achieve the parametric rate O(n 1/2). Our analysis provides results for various data batching schemes, including the fully online and minibatch ones. We additionally show that CD can be near-optimal, in the sense that its asymptotic variance is close to the Cramér-Rao lower bound.

1 Introduction

Describing data using probability distributions is a central task in multiple scientific and industrial disciplines [1, 2, 3]. Since the true distribution of the data is generally unknown, such a task requires finding an estimator of the true distribution among a model class that best describes the available data. An estimator can be characterized at multiple levels of granularity: at the highest level lies consistency [4], a property which states that as the number of available data points increases, a given estimator will converge to the one best describing the data distribution. At a lower level, a consistent estimator can be further characterized by its convergence rate, a quantity upper bounding its distance to the true distribution as a function of the number of samples. A convergence rate can be either asymptotic, e.g. hold only in the limit of an infinite sample size, or non-asymptotic, in which case the rate also holds for finite sample sizes. In their simplest form, convergence rates are provided in big O notation, discarding finer grained information such as asymptotically dominated quantities as well as multiplicative constants. These constants play a role in the so called asymptotic variance of the estimator, which is a precise descriptor of an estimator s statistical efficiency. Convergence rates and asymptotic variances have been the subject of extensive research in the statistical literature; in particular, well known lower bounds exists regarding both the best possible (asymptotic) convergence rate of an estimator and its best possible asymptotic variance. These results set a clear frame of reference to interpret individual convergence rates, and are routinely present in the analysis of modern statistical algorithms such as noise-contrastive estimation [5, 6] or score matching [7, 8, 9].

In this work, we focus on cases where (1) the true data distribution admits a density with respect to some known base measure, and (2) the model class is parametrized by a finite-dimensional parameter. In this setting, provided that the true distribution belongs to the model class, a celebrated result in statistical estimation states that the model maximizing the average log-likelihood both achieves the best possible asymptotic convergence rate (called the parametric rate) and the best possible asymptotic variance, called the Cramér-Rao bound (see, e.g. [10]). While this result shows that Maximum Likelihood Estimators (MLE) are asymptotically optimal, fitting them is complicated by computational hurdles when using models with intractable normalizing constants. Such unnormalized models are common in the Machine Learning literature due to their high flexibility [11, 12]; their weakness however lies in the fact that expectations under these models have no unbiased approximation. For this reason, popular approximation algorithms such as unbiased gradient-based

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

stochastic optimization of the empirical log-likelihood cannot a priori be used, as the gradient of the normalizing constant is given by an expectation under the model distribution.

The Contrastive Divergence (CD) algorithm [13] is a popular approach that circumvents this issue by using a Markov Chain Monte Carlo (MCMC) algorithm to approximate the gradient of the loglikelihood. Unnormalized models trained with Contrastive Divergence have been shown to reach competitive performance in high-dimensional tasks such as image [14, 15, 16], text [17], and protein modeling [18, 19], or neuroscience [20]. A consistency analysis of the Contrastive Divergence algorithm is delicate, however: indeed, the optimization error e.g. the difference between the estimate returned by CD and the MLE, is likely to be non-negligible as compared with statistical error the distance between the MLE and the true distribution and thus cannot be discarded, as often done when analyzing estimators that minimize tractable objectives [8, 5]. Recent work [21] elegantly established asymptotic O(n 1/3) consistency of the CD estimator for unnormalized exponential families when using only a finite number of MCMC steps. Key to their argument is the fact that the bias of the CD gradient estimate decreases as iterates approach the data distribution. However, as noted by the authors, their work left open the question of whether and under what conditions CD might achieve O(n 1/2) consistency.

Contributions In this work, we answer this question by providing a non-asymptotic analysis of the CD algorithm for unnormalized exponential families. While existing convergence bounds [21] were derived for the full batch setting, where the CD gradient is estimated using the full dataset at each iteration, our analysis covers both the online setting (where data points are processed one at a time without replacement), and the offline setting with multiple data reuse strategies (including full batch).

In the online case (Section 3), we show, under a restricted set of assumptions compared to Jiang et al. [21], that the CD iterates can converge to the true distribution at the parametric O(n 1/2) rate. Our analysis reveals that CD contains two sources of approximation: a bias term, and a variance term. These sources are almost independent of each other, in the sense that decreasing the bias by increasing the number of MCMC steps will not decrease the variance. The impact of these two sources of approximation transparently propagates in our resulting bounds: in particular, as the bias of the CD algorithm goes to 0, our bounds recover well-known results in online stochastic optimization [22]. Finally, we study the asymptotic variance of an estimator obtained by averaging the CD iterates, a classic acceleration technique in stochastic optimization [23]. We show that provided that the number of steps m is sufficiently large, the asymptotic variance of this estimator matches (up to a factor 4) the Cramér-Rao bound.

Next, we study the offline setting (Section 4), where the CD gradient is estimated by reusing (potentially random) subsets of a finite dataset. We show that a similar result to the online setup holds, up to an additional correlation term that arises from data reuse, and present several approaches to control this term. We improve over the results of [21] by showing a non-asymptotic and nearparametric rate at O((log n)1/2n 1/2) under their conditions, and also illustrate how different rates can be obtained under a variety of conditions. Our results also show an interesting tradeoff between the effect of initialization and the statistical error as a function of batch size.

In summary, we establish the near optimality of a variety of Contrastive Divergence algorithms for unnormalized exponential families in the so called long-run regime, where the number of MCMC steps is high enough to ensure that the CD gradient bias is sufficiently offset by the convexity of the negative log-likelihood.

2 Contrastive Divergence in Unnormalized Exponential Families

Unnormalized Exponential Families Exponential families (EF) [24, 25] form a well-studied class of probability distributions, given by

pψ(dx) := eψ ϕ(x) log Z(ψ)c(dx), Z(ψ) := R

X eψ ϕ(x)c(dx). (1)

Here, X x is the data or sample space, which we set to be a subset of Rd for some d N , although our results are readily extendable to more general measurable spaces. c is a measure on X called the base or carrier measure. When X Rd, c is often set to be the corresponding Lebesgue measure. ψ Ψ Rp is a finite-dimensional parameter called the natural parameter, and ϕ : Rd 7 Rp is a function called the sufficient statistics, which, alongside with the base measure, fully describes an exponential family. Finally, log Z(ψ), the log normalizing (or cumulant) function, is a quantity

ensuring that pψ integrates to 1 over X. Crucially, we will not assume that log Z(ψ) admits a closed form expression for all ψ. The latter fact provides the practitioner with a great deal of flexibility in designing the model class: indeed, the only requirement that should be satisfied prior to performing statistical estimation is to have Z(ψ) < + for all ψ, something that can be readily verified and is often the case in practice. The drawback of unnormalized EFs is the fact that sampling (and thus approximating expectations under the model) cannot usually be performed in an unbiased manner. Instead, inference in unnormalized EFs is often performed using tools from the Bayesian Inference literature, such as MCMC [26]. Unnormalized EFs belong to the larger class of unnormalized models [27, 28, 7, 6], of the form e Eψ(x) log Z(ψ)c(dx), Z(ψ) = e Eψ(x)c(dx), for some parametrized function Eψ : Rd 7 R referred to as the energy. Unnormalized models thus take the flexibility of unnormalized EFs one step further by allowing the (negative) unnormalized log density to be an arbitrary function Eψ of x and ψ, instead of requiring a linear dependence on ψ as in Equation 1. We focus in this work on unnormalized EFs due to the multiple computational benefits they provide, as explained in the next section, but we believe that extending our analysis to more general unnormalized models is an interesting avenue for future work.

Statistical Estimation in Unnormalized Exponential Families using Contrastive Divergence We now review the Contrastive Divergence algorithm, an algorithm used to fit unnormalized models, and our main object of study in this work. The general setting is the following: we assume access to n i.i.d. samples (X1, . . . , Xn) drawn from some unknown distribution p , which we assume belongs to Pψ, e.g. p = pψ for some ψ Ψ. Given these samples, we aim to perform statistical estimation, e.g. find a parameter ψn within Ψ that should approach ψ as n grows.

The starting point of the Contrastive Divergence algorithm is the unfortunate realization that Maximum Likelihood Estimation, which corresponds to minimizing the cross-entropy L(ψ) := Epn log dpψ/dc between the model pψ and the empirical data distribution pn := 1/n Pn i=1 δXi, cannot be performed using exact (possibly stochastic) gradient-based optimization, as the gradient ψL(ψ) of L with respect to the parameter ψ contains an expectation under the model distribution pψ. Indeed, the cross entropy and its gradient are given by L(ψ) = 1

n Pn i=1 ϕ(Xi) ψ + log Z(ψ) ψL(ψ) = 1

n Pn i=1 ϕ(Xi) + Epψϕ. (2)

The second line follows from the well known identity ψ log Z(ψ) := Epψϕ; we refer to [25, Proposition 3.1] for a proof. The Contrastive Divergence algorithm circumvents this issue by running approximate stochastic gradient descent (SGD) on L, where the intractable expectation in ψ log Z is estimated using an MCMC algorithm initialized at the empirical data distribution. In more details, given a number of epochs T, a sequence of data batches Bt,j of size B (e.g. Bt,j [[1, n]]B, 1 t T, 1 j N n/B ), and a family of Markov kernels {kψ, ψ Ψ} each with invariant distribution pψ, at the jth minibatch of epoch t, ψ log Z(ψt,j 1) is approximated by 1 B P i Bt,j ϕ( Xm i ), where Xm i is produced by running the recursion Xk i kψt( Xk 1 i , ), X0 i = Xi up to k = m. Throughout the paper, we will refer to the conditional distribution of Xm i given Xi as km ψ (Xi, ). The resulting gradient estimate arising from combining this approximation with the other (tractable) sum over the data samples present in ψL(ψ), which we refer to as the CD gradient and denote as ht, is thus

i Bt,j ϕ(Xi) 1

i Bt,j ϕ( Xm i ) = 1

ϕ(Xi) ϕ( Xm i ) . (3)

Key to the behavior and analysis of the CD algorithm is the strategy employed to generate minibatches Bt,j. The case where T = 1, B = 1, and B1,j = {j} will be referred to as online CD, while the variant where T > 1, and each batch Bt,j draws B indices (with or without replacement) from [[1, n]] will be referred to as offline CD. In online CD, each data point is present in one and one batch only, while in offline CD, data points are reused across batches. From a statistical perspective, we will see that online CD can be analyzed in a remarkably simple way, while offline CD introduces additional correlations that require care to be controlled. Both settings come with their advantages and drawbacks, as we will see in the next section. The CD algorithms we study will employ decreasing step size schedules (ηt)t 0 of the form ηt = Ct β, where C > 0 is the initial leaning rate and β [0, 1]. We lay out online CD and offline CD in Algorithms 1 and 2. Note that our algorithms include a projection step on the parameter space Ψ to account for the case where Ψ is compact. In the case Ψ = Rp, this step can be omitted. Next we depart from the setting of [21] and start by analyzing online CD.

Algorithm 1 Online CD

Input: (X1, . . . , Xn) i.i.d. pψ Parameters: Model class {pψ, ψ Ψ}, Markov kernels {kψ, ψ Ψ}, number of MCMC steps m, learning rate schedule ηt := Ct β, β [0, 1], C > 0, initial parameter ψ0 for t = 1, . . . , n do

//Approx. sample from pψt 1

Xm t km ψt 1(Xt, ) ht := ϕ(Xt) ϕ( Xm t ) ψt ψt 1 ηtht ψt ProjΨ(ψt) end for return ψn

Algorithm 2 Offline CD

Input: (X1, . . . , Xn) i.i.d. pψ Parameters: Same as Algorithm 1, plus number of epochs T, batch size B, batching schedule Bt,j, initial parameter ψ0,0 for t = 1, . . . , T do

for j = 1, . . . n/B do

[ Xm t,i,j km ψt 1(Xi, ) for i in Bt,j] ht,j := 1

B P i Bt,j(ϕ(Xi) ϕ( Xm t,i,j)) ψt,j ψt,j 1 ηtht,j ψt,j ProjΨ(ψt,j) end for end for return ψT

3 Non-asymptotic analysis of Online CD

3.1 Preliminaries and Assumptions Recall that the chi-squared divergence between two probability measures p and q is defined as: χ2(p, q) := R ( dp

dq (x) 1)2q(dx) if p q, and + otherwise. Here, p q denotes that p is absolutely continuous with respect to q and dp/dq is the Radon-Nikodym derivative [29] of p with respect to q. Let L2(pψ) be the space of square-integrable functions with respect to pψ. For a function f L2(pψ), we define

R R (f Epψf)(y) kψ(x, dy) 2pψ(dx) 1/2

R (f Epψf)(x)2pψ(dx) 1/2 (4)

which is a measure of how quick a Markov chain with kernel kψ mixes, relative to the function f [30]. With these definitions in hand, we now state the assumptions required by our analysis of online CD. These assumptions form a strict subset of the assumptions considered in prior work [21], which required additional regularity and tail conditions on the Markov kernels kψ. Assumption A1. Pψ is a subset of a regular and minimal [25, Section 3.2] exponential family with natural parameter domain D Rp, Ψ is a convex and compact subset of D, and ψ lies in the interior of Ψ. Assumption A2. There exists a constant Cχ > 0 such that χ2(pψ , pψ) C2 χ ψ ψ 2

Assumption A3. α := sup{α(f, ψ), f {ϕi}p i=1 {ϕiϕj}p i,j=1, ψ Ψ} < 1, where ϕi is the i-th component of the function ϕ, and ϕ2 i is the i-th component of the function x 7 ϕ(x)2.

A well known property of EFs [25, Proposition 3.1] is that their negative cross-entropy (against any other measure) is C , convex, and strictly so if the exponential family is minimal (meaning that the set of sufficient statistic functions ϕi are not linearly dependent). Leaving aside the issue of intractable expectations, this convexity suggests that L can be efficiently minimized using stochastic approximation algorithms [31, 22]. The compactness of Ψ provided by Assumption A1 thus ensures, by the extreme value theorem [32], the existence of finite positive constants µ and L defined as:

µ := minψ Ψ λmin 2 ψL(ψ) , L := maxψ Ψ λmax 2 ψL(ψ) , (5)

where 2 ψL is the Hessian of L with respect to ψ. µ (called the strong convexity constant) and L (a bound controlling the smoothness of the problem) play a critical role in the analysis of convex optimization algorithms [31]. While it is possible to obtain convergence rates in non-smooth or non-strongly-convex settings, our analysis follows the spirit of [21] by leveraging the strong convexity of the problem to compensate for the bias introduced by using CD gradients instead of unbiased stochastic gradients.

Assumption A2 allows link variations in distribution space to variations in parameter space, and will be instrumental to control the bias of the CD gradient. Note that since χ2(pψ , pψ) =

elog Z(2ψ ψ ) (2 log Z(ψ) log Z(ψ )) 1 provided that 2ψ ψ D (see [33, Lemma 1]), we expect Assumption A2 to hold in many cases of interests. On the other hand, the possible exponential scaling of Cχ w.r.t log Z suggests that this constant may be large in some instances.

Assumption A3 is a restricted spectral gap condition: it guarantees that the time required by the MCMC algorithm to estimate expectations of ϕ and ϕ2 under pψ will be uniformly bounded. This assumption is weaker than the (unrestricted) uniform spectral gap condition of [21], which requires that α controls the convergence rate of all functions in L2(pψ). Note that standard results in stochastic analysis [34] guarantee that α 1: thus, it only remains to ensure that α is strictly less than 1. Spectral gaps are strongly dependent on two properties of distribution: their tail behavior and their multimodality. While multimodality poses the risk of pushing the constant α close to 1, very heavy tails distributions may not verify the spectral gap condition at all.

3.2 Results

3.2.1 Parametric convergence of online CD In this section, we show that under the assumptions stated in Section 3.1, the iterates ψt produced by the online CD algorithm described in Algorithm 1 will converge to the true parameter ψ at the parametric rate O(n 1/2). To do so, we follow a well known paradigm in convex optimization [22] by deriving a recursion on the quantity δt := E ψt ψ 2, which will allow, after unrolling, to obtain convergence rates for the iterates ψt. We aim to characterize precisely the impact of performing CD as opposed to performing online SGD on L, which would consist of replacing the CD gradient ht of Algorithm 1 by the unbiased (stochastic) gradient, given by:

gt(ψ) := ϕ(Xt) + ψ log Z(ψ) (6)

which satisfies E gt(ψ) = ψL(ψ). The only stochasticity in gt comes from the sampling of a single data point xt from the true distribution, which is unavoidable in the online setting, and we have E gt(ψ ) 2 = Tr(Covpψ ϕ) =: σ2 . σ2 plays a key role in the analysis of Stochastic Gradient Descent [22]. We expect that replacing gt by ht will introduce two sources of approximation: a bias term coming from using a finite number of MCMC steps m, and an additional variance term, coming from using a single sample xm t to estimate ψ log Z(ψt). With that in mind, we derive a recursion on δt in the following lemma.

Lemma 3.1. Let (ψt)0 t n be the iterates from Algorithm 1. Denote δt = E ψt ψ 2, σ = (Epψ ϕ Epψ ϕ 2)1/2, and σt = (Epψt ϕ Epψtϕ 2)1/2. Then, under A1, A2 and A3, for all t 1,

δt (1 2ηt µm,t 1 + 2η2 t L2)δt 1 + 2η2 t σ2 m,t 1 + 4αm/2η2 t log Z 3, Cχδ1/2 t 1 (7)

where log Z 3, is a constant, µm,t := µ αmσt Cχ, and σm,t := (σ2 + σ2 t + 2σ2 t α2m)1/2.

Lemma 3.1 is proved in Appendix D.2, which details the form of log Z 3, , a constant that we expect to scale roughly as d L. Loosely speaking, this recursion suggests that as the learning rate ηt goes to 0, the two terms scaling in η2 t will be negligible, in which case we will have: δt (1 2ηt µm,t 1)δt 1 < δt 1, yielding convergence of δt to 0. We make these arguments formal in the next theorem. The reader familiar with the convex optimization literature will note the similarities between this recursion and the one derived in [22], which would apply as is to online SGD on L using gt. The difference between the two recursions is that the roles of the strong convexity constant µ and the noise σ are now played respectively by

µm,t 1 = µ αmσt 1Cχ and σ2 m,t 1 = σ2 + σ2 t + 2σ2 t α2m

These two modifications respectively characterize the impact of the bias and the additional variance introduced by the CD gradient. The last term in Equation 7, scaling in αm/2η2 t δt, is a residual higher order mixed term coming from relating the variance of the Markov chain sample xm t to σ2 t . This term can be easily controlled as done next, and disappears as m . Investigating the impact of m in the recursion, we notice that as m , µm,t µ. As we will see later, this ensures that CD will converge for a sufficiently high m. On the other hand, in that same regime, σm,t does not converge to σ , but rather to (σ2 + σ2 t )1/2, showing the irreducible impact of the variance term.

While we precisely investigate the impact of the residual variance term in the next section, we now unify σ and σt by introducing

σ := supψ Ψ(Epψ ϕ Epψϕ 2)1/2. (8)

σ is an upper bound on the noise induced both by the CD gradient and by the online setup, and was used in prior work [21]. Note that by the properties of log Z, σ2 also equals supψ Ψ tr( 2 ψL(ψ)), where tr(A) is the trace of A Rp p, and thus finite by the extreme value theorem. The following theorem is obtained by invoking standard unrolling arguments in the convex optimization literature. In the next result, we use the function φγ(t), defined as φγ(t) = tγ 1

γ if γ = 0, and log t if γ = 0.

Theorem 3.2. Fix n 1. Let (ψt)0 t n be the iterates produced by Algorithm 1, and define δt := E ψt ψ 2. Moreover, assume that m > log(σCχ/µ)

log |α| , i.e. µm := µ αmσCχ > 0. Then under Assumptions A1, A2 and A3, for ηt = Ct β with C > 0, we have:

2 exp 4 LC2φ1 2β(n) exp µm C

4 n1 β δ0 + σ2 m L2 + 4C σ2 m µmnβ , if 0 β < 1

exp(2 L2C2)

n µC δ0 + σ2 m L2 + 2 σ2 m C2 φ µm C/2 1(n)

n µm C/2 , if β = 1 ,

where σm = σ2(2 + 2α2m) + αm/2 log Z 2 3, C2 χ and L = (L2 + αm/2)1/2 . Consequently, if

n with an initial learning rate C > 2 µ 1 m , we have δn 2 σm C q

µm C µm C 2 1 n + o 1 n .

Theorem 3.2 is proved in Appendix D.3. It shows that the iterates produced by online CD will converge to the true parameter ψ at the rate O(n 1/2) provided that the number of steps m is sufficiently large, improving over the asymptotic O(n 1/3) rate of [21], while imposing slightly weaker conditions on the number of steps m (see [21, Theorem 2.1]). This proves that online CD can be asymptotically competitive with other methods for training unnormalized models, such as Noise Contrastive Estimation [6], or Score Matching [7]. However, the asymptotic variance of ψt (e.g. the multiplicative factor in front of the O(n 1/2) term) is likely to be suboptimal, e.g. much larger than the Crámer-Rao bound, given by the trace of the inverse of the Fisher information matrix [25]. Given the statistical optimality of MLE, and the fact that CD in an approximate MLE method, this motivates the further goal or obtaining a CD estimator with near-optimal statistical properties. In the next section, we achieve this goal by showing that averaging the iterates ψt will produce a near statistically-optimal estimator, in a sense that we will make precise.

3.2.2 Towards statistical optimality with averaging

Polyak-Ruppert averaging [23] is a simple yet surprisingly effective way to construct an asymptotically optimal estimator ψn := 1

n Pn t=1 ψi from a sequence of iterates (ψt)0 t n obtained by running a standard online SGD algorithm [22]. As shown in [22], when the objective is the cross-entropy of a model, and assuming the unbiased stochastic gradients are available, averaging yields an estimator ψ with the asymptotic variance tr(I(ψ ) 1)/n, where I(ψ) := Covpψ ϕ is the Fisher information matrix of the data distribution pψ . I(ψ ) 1 being the Cramér-Rao lower bound on asymptotic variances of statistical estimators [10], this estimator ψn is asymptotically optimal. The following theorem shows conditions under which averaging CD iterates can give rise to a near-optimal estimator.

Theorem 3.3 (Contrastive Divergence with Polyak-Ruppert averaging). Let (ψt)t 0 the sequence of iterates obtained by running the CD algorithm with a learning rate ηt = Ct β for β ( 1

2, 1). Define ψn := 1

n Pn i=1 ψi. Then, under the same assumptions as Theorem 3.2, and assuming additionally that m := m(n) > (1 β) log n

2| log α| , we have, for all n 1,

E ψn ψ 2 1/2 2

tr(I(ψ ) 1)

n + o(n 1/2)

Consequently, we have that lim supn n E( ψn ψ 2) 4tr(I(ψ ) 1).

Theorem 3.3, alongside with a statement which includes the asymptotic order of the residual term, is proved in Appendix D.4. It shows that at the cost of an increase in computational complexity of the entire algorithm from O(n) to O(n log n), ψn will be a near-optimal statistical estimator of ψ .

While this increase in complexity emerges from the bias of CD, the additional variance of CD results in an asymptotic variance inflated by a factor of 4 compared to the Cramér-Rao bound.

Theorem 3.3 concludes our analysis of online CD. Despite their asymptotic near-optimality, the bounds provided for online CD and its averaged version have weaknesses: the online CD iterates are not robust to choices of C. On the other hand, as shown in Appendix D.4, the bound of the averaged iterates contain higher-order terms that could be large in intermediate sample regimes. Next, we show that offline CD, which processes data points multiple times, can alleviate these issues.

4 Non-asymptotic analysis of offline CD

In practice, CD gradient approximation schemes are commonly used within an offline stochastic gradient descent (SGD) algorithm, where one is given the full size-n dataset upfront and each update uses some stochastic subset of the data. We study CD under offline SGD with replacement (SGDw), i.e. Algorithm 2 with batches Bt,j being i.i.d. uniform draws of size-B subsets of [n], and include SGD without replacement in Appendix B.2. To do so, we follow the setting of prior work on offline CD [21], which established its asymptotic O(n 1

3 ) consistency. We show that by slightly strengthening a moment assumption used in [21], the offline CD iterates converge to the true parameter at a near-parametric O((log n) 1 2 n 1

2 ) rate. Our proof proceeds by controlling a tail probability term specific to the offline setting which characterizes the strength of the correlations between the offline CD iterates and the training data. While, as we show, the assumptions of [21] provide a tail control sufficient to obtain a near-parametric rate, other strategies are possible to obtain convergence guarantees. In particular, we show that non-asymptotic convergence can be obtained by either (1) relaxing assumptions on the Markov kernel required by prior work, or (2) making a specific mixing assumption the Markov chain.

4.1 Background: Asymptotic consistency of offline CD in subexponential settings

Prior work [21] has established asymptotic O(n 1

3 ) consistency of the (averaged) offline CD iterates in the full-batch case. We summarize their results and assumptions below. Assumption A4. There exists ν 2 s.t. for all m N, there is κν;m < s.t.

supx X supψ Ψ E ϕ(Km ψ (x)) E[ϕ(Km ψ (x))] ν 1/ν κν;m .

Assumption A5. There exists some Cm > 0 such that, for all ψ1, ψ2 Ψ, supx X E[ϕ(Km ψ1(x))] E[ϕ(Km ψ2(x))] Cm ψ1 ψ2 .

Assumption A6. There exist some σm, ζm > 0 such that, for any z Rp with z ζm, E[ez (ϕ(Km ψ (X1)) E[ϕ(Km ψ (X1))])] eσ2 m z 2/2. Theorem 4.1 (Theorem 2.1 of [21]). Assume assumptions A1,A2, A3, A4 (for ν = 2), A5 and A6. Let ψt,1 be the t-th iterate of offline CD with full-batch gradient descent and constant stepsize ηt = C, i.e the iterates produced by Algorithm 2 using Bt,1 = [[1, n]]. Then for any learning rate C and number of Markov kernel steps m satisfying µ αmσCχ C

2 (L + αmσCχ)2 > 0, we have, for some Am > 0,

t=1 ψSGDw t,1 ψ > Amn 1

This result shows convergence of the averaged full-batch CD iterates to the true parameter in the large n and T limit. As discussed, this result is asymptotic both in n and T: the probability of the error exceeding Amn 1

3 goes to 0 as n and T , but at an unknown rate. Moreover, the O(n 1

3 ) does not match the optimal O(n 1

4.2 Sharpening offline CD bounds in subexponential settings

4.2.1 Non-asymptotic O(n 1/2)-consistency

As a first result, we show that under the assumptions of [21] (except for a slightly stronger ν > 2 moment assumption in A4), ψSGDw T,N in fact achieves a near-parametric rate. The most general version of our result holds for any learning rate schedule of the form Ct β, β [0, 1], and for offline SGD

with arbitrary batch sizes B, with data drawn either with or without replacement across batches. For simplicity, we first present our result assuming full batch (B = n, N = 1, ψSGDw t,j = ψSGDw t,1 for t 1) SGD with constant step sizes ηt = C, which is the setting of [21]. Analogue bounds holding for the other mentioned batching and step sizes schedules can be found in Appendix B.

Theorem 4.2. Assume the setup of Theorem 4.1, except that Assumption A4 holds for some ν > 2, and that µm = µ αmσCχ > 4CL2 . Let δSGDw t,j := E ψSGDw t,j ψ 2. Then, we have:

δSGDw T,1 ET,1 1 q

δSGDw 0,0 + C (p, ν, m, Ψ) log n n + 1 n

µm C + ET,1 2 L2C2 (9)

where ET,1 1 , ET,1 2 are functions decreasing exponentially in T, and C (p, ν, m, Ψ) is a constant in n, T. Consequently,

δSGDw T,1 e µm C

µm C C (p, ν, m, Ψ, β) log n n + 1 n

The precise values of all the constants can be found in Theorem B.1 (for ET,1 1 , ET,1 2 ) and Lemma B.3 (for C (p, ν, m, Ψ, β)), including their expressions for N > 1 and β [0, 1]. We comment on the main differences between our result and the one of [21]. First our bound holds for any epoch T and number of samples n. Second, fixing n but taking T , the final bound matches the parametric O( n) up to a p

log(n) factor, a significant improvement over the O(n 1

3 ) rate of [21]. Finally, we control an L2 error, which is a stronger control than a high probability bound by Markov s inequality; we hypothesize this is the reason why a slightly stronger moment assumption is required for our setup, compared to the one used for the high probability bound in [21].

Inspecting Equation 9, we notice the presence of two transient terms, and a stationary term, reminiscent of the structure of upper bound of Theorem 3.2. The transient terms (i.e. the ones containing ET,1 1 and ET,1 2 ) vanish exponentially fast in the total number of CD updates T. However, unlike in online CD where the number of updates and the number of samples are tied (e.g. T = n), these two values are now decoupled, and these terms can be made arbitrarily small by increasing the number of gradient steps T without having to collect more samples n. The stationary term, which is the only one remaining in the limit of T , decreases with n at a rate that is independent of hyperparameters like the step size C or the learning rate schedule β (see Lemma B.3). In that sense, offline CD compares favorably to online CD, whose rate is sensitive to β and C, and averaged online CD, whose bound contains higher-order (in n) terms which can be large in the moderate n regime. On the other hand, the stationary term in offline CD is asymptotically suboptimal: its rate is larger (while only up to a log factor) than the best-case O( n) one achieved by online CD algorithms, and the leading constant does not match the optimal one.

4.2.2 Proof of Theorem 4.2

The high-level proof of Theorem 4.2 follows a similar strategy as the online one: first, derive a recursion for the quantity δSGDw t,1 := E ψSGDw t,1 ψ 2, then unroll it explicitly to obtain a final bound on δSGDw T,1 . The main difference to online CD is the presence of an additional offline-specific correlation between the iterates and the data. We thus break down the proof into three steps: (1) deriving a controllable, uniform-in-time upper bound of the data-iterate correlations, (2) deriving and unrolling a recursion on δSGDw t,1 containing this new term, and (3) controlling that term to obtain a final bound on δSGDw T,1 .

Step 1:characterizing the data-iterate correlations in offline CD In offline CD, at each epoch t 1, the iterate ψSGDw t 1,1 and the data samples Xi are correlated: this is because these samples may have been used in previous epochs t < t 1 to obtain the ψSGDw t ,1 , which themselves influenced ψt 1,1. With such correlations, we now have P(Xi|ψSGDw t 1,1 ) = P(Xi), preventing us from obtaining an unrollable recursion on δSGDw t,1 by first marginalizing Xi out to obtain an upper bound of E ψSGDw t,1 ψ 2|ψSGDw t 1,1 that only depends on ψSGDw t 1,1 ψ , and then marginalizing over ψSGDw t 1,1 to obtain a recursion as in Lemma 3.1. As this problem would not have occurred had we used fresh samples (e.g. i.i.d copies of Xi not present in the training data) to perform our update, the

core of the proof lies in controlling the following quantity:

(ψSGDw t,1 ) := 1 n

i n E ϕ Km i;ψSGDw t,1 (Xi) ψSGDw t,1 , Xi E ϕ Km i;ψSGDw t,1 (X 1) ψSGDw t,1

where X 1 is an i.i.d. copy of X1. (ψSGDw t,1 ) is the expected (over the data and iterates) error between a quantity that allows to obtain a recursion (the rightmost term) and the one actually used by offline CD (the leftmost term). To control it, we upper-bound it using a tail decomposition:

E[ (ψSGDw t,1 )2] ϵ2+(sup t E[ (ψSGDw t,1 )ν])2/ν sup t P( (ψSGDw t,1 ) > ϵ) ν 2

ν := εSGDw n,m,T ;ν(ϵ)2

(10) We invoke an additional assumption to ensure that (E[ (ψSGDw t,1 )ν])2/ν is finite; in the results of [21], this is automatically implied by assumptions A4 and A6. For simplicity we assume the same bounding constant κν;m. Assumption A7. There exists ν 2 s.t. for all m N, κν;m from A4 moreover verifies supψ Ψ E ϕ(Km ψ (X1)) E[ϕ(Km ψ (X1))] ν 1/ν κν;m .

Note the similarity of this assumption with assumption A4: the only difference is that X1 is now a random training point instead of an deterministic (arbitrary) one. Ensuring assumption A7 in addition to assumption A4 thus requires controlling a ν-th order moment, instead of all moments as implied by assumption A6.

Step 2: Deriving and unrolling the recursion on δSGDw t,1 The right-hand side of Equation (10) does not depend on t, allowing for the derivation of an unrollable recursion on δSGDw t,1 and its subsequent unrolling, which is performed in the following theorem. For simplicity, we again assume β = 0 and N = 1 and defer the general case to Theorem B.1 in appendix. Theorem 4.3 (Convergence up to a tail control). Assume A1, A2, A3, A4 and A7. Let ηt = C for some C > 0, and assume that µm = µ αmσCχ > 4CL2 . Then for any ϵ > 0, q

δSGDw T,1 ET,1 1 q

δSGDw 0,0 + C εSGDw n,m,T ;ν(ϵ) + 5σ + 5κν,m n

µm C + ET,1 2 L2C2

where εSGDw n,m,T ;ν(ϵ) is defined in Equation (10).

Note that in the general, non-full batch B n case, 5σ+5κν,m n is replaced by 5σ+5κν,m

B (see Theorem B.1). Under our bounds, obtaining consistency thus requires setting B B(n) n + .

Step 3: Controlling the tail probability term Theorem 4.3 is just one step away from the final bound of Theorem 4.2: it remains to control the tail term εSGDw n,m,T ;ν(ϵ). Under the assumptions of [21], minimizing εSGDw n,m,T ;ν(ϵ) over ϵ yields the following result:

Lemma 4.4. Assume the setup of Theorem 4.2. Let n N be sufficiently large s.t. log n

n < σ2 mζ2 m p+ν 2. Denote rΨ as the radius of the smallest sphere in Rp that contains Ψ, which is finite under A1. Then

inf ϵ>0εSGDw ν;n,m,T (ϵ) 3σm p

p((ν 2)p + 2ν) ν 2 + κν;m2 ν 2

2ν 1 + 2Cm(ν 2)1/2

σmp1/2((ν 2)p + 2ν)1/2 (ν 2)p

2ν log n n .

To obtain this result, we control the moment term (supt E[ (ψSGDw t,1 )ν]) using A7, and we control the tail probability term supt P( (ψSGDw t,1 ) > ϵ) as in [21, Lemma 3.1] using an union bound, a covering argument and A6. Theorem 4.2 then follows by plugging Lemma 4.4 into Theorem 4.3.

4.3 Consistency of offline CD: beyond subexponential tails.

As discussed above, the general unrolling result of Theorem 4.3 holds without the subexponentiality assumption A6; this assumption was only used in Lemma 4.4 to control εSGDw n,m,T ;ν(ϵ). We now discuss two alternative ways to control this quantity without requiring subexponential tails. The first generalizes the idea of Jiang et al. [21], while the second exploits mixing of the Markov chain Km ψ (x) as m . As before we only state partial results (full batch, β = 0) and defer the full explicit bounds to Appendix B.3.

Control via Markov Inequality Our first alternative uses Markov Inequality to yield the following.

Theorem 4.5. Assume the setup of Theorem 4.3 and additionally that A5 holds. Then

inf ϵ>0 εSGDw ν;n,m,T (ϵ) C(p, ν, m, Ψ) n (ν 2)ν 2(ν2+(ν 2)p) ,

δSGDw T,1 C (p, ν, m, Ψ) n (ν 2)ν 2(ν2+(ν 2)p) + 1 n ,

where C and C are functions whose explicit expressions are given in Lemma B.4 in the appendix.

In the case p = 1 and ν = 3, the sub-optimal error from Theorem 4.5 reads O(n 3/20). Theorems 4.2 and 4.5 reveal that, depending on the tail condition imposed on the noise introduced by the Markov kernel, the convergence rate of offline CD varies: A subexponential tail, as assumed in prior work, in fact leads to near-parametric rate. Meanwhile, consistency can be obtained without assuming subexponentiality, albeit at a sub-optimal rate.

Control via Markov chain mixing. Alternatively, notice that E[ (ψSGDw t,1 )2] involves an average of E ϕ Km ψSGDw t,1 (Xi)2 Xi, ψSGDw t,1 E ϕ Km ψSGDw t,1 (X 1) ψSGDw t,1 . When m , the effect of initialization vanishes, and one may expect the difference to converge to zero. We defer to Lemma B.5 in the appendix to show that, under a ϕ-discrepancy mixing condition ([35]) with a mixing coefficient α [0, 1),

inf ϵ>0 εSGDw ν;n,m,T (ϵ) = O(κν;m α

3ν 2 ) and lim T

δSGDw T,1 = O κν;m α

3ν 2 + σ + κν;m n

As m , this recovers the parametric rate O(n 1/2). This alternative convergence guarantee comes at the cost of requiring m, the number of Markov chain steps, to grow with the sample size n.

Remark (Examples). In our main results (Theorems 3.2, 3.3 and 4.3) and the tail condition for offline SGD (Theorem 4.2), we employed a weaker set of assumptions than those in [21] (except for the mild ν > 2 moment assumption in (A4)). Consequently, our results apply to all three examples studied in [21]: A bivariate Gaussian model with unknown mean and random-scan Gibbs sampler, a fully visible Boltzmann machine with random-scan Gibbs sampler, and an exponential-family random graph model with a Metropolis-Hastings sampler.

5 Related Work

Central to this paper is the prior work of Jiang et al. [21], which provided a rigorous theoretical foundation to analyze the convergence of full-batch CD, and which we refine. The study of optimization with biased gradient descent has attracted a lot of attention in recent years [36, 37, 38, 39]. These works, while closely connected to ours, analyze algorithms with different implementation choices than the CD algorithm: i.i.d. noise setup [36], or setup where a persistent Markov chain is maintained through the iterations [36, 37, 38, 39]. The latter is akin to a variant of the CD algorithm, called the persistent CD [40]. In contrast, our analysis focus on the CD algorithm that restarts a batch of Markov chains from the data distribution at every iteration. Finally, there is a rich body of work on convergence guarantees for offline multi-pass SGD [41, 42, 43, 44, 45, 46]. A notable difference of our analysis is that we are primarily concerned with statistical errors associated with convergence to the true parameter ψ in number of samples n, and not the commonly studied convergence rate in number of epochs T. Consequently, most of our work for the offline setup goes into handling the correlations that accumulate by reusing data across epochs.

6 Discussion

In this work, we provide a non-asymptotic analysis of the Contrastive Divergence algorithms, showing, in the online setting, their potential to converge at the parametric rate and to have near-optimal asymptotic variance, and proving a near-parametric rates in the offline setting, significantly extending prior results. Our results apply to unnormalized exponential families: despite their flexibility, these models only cover log-densities with linear relationships on the model parameters. We believe that extending our results to more general forms of unnormalized models is an important direction for future work.

Acknowledgments and Disclosure of Funding

All authors acknowledge support from the Gatsby Charitable Foundation.

[1] Kyle Cranmer, Johann Brehmer, and Gilles Louppe. The frontier of simulation-based inference. Proceedings of the National Academy of Sciences, 117(48):30055 30062, 2020.

[2] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748 8763. PMLR, 2021.

[3] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

[4] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

[5] Holden Lee, Chirag Pabbaraju, Anish Prasad Sevekari, and Andrej Risteski. Pitfalls of gaussians as a noise distribution in nce. In The Eleventh International Conference on Learning Representations, 2023.

[6] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010.

[7] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 2005.

[8] Frederic Koehler, Alexander Heckett, and Andrej Risteski. Statistical efficiency of score matching: The view from isoperimetry. In The Eleventh International Conference on Learning Representations, 2022.

[9] Chirag Pabbaraju, Dhruv Rohatgi, Anish Prasad Sevekari, Holden Lee, Ankur Moitra, and Andrej Risteski. Provable benefits of score matching. Advances in Neural Information Processing Systems, 36, 2024.

[10] George Casella and Roger L Berger. Statistical inference. Cengage Learning, 2021.

[11] Nicolas Le Roux and Yoshua Bengio. Representational power of restricted boltzmann machines and deep belief networks. Neural computation, 20(6):1631 1649, 2008.

[12] Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.

[13] Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 2002.

[14] Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of mcmc-based maximum likelihood learning of energy-based models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.

[15] Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent non-persistent short-run mcmc toward energy-based model. Advances in Neural Information Processing Systems, 2019.

[16] Yilun Du, Shuang Li, Joshua Tenenbaum, and Igor Mordatch. Improved contrastive divergence training of energybased models. ar Xiv preprint ar Xiv:2021.01316, 2021.

[17] Lianhui Qin, Sean Welleck, Daniel Khashabi, and Yejin Choi. Cold decoding: Energy-based constrained text generation with langevin dynamics. Advances in Neural Information Processing Systems, 35:9538 9551, 2022.

[18] Deqian Kong, Bo Pang, Tian Han, and Ying Nian Wu. Molecule design by latent space energybased modeling and gradual distribution shifting. In Uncertainty in Artificial Intelligence, pages 1109 1120. PMLR, 2023.

[19] Csilla Varnai, Nikolas S Burkoff, and David L Wild. Efficient parameter estimation of generalizable coarse-grained protein force fields using contrastive divergence: a maximum likelihood approach. Journal of chemical theory and computation, 9(12):5718 5733, 2013.

[20] Pierre Glaser, Michael Arbel, Arnaud Doucet, and Arthur Gretton. Maximum likelihood learning of energy-based models for simulation-based inference. 2022.

[21] Bai Jiang, Tung-Yu Wu, Yifan Jin, and Wing H Wong. Convergence of contrastive divergence algorithm in exponential family. The Annals of Statistics, 46(6A):3067 3098, 2018.

[22] Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011.

[23] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838 855, 1992.

[24] Lawrence D Brown. Fundamentals of statistical exponential families: with applications in statistical decision theory. Ims, 1986.

[25] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008.

[26] Charles J Geyer. Introduction to markov chain monte carlo. Handbook of markov chain monte carlo, 20116022:45, 2011.

[27] Yann Le Cun, Sumit Chopra, Raia Hadsell, M Ranzato, and Fujie Huang. A tutorial on energybased learning. Predicting structured data, 1(0), 2006.

[28] Yang Song and Diederik P Kingma. How to train your energy-based models. ar Xiv preprint ar Xiv:2101.03288, 2021.

[29] Joseph L Doob. Measure theory, volume 143. Springer Science & Business Media, 2012.

[30] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.

[31] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.

[32] Gary Harris and Clyde Martin. Shorter notes: The roots of a polynomial vary continuously as a function of the coefficients. Proceedings of the American Mathematical Society, pages 390 392, 1987.

[33] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Processing Letters, 21(1):10 13, 2013.

[34] Dominique Bakry, Ivan Gentil, Michel Ledoux, et al. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.

[35] Maxim Rabinovich, Aaditya Ramdas, Michael I Jordan, and Martin J Wainwright. Functionspecific mixing times and concentration away from equilibrium. 2020.

[36] Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory, pages 1944 1974. PMLR, 2019.

[37] Tao Sun, Yuejiao Sun, and Wotao Yin. On markov chain gradient descent. Advances in neural information processing systems, 31, 2018.

[38] Thinh T Doan. Finite-time analysis of markov gradient descent. IEEE Transactions on Automatic Control, 68(4):2140 2153, 2022.

[39] Yves F Atchadé, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algorithms. Journal of Machine Learning Research, 18(10):1 33, 2017.

[40] Tijmen Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, 2008.

[41] Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil, and Leslie Pack Kaelbing. Stability of randomized learning algorithms. Journal of Machine Learning Research, 6(1), 2005.

[42] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225 1234. PMLR, 2016.

[43] Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(97):1 47, 2017.

[44] Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Advances in Neural Information Processing Systems, 31, 2018.

[45] Nicole Mücke, Gergely Neu, and Lorenzo Rosasco. Beating sgd saturation with tail-averaging and minibatching. Advances in Neural Information Processing Systems, 32, 2019.

[46] Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Risk bounds of multi-pass sgd for least squares in the interpolation regime. Advances in Neural Information Processing Systems, 35:12909 12920, 2022.

[47] Dheeraj Nagaraj, Prateek Jain, and Praneeth Netrapalli. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pages 4703 4711. PMLR, 2019.

[48] Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of sgd without replacement. In International Conference on Machine Learning, pages 7964 7973. PMLR, 2020.

[49] Dimitri Bertsekas. Convex optimization theory, volume 1. Athena Scientific, 2009.

[50] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

[51] S. W. Dharmadhikari, V. Fabian, and K. Jogdeo. Bounds on the moments of martingales. Ann. Math. Statist., pages 1719 1723, 1968.

Supplementary Material for "Near-Optimality of Contrastive Divergence Algorithms"

The supplementary material provides the proofs of the main results of the paper:

Section B states full explicit bounds for the offline CD algorithm.

Section C collects a list of useful tools for our proofs. These include the properties of φγ introduced before Theorem 3.2 in the main text, as well as several contraction and integrability results.

Section D provides the proofs for the online CD algorithm.

Sections E, F and G contain the proofs about the offline CD algorithm and the tail control.

A Notations

Throughout the proofs, we will denote by P m ψ the following operator from L2(pψ) to itself:

P m ψ f(x) := Z km(x, x )f(x )pψ(x )dx . (11)

Here, km(x, x ) is the m-iterated version of some Markov transition kernel kψ, e.g.:

km ψ (x, x ) := Z kψ(x, x1) . . . kψ(xm 2, xm 1) . . . kψ(xm 1, x )dx1 . . . dxm 1. (12)

ProjΨ : Rp 7 Ψ denotes the projection operator onto the convex set Ψ, e.g.

ProjΨ(ψ) := argmin ψ Ψ ψ ψ .

We also frequently use the following function, used in standard convex optimization results [22].

γ if γ = 0 log t if γ = 0

which is defined on R+ \ {0}.

B Additional results for offline SGD

In this section, we provide the full statements on error bounds for SGD with replacement (SGDw), SGD with reshuffling (SGDo) and tail moment bounds, which complement the results in Section 4. Proofs are deferred to Appendix F, which make use of L2 approximation by auxiliary gradient updates derived in Appendix E.

Notations Denote the SGDw iterates by (ψSGDw t,j )t N,j N and let X 1 be an i.i.d. copy of X1. Throughout the remaining of the appendix, we define given ϵ > 0 and n N

ϑSGDw n,m,T (ϵ):= sup t [T ] j [N]

Pn i=1 E h ϕ Km ψSGDw t 1,j (Xi) Xi, ψSGDw t 1,j i E h ϕ Km ψSGDw t 1,j (X 1) ψSGDw t 1,j i

= sup t,j P( (ψSGDw t,j ) > ϵ)

and ϑSGDo n,m,T (ϵ) analogously. Using these notations, we can redefine the quantity εSGDw n,m,T ;ν(ϵ) in the

main as εSGDw n,m,T ;ν(ϵ) := q

ϵ2 + κ2ν;m ϑSGDw n,m,T (ϵ) ν 2

B.1 An explicit finite-sample bound for SGDw

In the result below, we write δSGDw t,j := E ψSGDw t,j ψ 2 and, for a fixed ϵ > 0, the quantity

σSGDw n,T = εSGDw n,m,T ;ν(ϵ) + 5σ + 5κm

ϵ2 + κ2m ϑSGDw n,m,T (ϵ) ν 2

ν + 5σ + 5κm

Theorem B.1. Assume A1 (where Ψ may be non-compact), A2, A3, A4 and A7. Let ηt = Ct β for some β [0, 1] and C > 0, and assume that m > log(σCχ/µ)

log |α| s.t. µm = µ αmσCχ > 0 as in

Theorem 3.2. Then for any ϵ > 0, q

δSGDw T,N is upper bounded by

δSGDw 0,0 + CσSGDw n,T 4e

µm C + 2N(1 + µm C)N 1φ 1

2 L2C2N(T + 1) ET,N 2

δSGDw 0,0 + CσSGDw n,T 4 µm C + 3N 1 + L2C2

2 N 1 e2L2C2N log(T + 1)

(T + 1)( µm CN)/2 for β = 1 ,

δSGDw 0,0 + CσSGDw n,T 22β+1

µm C 2(1 β) N (T +1)β + 3β(1 + µm C)N 1(T + 2)β

L2C2 ET,N 2 otherwise ,

where ET,N 1 and ET,N 2 are two decreasing functions in T defined by

ET,N 1 := exp 1 N µm Cφ1 β(T + 1) + NL2C2

2 φ1 2β(T + 1) ,

ET,N 2 := exp N µm C

2 φ1 β(T + 1) + 2NL2C2φ1 2β(T + 1) .

We emphasize that the full result above holds for any β [0, 1], which in particular includes the constant step size β = 0 regime considered by [21]. When β = 0, for ET,N 1 and ET,N 2 to decay to zero as T , we additionally need the condition

µm = µ αmσCχ > 4CL2 .

This is almost identical to the condition used in [21, Equation 2.5, Theorem 2.1], except that 4L2 gets replaced by 1

2(L + αmσCχ)2. Notably this says that an additional step size condition is needed for our results to hold in the constant step size regime, but not necessary for a decreasing step size.

B.2 Results for SGDo

SGD with reshuffling (SGDo, also called SGD without replacement) is an optimization scheme that is also widely used in practice compared to SGDo and online SGD. In the context of CD, it corresponds to Algorithm 2 with batches chosen as

(Bt,1, . . . , Bt,N) = π({1, . . . , n}) ,

where π is a uniform draw of the permutation group on n elements. We denote the iterates of SGDo (ψSGDo t,j )t N,j [N]. Analogously to ϑSGDw n,m,T , we define, for X 1 an i.i.d. copy of X1, ϵ > 0 and n N, the tail probability term

ϑSGDo n,m,T (ϵ):= sup t [T ] j [N]

Pn i=1 E h ϕ Km ψSGDo t 1,j (Xi) Xi, ψSGDo t 1,j i E h ϕ Km ψSGDo t 1,j (X 1) ψSGDo t 1,j i

Also denote εSGDo n,m,T ;ν(ϵ) = q

ϵ2 + κ2m ϑSGDo n,m,T (ϵ) ν 2

ν and σSGDw n,T = εSGDo n,m,T ;ν(ϵ) + 5σ + 5κm

The following result says that ψSGDo t,j enjoys exactly the same convergence guarantee as ψSGDo t,j in Theorem 4.3. The statement is identical to that of Theorem B.1 and is stated in full for completeness; see Appendix F.2 for the proof, which is a slight adaptation of the proof for Theorem B.1. As before we write δSGDo t,j := E ψSGDo t,j ψ 2.

Theorem B.2 (Convergence of CD-SGDo). Assume A1 (where Ψ may be non-compact), A2, A3, A4 and A7. Let ηt = Ct β for some β [0, 1] and C > 0, and assume that m > log(σCχ/µ)

log |α| s.t. µm = µ αmσCχ > 0 as in Theorem 3.2. Then for any ϵ > 0, q

δSGDo T,N is upper bounded by

δSGDo 0,0 + CσSGDo n,T 4e

µm C + 2N(1 + µm C)N 1φ 1

2 L2C2N(T + 1) ET,N 2

δSGDo 0,0 + CσSGDo n,T 4 µm C + 3N 1 + L2C2

2 N 1 e2L2C2N log(T + 1)

(T + 1)( µm CN)/2 for β = 1 ,

δSGDo 0,0 + CσSGDo n,T 22β+1

µm C 2(1 β) N (T +1)β + 3β(1 + µm C)N 1(T + 2)β

L2C2 ET,N 2 otherwise ,

where ET,N 1 and ET,N 2 are two decreasing functions in T defined by

ET,N 1 := exp 1 N µm Cφ1 β(T + 1) + NL2C2

2 φ1 2β(T + 1) ,

ET,N 2 := exp N µm C

2 φ1 β(T + 1) + 2NL2C2φ1 2β(T + 1) .

Remark. We also remark that existing works [47, 48] show that the standard SGDo typically gives a faster convergence rate in T than SGDw. An analogous result for the CD setup would involve additional technical hurdles of jointly controlling the correlations across minibatches and from reusing data samples, and we defer this to future work.

B.3 Explicit tail control

We now provide the full explicit tail control bounds. All results in this section hold directly for εSGDo ν;n,m,T (ϵ) and δSGDo T,N , and we omit them here. In the result below, we denote rΨ as the radius of the smallest sphere in Rp that contains Ψ, which is finite under A1.

Lemma B.3. Assume A5 and A6. Let n N be sufficiently large s.t. log n

n < σ2 mζ2 m p+ν 2. Then

inf ϵ>0εSGDw ν;n,m,T (ϵ) 3σm p

p((ν 2)p + 2ν) ν 2 + κν;m2 ν 2

2ν 1 + 2Cm(ν 2)1/2

σmp1/2((ν 2)p + 2ν)1/2 (ν 2)p

2ν log n n .

In particular, if we additionally assume the conditions of Theorem B.1, we have

δSGDw T,N C (p, ν, m, Ψ, β) log n n + 1

C (p, ν, m, Ψ, β) := 8(1 + 5σ + 5κm)

p((ν 2)p + 2ν) ν 2 + κν;m2 ν 2

2ν 1 + 2Cm(ν 2)1/2

σmp1/2((ν 2)p + 2ν)1/2 (ν 2)p

Lemma B.4. Assume the conditions of Theorem 4.3 and additionally that A5 holds. Then

inf ϵ>0 εSGDw ν;n,m,T (ϵ) 3Cm + κν/2 ν;m(rΨ)

2ν n (ν 2)ν 2(ν2+(ν 2)p) ,

lim T δSGDw T,N 1/2 C (p, ν, m, Ψ, β) n (ν 2)ν 2(ν2+(ν 2)p) + B 1/2 ,

C (p, ν, m, Ψ) := 8(1 + 5σ + 5κm)

µm 3Cm + κν/2 ν;m(rΨ)

The next result considers a ϕ-discrepancy mixing condition ([35]), which is a mixing assumption on Km ψ but with respect to a specific test function ϕ, and we impose it uniformly over ψ Ψ. We also recall that Xψ 1 pψ.

Lemma B.5. Assume that there exist α [0, 1) and CK > 0 such that, for all ψ Ψ and x X, E[ϕ(Km ψ (x))] E[ϕ(Xψ 1 )] CK αm. Then

infϵ>0 εSGDw ν;n,m,T (ϵ) 1 + 2 ν 2

2ν κν;m( CK) ν 2

In particular, if we additionally assume the conditions of Theorem 4.3, we have

δSGDw T,N 8 µ αmσCχ

2ν κν;m( CK) ν 2

3ν 2 + 5σ + 5κm

C Auxiliary Tools

C.1 Properties of φγ

The following lemma collects some identities used in [22]. Lemma C.1. φγ satisfies the following properties:

(i) φγ is increasing on R+ for all γ ;

(ii) φγ(t) tγ

γ for γ > 0, and φγ(t) 1

γ for γ < 0 ;

(iii) φ1 β(t) t1 β for β (0, 1] ;

(iv) φγ(t) φγ( t

2xγ for γ (0, 1] .

The next lemma provides some additional results on φγ. Lemma C.2. φγ satisfies the following properties:

(i) φγ is positive on t > 0 and increasing for every γ R; (ii) For 1 t1 t2 and β 0, we have

φ1 β(t2 + 1) φ1 β(t1) Xt2 t=t1 t β 2 φ1 β(t2 + 1) φ1 β(t1) .

If instead β < 0, we have

1 2 φ1 β(t2 + 1) φ1 β(t1) Xt2 t=t1 t β φ1 β(t2 + 1) φ1 β(t1) ;

(iii) For 1 t1 t2 and γ = 0, we have

tγ 1 1 φγ(t2) φγ(t1) tγ 1 2 if γ 1 ,

t (1 γ) 2 φγ(t2) φγ(t1) t (1 γ) 1 if γ 1 ;

(iv) Let 1 t1 < t2 and κ, β 0. If κ = 1 and a > 0, we have

Xt2 t=t1(t + 1) β exp a φ1 κ(t 1) (t2 + 1)max{κ β,0}

a exp a φ1 κ(t2 + 1) ,

and if κ = 1 and a < 0, we have

Xt2 t=t1(t + 1) β exp a φ1 κ(t) (t2 + 1)max{κ β,0}

( a) exp a φ1 κ(t1) .

Proof of Lemma C.2. (i) follows from checking γ > 0, γ = 0 and γ < 0 respectively. The first set of bounds in (ii) follow by noting that t 7 t β is decreasing for β 0:

Xt2 t=t1 t β Z t2+1

t1 t βdt = φ1 β(t2 + 1) φ1 β(t1) ,

Xt2 t=t1 t β 2 Xt2 t=t1(t + 1) β 2 Z t2

t1 1(t + 1) βdt = 2 φ1 β(t2 + 1) φ1 β(t1) .

The second set of bounds follows from noting that t 7 t β is increasing for β < 0:

Xt2 t=t1 t β 1

Xt2 t=t1(t + 1) β 1 2

t1 1(t + 1) βdt = 1 2 φ1 β(t2 + 1) φ1 β(t1) ,

Xt2 t=t1 t β Z t2+1

t1 t βdt = φ1 β(t2 + 1) φ1 β(t1) .

For (iii), we note that for γ = 0,

φγ(t2) φγ(t1) = tγ 2 tγ 1 γ ,

so by the mean value theorem,

inft1 t t2 tγ 1 φγ(t2) φγ(t1) supt1 t t2 tγ 1 .

The desired bounds then follow from an explicit computation of the infimum and the maximum in each of the two cases γ 1 and γ 1.

For (iv), we first consider the case κ = 1 and a > 0. Then

Xt2 t=t1(t + 1) β exp a φ1 κ(t 1) = e a 1 κ Xt2 t=t1(t + 1) β exp at1 κ

e a 1 κ maxt1 t t2 (t + 1)κ

(t + 1)β Xt2 t=t1(t + 1) κ exp at1 κ

e a 1 κ (t2 + 1)max{κ β,0} Xt2 t=t1(t + 1) κ exp at1 κ

Since, for x 0, x 7 (x + 1) κ is decreasing and x 7 exp(ax1 κ/(1 κ)) is increasing, we have that for t1 t t2 and x [t, t + 1],

(t + 1) κ x κ and exp(at1 κ/(1 κ)) exp(ax1 κ/(1 κ)) .

This implies that

Xt2 t=t1(t + 1) β exp a φ1 κ(t 1)

(t2 + 1)max{κ β,0}e a 1 κ Xt2 t=t1

t x κ exp ax1 κ

= (t2 + 1)max{κ β,0}e a 1 κ Z t2+1

t1 x κ exp ax1 κ

(t2 + 1)max{κ β,0}

a e a 1 κ e

= (t2 + 1)max{κ β,0}

a exp a φ1 κ(t2 + 1) .

The main difference in the case κ = 1 and a < 0 is that we now use x 7 exp(a(x + 1)1 κ/(1 κ)) is decreasing to obtain, for t1 t t2 and x [t, t + 1],

exp(a(t + 1)1 κ/(1 κ)) exp(ax1 κ/(1 κ)) .

A similar argument then yields

Xt2 t=t1(t + 1) β exp a φ1 κ(t)

(t2 + 1)max{κ β,0}e a 1 κ Xt2 t=t1(t + 1) κ exp a(t + 1)1 κ

(t2 + 1)max{κ β,0}e a 1 κ Z t2+1

t1 x κ exp ax1 κ

= (t2 + 1)max{κ β,0}

a exp a φ1 κ(t2 + 1) exp a φ1 κ(t1)

(t2 + 1)max{κ β,0}

( a) exp a φ1 κ(t1) .

We also need the following lemma, which is useful for controlling the accumulation of errors from the noise terms over iterations.

Lemma C.3. For any a, b 0, T, N N and κ, β 0 such that bt β at κ 1 for all 1 t T, we have that

t=1(1 bt β + at κ)N exp b N φ1 β(T + 1) + a N φ1 κ(T + 1) .

Moreover, for any ζ 0, we have that

j=1(1 bt β + at κ) YT

s=t+1(1 bs β + as κ)N

Q1 + exp b N

2 φ1 β(T + 1) + 4a N φ1 κ(T + 1) Q2 ,

22ζ+1(T + 3)max{β ζ,0}

b exp b N 2(1 β)(T + 1)β if β = 1 , b > 0 ,

2Nφ1 ζ+b N/2(T + 1) exp b N

2 φ1 β(T + 1) if β = 1 or b = 0 ,

3ζ(1 + a)N 1

2a (T + 2)max{κ ζ,0} if κ = 1 and a > 0 ,

2N(1 + a)N 1 φ1 ζ 2a N(T + 1) if κ = 1 or a = 0 .

In the special case ζ = β = 1 < κ , we have

j=1(1 bt β + at κ)j 1 YT

s=t+1(1 bs β + as κ)N

4 b + 3N(1 + a)N 1 e 4a N κ 1 log(T + 1)

(T + 1) b N

Proof of Lemma C.3. By assumption, bt β at κ 1 for all 1 t T. Since 0 1 x e x for all x 1, we have that for any 1 t1 t2 T,

Yt2 t=t1(1 bt β + at κ)N exp b N Xt2 t=t1 t β + a N Xt2 t=t1 t κ . (13)

Applying this to the first quantity of interest followed by noting that a, b 0 and using Lemma C.2(ii), we obtain the first bound that

t=1(1 bt β + at κ)N exp b N XT

t=1 t β + a N XT

exp b N φ1 β(T + 1) + a N φ1 κ(T + 1) .

For the second bound, we define

t0 := sup n t T b 2 at (κ β)o .

Then by noting that 1 bt β + at κ 0 for all 1 t T again,we can bound the quantity of interest as

j=1(1 bt β + at κ)j 1 YT

s=t+1(1 bs β + as κ)N

t=t0+1 t ζ XN

j=1(1 bt β + at κ)j 1 YT

s=t+1(1 bs β + as κ)N

s=t0+1(1 bs β + as κ)

Xt0 t=1 t ζ XN

j=1(1 bt β + at κ)j 1 Yt0 s=t+1(1 bs β + as κ)N

t=t0+1 t ζ XN

2 t β j 1 YT

2 s β N Xt0 t=1 t ζ XN

j=1(1 + at κ)j 1 Yt0 s=t+1(1 + as κ)N

t=t0+1 t ζ YT

| {z } =:S1

+ N(1 + a)N 1 YT

| {z } =:S3

Xt0 t=1 t ζ Yt0 s=t+1(1 + as κ)N

| {z } =:S2

In the last line, we have used that 0 1 b

2t β 1 for t t0 + 1 and 1 + at κ 1 + a. To control the three quantities, we first note that by (13) , we have

s=1 s β exp b N

Xt0 s=1 s β

(a) exp b N

s=1 s β exp a N Xt0 s=1 s κ

(b) exp b N

2 φ1 β(T + 1) + 2a N φ1 κ(T + 1) .

In (a) above, we have noted that b

2 as (κ β) for s t0; in (b), we have used t0 T and Lemma C.2(ii) with a, b 0. In the special case β = 1 < κ, the above yields

S3 (T + 1) b N

2 exp 2a N 1 (T + 1) (κ 1)

(T + 1) b N

2 e 2a N κ 1 .

We now control S2. By (13) again, we have

S2 Xt0 t=1 t ζ exp a N Xt0 s=t+1 s κ

t=1 t ζ exp a N XT

(c) exp 2a Nφ1 κ(T + 1) XT

t=1 t ζ exp 2a Nφ1 κ(t + 1)

(d) 3ζ exp 2a Nφ1 κ(T + 1) XT

t=1(t + 2) ζ exp 2a Nφ1 κ(t + 1) .

In (c) above, we have applied Lemma C.2(ii); in (d), we have noted that supt N(t + 2)β/tβ = 3β. If κ = 1 and a > 0, we can apply Lemma C.2(iv) to get that

2a N (T + 2)max{κ ζ,0} exp 2a Nφ1 κ(T + 1)

= Q2 N(1 + a + c)N 1 exp 2a Nφ1 κ(T + 1) .

If κ = 1 or a = 0, the bound from (c) above reads

S2 exp 2a Nφ1 κ(T + 1) XT

t=1 t ζ(t + 1) 2a N

exp 2a Nφ1 κ(T + 1) XT

t=1 t ζ 2a N

2 φ1 ζ 2a N(T + 1) exp 2a Nφ1 κ(T + 1) = Q2 N(1 + a)N 1 exp 2a Nφ1 κ(T + 1) ,

where we have used Lemma C.2(ii) in the last line. Now consider the special case with ζ = 1 < κ, the bound from (d) becomes

S2 3 exp 2a Nφ1 κ(T + 1) XT

t=1(t + 2) 1 exp 2a Nφ1 κ(t + 1)

3 exp 2a N 1 (T + 1) (κ 1)

t=1(t + 2) 1

3e 2a N κ 1 log(T + 1) .

We are left with controlling S1, which follows from a similar strategy as controlling S2:

t=t0+1 t ζ exp b N

t=1 t ζ exp b N

(a) exp b N

2 φ1 β(T + 1) XT

t=1 t ζ exp b N

2 φ1 β(t + 1) (14)

2 φ1 β(T + 1) XT

t=1(t + 3) ζ exp b N

2 φ1 β(t + 1) .

In (a) above, we used Lemma C.2(ii). For β = 1 and b = 0, we can apply Lemma C.2(iv) with b 2 > 0 to obtain

S1 4ζ exp b N

2 φ1 β(T + 1) (T + 3)max{β ζ,0}

b N/2 exp b N

2 φ1 β(T + 3)

= 22ζ+1(T + 3)max{β ζ,0}

b N exp b N 2(1 β)

(T + 3)1 β (T + 1)1 β

(b) 22ζ+1(T + 3)max{β ζ,0}

b N exp b N 2(1 β)(T + 1)β = Q1

In (b), we have used Lemma C.2(iii) with 1 β 1. Meanwhile, if β = 1 or b = 0, we have

2 φ1 β(T + 1) XT

t=1 t ζ(t + 1)b N/2

2 φ1 β(T + 1) XT

t=1 t ζ+b N/2

2φ1 ζ+b N/2(T + 1) exp b N

2 φ1 β(T + 1) = Q1

For the special case with ζ = β = 1, the bound in (14) becomes

2 φ0(T + 1) XT

t=1 t 1 exp b N

2 φ0(t + 1)

= (T + 1) b N

t=1 t 1(t + 1) b N

(T + 1) b N

t=1 t (1 b N

(c) 2(T + 1) b N

2 (T + 1) = 2(T + 1) b N

2 (T + 1)b N/2 1

b N/2 4 b N .

In (c), we have used Lemma C.2(ii) for both the case 1 b N

2 0 and 1 b N

Combining the bounds for the general cases, we obtain the first desired inequality that XT

j=1(1 bt β + at κ)j 1 YT

s=t+1(1 bs β + as κ)N

2 φ1 β(T + 1) + uφ1 ξ(T + 3) + 4a φ1 κ(T + 1) Q2 .

For the special case ζ = β = 1 < κ, γ, combining the earlier bounds gives

j=1(1 bt β + at κ)j 1 YT

s=t+1(1 bs β + as κ)N

4 b + 3N(1 + a)N 1 e 4a N κ 1 log(T + 1)

(T + 1) b N

C.2 Contraction and integrability results

The next result is a standard result in convex analysis, needed to handle projections performed in Algorithms 1 and 2. Lemma C.4. Let Ψ a be convex subset of Rp. Let ψ Ψ. Then, for all ψ Rp, we have:

ProjΨ(ψ) ψ ψ ψ

Proof. We have:

ψ ψ 2 = ψ ProjΨ(ψ) + ProjΨ(ψ) ψ 2

= ψ ProjΨ(ψ) 2 + 2 ψ ProjΨ(ψ), ProjΨ(ψ) ψ + ProjΨ(ψ) ψ 2

Since by [49, Proposition 1.1.9], we have:

ψ ProjΨ(ψ), ψ ProjΨ(ψ) 0

for all ψ Ψ, we can use this inequality at ψ = ψ Ψ to obtain:

ψ ψ 2 ψ ProjΨ(ψ) 2

and the result follows by taking the square root.

We now state two lemmas that guarantee an amount of integrability sufficient to our analysis.

Lemma C.5. Let p, q P(X) such that dp

dq exists, and such that χ2(p; q) < + . Assume that f L2(q) Then |Epf| < + .

Proof. By assumption, f L2(q). Moreover, χ2(p, q) < + , and thus we have dp

dq 1 L2(q). Thus, the inner product is finite, and Z f dp

dq 1 dq = Z fdp Z fdq = |Epf Eqf| := M < +

= M |Eqf| < Epf < M + |Eqf|

Lemma C.6. For all ψ Ψ, for all m 1, and for all k 1, we have:

Epψ P m ψ ϕ k < +

Proof. By analycity of the log partition function ψ 7 log Z(ψ), we have R ϕ k dpψ(x) < + for all ψ Ψ, and thus, the function x 7 ϕ k (x) L2(pψ) for all ψ. Consequently, P m ψ ϕ k

L2(pψ). We can apply Lemma C.5 to P m ψ ϕ k to obtain Eψ P m ϕ k < + for all k 1 and for

all m 0. As a by-product, we obtain P m ψ ϕ k L2(pψ ), and thus so P mϕ k.

The following lemma is used multiple time in our analysis. Lemma C.7. Assume A3. Let q be a positive integer. Let f := (f1, . . . , fq) such that fk {ϕi}p i=1 {ϕiϕj}p i,j=1 for k [q]. Then, for all ψ Ψ, we have

Epψ P m ψ (f Epψf) αm Cχ Epψ h f Epψf 2i 1/2 ψ ψ

Proof. Let us note first that

Epψ P m ψ f Epψf 2 (a) =

Z P m ψ fi Epψfi (x) (pψ (x) pψ(x)) dx 2

Z P m ψ fi Epψfi (x) dpψ

dpψ (x) 1 pψ(x)dx 2

Z dp ψ dpψ (x) 1 2 pψ(x)dx

Z P m ψ fi Epψfi (x)2pψ(x)dx

χ2(pψ, pψ )

P m ψ (fi Epψfi) L2(pψ)

(c) α2mχ2(pψ, pψ )

fi Epψfi L2(pψ)(Rd)

(d) α2m C2 χ ψ ψ 2 Epψ f Epψf 2 .

Here, we used the fact that P m ψ admits pψ as an invariant measure in (a) [34, Eq. (1.2.2)], the Cauchy-Schwarz inequality in (b)

C.3 Miscellaneous

Lemma C.8. Let f : Ψ Rp be a differentiable function in the interior of Ψ Rp. For ψ Ψ, define σmin(ψ) := infθ Ψ, θ =1 θ f(ψ)θ and σmax(ψ) := supθ Ψ, θ =1 θ f(ψ)θ with respect to the Jacobian matrix f(ψ). Then for any ψ1, ψ2 Ψ, we have that

infψ Ψ σmin(ψ) (ψ1 ψ2) f(ψ1) f(ψ2) supψ Ψ σmax(ψ)

Proof of Lemma C.8. By the mean value theorem, there exists some a (0, 1) such that

(ψ1 ψ2) f(ψ1) f(ψ2) = (ψ1 ψ2) f(1 ψ1 + 0 ψ2) f(0 ψ1 + 0 ψ2)

= (ψ1 ψ2) f(aψ1 + (1 a)ψ2)(ψ1 ψ2)

= ψ1 ψ2 2 (ψ1 ψ2)

ψ1 ψ2 f(aψ1 + (1 a)ψ2) ψ1 ψ2 ψ1 ψ2 .

Plugging in the definition of σmax gives the desired upper bound and similarly σmin implies the lower bound.

D Proofs for Online CD

D.1 Auxiliary Lemmas for Online CD

We recall the following notations used in the next lemmas, namely σψ := Epψ ϕ Epψϕ 2, σ := σψ and σ := supψ Ψ σψ.

We now provide two intermediary lemmas necessary to analyze the impact of variance in the CD gradient. The strategy is similar in both of them: we change the integration from pψ to pψ to obtain contraction, at the cost of an additional term scaling with Cχ ψ ψ . We obtain exact constants that we choose to describe in terms of the smoothness parameters of the problem, e.g. the kth derivatives of the log partition function log Z, which, for k 2, equals the kth derivative of the negative cross-entropy model w.r.t pψ .

Second Moment convergence The following lemmas guarantee the second moment of a sample from km ψ pψ approaches the second moment of a sample from the target distribution pψ.

Lemma D.1. Under A1, A2 and A3, for all ψ Ψ, we have: Epψ P m ψ ϕ 2 Epψ ϕ 2 αm Cχ ψ ψ log Z 1,

log Z 1, := sup ψ Ψ

i=1 (4 1 i log Z(ψ)2 2 i log Z(ψ) + 2 2 i log Z(ψ)2 + 4 1 i log Z(ψ) 3 i log Z(ψ)

+ 4 i log Z(ψ))1/2 < +

Proof. Applying Lemma C.7 to each fi := ϕ2 i , we have Epψ P m ψ ϕ2 i Epψϕ2 i = αm Cχ ψ ψ Epψ ϕ2 i Epψϕ2 i 2 1/2

= αm Cχ ψ ψ Epψϕ4 i (Epψϕ2 i )2 1/2

We map the two moments to derivatives of log Z(ψ), since the kth derivative of log Z(ψ) is the kth cumulant. It can be shown, using the multivariate moment to cumulant mapping, that

Eϕ4 i = log Z

4 + 6 log Z

ψ2 i + 3 2 log Z

2 + 4 log Z

ψ3 i + 4 log Z

ψ4 i = 1 i log Z(ψ)4 + 6 1 i log Z(ψ)2 2 i log Z(ψ) + 3 2 i log Z(ψ)2 + 4 1 i log Z(ψ) 3 i log Z(ψ)

+ 4 i log Z(ψ)

where k i log Z(ψ) denotes the kth derivative of log Z with respect to ψi. On the other hand,

Eϕ2 i = 1 i log Z(ψ)2 + 2 i log Z(ψ)

= (Eϕ2 i )2 = 1 i log Z(ψ)4 + 2 1 i log Z(ψ)2 2 i log Z(ψ) + 2 i log Z(ψ)2

Epψϕ4 i (Ep2 ψϕ2 i )2 = 4 1 i log Z(ψ)2 2 i log Z(ψ) + 2 2 i log Z(ψ)2 + 4 1 i log Z(ψ) 3 i log Z(ψ)

+ 4 i log Z(ψ) The result follows by summing over i, since:

|Epψ P m ψ ϕ 2 Epψ ϕ 2| Xd

i=1 |Epψ P m ψ ϕ2 i Epψϕ2 i |

Note that log Z 1, is finite since Ψ is compact and log Z is analytic.

Squared First Moment convergence The next lemma provides convergence (in squared absolute value) of the first moment of the m-iterated Markov kernel km ψ . Lemma D.2. Under A1, A2 and A3, for all ψ Ψ, we have Epψ P m ψ ϕ 2 Epψϕ 2 αmσ2 ψ + Cχαm/4 log Z 2, ψ ψ

log Z 2, := sup ψ Ψ

i=1 F(ψ) 2 i log Z(ψ) 1/4 + 2 1 i log Z(ψ) 2 i log Z(ψ)1/2

and F(ψ) := 15 2 i log Z(ψ)3 + 10 3 i log Z(ψ)2 + 15 2 i log Z(ψ) 4 i log Z(ψ) + 6 i log Z(ψ)

Proof. We have:

(Epψ P m ψ ϕi)2 = (Epψ P m ψ ϕi Epψϕi + Epψϕi)2

= (Epψ P mϕi Epψϕi)2 + 2Epψϕi (Epψ P mϕi Epψϕi)

= |Epψ (P m ψ ϕi)2 (Epψϕi)2| (Epψ P m ψ ϕi Epψϕi) 2 | {z } 1 + 2 Epψϕi Epψ P m ϕi Epψϕi | {z } 2

1 = Epψ(P m ψ (ϕi Eψϕi | {z } 1,1

P m ψ ϕi Epψϕi 2 dpψ

(a) α2m Eψ(ϕi Epψϕi)2 + Cχ ψ ψ Epψ(P m(ϕi Epψϕi)4 1/2

(b) α2m Eψ(ϕi Eψϕi)2 + Cχ ψ ψ EpψP m(ϕi Epψϕi)2 1/4 EpψP m(ϕi Epψϕi)6 1/4

(c) α2m Eψ(ϕi Eψϕi)2 + αm/2Cχ ψ ψ Epψ(ϕi Epψϕi)2 1/4 Epψ(ϕi Epψϕi)6 1/4 .

In (a), we used the restricted spectral gap Assumption A3 for 1,1, and the Cauchy-Schwarz inequality combined with Assumption A2 for 1,2. In (b), we used the Cauchy-Schwarz once again, and in (c) we used the fact that P m ψ is a contraction in L6(pψ) and another invocation of the spectral gap assumption A3 As an aside, note that a simpler result can be obtained by making regularity assumption on the mapping ψ 7 P m ψ . Assuming that ψ 7 P m ψ (x) is uniformly Lm-Lipschitz across x X for instance (as done in [21, Assumption 5]), the second term 1,2 of 1 could have been handled using

1,2 2Epψ P m ψ ϕ Eψ ϕ 2 + 2 Epψϕ Epψ ϕ 2

4(Lm ψ ψ + σ2 α2m + 2 Epψϕ Epψ ϕ 2)

4(Lm ψ ψ + σ2 α2m + 2L2 ψ ψ )

Although this result does not require possibly large constants related to sixth-order moments, it is less tight in the sense that it does not go to 0 as m . Back to the main proof, and 2 in particular. Applying Lemma C.7 to f := ϕ, we have

2 αm Epψϕi Cχ ψ ψ Epψ(ϕi Epψϕi)2 1/2

αm Cχ 1 i log Z(ψ) 2 i log Z(ψ)1/2 ψ ψ

Putting everything together, we have:

|Epψ (P m ψ ϕi)2 Epψϕ2 i | α2m Epψ(ϕi Epψϕi)2

+ Cχ ψ ψ αm/2(Epψ(ϕi log ϕi)6)1/4(Epψ(ϕi log ϕi)2)1/4

+ 2αm 1 i log Z(ψ) 2 i log Z(ψ)1/2

α2m Epψ(ϕi Epψϕi)2

+ Cχαm/2 ψ ψ (Epψ(ϕi log ϕi)6)1/4(Epψ(ϕi log ϕi)2)1/4

+ 2 1 i log Z(ψ) 2 i log Z(ψ)1/2

Summing over i, we obtain

|Epψ P m ψ ϕ 2 (Epψϕ)2|

i=1 |Epψ (P m ψ ϕi)2 (Epψϕi)2|

i=1 Epψ(ϕi Epψϕi)2 + Cχαm/2 log Z 2, ψ ψ

α2mσ2 ψ + Cχαm/2 log Z 2, ψ ψ

log Z 2, = supψ Ψ Xp

i=1 Epψ(ϕi Epψϕi)6 1/4 Epψ(ϕi Epψϕi)2 1/4

+ 2 1 i log Z(ψ) 2 i log Z(ψ)1/2

Similarly to the previous lemma, one can upper bound E(ϕi Epψϕi)6 using the centered moment to cumulant formula:

Epψ(ϕi Eϕi)6 =15 2 i log Z(ψ)3 + 10 3 i log Z(ψ)2 + 15 2 i log Z(ψ) 4 i log Z(ψ) + 6 i log Z(ψ)

To get a full description of log Z 2, :

log Z 2, = supψ Ψ Xp

i=1 F(ψ) 2 i log Z(ψ) 1/4 + 2 1 i log Z(ψ) 2 i log Z(ψ)1/2 .

Lemma D.3. Under A1, A2 and A3, for all ψ Ψ, we have Epψ P m ψ ϕ 2 Epψϕ 2 αmσ2 ψ + Cχαm/4 log Z 2, ψ ψ

log Z 2, := sup ψ Ψ

i=1 F(ψ) 2 i log Z(ψ) 1/4 + 2 1 i log Z(ψ) 2 i log Z(ψ)1/2

F(ψ) := 15 2 i log Z(ψ)3 + 10 3 i log Z(ψ)2 + 15 2 i log Z(ψ) 4 i log Z(ψ) + 6 i log Z(ψ)

Proof. We have:

Epψ (P m ψ ϕi)2 = Epψ (P m ψ ϕi Epψϕi + Epψϕi)2

= Epψ (P mϕi Epψϕi)2 + 2Epψϕi Epψ P m(ϕi Epψϕi)

= |Epψ (P m ψ ϕi)2 (Epψϕi)2| Epψ (P m ϕi Epψϕi) 2 | {z } 1 + 2 Epψϕi Epψ P m ϕi Epψϕi | {z } 2 where

1 = Epψ(P m ψ (ϕi Eψϕi | {z } 1,1

P m ψ ϕi Epψϕi 2 dpψ

(a) α2m Eψ(ϕi Epψϕi)2 + Cχ ψ ψ Epψ(P m(ϕi Epψϕi)4 1/2

(b) α2m Eψ(ϕi Eψϕi)2 + Cχ ψ ψ EpψP m(ϕi Epψϕi)2 1/4 EpψP m(ϕi Epψϕi)6 1/4

(c) α2m Eψ(ϕi Eψϕi)2 + αm/2Cχ ψ ψ Epψ(ϕi Epψϕi)2 1/4 Epψ(ϕi Epψϕi)6 1/4 .

In (a), we used the restricted spectral gap Assumption A3 for 1,1, and the Cauchy-Schwarz inequality combined with Assumption A2 for 1,2. In (b), we used the Cauchy-Schwarz once again, and in (c) we used the fact that P m ψ is a contraction in L6(pψ) and another invocation of the spectral gap assumption A3 As an aside, note that a simpler result can be obtained by making regularity assumption on the mapping ψ 7 P m ψ . Assuming that ψ 7 P m ψ (x) is uniformly Lm-Lipschitz across x X for instance (as done in [21, Assumption 5]), the second term 1,2 of 1 could have been handled using

1,2 2Epψ P m ψ ϕ Eψ ϕ 2 + 2 Epψϕ Epψ ϕ 2

4(Lm ψ ψ + σ2 α2m + 2 Epψϕ Epψ ϕ 2)

4(Lm ψ ψ + σ2 α2m + 2L2 ψ ψ )

Although this result does not require possibly large constants related to sixth-order moments, it is less tight in the sense that it does not go to 0 as m . Back to the main proof, and 2 in particular. Applying Lemma C.7 to f := ϕ, we have

2 αm Epψϕi Cχ ψ ψ Epψ(ϕi Epψϕi)2 1/2

αm Cχ 1 i log Z(ψ) 2 i log Z(ψ)1/2 ψ ψ

Putting everything together, we have:

|Epψ (P m ψ ϕi)2 Epψϕ2 i | α2m Epψ(ϕi Epψϕi)2

+ Cχ ψ ψ αm/2(Epψ(ϕi log ϕi)6)1/4(Epψ(ϕi log ϕi)2)1/4

+ 2αm 1 i log Z(ψ) 2 i log Z(ψ)1/2

α2m Epψ(ϕi Epψϕi)2

+ Cχαm/2 ψ ψ (Epψ(ϕi log ϕi)6)1/4(Epψ(ϕi log ϕi)2)1/4

+ 2 1 i log Z(ψ) 2 i log Z(ψ)1/2

Summing over i, we obtain

|Epψ P m ψ ϕ 2 (Epψϕ)2|

i=1 |Epψ (P m ψ ϕi)2 (Epψϕi)2|

i=1 Epψ(ϕi Epψϕi)2 + Cχαm/2 log Z 2, ψ ψ

α2mσ2 ψ + Cχαm/2 log Z 2, ψ ψ

log Z 2, = supψ Ψ Xp

i=1 Epψ(ϕi Epψϕi)6 1/4 Epψ(ϕi Epψϕi)2 1/4

+ 2 1 i log Z(ψ) 2 i log Z(ψ)1/2

Similarly to the previous lemma, one can upper bound E(ϕi Epψϕi)6 using the centered moment to cumulant formula:

Epψ(ϕi Eϕi)6 =15 2 i log Z(ψ)3 + 10 3 i log Z(ψ)2 + 15 2 i log Z(ψ) 4 i log Z(ψ) + 6 i log Z(ψ)

To get a full description of log Z 2, :

log Z 2, = supψ Ψ Xp

i=1 F(ψ) 2 i log Z(ψ) 1/4 + 2 1 i log Z(ψ) 2 i log Z(ψ)1/2 .

We can now use the previous lemmas to obtain an expression on the second moment of the contrastive divergence gradient estimator, relating it to the one of the stochastic log-likelihood gradient estimator. Lemma D.4. Under A1, A2 and A3, we have:

E ht(ψ, Xt) 2 2σ2 + 2σ2 ψ + 2L2 ψ ψ 2 + 4(σ2 ψα2m+αm/2 log Z 3, Cχ ψ ψ )

where log Z 3, := 2 max( log Z 1, , log Z 2, ).

Proof. We rely on the following decomposition:

ht(ψ, Xt) = (ϕ(Xt) Epψ ϕ) | {z } 1,1

+ (Epψ ϕ Epψϕ) | {z } 1,2

+ (Ekm ψ (Xt, )ϕ ϕ(km ψ (Xt, ))) | {z } 2 + (Epψϕ Ekm ψ (Xt, )ϕ) | {z } 3

1,1 + 1,2 form the differentiable stochastic gradient gt of Equation 6. Note that 1,1 is meanzero, and 2 is mean-zero conditionally on Xt. Consequently, E 2, 3 = E 2, 1,1 = E 1,1, 1,2 = E 1,1, 2 = 0, and the only mixed-terms that remain to be controlled are E 1,1, 3 and E 1,2, 3 . We first control the unmixed terms, and the simple ones first: we have E 1,1 2 = σ2 , as well as E 1,2 2 L2 ψ ψ 2. For 2, we have:

Ekm ψ (x, ) 2 2 = Ekm ψ (x, ) ϕ(km ψ (x, )) 2 Ekm ψ (x, )ϕ(km ψ (x, )) 2

= P m ψ ϕ(x) 2 P m ψ ϕ(x) 2

We can invoke Lemmas D.3 and D.1, which guarantee Epψ P m ψ ϕ 2 Epψϕ 2 α2mσ2 ψ + Cχαm/2 log Z 2, ψ ψ Epψ P m ψ ϕ 2 Epψ ϕ 2 αm Cχ log Z 1, ψ ψ

Epψ Ekm ψ (x, ) 2 2

= Epψ ϕ 2 Eϕ 2 + α2mσ2 ψ + Cχαm/2 log Z 1, + log Z 2, ψ ψ

= E ϕ Eϕ 2 + α2mσ2 ψ + αm/2 log Z 3, Cχ ψ ψ

where log Z 3, := 2 max( log Z 1, , log Z 2, ).

For 3, notice that 3 is precisely the term 1 in Lemma D.3, and we can thus bound it by

Epψ 3 2 σ2 ψα2m + αm/2 log Z 2, Cχ ψ ψ

Finally, we simply bound 2E 1,1, 3 by E 1,1 2+E 3 2, and 2E 1,2, 3 by E 1,2 2+ E 3 2. Putting everything together, we have:

E ht(ψ) 2 = E 1,1 2 + E 1,2 2 + E 2 2 + E 3 2 + 2E 1,1, 3 + 2E 1,2, 3

2E 1,1 2 + 2E 1,2 2 + E 2 2 + 3E 3 2

2σ2 + 2L2 ψ ψ 2 + σ2 ψ + 4(σ2 ψα2m + αm/2 log Z 3, Cχ ψ ψ )

2σ2 + 2σ2 ψ + 2L2 ψ ψ 2 + 4(σ2 ψα2m + αm/2 log Z 3, Cχ ψ ψ )

D.2 Proof of the SGD recursion (Lemma 3.1)

We are now ready to provide an SGD-style recursion for the expected squared distance to the optimum E h ψt ψ 2i .

Lemma (Restatement of Lemma 3.1). Let ψt be the iterates produced by Algorithm 1. Denote δt = E ψt ψ 2 , σ = (Epψ ϕ Epψ ϕ 2)1/2, and σt = (Epψt ϕ Epψtϕ 2)1/2. Then, under Assumptions A1, A2 and A3, for all t 1, we have:

δt (1 2ηt µm,t 1 + 2η2 t L2)δt 1 + η2 t σ2 m,t 1 + 4αm/2η2 t log Z 2, Cχδ1/2 t 1

where log Z 3, is a constant, µm,t := µ αmσt Cχ, and σm,t := (σ2 + σ2 t + 2σ2 t α2m)1/2.

Proof. In this proof, we note (Ft)t 0, the increasing family of σ-algebras generated by the random variables (Xt)t 0 pψ and the Markov chain samples Xm t |Xt, ψt km ψt(Xt, ). We decompose the integrand of δt as follows:

ψt ψ n 2 = ProjΨ(ψt 1 ηtht(ψt 1)) ψ n 2

ψt 1 ηtht(ψt 1) ψ n 2 (By Lemma C.4)

= ψt 1 ψ 2 2ηt ht(ψt 1), ψt 1 ψ n + η2 t ht(ψt 1) 2

The first term is (up to an averaging operation) the previous iterate. The middle term will ensure (provided m is large enough) contraction of the expected distance to the optimum. Finally, the third term can be described by Lemma D.4, and essentially behaves like the second moment of a log-likelihood stochastic gradient. Indeed, noting g(ψ) := Epψ ϕ + Epψϕ the expectation of gt w.r.t xt (which is the gradient of the negative cross-entropy between pψ and pψ ), we have:

ht(ψn,t 1), ψt 1 ψ n = g(ψn,t 1), ψt 1 ψ n + (ht(ψn,t 1) g(ψn,t 1)), ψt 1 ψ n | {z } and applying Lemma C.7, we get that

ht(ψt 1) g(ψt 1) = ϕ(km(Xt, )) Epψϕ

= Epψ Ekm ψ (Xt, ) [ht(ψt 1) g(ψt 1)|Ft 1] = P m ψ ϕ Epψϕ ,

meaning |E [ |Ft 1]| Epψ P m ψ (ϕ Epψϕ) ψt 1 ψ

αm Cχ(Epψt 1

f Epψt 1f 2 )1/2 ψ ψ 2

αmσt 1Cχ ψ ψ 2

On the other hand, by applying Lemma C.8 to g, we have:

g(ψt 1), ψt 1 ψ = g(ψt 1) g(ψ ), ψt 1 ψ µ ψt 1 ψ 2

Combining the above results, we obtain:

Epψ Ekm ψ (x, ) h ψt ψ 2 |Ft 1 i

(1 2ηt(µ αmσt 1Cχ)) ψt 1 ψ 2

+ η2 t (2σ2 + 2σ2 t 1 + 2L2 ψt 1 ψ 2 + 4(σ2 t 1α2m + αm/2 log Z 3, Cχ ψt 1 ψ ))

And the result follows by integrating over Ft 1.

D.3 Proof of Online CD convergence

We now prove Theorem 3.2. The recursion of Lemma 3.1 is almost identifiable, up to a cross-term of second order, with the one of an SGD algorithm as presented in the setting of [22, Theorem 1]. To make the identification exact, we use the bound 4αm/2η2 t log Z 3, Cχδ1/2 t 1 2αm/2η2 t δt + 2αm/2η2 t log Z 2 3, C2 χ, yielding the following recursion:

δt (1 2ηt(µ αmσCχ) + 2η2 t (L2+αm/2))δt 1 + (σ2(2 + 2α2m)+αm/2 log Z 2 3, C2 χ)η2 t

where we used the fact that σm,t σ. This recursion is of the same form as the one studied in [22, Equation 6, Theorem 1] given by:

δt 1 2µγt + 2L2γ2 t δt 1 + 2σ2γ2 t by identifying:

σ2 σ2(2 + 2α2m) + αm/2 log Z 2 3, =: σ2 m

L2 (L2 + αm/2) =: L2

µ µ αmσCχ =: µm γt ηt

We can use the same unrolling strategy as theirs (the only condition required to proceed is that µm > L, which automatically holds since µ < L), and we obtain

2 exp 4 LC2φ1 2β(n) exp µm C

4 n1 β δ0 + σ2 m L2 + 4C σ2 m µmnβ , if 0 β < 1

exp(2 L2C2)

n µm C δ0 + σ2 m L2 + 2 σ2 m C2 φ µm C/2 1(n)

n µm C/2 , if β = 1.

D.4 Proof of online CD with averaging (Theorem 3.3)

We first restate the theorem in its complete form.

Theorem (Contrastive Divergence with Polyak-Ruppert averaging). Let (ψt)t 0 the sequence of iterates obtained by running the CD algorithm with a learning rate ηt = Ct β for β ( 1

2, 1). Define ψn := 1

n Pn i=1 ψi. Then, under the same assumptions as Theorem 3.2 we have, for all n 1,

n + O nmax( ( 1

2 +m |log α|

Where I(ψ ) := Covpψ [ϕ] is the Fisher information matrix of the data distribution. Additionally, if

m > (1 β) log n

2| log α| , we have q

n + o n 1/2 .

Throughout the proof, we will denote by hn the standard online CD gradient defined in Equation 3:

hn(ψn 1) = ϕ(Xn) + ϕ(km ψn 1( , Xn)), Xn pψ , n N \ {0},

h(ψn 1) := E [h(ψn 1) | ψn 1] = Epψ ϕ(Xn) + Epψ Ekm ψn 1ϕ(km ψn 1(Xn, )).

We start by establishing some intermediate lemmas.

Lemma D.5. Under Assumptions A1, A2, A3, the online CD iterates produced by Algorithm 1 using ηt = Ct β for β ( 1

2, 1) verify

E h ψi ψ 2i = O(n 1

Proof. Let us note δn = E h ψn ψ 2i . Summing the r.h.s of Theorem 3.2, we have

i=1 4C σ2 m µmiβ + 2 δ0 + σ2 m L2

i=1 e4 LC2φ1 2β(i)e µm C

4C σ2 m µm φ1 β(n) + 1

r 2(δ0 + σ2 m L2 )A3 = O n 1

where A3 is finite if β < 1, and A3 = O(n) otherwise [22].

Lemma D.6. Under Assumptions A1, A2, A3, the online CD iterates produced by Algorithm 1 using ηt = Ct β for β ( 1

2, 1) verify

E h ψi ψ 2i 1/2 = O(n 1

Proof. Let us note δn = E h ψn ψ 2i . Applying x + y x + y to the r.h.s of Theorem 3.2, we have

i=1 δ1/2 i 2C1/2 σm

i=1 i β/2 +

2 δ0 + σ2 m L2

i=1 e2 L2C2φ1 2β(i)e µm C

i=1 δ1/2 i 1

µ1/2 m φ1 β/2(n) + 1

where A4 is finite if β < 1, and A4 = O(n) otherwise [22].

Lemma D.7. Under Assumptions A1, A2, A3, the online CD iterates produced by Algorithm 1 using ηt = Ct β for β ( 1

2, 1) verify s

i=1 hi(ψi 1) 2 = O n β 2 1 .

Proof. The result follows from the fact that ht verifies

ht(ψt 1) = 1 ηt (ψt 1 ψt) .

A similar quantity was handled in the case of standard SGD [22, Theorem 3], and the only condition needed to reuse their steps is that (ψt)t n satisfies an upper bound of the same form as the one [22, Theorem 1] derived. This is precisely the nature of our bound of ψt estblished in Theorem 3.2, with µm, σm, L. Borrowing on their result, we have: s

i=1 hi(ψi 1) 2 4 σmβ C1/2n µm φβ/2(n) + 4β

δ0 + σ2 m L2

C + 2 L δ1/2 0 + 2 L

µ1/2 m φ1 β(n)1/2

δ0 + σ2 m L2

where µm, σm and L are defined in 3.2, and

16 k1 β+16 L4 1C4φ1 2β(k)

Lemma D.8. Under Assumptions A1, A2, A3, the online CD iterates produced by Algorithm 1 using ηt = Ct β for β ( 1

2, 1) verify

E h ψi ψ 4i = O(n β).

Proof. We proceed as in the proof of [22, Theorem 3], first establishing a recurrence for E h ψi ψ 4i , and then unrolling it. We have

E h ψn ψ 4 | Fn 1 i ψn 1 ψ 4 + 6η2 n ψn 1 ψ 2 E h hn (ψn 1) 2 | Fn 1 i

+ η4 n E h hn (ψn 1) 4 | Fn 1 i

4ηn ψn 1 ψ 2 ψn 1 ψ , E [hn]

+ 4η3 n ψn 1 ψ E h hn (ψn 1) 3 | Fn 1 i .

The second and fourth terms will be controlled using results from our previous sections. For simplicity, we don t attempt to relate the moments of hn 4 as precisely as before. Instead we use:

Epψ h hn ki 2k 1 Epψ Ekm pψn 1

h ϕ(km ψn 1(Xn, )) ki + Epψ ϕ(Xn) k

And let us note τ = supψ Ψ Epψ ϕ 8 1/8 . We have, for k 4,

Epψ Ekm pψn 1 ϕ(km ψn 1(Xn, )) k Epψn 1 ϕ k + Cχ(Epψn 1( ϕ k Epψn 1 ϕ k)2)1/2 ψn 1 ψ

τ k + 2Cχτ k ψn 1 ψ

Where we used the fact that P m ψn 1 is a contraction, and (E ϕ k)1/k is an increasing function of k. On the other hand, we simply have Epψ ϕ(Xn, ) k τ k. Plugging this into the previous equation, we obtain

E h ψn ψ 4 | Fn 1 i ψn 1 ψ 4

+ 6η2 n ψn 1 ψ 2 4τ 2 + 4Cχτ 2 ψn 1 ψ

+ η4 n(16τ 4 + 16Cχτ 4 ψn 1 ψ )

4ηn ψn 1 ψ 2 ψn 1 ψ , E [hn|Fn 1]

+ 4η3 n ψn 1 ψ 8τ 3 + 8Cχτ 3 ψn 1 ψ

To simplify the recursion, we use the four following inequalities:

τ 2η2 n ψn 1 ψ 3 1

2(τ 2η2 n( ψn 1 ψ 2 + τ 2η2 n ψn 1 ψ 4)

τ 4η4 n ψn 1 ψ (τ 4η4 n + 1

4τ 4η4 n ψn 1 ψ 4))

η3 nτ 3 ψn 1 ψ 1

2(η2 nτ 2 ψn 1 ψ 2 + 16η4 nτ 4

τ 3η3 n ψn 1 ψ 2 1

2(η4 nτ 4 + ψn 1 ψ 2 η2 nτ 2)

Injecting them in our recursion, we obtain:

E h ψn ψ 4 | Fn 1 i ψn 1 ψ 4

+ 12η2 n ψn 1 ψ 2 τ 2

+ 12Cχτ 2η2 n ψn 1 ψ 2

+ 12Cχτ 2η2 n ψn 1 ψ 4

+ 16η4 nτ 4

+ 16Cχη4 nτ 4

+ 4Cχη4 nτ 4 ψn 1 ψ 4

4ηn µm ψn 1 ψ 4

+ 16η2 nτ 2 ψn 1 ψ 2

+ 16η4 nτ 4

+ 16η4 n Cχτ 4

+ 16η2 n Cχτ 2 ψn 1 ψ 2 ,

which, after further simplifications, yields

E ψn ψ 4 | Fn 1

ψn 1 ψ 4 (1 4ηn µm + 12Cχη2 nτ 2 + 4Cχτ 4η4 n)

+ η2 n ψn 1 ψ 2 (28(1 + Cχ)τ 2) + 32η4 n(τ 4(1 + Cχ))

ψn 1 ψ 4 (1 4ηn µm + 12(1 + Cχ)η2 nτ 2 + 4(1 + Cχ)τ 4η4 n)

+ 28η2 n ψn 1 ψ 2 ((1 + Cχ)τ)2 + 32η4 n(1 + Cχ)τ)4

ψn 1 ψ 4 (1 4ηn µm + 12η2 n((1 + Cχ)τ)2 + 4((1 + Cχ)τ)4η4 n)

+ 28η2 n ψn 1 ψ 2 ((1 + Cχ)τ)2 + 32η4 n(1 + Cχ)τ)4

ψn 1 ψ 4 (1 4ηn µm + 12η2 n(2(1 + Cχ)τ)2 + 16η2 n(2(1 + Cχ)τ)3

+ 4(2(1 + Cχ)τ)4η4 n) + 20η2 n ψn 1 ψ 2 (2(1 + Cχ)τ)2 + 16η4 n(2(1 + Cχ)τ)4

ψn 1 ψ 4 (1 4ηn µm + 12η2 n(2(1 + Cχ)τ + L)2 + 16η2 n(2(1 + Cχ)τ + L)3

+ 4(2(1 + Cχ)τ + L)4η4 n) + 20η2 n ψn 1 ψ 2 (2(1 + Cχ)τ)2 + 16η4 n(2(1 + Cχ)τ)4

ψn 1 ψ 4 (1 4ηn µm + 12η2 n L2 1 + 16η2 n L3 1 + 4 L4 1η4 n)

+ 20η2 n ψn 1 ψ 2 τ 2 1 + 16η4 n τ 4 2

where we defined τ1 := 2(1 + Cχ)τ and L1 := 2(1 + Cχ)τ + L. This recursion is of the form of the one studied in [22, Equation 32] (note that by design, L µm.) The steps performed to bound E h ψi ψ 4i thus follow from their derivations, and we obtain:

i=1 E ψi 1 ψ 4

C1/2φ1 3β/2(n) + µ 1/2 m φ1 β(n)

2n A1 exp 24 L4 1C4 δ0 + µm E θ0 θ 4

20C τ 2 1 + 2 τ 2 1 C3 µm + 8 τ 2 1 C2 1/2

where A1 = Xn

k=1 e µC 16 k1 β+16 L4 1C4φ1 2β(k)

and we have A(1) < + if β < 1, and A(1) = O(n) otherwise.

Lemma D.9. For all ψ Ψ, we have: Cov ϕ(km ψ (Xn, )) Covpψ [ϕ(Xn)] F αm Cχ( τ 1/2 + 2 log Z 4, σ) ψ ψ + α2m C2 χσ2 ψ ψ 2 .

where τ := supψ Ψ Epψ h ϕϕ Epψ[ϕϕ ] 2 F

i < + and log Z 4, := supψ Ψ Eψϕ

i=1 2 i log Z(ψ)2.

Proof. We have

Cov ϕ(km ψ (Xn, )) = Epψ P m ψ ϕϕ Epψ [P m ψ ϕ] Epψ [P m ψ ϕ]

Looking at the second moment first, we have

Epψ P m ψ ϕϕ Epψϕϕ = Epψ P m ψ (ϕϕ Epψϕϕ ) | {z } 1

Applying lemma C.7 to the Rd2-valued function f given by fij := ϕiϕj Epψϕiϕj,

1 F ψ ψ αm Cχ

Epψ h ϕϕ Epψϕϕ 2 F

i τ 1/2αm Cχ ψ ψ .

We now investigate the first moment. We have

Epψ P m ψ ϕ = Epψ P m ψ ϕ Epψ [ϕ] | {z } 2,1

= (Epψ P m ψ ϕ)(Epψ P m ψ ϕ) EpψϕEpψϕ | {z } 2

= 2,1 T 2,1 + 2,1Epψϕ + Epψϕ 2,1

and thus, applying Lemma C.7 on 2,1, we have:

2 F 2,1 2,1 + 2,1Epψ ϕ + Epψ [ϕ] 2,1 F 2,1 2,1 F + 2 2,1 Epψ [ϕ]

α2mσ2C2 χ ψ ψ 2 + 2 log Z 4, αm Cχσ ψ ψ ,

where we used that 2,1 2,1 F = 2,1 2. We can now combine our two matrix moment bounds to obtain Cov ϕ(km ψ (Xn, )) Covpψ [ϕ(Xn)] F = 1 + 2 F αm Cχ( τ 1/2 + 2 log Z 4, σ) ψ ψ + α2m C2 χσ2 ψ ψ 2 .

Lemma D.10. Under Assumptions A1, A2, A3 it holds that

i=1 f (ψ ) 1(h(ψi 1) hi(ψi 1)) 2 2

tr(I(ψ ) 1)

I(ψ ) 2 1/2 F n

(M + αm Cχ( τ 1/2 + 2 log Z 4, σ))1/2 Xn

i=1 δ1/2 i 1 1/2

+ α2m C2 χσ2 Xn

i=1 δi 1 1/2

where M := supψ Ψ 3 log Z(ψ) op( F , F ) < + .

Proof. f (ψ ) 1(h(ψn 1) hn(ψn 1))

= f (ψ ) 1 (ϕ(Xn) Epψ ϕ) | {z } 1,n

+f (ψ ) 1 (ϕ(km ψn 1(Xn)) Epψ Ekm ψn 1ϕ(km ψn 1(Xn, ))) | {z } 2,n = f (ψ ) 1 1,n + f (ψ ) 1 2,n

Noting the l.h.s of Lemma D.10, we have, summing over [n], and using Minkowski s inequality,

i=1 f (ψ ) 1 1,i 2 + 1

i=1 f (ψ ) 1 2,i 2

Note that this step was made possible because we are looking at the square-root of the variance, which is unlike the recursion in Lemma 3.1. This allows to separate the terms and use fewer intermediaries than in the proof of Lemma 3.1.

Since both 1,n and 2,n are martingale differences with respect to the filtration Fn 1, the covariance terms vanish, and we have

i=1 E h f (ψ ) 1 1,i 2i + 1

i=1 E h f (ψ ) 1 2,i 2i

i=1 tr(f (ψ ) 1E 1,i 1,if (ψ ) 1)

i=1 tr(f (ψ ) 1E 2,i 2,if (ψ ) 1)

tr(I(ψ ) 1)

i=1 tr(f (ψ ) 1Cov h ϕ(km ψi 1(xi)) i f (ψ ) 1)

tr(I(ψ ) 1)

i=1 tr(f (ψ ) 1(Covpψ ϕ + (Covpψi 1ϕ Covpψ ϕ)

+ (Cov ϕ(km ψi 1(xi)) Covpψi 1 ϕ))f (ψ ) 1) 1/2

tr(I(ψ ) 1)

tr(I(ψ ) 2)

i=1 (Covpψn 1ϕ Covpψ ϕ F

+ Cov ϕ(km ψ (xi)) Covpψ ϕ) F 1/2

tr(I(ψ ) 1)

tr(I(ψ ) 2)

(M + αm Cχ( τ 1/2 + 2 log Z 4, σ)1/2) q Xn

i=1 ψi 1 ψ + αm Cχσ q Xn

i=1 ψi 1 ψ 2

tr(I(ψ ) 1)

tr(I(ψ ) 2)

(M + αm Cχ( τ 1/2 + 2 log z 4, σ))1/2 q Xn

+αm Cχσ q Xn

In (a) we used Lemma D.10, the cyclicity of the trace, tr(A B) AB F A F B F, and the fact that since Covpψi 1 [ϕ(xi)] = 2 ψL(ψi 1), by analycity of L, there exists a constant M := supψ Ψ 3 log Z(ψ) op( , F ) such that

2 ψL(ψi 1) 2 ψL(ψ ) F M ψi 1 ψ .

In (b) we used Jensen s inequality to get E [ ψi 1 ψ ] r

E h ψi 1 ψ 2i = δ1/2 i 1.

We are now ready to prove Theorem 3.3.

Proof of Theorem 3.3. It holds that:

f (ψ )(ψn 1 ψ ) = f (ψn 1) f (ψ ) + (f (ψ )(ψn 1 ψ ) f (ψn 1) + f (ψ ))

= hn(ψn 1) f (ψ ) + (f (ψ )(ψn 1 ψ ) f (ψn 1) + f (ψ ))

+ (f (ψn 1) h(ψn 1)) + (h(ψn 1) hn(ψn 1)).

Applying on both sides: (a) a summation over i [n], (b) a multiplication by f (ψ ) 1, (c) p

E[ 2], and using Minkowski s inequality on the r.h.s, we obtain s

i=1 f (ψ ) 1hi(ψi 1) 2

i=1 f (ψ ) 1(f (ψ )(ψi 1 ψ ) f (ψi 1) 2

| {z } (ii)

i=1 f (ψ ) 1(f (ψi 1) h(ψi 1)) 2

| {z } (iii)

i=1 f (ψ ) 1(h(ψi 1) hi(ψi 1)) 2

| {z } (iv)

(i) and (ii) have direct analogues in the proofs of prior work [22] on the convergence of (unbiased) SGD with Polyak-Ruppert averaging, and will be bounded similarly. (iii) captures the bias of the CD algorithm, while (iv) captures the variance.

Bounding (i) Using Lemma D.7, we have (i) = O(n β 2 1).

Bounding (ii) Since log Z(ψ) is analytic, there exists some constant M such that

f (ψ )(ψi 1 ψ ) f (ψi 1) M ψi 1 ψ 2.

Thus, we have:

i=1 ψi ψ 2 2 M

E h ψi ψ 4i = O(n β)

where the second-to-last inequality used Minkowski s inequality, and the last applied Lemma D.8.

Bounding (iii) By Minkowski s inequality, we have:

E h f (ψn 1) h(ψn 1) 2i

Moreover, using lemma C.7, we have: f (ψn 1) h(ψn 1) = Epψn 1ϕ Epψ Ekm ψn 1ϕ

ϕ Epψn 1ϕ 2 Cχ ψn 1 ψ

αmσCχ ψn 1 ψ .

We thus obtain

(iii) αm Cχ

i=1(E ψi 1 ψ 2)1/2 = αm Cχ

i=1 δ1/2 i 1.

Recalling that δi satisfies Theorem 3.2, we have that δ

1 2n = O(n β/2), and we thus have Pn i=1 δ1/2 i = O(n1 β

2 ). By squaring the result of Lemma D.6, we have Pn i=1 δ1/2 i = O(n 1 β

2 ), and thus, we

obtain that (iii) = O(αmn β/2) = O(n ( β

2 +m |log α|

Bounding (iv) We have

I(ψ ) 2 1/2 F n

(M + αm Cχ( τ + 2 log Z 4, σ))1/2 q Xn

i=1 δ1/2 i 1

+αm Cχσ q Xn

Where in (a), we used Lemma D.10 and in (b), we used Lemma D.6 and D.5.

Final bound Putting everything together, we have that: q

n + O(nmax( ( 1

2 +m |log α|

If, furthermore, m > (1 β) log n

2| log α| , we have

2 + m|log α|

which concludes the proof.

E L2 approximation by auxiliary gradient updates

In this section, we consider different gradient update schemes starting from some random initialization θinit, and control the L2 distance between the different updates and the deterministic target ψ Ψ.

Notation. Recall the notation that X1, . . . , Xn i.i.d. pψ , B = n/N and for ψ Ψ, Kψ(x) km ψ (x, ). We also write, for m N { },

Km 1;ψ(x) , . . . , Km n;ψ(x) i.i.d. km ψ (x, )

Let θinit be some Ψ-valued random initialization that is possibly correlated with X1, . . . , Xn. We capture the effect of correlation through the following quantities: For ϵ > 0 and ν > 2, let

ϑinit n,m(ϵ) := P

Pn i=1 E ϕ Km θinit(Xi) Xi, θinit E ϕ Km θinit(X i) θinit

and εinit n,m;ν(ϵ) := q

ϵ2 + κ2ν;m ϑinit n,m(ϵ) 2 ν 2 .

We also consider the i.i.d. samples, drawn independently of X1, . . . , Xn and on a given ψ Ψ, as

Xψ 1 , . . . , Xψ n i.i.d. pψ .

For notational clarity, we shall use θm,B to denote parameters arising from an one-step update, where the subscripts m, B represent performing the one-step update with length-m Markov chains and with batch size B. This is to be distinguished from ψt elsewhere in the text, which denotes the parameter from the actual multi-step CD algorithm and the subscript t denotes the t-th CD iterate.

Gradient update schemes. We consider five different updates. Let X 1 be an i.i.d. copy of X1 drawn independently of all other random variables. The SGD-with-replacement update is given by

θSGDw m,B := F SGDw m,B (θinit) , where F SGDw m,B (ψ) := ψ η

i Sw ϕ(Xi) ϕ Km i;ψ(Xi)

and Sw is a uniformly drawn size-B subset of [n]. The SGD-without-replacement update, after renormalizing the learning rate, is given by the N-fold function composition

θSGDo m,B := F SGDo m,B;N . . . F SGDo m,B;1(θinit) ,

where F SGDo m,B;j(ψ) := ψ η NB

ϕ(Xi) ϕ Km i;ψ(Xi) for each j [N] ,

and So j s are disjoint size-B random subsets of [n], defined by (So 1, . . . , So N) = π([n]) for a uniformly drawn element π of the permutation group on n objects. The full-batch gradient update is given by

θGD m := F GD m (θinit) , where F GD m (ψ) := ψ η

ϕ(Xi) ϕ Km i;ψ(Xi) .

The full-batch gradient update with an infinite-length Markov chain is given by

θGD := F GD (θinit) , where F GD (ψ) := ψ η

ϕ(Xi) ϕ(Xψ 1 ) .

The population gradient update with an infinite-length Markov chain is given by

θpop := f pop(θinit) , where f pop(ψ) := ψ η E ϕ(X1) ϕ(Xψ 1 ) ,

where we use the lowercase f to emphasize that f pop is a deterministic function.

The forthcoming results are summarized below:

θSGDw m,B Lemma E.1 θGD m Lemma E.2 θGD Lemma E.3 θpop Lemma E.4 ψ .

Lemma E.1. Let Fn be the sigma algebra generated by {Xi, Km i;θinit(Xi) | 1 i n}. Then

E θSGDw m,B θGD m θinit, Fn = 0 almost surely .

Moreover, under A1 and A7, we have

E θSGDw m,B θGD m 2 4η2(σ2 + κ2 ν;m)

Proof of Lemma E.1. Write A := (A1, . . . , An), where

Ai := ϕ(Xi) ϕ Km i;θinit(Xi) E ϕ(X1) ϕ Km i;θinit(X1) .

Since Sw is uniformly drawn from all size-B subsets of [n] and independently of all other variables, we have that almost surely

E θSGDw m,B θGD m θinit, Fn = E h η

i n Ai A i = 0 .

To prove the remaining bound, we note that the above relation implies θSGDw m,B θGD m is zero-mean. By the law of total variance, we have

E θSGDw m,B θGD m 2 = Tr Cov θSGDw m,B θGD m = Tr E Cov θSGDw m,B θGD m θinit, Fn

= η2 Tr E E h 1

i Sw Ai A i 1

To compute the covariance, recall that Sw is a uniformly drawn size-B subset of [n] with B = n/N. Let PN([n]) be the collection of all partitions of [n] into N size-B subsets. We can generate Sw by the following two-step process:

(i) Uniformly draw a partition P = (P 1, . . . , P N) from PN([n]); (ii) Uniformly sample an index K from [N] and set Sw = P K.

Then we have, almost surely

i Sw Ai A i 1

= 1 |PN([n])|

i,j P k Ai A j 1

i,j n Ai A j

= 1 |PN([n])|

i,j P k Ai A j 1 N2B2 X

i P k,j P l Ai A j

= 1 |PN([n])|

i,j P k Ai A j 1 N2B2 X

i P k,j P l Ai A j .

By noting that Ai s are exchangeable, we obtain

E θSGDw m,B θGD m 2 = η2Tr E N 1

NB A1A 1 + (N 1)(B 1)

NB A1A 2 N 1

= η2Tr E N 1

NB A1A 1 N 1

NB E A1 2 E A1, A2

B E ϕ(X1) E[ϕ(X1)] ϕ Km i;θinit(X1) E h ϕ Km i;θinit(X1) i 2

Tr Cov[ϕ(X1)] + Tr Cov h ϕ Km i;θinit(X1) i

(b) 4η2(σ2 + κ2 ν;m)

In (a), we have used a Cauchy-Schwarz inequality; in (b), we have used A1 and A7. Finally we note that if B = n, θSGDw m,B = θGD m almost surely, which implies the desired bound.

Lemma E.2. Denote An as the sigma algebra generated by θinit, X1, . . . , Xn. Under A1, A2, A3 and A7, we have that for any ϵ > 0 and ν > 2,

E E θGD m θGD An 2 η2 αmσCχ p

E θinit ψ 2 + εinit n,m;ν(ϵ) 2 ,

E θGD m θGD 2 η2 αmσCχ p

E θinit ψ 2 + εinit n,m;ν(ϵ) 2 + κ2 ν;m + σ2

Proof of Lemma E.2. The main challenge arises from the possible correlation between θinit and X1, . . . , Xn. First note that for any ϵ > 0, ν > 2 and a real-valued random variable Y , by Hölder s inequality, we have

E[Y 2] = E Y 2 I{Y ϵ} + Y 2 I{Y >ϵ}

ϵ2 + E[Y 2I{Y >ϵ}] ϵ2 + (E[Y ν])2/νP(Y > ϵ)(ν 2)/ν . (15)

Also note the useful inequality that for two real-valued random vectors (possibly correlated) V1, V2, we have

E[ V1 + V2 2] E[( V1 + V2 )2] E V1 2 + 2 p

(E V1 2)(E V2 2) + E V2 2

E V2 2 2 . (16)

Now to control the first quantity of interest, by using a triangle inequality, we have

E E θGD m θGD An 2 = E E h η

i n ϕ Km i;θinit(Xi) ϕ Xθinit i An i 2

E[ 2 1] + q

E[ 2 2] 2 ,

i n ϕ Km i;θinit(Xi) E ϕ Km i;θinit(X 1) θinit An i

i n E ϕ Km i;θinit(Xi) θinit, Xi E ϕ Km i;θinit(X 1) θinit ,

i n E ϕ Km i;θinit(X 1) θinit ϕ Xθinit i An i

= E ϕ Km 1;θinit(X 1) ϕ Xθinit 1 θinit ,

and X 1 is an i.i.d. copy of X1 and in particular independent of θinit. 1 is controlled via (15):

E[ 2 1] ϵ2 + (E[ ν 1])2/ν P( 1 > ϵ)ν/(ν 2)

(a) ϵ2 + E ϕ(Km 1;θinit(X1)) E h ϕ(Km 1;θinit(X 1)) θiniti ν 2/ν

P P i n E ϕ Km i;θinit(Xi) θinit, Xi E ϕ Km i;θinit(X 1) θinit

(b) ϵ2 + κ2 ν;m ϑinit n,m(ϵ) ν 2

= εinit n,m;ν(ϵ) 2 . (17)

In (a), we have plugged in the definition of 1 and applied a Jensen s inequality with respect to the empirical average; in (b), we have used A7 to bound the ν-th moment term as E ϕ(Km 1;θinit(X1)) E h ϕ(Km 1;θinit(X 1)) θiniti ν 1/ν

= E h E h ϕ(Km 1;θinit(X1)) E h ϕ(Km 1;θinit(X 1)) θiniti ν θiniti i 1/ν

supψ Ψ E h ϕ(Km 1;ψ(X1)) E ϕ(Km 1;ψ(X 1)) ν i 1/ν

= supψ Ψ E h ϕ(Km 1;ψ(X1)) E ϕ(Km 1;ψ(X1)) ν i 1/ν κν;m

and recalled the definitions of ϑinit n,m and εinit n,m;ν. On the other hand,

E[ 2 2] = E Z

Rd ϕ(x)(Km 1;θinitpψ )(x)dx E ϕ(Xθinit 1 ) θinit 2

Rd(Km 1;θinitϕ)(x) pψ (x)dx Epθinit[ ϕ ] 2

Rd Km 1;θinit ϕ Epθinit[ ϕ ] (x) pψ (x)dx 2

Rd Km 1;θinit ϕ Epθinit[ ϕ ] (x) (pψ (x) pθinit(x))dx 2

Rd Km t1;θinit ϕ Epθinit[ ϕ ] (x) el pθinit(x) pψ (x) pθinit(x)

pθinit(x) dx 2

Rd Km t1;θinit ϕ Epθinit[ ϕ ] (x) el 2pθinit(x)dx

pψ (x) pθinit(x)

2 pθinit(x)dx 2i

l=1 E α2m Tr Cov ϕ(Xθinit 1 ) θinit χ2(pψ , pθinit) 2

(g) α2mσ2C2 χ E θinit ψ 2 .

In (a), we have used that (Kf)(x) = R K(x, y)f(y)dy; in (b), we have used that the Markov operator leaves the constant function invariant; in (c), we used that K1;θinit leaves pθinit invariant; in

(d), we denoted (el)l d as the standard basis vectors of Rd and multiplied and divided by pθinit(x); in (e), we have used a Cauchy-Schwarz inequality; in (f), we have used the definition of the spectral gap α in A3; in (g), we have used A1 and A2. Combining the bounds gives the first inequality that

E E θGD m θGD An 2 η2 αmσCχ p

E θinit ψ 2 + εinit n,m;ν(ϵ) 2 .

Now we handle the second quantity by conditioning on θinit and perform a bias-variance decomposition:

E θGD m θGD 2 = η2 E h E h 1 n

i n ϕ Km i;θinit(Xi) ϕ(Xθinit i ) 2 An i i

= η2(QB + QV ) ,

QB := E E h 1

i n ϕ Km i;θinit(Xi) ϕ(Xθinit i ) An i 2 ,

QV := E h Tr Cov h 1

i n ϕ Km i;θinit(Xi) An i + Cov h 1

i n ϕ(Xθinit i ) θiniti i .

Note that the covariance terms separate because Xθinit i is independent of Km i;θinit(Xi) conditioning on θinit. η2QB is exactly the quantity controlled above, so it suffices to bound the variance term QV . By explicitly computing the second covariance term while noting that Xθinit 1 , . . . , Xθinit n are conditionally i.i.d. given θinit and Km i;θinit(Xi) s are conditionally independent across 1 i n given An, we have

P i n E[Tr Cov[ϕ(Km i;θinit(Xi)) | θinit, Xi]]

n2 + E Tr Cov[ϕ(Xθinit 1 )|θinit]

(a) = E[Tr Cov[ϕ(Km 1;θinit(X1)) | θinit, X1]]

n + E Tr Cov[ϕ(Xθinit 1 )|θinit]

κ2 ν;m + σ2

where we have used A4, A7 and A1 in the last line. Combining the bounds, we obtain that

E θGD m θGD 2 = η2(QB + QV )

E θinit ψ 2 + εinit n,m;ν(ϵ) 2 + κ2 ν;m + σ2

Lemma E.3. Under A1, E θGD θpop 2 4η2σ2

Proof of Lemma E.3. Since both F GD and f pop involve infinite-length Markov chains, the initializations do not matter and we can decouple the stochasticity of Xi and Ki;θinit. In particular,

E θGD θpop 2 = η2 E 1 n

i n ϕ(Xi) ϕ Xθinit i E ϕ(X1) ϕ Xθinit 1 2

η2 Q 1 + 2 p

Q 1Q 2 + Q 2 ,

Q 1 := E 1 n

i n ϕ(Xi) E[ϕ(X1)] 2 = Tr Cov[ϕ(X1)]

n = Tr 2 θ log Z(ψ )

Q 2 := E 1 n

i n ϕ(Xθinit i ) E ϕ Xθinit 1 2

= E Tr Cov[ϕ(Xθinit 1 )|θinit]

n = E Tr 2 θ log Z(θinit)

In the computations above, we have used the relation 2 θ log Z(θ) = Cov X pθ[ϕ(X)] and the assumption supθ Ψ tr( 2 θ log Z(θ)) = σ2 from A1. This implies the desired bound.

Lemma E.4. Under A1, E θpop ψ 2 1 2µη + L2η2 E θinit ψ 2 .

Proof of Lemma E.4. Recall that

f pop(θ ) = θ η E ϕ(X1) ϕ(Xθ 1 ) = θ η ψ log Z(ψ ) ψ log Z(θ ) .

By construction, f pop is deterministic and f pop(ψ ) = ψ . By plugging in the recursions and expanding the square, we get that

E θpop ψ 2 = E f pop(θinit) f pop(ψ ) 2

= E (θinit ψ ) η( ψ log Z(θinit) ψ log Z(ψ )) 2

= E θinit ψ 2 2η E θinit ψ , ψ log Z(θinit) ψ log Z(ψ )

+ η2 E ψ log Z(θinit) ψ log Z(ψ ) 2

E θinit ψ 2 2µη E θinit ψ 2 + L2η2 E θinit ψ 2 .

In the last line, we have recalled infψ Ψ λmin( 2 ψ log Z(ψ)) = µ and supθ Ψ λmax( 2 ψ log Z(ψ)) = L by A1 and applied Lemma C.8. Combining the coefficients gives the desired statement.

F Proofs for offline SGD

We prove Theorem B.1 (which directly implies Theorem 4.3) and Theorem B.2 in this section. The key ingredient of both proofs is Lemma F.1 below, which provides an iterative error bound for the SGD-with-replacement scheme by combining different approximation bounds in Appendix E. Throughout this section, we denote δSGDw t,j := E ψSGDw t,j ψ 2.

Lemma F.1. Under A1, A2, A3, A4 and A7, we have that for 1 j N 1,

δSGDw t,j 1 ηt µ αmσCχ L2

δSGDw t,j 1 + ηt εSGDw n,m,t;ν(ϵ) + 5σ + 5κν;m

where 1 ηt µ αmσCχ L2

Proof of Lemma F.1. We first remark that in view of Lemma C.4, the projection step in Algorithm 2 does not increase δSGDw t,j , so it suffices to bound δSGDw t,j as if projection is not performed on ψs GDw t,j . To apply the results from Appendix E, we identify θinit = ψSGDw t,j 1 and η = ηt, which allows us to write ψSGDw t,j = θSGDw m,B . This also implies E θinit ψ 2 = δSGDw t,j 1 and εinit n,m;ν(ϵ) εSGDw ν;n,m,t(ϵ).

By adding and subtracting the auxiliary gradient updates followed by expanding the square, we obtain

δSGDw t,j = E θSGDw m,B θGD m + θGD m θGD + θGD θpop + θpop ψ 2

(a) = E θSGDw m,B θGD m 2 + E θGD m θGD 2 + E θGD θpop 2 + E θpop ψ 2

+ 2 E θSGDw m,B θGD m , θGD m θGD + 2 E θSGDw m,B θGD m , θGD θpop

+ 2 E E θSGDw m,B θGD m θinit, Fn , θpop ψ + 2 E θGD m θGD , θGD θpop

+ 2 E E θGD m θGD An , θpop ψ + 2 E θGD θpop , θpop ψ

(b) 4η2 t (σ2 + κ2 ν;m)

B I{B<n} + η2 t αmσCχ q

δSGDw t,j 1 + εSGDw n,m,t;ν(ϵ) 2 + κ2 ν;m + σ2

n + (1 2µηt + L2η2 t )δSGDw t,j 1

B I{B<n}ηt αmσCχ q

δSGDw t,j 1 + εSGDw n,m,t;ν(ϵ) +

B I{B<n} 2ηtσ n + 0

+ 2ηt αmσCχ q

δSGDw t,j 1 + εSGDw n,m,t;ν(ϵ) +

+ 2ηt αmσCχ q

δSGDw t,j 1 + εSGDw n,m,t;ν(ϵ) q

(1 2µηt + L2η2 t )δSGDw t,j 1

(1 2µηt + L2η2 t )δSGDw t,j 1

(c) 1 2µηt + Lη2 t + η2 t α2mσ2C2 χ + 2ηtαmσCχ q

1 2µηt + L2η2 t δSGDw t,j 1

+ 2 η2 t αmσCχεSGDw n,m,t;ν(ϵ) + 2η2 t ( q

σ2 + κ2ν;m + σ) αmσCχ

+ ηtεSGDw n,m,t;ν(ϵ) q

1 2µηt + L2η2 t + 2ηtσ q

1 2µηt + L2η2 t n

δSGDw t,j 1

+ η2 t 4(σ2 + κ2 ν;m)

B + εSGDw n,m,t;ν(ϵ) 2 + 5κ2 ν;m + 9σ2

σ2 + κ2ν;m εSGDw n,m,t;ν(ϵ)

B + 4σεSGDw n,m,t;ν(ϵ)

=: (A1) δSGDw t,j 1 + (A2) q

δSGDw t,j 1 + (A3) .

In (a), we have expanded the square, used Fn defined in Lemma E.1 and An defined in Lemma E.2, and noted that θpop ψ is almost surely constant given θinit; in (b), we have applied Lemmas E.1, E.2, E.3 and E.4 under A1, A2, A3, A4 and A7, and used

b for a, b 0; in (c), we have grouped the terms by powers of δSGDw t,j 1 , bounded the indicator function from above by 1 and used B n to replace n in the denominator by B. While the computation above is complicated, we remark that the key steps are taking the conditional expectations for the crossterms in (b), which gives us a tighter bound than directly applying a triangle inequality of the form E Y1 + Y2 2 ( p

E Y2 2). To further simplify the bounds, we seek to bound each

coefficient by a square, which yields

(A1) = ηtαmσCχ + q

1 2µηt + Lη2 t 2 ,

(A2) 2 ηtαmσCχ + q

1 2µηt + Lη2 t ηt εSGDw n,m,t;ν(ϵ) + 4σ + 2κν;m

(A3) η2 t εSGDw n,m,t;ν(ϵ) 2 + 8σ2 + 4κ2 ν;m

B εSGDw n,m,t;ν(ϵ) + 21σ2 + 17κ2 ν;m B

η2 t εSGDw n,m,t;ν(ϵ) + 5σ + 5κν;m

This implies

δSGDw t,j (A1) δSGDw t,j 1 + (A2) q

δSGDw t,j 1 + (A3)

ηtαmσCχ + q

1 2µηt + Lη2 t q

δSGDw t,j 1 + ηt εSGDw n,m,t;ν(ϵ) + 5σ + 5κν;m

Now note that since µ L by the definitions in A1, 2µηt L2η2 t 2Lηt L2η2 t 1. By using 1 x 1 x

2 for all x 1, we get that q

1 2µηt + L2η2 t 1 µηt + L2

Substituting this into the earlier bound and taking a square-root, we obtain the desired bound that q

δSGDw t,j 1 ηt µ αmσCχ L2

δSGDw t,j 1 + ηt εSGDw n,m,t;ν(ϵ) + 5σ + 5κν;m

Moreover, since L µ by definition from A1, we have

1 ηt µ αmσCχ L2

2ηt 1 2 + (

2 L µ)ηt + αmσCχηt > 0 ,

which finishes the proof.

F.1 Proof of Theorem B.1

Under A1, A2, A3, A4 and A7, Lemma F.1 implies q

δSGDw t,j 1 µm Ct β + L2C2

δSGDw t,j 1 + Ct β σSGDw n,T

for 1 t T and 1 j N, where we have used µm = µ αmσCχ, ηt = Ct β and

εSGDw n,m,t;ν(ϵ) + 5σ + 5κν;m

B εSGDw n,m,T ;ν(ϵ) + 5σ + 5κν;m

B = σSGDw n,T .

By an induction on j = 1, . . . , N, we have q

δSGDw t,N 1 µm Ct β + L2C2

+ CσSGDw n,T t β XN

1 µm Ct β + L2C2

2 t 2β j 1 .

By noting that δSGDw t,0 = δSGDw t 1,N almost surely for t 2 and using another induction on t = 1, . . . , T, q

δSGDw T,N Q0 q

δSGDw 0,0 + CσSGDw n,T XT

t=1 Qt At t β , (18)

1 µm Cs β + L2C2

2 s 2β N and At := XN

1 µm Ct β + L2C2

2 t 2β j 1 .

Note that by Lemma F.1, we also have that 1 µm Ct β + L2C2

2 t 2β > 0. This allows us to apply Lemma C.3 to obtain

κ0 exp 1 N µm Cφ1 β(T + 1) + NL2C2

2 φ1 2β(T + 1) = ET,N 1 .

To control PT t=1 Qt Att β, recall that β [0, 1], µm > 0, L2C2

2 > 0, and that

ET,N 2 = exp N µm C

2 φ1 β(T + 1) + 2NL2C2φ1 2β(T + 1) .

We can now apply Lemma C.3: If β { 1

2, 1}, then

t=1 Qt At t β 22β+1

µm C 2(1 β) N (T +1)β + 3β(1 + µm C)N 1(T + 2)β

L2C2 ET,N 2 .

2, i.e. 2β = 1, we have

t=1 Qt At t β 4 µm C e

(T +1)1/2 + 2N(1 + µm C)N 1φ 1

2 L2C2N(T + 1) ET,N 2 .

If β = 1, we get that

t=1 Qt At t β 4 µm C + 3N 1 + L2C2

2 N 1 e2L2C2N log(T + 1)

(T + 1)( µm CN)/2

Substituting the bounds into (18), we get the desired bound that q

δSGDw T,N is upper bounded by

δSGDw 0,0 + CσSGDw n,T 4e

µm C + 2N(1 + µm C)N 1φ 1

2 L2C2N(T + 1) ET,N 2

δSGDw 0,0 + CσSGDw n,T 4 µm C + 3N 1 + L2C2

2 N 1 e2L2C2N log(T + 1)

(T + 1)( µm CN)/2 for β = 1 ,

δSGDw 0,0 + CσSGDw n,T 22β+1

µm C 2(1 β) N (T +1)β + 3β(1 + µm C)N 1(T + 2)β

L2C2 ET,N 2 otherwise .

F.2 Proof of Theorem B.2

Recall that δSGDo t,j := E ψSGDo t,j ψ 2. To apply the results from Appendix E to ψSGDo t,j , we condition on So t , which in particular fixes So t,j, the last size-B subset of [n] chosen. We then identify θinit = ψSGDo t,j 1 , η = ηt and the dataset used as Do t,j := (Xi : i So t,j), which allows us to identify ψSGDo t,j = θGD m , the full-batch gradient descent update using Do t,j. Meanwhile, we note that almost surely θGD m = θSGDw m,B , the SGD-with-replacement iterate that uses the full dataset Do t,j. Observe that the proof of Lemma F.1 holds with δSGDo t,j 1 replaced by any random initialization θinit possibly correlated with X1, . . . , Xn, which allows us to obtain q

δSGDo t,j 1 ηt µ αmσCχ L2

δSGDo t,j 1 + ηt εSGDw n,m,t;ν(ϵ) + 5σ + 5κν;m

Since the error recursion for δSGDo t,j is identical to that of δSGDw t,j in Lemma F.1, the proof of Theorem 4.3 follows directly, thereby yielding an identical result for δSGDo T,N as with δSGDw T,N in Theorem 4.3.

G Proofs for tail probability bounds in offline SGD

We present the proofs for results in Appendix B.3 that control the tail probability terms ϑSGDw ν;n,m,T and εSGDw ν;n,m,T .

Proof of Lemma B.3. For δ > 0, let Nδ be the δ-covering number of Ψ, which satisfies Nδ rΨ(1 + 2/δ) p (Example 5.8, [50]). Note also that by the Jensen s inequality applied to E[ |X1] and Assumption A6, there exist some σm, ζm > 0 such that, for any z Rp with z ζm,

E[ez (E[ϕ(Km ψ (X1))|X1] E[ϕ(Km ψ (X1))])] E[ez (ϕ(Km ψ (X1)) E[ϕ(Km ψ (X1))])] eσ2 m z 2/2 .

Meanwhile under A5, recycling the proof of Lemma 3.1 of [21] shows that if Cmδ/ p < σ2 mζm,

i=1 E ϕ Km ψ (Xi) Xi E ϕ(Km ψ (Xi)) > 3Cmδ 2Nδp exp n C2 mδ2

2(rΨ)p exp p log(1 + 2δ 1) n C2 mδ2

Since the probability above is an upper bound to ϑSGDw n,m,T (3Cmδ), we get that if Cmδ/ p < σ2 mζm,

εSGDw ν;n,m,T (3Cmδ) 2 = 9C2 mδ2 + κ2 ν;m ϑSGDw n,m,T Cmδ

9C2 mδ2 + κ2 ν;m2

ν exp (ν 2)p

ν log(1 + 2δ 1) n(ν 2)C2 mδ2

Recall that by assumption, log n

n < σ2 mζ2 m p+ν 2. We now choose

p p + 2ν ν 2

p (ν 2)p + 2ν

which implies

εSGDw ν;n,m,T (ϵ) 2 εSGDw ν;n,m,T (3Cmδ) 2

9σ2 mp((ν 2)p + 2ν) log n

(ν 2)n + κ2 ν;m2 ν 2

ν (1 + 2δ 1)

ν exp (ν 2)p + 2ν

9σ2 mp((ν 2)p + 2ν) log n

+ κ2 ν;m2 ν 2

ν 1 + 2Cm(ν 2)1/2

σmp1/2((ν 2)p + 2ν)1/2

ν n (ν 2)p+2ν

9σ2 mp((ν 2)p + 2ν) log n

+ κ2 ν;m2 ν 2

ν 1 + 2Cm(ν 2)1/2

σmp1/2((ν 2)p + 2ν)1/2 (ν 2)p

2ν (ν 2)p+2ν

9σ2 mp((ν 2)p + 2ν)

ν 2 + κ2 ν;m2

ν 1 + 2Cm(ν 2)1/2

σmp1/2((ν 2)p + 2ν)1/2 (ν 2)p

Taking a squareroot across and using

b for a, b > 0 gives the desired bound. The limiting result follows by substituting this bound into Theorem B.1.

Proof of Lemma B.4. Let δ > 0 and Nδ be defined as in the proof of Lemma B.3, with Nδ rΨ(1 + 2/δ) p. Let (ψl)Nδ l=1 be the centers of the covering δ-balls. The covering-ball argument of the proof of Lemma 3.1 of [21] shows that

ϑSGDw n,m,T (3Cmδ) P sup ψ Ψ

i=1 E ϕ Km ψ (Xi) Xi E ϕ(Km ψ (Xi)) > 3Cmδ

XNδ l=1 P 1 n

i=1 E ϕ Km ψl(Xi) Xi E ϕ(Km ψl(Xi)) Cmδ .

By a Markov s inequality followed by the Burkholder s inequality applied to an average of i.i.d. summands (see e.g. [51] for ν > 2), there exists a constant Cν > 0 depending only on ν such that

ϑSGDw n,m,T (3Cmδ) XNδ l=1

E Xn i=1 E ϕ Km ψl(Xi) Xi E ϕ(Km ψl(Xi)) ν

E E ϕ Km ψl(X1) Xi E ϕ(Km ψl(X1)) ν

Nδκν ν;m nν/2Cνmδν (rΨ)p(1 + 2δ 1)pκν ν;m nν/2Cνmδν ,

where we have used assumption A4 in the last line. This implies εSGDw ν;n,m,T (3Cmδ) 2 = 9C2 mδ2 + κ2 ν;m ϑSGDw n,m,T (3Cmδ) (ν 2)/ν

9C2 mδ2 + κν ν;m(rΨ)

ν C (ν 2) m (1 + 2δ 1) (ν 2)p

Choosing δ = n (ν 2)ν 2(ν2+(ν 2)p) 1, we get that

εSGDw ν;n,m,T (ϵ) 2 εSGDw ν;n,m,T (3Cmδ) 2

9C2 mn (ν 2)ν ν2+(ν 2)p + κν ν;m(rΨ)

ν C (ν 2) m 3

ν n (ν 2)ν ν2+(ν 2)p

= 9C2 m + κν ν;m(rΨ)

ν C (ν 2) m 3

ν n (ν 2)ν ν2+(ν 2)p .

Taking a squareroot across and using

b for a, b > 0 gives the desired bound. The limiting result follows by substituting this bound into Theorem B.1.

Proof of Lemma B.5. By a Markov s inequality and a Jensen s inequality with respect to the empirical average, we have

ϑSGDw n,m,T (ϵ) sup t [T ],j [N]

Pn i=1 E E h ϕ Km ψSGDw t 1,j (Xi) Xi, ψSGDw t 1,j i E h ϕ Km ψSGDw t 1,j (X 1) ψSGDw t 1,j i

=: sup t [T ],j [N]

Pn i=1 Atji

Meanwhile by a triangle inequality and the ergodicity assumption, we have

Atji E E h ϕ Km ψSGDw t 1,j (Xi) Xi, ψSGDw t 1,j i E h ϕ X ψSGDw t 1,j 1 ψSGDw t 1,j i

+ E E h ϕ X ψSGDw t 1,j 1 ψSGDw t 1,j i E h ϕ Km ψSGDw t 1,j (X 1) ψSGDw t 1,j i

This implies ϑSGDw n,m,T (ϵ) 2 CK αmϵ 1, and therefore

εSGDw ν;n,m,T (ϵ) 2 ϵ2 + 2 ν 2

ν κ2 ν;m( CK) ν 2

Choosing ϵ = α(ν 2)m/(3ν 2) gives

infϵ>0 εSGDw ν;n,m,T (ϵ) 2 1 + 2 ν 2

ν κ2 ν;m( CK) ν 2

Taking a squareroot across and using

b for a, b > 0 gives the desired bound. The limiting result follows by substituting this bound into Theorem B.1.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Our abstract summarizes the theoretical results of our paper. We cleary mention that while CD can achieve these rates, assumptions are needed to guarantee this. We also mention the model class studied in this paper, namely (unnormalized) exponential family distributions.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We mention limitations multiple times in the main body of our work: after introducing the assumptions in Section 3.1, we mention that spectral gaps may not be verified for heavy tail distributions, or act numerically unfavorably for multimodal distributions, and that constants may be large in practice. In the discussion (Section 6), we reflect back on these limitations, and we mention the limitations of our considered model class (unnormalized exponential families).

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best

judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: All of our theoretical results are formally proved in the supplementary material. In the main paper, we provide some proof sketches (discussion of the tail term in Section 4, paragraphs after theorems in section 3) to provide high level intuitions behind the proofs. All of the theoretical results provided by our work are proved in the supplementary material, usually by invoking (and referencing) more atomic lemmas, also proved in the supplementary material. All of our theorems and lemmas are stated with the assumptions they require.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper does not include experiments, and thus no code or data. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: As a theory paper, we believe that our work does not breach the code of conduct. With this work, we aim to advance the understanding of statistically efficient algorithms, a domain which has the potential to improve the overall computational efficiency of machine learning algorithms.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: We believe that our work will not have the negative impacts mentioned (e.g., disinformation, surveillance, fairness, privacy, security considerations). We believe that advancing the domain of statistical effiency has the potential to improve the overall computational efficiency of machine learning algorithms, which, all else left equal, could constitute a positive societal impact.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: Since our work does not include experiments nor data, and only analyzes existing models, we do not believe that our work requires any specific guidelines.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: The paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: Our work does not release any assets.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: The paper does not involve crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.