# doubleloop_unadjusted_langevin_algorithm__425f773e.pdf

Double-Loop Unadjusted Langevin Algorithm

Paul Rolland 1 Armin Eftekhari 2 Ali Kavis 1 Volkan Cevher 1

A well-known ﬁrst-order method for sampling from log-concave probability distributions is the Unadjusted Langevin Algorithm (ULA). This work proposes a new annealing step-size schedule for ULA, which allows to prove new convergence guarantees for sampling from a smooth log-concave distribution, which are not covered by existing state-of-the-art convergence guarantees. To establish this result, we derive a new theoretical bound that relates the Wasserstein distance to total variation distance between any two log-concave distributions that complements the reach of Talagrand T2 inequality. Moreover, applying this new step size schedule to an existing constrained sampling algorithm, we show stateof-the-art convergence rates for sampling from a constrained log-concave distribution, as well as improved dimension dependence.

1. Introduction

Let dµ (x) e f(x) dx be a probability measure over Rd, where f : Rd R is a convex function with Lipschitz continuous gradient. In order to sample from such distributions, ﬁrst-order sampling schemes based on the discretization of Langevin dynamics and, in particular the Unadjusted Langevin Algorithm (ULA), have found widespread success in various applications (Welling & Teh, 2011; Li et al., 2016b; Patterson & Teh, 2013; Li et al., 2016a).

An ever-growing body of literature has been devoted solely to the study of ULA and its variations (Ahn et al., 2012; Chen et al., 2015; Cheng & Bartlett, 2017; Cheng et al., 2017; Dalalyan & Karagulyan, 2017; Durmus et al., 2017; 2018a; Dwivedi et al., 2018; Luu et al., 2017; Welling &

1LIONS, Ecole Polytechnique Fédérale de Lausanne, Switzerland 2Department of Mathematics and Mathematical Statistics, Umea University, Sweden. Correspondence to: Paul Rolland <paul.rolland@epﬂ.ch>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

Teh, 2011; Ma et al., 2015). The ULA iterates are given as

xk+1 = xk γk+1 f(xk) + p

2γk+1gk, (1)

where f is the gradient of f, {γk}k 0 is a non-increasing sequence of positive step-sizes, and the entries of gk Rd

are zero-mean and unit-variance Gaussian random variables, independent from each another and everything else. In its standard form (1), ULA can provably sample from any logconcave and smooth probability measure (Durmus et al., 2017; 2018a).

The recent analysis of (Durmus et al., 2018a) studies ULA through the lens of convex optimization. Their analysis shows strong resemblance with the convergence analysis of stochastic gradient descent (SGD) algorithm for minimizing a convex continuously differentiable function f : Rd R. Starting from x0 Rd, SGD iterates similarly as (1):

xk+1 = xk γk+1 f(xk) + γk+1Θ(xk),

where {γk}k 0 is a non-increasing sequence of positive step-sizes, and Θ : Rd Rd is a stochastic perturbation to f. One way of proving convergence guarantees for this method is to show the following inequality:

2γk+1(E[f(xk+1)] f(x )) E xk x 2 2

E xk+1 x 2 2 + Cγ2 k+1 (2)

for some constant C 0, k 0 and x arg minx Rd f(x). From this inequality, and using step size γk 1

k, it is then possible to show convergence, in

expectation, of the average iterate x T = 1

T PT 1 t=0 xt to the

optimal value, i.e., E[f( x T )] f(x ) = O 1

In their paper, (Durmus et al., 2018a) showed a similar descent lemma as (2) for the sequence of generated measures {µk}k 0 denoting the distributions of the iterates {xk}k 0 in (1), in which the objective gap E[f(xk)] f(x ) is replaced with the Kullback-Leibler divergence KL(µk; µ ), and the Euclidean distance xk x 2 is replaced with the 2-Wasserstein distance W2(µk, µ ):

2γk+1 KL(µk; µ ) W2 2(µk, µ ) W2 2(µk+1, µ )

+ 2Ldγ2 k+1, (3)

Double-Loop Unadjusted Langevin Algorithm

where L is the Lipschitz constant of the gradient of f. Then again, using γk 1

k, it is possible to show convergence

of the average sample distribution µT = 1

T PT t=0 µt to µ

in KL divergence, with rate O d3

In this work, we improve this convergence rate to O d3

. To this end, we ﬁrst establish a new bound that relates the W2 distance and the KL divergence between any two logconcave distributions. When applied to inequality (3), this new bound can be exploited to design a new step-size sequence {γk}k 0 that allows to derive new convergence rates for ULA.

We introduce a new multistage decaying step size schedule, which proceeds in a double loop fashion by geometrically decreasing the step-size after a certain number of iterations, and that we call Double-loop ULA (DL-ULA). To the best of our knowledge, all existing convergence proof for ULA use either constant, or polynomially decaying step sizes, i.e. of the form γk = k α for some α 0, and this is the ﬁrst work introducing a multistage decaying step size for a sampling algorithm. Interestingly, there is precedence to support our approach in that such step decay schedule can improve convergence of optimization algorithms (Hazan & Kale, 2014; Ge et al., 2019; Aybat et al., 2019; Youseﬁan et al., 2012).

Our new inequality relating KL divergence and W2 distance serves as an alternative to the powerful T2 inequality (Gozlan & Léonard, 2010), the latter requiring stronger assumptions on the distributions. The literature on Langevin dynamics commonly proves the convergence of an algorithm in KL divergence and then extends it to the total variation (TV) distance using the famous Pinsker s inequality (Pinsker, 1960; Cheng & Bartlett, 2017; Durmus et al., 2018a). Our new inequality enables to do the same for extending convergence results to W2 distance in the case of general log-concave distributions, and hence, might be of independent interest. Note, however, that this inequality applied alone to extend the result of (Durmus et al., 2018a) to W2 distance provides a suboptimal convergence rate, and modifying the step-size schedule and the analysis appears to be crucial for improving the rate.

Finally, we apply this multistage strategy to the constrained sampling algorithm MYULA (Brosse et al., 2017), which allows us to obtain improved convergence guarantees, both in terms of rate and dimension dependence. This approach provides state-of-the-art convergence guarantees for sampling from a log-concave distribution over a general convex set.

We summarize our contributions as follows:

We introduce a variant of the Unadjusted Langevin Algo-

rithm, using a new multistage decaying step-size schedule as well as a clipping step. Our new approach, called DLULA, yields new convergence guarantees, that are not covered by existing convergence result (i.e., either better convergence rate or better dimension dependence compared to state-of-the-art results).

We apply our new step-size schedule to an existing Langevin-based constrained sampling algorithm, called MYULA (Brosse et al., 2017), and improve its convergence both in terms of iteration and dimension dependences.

We introduce a new bound relating the 2-Wasserstein and the TV distance between any two log-concave distributions.

A summary of our convergence rates can be found in Tables 1 and 2.

Road map In section 3, we deﬁne several metrics on probability measures that we will use, recall some properties of log-concave distributions that we will exploit, as well as some results on convergence of ULA. In section 4, we present our new extension of ULA for unconstrained sampling, by introducing a new multistage step size schedule. We then prove convergence guarantees by making use of a new bound relating the KL divergence and the W2 distance. Finally, in section 5, we apply this procedure to the existing algorithm MYULA for constrained sampling, and show that it yields improved convergence guarantees, both in terms of convergence rate and dimension dependence.

2. Related work

Unconstrained sampling Sampling algorithms based on Langevin dynamics have been widely studied (Ahn et al., 2012; Chen et al., 2015; Cheng & Bartlett, 2017; Cheng et al., 2017; Dalalyan & Karagulyan, 2017; Durmus et al., 2018a; Dwivedi et al., 2018; Durmus et al., 2017; Luu et al., 2017; Welling & Teh, 2011). Although most convergence rates have been established in the strongly log-concave setting, rates have also been shown for general log-concave distributions, and in particular exhibit larger dimension dependences (see Table 1).

Convergence guarantees for ULA applied to a general unconstrained log-concave distribution have been successively improved over the years. To the best of our knowledge, the best existing convergence results are the one obtained by (Durmus et al., 2018a) and (Durmus et al., 2017), that respectively show O(d3ϵ 4) and O(d5ϵ 2) convergence guarantees in TV distance. In this paper, we improve upon the former one, by showing a O(d3ϵ 3) convergence rate. This result is not absolutely better than the one of (Durmus et al., 2017), but enjoys better dimension dependence.

Double-Loop Unadjusted Langevin Algorithm

Until recently, convergence rate in Wasserstein distance had not been proven in the general log-concave setting. Only recently, (Zou et al., 2018) presented a method based on underdamped Langevin dynamics that provably converges in W2-distance for a general log-concave distribution.

In (Zou et al., 2018), the authors show a O(d5.5ϵ 6) convergence rate in W2 distance for general log-concave distributions. However, they make the assumption that EX µ X 4 2 Ud2 for some scalar U. However, let dµ(x) e x 2 dx, which is a log-concave distribution. Then, EX µ X 4 2 = Ω(d4) and their assumption does not hold. For comparison purpose, if we replace it with our weaker Assumption 4, their rate becomes O(d10.5ϵ 6).

Constrained sampling Extensions of ULA have been designed in order to sample from constrained distributions (Bubeck et al., 2018; Brosse et al., 2017; Hsieh et al., 2018; Patterson & Teh, 2013). In (Bubeck et al., 2018), the authors propose to apply ULA, and project the sample onto the constraint at each iteration. They show a convergence rate of O(d12ϵ 12) in TV distance for log-concave distributions (i.e., O(d12ϵ 12) iterations of the algorithm are sufﬁcient in order to obtain an error smaller than ϵ in TV distance).

In (Brosse et al., 2017), the authors propose to smooth the constraint using its Moreau-Yoshida envelope, and obtain a convergence rate of O(d5ϵ 6) in TV distance when the objective distribution is log-concave. To do so, they penalize the domain outside the constrain directly inside the target distribution via its Moreau-Yoshida envelop.

The analysis of MYULA in (Brosse et al., 2017) only holds when the penalty parameter is ﬁxed and chosen in advance, leading to a natural saturation after a certain number of iterations. In this work, we extend this procedure using our Double-loop approach. This allows to obtain improved convergence both in terms of rate and dimension dependence, i.e., O(d3.5ϵ 5) in TV distance, and to ensure asymptotic convergence of the algorithm since the penalty is allowed to vary along the iterations.

The special case of sampling from simplices was solved in (Hsieh et al., 2018), introducing Mirrored Langevin Dynamics (MLD). Their work relies on ﬁnding a mirror map for the given constraint domain, and then performing ULA in the dual space. However, this method requires strong logconcavity of the distribution in the dual space. Moreover, ﬁnding a suitable mirror map for a general convex set is not an easy task.

3. Preliminaries

3.1. Various measures between distributions

Let us recall the distances/divergences between probability measures which are used frequently throughout the paper. The Kullback Leibler (KL) divergence between two probability measures µ, ν on Rd is deﬁned as

KL(µ; ν) = Eµ log(dµ/ dν), (4)

assuming that µ is dominated by ν. Their Total Variation (TV) distance is deﬁned as

µ ν TV = sup S |µ(S) ν(S)|, (5)

where the supremum is over all measurable sets S of Rd. Finally, the 2-Wasserstein (or W2 for short) distance between µ and ν is deﬁned as

W2 2(µ, ν) = inf φ Φ(µ,ν)

Rd Rd x y 2dγ(x, y), (6)

where Φ(µ, ν) denotes the set of all joint probability measures φ on R2d that marginalize to µ and ν, namely, for all measurable sets A, B Rd, φ(A Rd) = µ(A) and φ(Rd B) = ν(B).

The main difference between W2 and TV distances is that W2 associates a higher cost when the difference between the distributions occurs at points that are further appart (in terms of Euclidean distance). Due to this property, errors occurring at the tail of the distributions (i.e., when x 2 ) can have a small impact in terms of TV distance, but a major impact in terms of W2 distance.

3.2. Log-concave distributions and tail properties

We start by recalling the basic property that we will assume on the probability measure. We will then present some known results about this class of measures which will be exploited in the convergence analysis of our algorithm.

Deﬁnition 1. We say that a function f : Rd R has L-Lipschitz continuous gradient for L 0 if x, y Rd,

f(x) f(y) 2 L x y 2.

Deﬁnition 2. We say a function f : Rd R in convex if 0 t 1 and x, y Rd,

f(tx + (1 t)y) tf(x) + (1 t)f(y)

Deﬁnition 3. We say that probability measure µ e f(x) dx is logconcave if f is convex. Moreover, we say that µ is L-smooth if f has a L-Lipschitz continuous gradient.

Double-Loop Unadjusted Langevin Algorithm

As mentioned previously, bounding the Wasserstein distance between two probability measures requires controlling the error at the tail of the distributions. In order to deal with such a distance without injecting large dependence in the dimension, we make the following assumption on the tail of the target distribution, which is quite standard when working with unconstrained non-strongly log-concave distributions (Durmus et al., 2018a; 2017):

Assumption 4. There exists η > 0, Mη > 0 such that for all x Rd such that x 2 Mη,

f(x) f(x ) η x x 2

where x = arg minx Rd f(x). Without loss of generality, we will also assume x = 0 and f(x ) = 0.

Note that in the case of a distribution constrained to a set Ω Rd, this assumption is naturally satisﬁed with η arbitrary, and Mη = diam(Ω) where diam(Ω) is the diameter of Ω.

In order to see how this assumption transfers into a constraint on the tail of the distribution, we recall two following results shown in (Durmus et al., 2018a) and (Lovász & Vempala, 2007) respectively.

Lemma 5. Let X Rd be a random vector from a logconcave distribution µ satisfying assumption 4. Then

EX µ X 2 2 2d(d + 1)

Lemma 6. Let X Rd be a random vector from a logconcave distribution µ such that E X 2 2 C2. Then, for any R > 1, we have

Pr ( X 2 > RC) < e R+1

It is thus possible to combine both lemmas to show that any distribution satisfying assumption 4 necessarily has a subexponential tail. This property will allow us to control the Wasserstein distance in terms of the total variation distance.

Lemma 7. Let X be a random vector from a log-concave distribution µ satisfying assumption 4. Then, R > 1,

3.3. Unadjusted Langevin Algorithm

Finally, we recall the standard Unadjusted Langevin Algorithm as well as a very useful inequality bounding the

KL divergence between the target distribution and the k-th sample distribution.

Consider the probability space (Rd, B, µ ), where B is the Borel sigma algebra and µ is the target distribution. Suppose that µ is log-concave and dominated by the Lebesgue measure on Rd, namely,

dµ (x) = Ce f(x) dx, x S, (7)

where C is an unknown normalizing constant and the function f : Rd R is convex and f is L-Lipschitz continuous. We wish to sample from µ without calculating the normalizing constant C.

A well-known scheme for sampling for such a distribution is called ULA. Initialized at x0 Rd, the iterates of ULA are

xk+1 = xk γ f(xk) + p

for all k 0, where γ > 0 is the step-size and the entries of gk Rd are zero-mean and unit-variance Gaussian random variables, independent from each another and everything else. Let µk be the probability measure associated to iterate xk, k 0. It is well-known that ULA converges to the target measure in KL divergence.

More speciﬁcally, for n nϵ = O(d3Lϵ 2) iterations, we reach KL(µn; µ ) ϵ, where µn = 1 n Pn k=1 µk is the average of the probability measures associated to the iterates {xk}n k=0 (Durmus et al., 2018a). The averaging sum 1 n Pn k=1 µk is to be understood in the sense of measures, i.e., sampling from the µn is equivalent to choosing an index k uniformly at random among {1, ..., n}, and then sampling from µk.

To prove this result, the authors showed the following useful inequality that we will exploit in our analysis:

Lemma 8. Suppose that we apply the ULA iterations (8) for sampling from a smooth log-concave probability measure µ e f(x) dx with constant step-size γ > 0, starting from x0 µ0. Then, n > 0,

KL( µn; µ ) W2 2(µ0, µ )

2γn + Ldγ. (9)

4. DL-ULA for unconstrained sampling

In this section, we present a modiﬁed version of the standard ULA for sampling from an unconstrained distribution and provide convergence guarantees. This modiﬁed version of ULA involves a new step size schedule as well as a projection step. We will show that it allows to obtain improved convergence rate, as well as the ﬁrst convergence rate in W2-distance for overdamped Langevin dynamics.

Double-Loop Unadjusted Langevin Algorithm

Algorithm 1 Double-loop Unadjusted Langevin Algorithm (DL-ULA)

Input: Smooth unconstrained probability measure µ , step sizes {γk}k 1, number of (inner) iterations {nk}k 1, initial probability measure µ0 on Rd, and thresholds {τk}k 1. Initialization: Draw a sample x0 from the probability measure µ0. for k = 1, . . . do

xk,0 xk 1 for n = 1, . . . , nk do

xk,n+1 xk,n γk f(xk,n) + 2γkgk,n, where gk,n N(0, Id). end for xk xk,i, where i is drawn from the uniform distribution on {1, , nk}. if xk 2 > τk then

xk τkxk/ xk 2. end if end for

4.1. DL-ULA algorithm

We consider the problem of sampling from a smooth and unconstrained probability measure µ e f(x) dx, where f : Rd R is differentiable. To this end, we apply the standard ULA in a double-loop fashion, and decrease the step size only between each inner loop. Moreover, each inner loop is followed by a projection step onto some Euclidean ball. The procedure is summarized in Algorithm 1.

The projection step appears to be crucial in our analysis in order to control the tail of the sample distribution, which is necessary for bounding its Wasserstein distance to the target distribution.

In the following sections, we derive the convergence rate for Algorithms 1. The global idea for showing the convergence of this algorithm is to use the inequality (9) recursively between each successive outer loop. We denote as µk the average distribution associated to the iterates of outer iteration k just before the projection step. Similarly, we denote as µk the same distribution after the projection step.

Each outer iteration k uses as a starting point a sample from the previous outer iteration xk,0 µk 1. Therefore, we can apply the inequality (9) to the outer iteration k to obtain

KL( µk; µ ) W2 2( µk 1, µ )

2γknk + Ldγk. (10)

In order to unfold the recursion, we must have a bound on W2 2( µk 1, µ ) in terms of KL( µk 1, µ ). Using the light tail property of log-concave distributions, it is easy to obtain a bound between W2 2( µk 1, µ ) and W2 2( µk 1, µ ).

However, it is not clear how to bound W2 2( µk 1, µ ) by KL( µk 1, µ ).

As an intermediate step in the convergence analysis, we derive in the next section a bound between the W2-distance and the TV-distance between two general log-concave probability measures, which can then be extended to a W2-KL bound using Pinsker s inequality.

4.2. Relation Between W2and TV-Distances

When µ and ν are both compactly supported on an Euclidean ball of diameter D, then it is well-known that W2(µ, ν) D p

µ ν TV (Gibbs & Su, 2002). Otherwise, if µ and ν are not compactly supported, their fastdecaying tail (Lemma 7) allows us to derive a similar bound, as summarized next and proved in Appendix A.

Lemma 9. (W2-TV distances inequality) Let µ, ν be logconcave probability measures on Rd both satisfying Assumption 4 with (η, Mη). Then, for some scalar c R,

W2(µ, ν) cd max log 1 µ ν TV

In a sense, (11) is an alternative to the powerful T2 inequality which does not apply generally in our setting (Gozlan & Léonard, 2010). Indeed, for Cµ > 0, recall that a probability measure µ satisﬁes Talagrand s T2(Cµ) transportation inequality if

W2(µ, ν) Cµ p

KL(µ; ν), (12)

for any probability measure ν. Above, Cµ depends only on µ and, in particular, if µ is κ strongly log-concave,1

then (12) holds with Cµ = O(1/ κ) (Gozlan & Léonard, 2010). In this work, the target measures that we consider are not necessarily strongly log-concave measures, leaving us in need for a replacement to (12). In our analysis, (11) serves as a replacement for (12). Indeed, using the Pinsker s inequality (Pinsker, 1960), an immediate consequence of (11) is that

W2(µ, ν) = e O(KL(µ; ν) 1 4 ). (13)

In fact, (11) might also be of interest in its own right, especially when working with non-strongly log-concave measures. For example, it is easy to use (13) to extend the well-known O(ϵ 2) convergence rate of ULA in KL divergence to a e O(ϵ 8) convergence rate in W2 distance in the non-strongly log-concave setting. To the best of our knowledge, such a result does not exist in the literature.

1If dµ e f dx, then we say that µ is κ is strongly logconcave if f is κ strongly convex.

Double-Loop Unadjusted Langevin Algorithm

Literature W2 TV KL

(Durmus et al., 2018a) - e O Ld3ϵ 4 e O Ld3ϵ 2

(Durmus et al., 2017) - e O L2d5ϵ 2 - (Zou et al., 2018) e O L2d10.5ϵ 6 - -

Our work e O Ld9ϵ 6 e O Ld3ϵ 3 e O Ld3ϵ 3

Table 1. Complexity of sampling from a smooth and log-concave probability distribution. For each metric, the entry corresponds to the total number of iterations to use in order to reach an ϵ accuracy in the speciﬁed metric. ( For comparison purpose, we extended the proof in (Zou et al., 2018) in the case where the distribution satisﬁes the weaker assumption 4. The dimension dependence is thus different from (Zou et al., 2018)).

4.3. Convergence Analysis of DL-ULA

Having covered the necessary technical tools above, we now turn our attention to the convergence rate of Algorithm 1. The ﬁnal step to take care of is to choose the sequences {γk}k 1 and {nk}k 1 so as to obtain the best possible convergence guarantees. We summarize our result in Theorem 10.

Theorem 10. (iteration complexity of DL-ULA) Let µ

be a L-smooth log-concave distribution satisfying assumption 4. Suppose that µ0 also satisﬁes assumption 4. For every k 1, let

nk = LM 2dk2e3k (14)

Lde 2k (15)

τk = Mk. (16)

where M = q

η2 + M 2η = O(d).

Let µk, µk be the average distributions associated with the iterates of outer iteration k of DL-ULA using the parameters above, just before and after the projection step respectively. Then, ϵ > 0, we have:

After N KL = O(Ld3ϵ 3

2 ) total iterations, we obtain KL( µk; µ ) ϵ.

After N TV = O(Ld3ϵ 3) total iterations, we obtain µk µ TV ϵ.

After N W2 = O(Ld9ϵ 6) total iterations, we obtain W2( µk, µ ) ϵ.

A few remarks about Theorem 10 are in order.

Geometric sequences. Theorem 10 prescribes a geometric sequence for the choice of {γk}k and {nk}k. As outer iteration counter k increases, more and more ULA (inner) iterations are performed with the constant step-size γk. Asymptotically, we observe that the step size decreases at a rate

3 where n is the total number of ULA iterations. This decaying rate is faster than the standard decaying rate of n 1

2 for ULA (Durmus et al., 2018a).

In constrast to convex optimization where a global optimum can provably be reached with constant step-size, ULA cannot converge to the target distribution µ when using constant step-size, since the stationary distribution of ULA iterates (8) when using a constant step size is different from the target distribution. Asymptotically, it is thus desirable to use as small a step-size as possible.

Projection step. Although the initial and target distributions are both log-concave, and thus have a sub-exponential tail, the sample distributions µk are not generally logconcave, and it is not clear whether they also share the sub-exponential tail property. The projection step at the end of each outer iteration provides a way to enforce the light tail property, so that we can still apply a bound similar to (11). This procedure is made clearer in the proof of the theorem.

This procedure also provides more stability in the early outer iterations where the step-size is the largest. Moreover, since limk τk = , the projection step asymptotically never applies in practice.

Convergence rate comparison Table 1 summarizes various convergence rates of Langevin dynamics based methods applied to general log-concave distributions. We observe that DL-ULA achieves improved convergence guarantees either in terms of rate or dimension dependence. Compared to (Durmus et al., 2017), the convergence rate in TV distance is worse in terms of accuracy ϵ but enjoys much better dimension dependence, and is also better in terms of Lipschitz constant dependence.

5. DL-MYULA for constrained sampling

We now apply the same multistage idea to an existing constrained sampling algorithm, and show that it allows both to obtain an asymptotic convergence and improved convergence guarantees.

Double-Loop Unadjusted Langevin Algorithm

5.1. DL-MYULA algorithm

Consider sampling from a log-concave distribution over a convex set Ω Rd, i.e.,

( e f(x)/ R

Ωe f(x )dx x Ω 0 x / Ω. (17)

In (Durmus et al., 2018b; Brosse et al., 2017), the authors propose to reduce this problem to an unconstrained sampling problem by penalizing the domain outside Ωdirectly inside the probability measure using its Moreau-Yoshida envelop. More precisely, they propose to sample from the following unconstrained probability measure dµλ(x) e fλ(x) dx where fλ : Rd R is deﬁned as:

fλ(x) = f(x) + 1

2λ x projΩ(x) 2 2, x Rd,

where projΩ: Rd Ωis the standard projection operator onto Ωdeﬁned as projΩ(x) = arg miny Ω x y 2. Note that this penalty is easily differentiable as soon as the projection onto Ωcan be computed since fλ(x) = f(x) + 1

λ(x projΩ(x)).

By bounding the TV distance between µλ and µ , they showed that, by sampling from µλ with λ small enough, it is possible to sample from µ with arbitrary precision. This algorithm is called Moreau-Yoshida ULA (MYULA).

Building on this approach, we can apply our double loop algorithm, by modifying both the step size as well as the penalty parameter λ between each inner loop (Algorithm 2).

In addition to providing improved rate, as we will show later, our algorithm also has the advantage to use a decreasing penalty parameter λ so as to guarantee asymptotic convergence of the algorithm to the target distribution. On the other hand, MYULA uses constant penalty λ, and thus saturates after a certain number of iterations. Although this looks like a trivial extension, using varying penalty parameter makes the analysis more challenging since the target distribution of the algorithm is regularly changing.

5.2. Convergence analysis of DL-MYULA

We now analyze the convergence of DL-MYULA. In Algorithm 2, both the step-size γ and the penalty parameter λ are decreased after each outer iteration. Therefore, at each outer iteration k, we aim to sample from the unconstrained penalized distribution dµλk e fλk (x) dx where fλk is deﬁned in equation (18).

Similarly as for DL-ULA, we will use Lemma 9 after each outer iteration. However, since the target distribution of outer iteration is µλk instead of µ , the inequality reads as

Algorithm 2 DL-MYULA

Input: Smooth constrained probability measure µ , step sizes {γk}k 1, penalty parameters {λk}k 1, number of (inner) iterations {nk}k 1, initial probability measure µ0 on Rd, and thresholds {τk}k 1. Initialization: Draw a sample x0 from the probability measure µinit. for k = 1, . . . do

xk,0 xk 1 for n = 1, . . . , nk do

xk,n+1 xk,n γk( f(xk,n) + 1 λk (xk,n projΩ(xk,n)))+ 2γkgk,n, where gk,n N(0, Id). end for xk xk,i, where i is drawn from the uniform distribution on {1, , nk}. if xk 2 > τk then

xk τkxk/ xk 2. end if end for

KL( µk; µλk) W2 2( µk 1, µλk)

2γknk + Ldγk.

where we recall that µk is the average iterate distribution of outer iteration k just before the projection step, and µk is the one just after the projection step.

In order to use a similar recursion argument as previously, we must thus bound W2( µk 1, µλk) by W2( µk 1, µλk 1). Using the triangle inequality for W2, we have

W2( µk 1, µλk) W2( µk 1, µλk 1) + W2(µλk 1, µ )

+ W2(µλk, µ ).

In (Brosse et al., 2017), the authors showed a bound for µλ µ TV in terms of λ > 0, and it is easy to extend their proof to obtain a bound for W2(µλ, µ ) (see Lemma 12 and its proof in Appendix C).

In order to prove our result, we make the same assumptions on the constraint set Ωas in (Brosse et al., 2017):

Assumption 11. There exist r, R, 1 > 0 such that

1. B(0, r) Ω B(0, D) where B(0, r0) = {y Rd : x y 2 r0} r0 > 0,

2. einfΩc(f) maxΩ(f) 1, where Ωc = Rd\Ω.

Lemma 12. Let Ω Rd satisfy Assumption 11. Then λ < r2 8d2 , W2 2(µλ, µ ) C2 Ωd

for some scalar CΩ> 0 depending on D, r and 1.

Double-Loop Unadjusted Langevin Algorithm

The proof of the previous Lemma is given in Appendix C. Using these results, the convergence proof is then very similar as for DL-ULA, and is summarized in Theorem 13, whose proof can be found in Appendix D.

Theorem 13. (iteration complexity of DL-MYULA) Let Ω Rd be a convex set satisfying Assumption 11 and µ

be a log-concave distribution given by (17) where f has L-Lipschitz continuous gradient. For every k 1, let

r2 + de2k (20)

nk = Ldk2e5k (21)

Lde 4k (22)

τk = Dk (23)

for every k 1. Then, ϵ > 0, we have:

After N TV = O d3.5ϵ 5 total iterations, we obtain ˆµK µ TV ϵ.

After N W2 = O d3.5ϵ 10 total iterations, we obtain W2(ˆµK, µ ) ϵ.

We make a few comments about this convergence result.

Smoothness of µλk One can notice that outer iterations in DL-MYULA are longer than in DL-ULA. In order to explain this choice, ﬁrst observe that the Lipschitz constant associated with the penalized distribution µλ grows as O 1

λ as λ goes to 0. As k increases and λk decreases, µλk becomes less and less smooth. Thus, for ULA to succeed in approximating µλk, the step size γk of ULA iterations reduces accordingly, and the number of iterations increases.

The choice for λk ensures that λk < r2 8d2 as required for Lemma 12 to be applicable.

Convergence rate comparison Table 2 summarizes convergence rates in TV distance for various ﬁrst-order constrained sampling algorithms. We can see that DL-MYULA outperforms existing approaches, both in terms of rate and dimension dependence.

6. Conclusion

In this work, we proposed and analyzed a new step-size schedule for the well-known Unadjusted Langevin Algorithm. Our approach works by applying ULA successively with constant step-size, and by geometrically decreasing

Algorithm TV Literature

PLMC d12 e O ϵ 12 (Bubeck et al., 2018) MYULA d5 e O ϵ 6 (Brosse et al., 2017) DL-MYULA d3.5 e O ϵ 5 Our work

Table 2. Upper bounds on the number of iterations required in order to guarantee an error smaller than ϵ in TV distance for various constrained sampling algorithms.

it after a certain number of iterations. Exploiting a new result on the relation between the 2-Wasserstein distance and the TV distance of two log-concave distributions, we were able to prove new convergence guarantees for this procedure. We also applied our approach to an existing ﬁrst-order constrained sampling, and showed improved convergence guarantees, both in terms of rate and dimension dependence.

7. Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement n 725594 - time-data).

This work was supported by the Swiss National Science Foundation (SNSF) under grant number 407540_167319.

This project was sponsored by the Department of the Navy, Ofﬁce of Naval Research (ONR) under a grant number N62909-17-1-2111.

This work was supported by Hasler Foundation Program: Cyber Human Systems (project number 16066).

Ahn, S., Korattikara, A., and Welling, M. Bayesian posterior sampling via stochastic gradient ﬁsher scoring. ar Xiv preprint ar Xiv:1206.6380, 2012.

Aybat, N. S., Fallah, A., Gurbuzbalaban, M., and Ozdaglar, A. A universally optimal multistage accelerated stochastic gradient method. ar Xiv preprint ar Xiv:1901.08022, 2019.

Brosse, N., Durmus, A., Moulines, É., and Pereyra, M. Sampling from a log-concave distribution with compact support with proximal langevin monte carlo. ar Xiv preprint ar Xiv:1705.08964, 2017.

Bubeck, S., Eldan, R., and Lehec, J. Sampling from a log-concave distribution with projected langevin monte carlo. Discrete & Computational Geometry, 59(4):757 783, 2018.

Chen, C., Ding, N., and Carin, L. On the convergence of stochastic gradient mcmc algorithms with high-order in-

Double-Loop Unadjusted Langevin Algorithm

tegrators. In Advances in Neural Information Processing Systems, pp. 2278 2286, 2015.

Cheng, X. and Bartlett, P. Convergence of langevin mcmc in kl-divergence. ar Xiv preprint ar Xiv:1705.09048, 2017.

Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. Underdamped langevin mcmc: A non-asymptotic analysis. ar Xiv preprint ar Xiv:1707.03663, 2017.

Dalalyan, A. S. and Karagulyan, A. G. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. ar Xiv preprint ar Xiv:1710.00095, 2017.

Durmus, A., Moulines, E., et al. Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability, 27(3):1551 1587, 2017.

Durmus, A., Majewski, S., and Miasojedow, B. Analysis of langevin monte carlo via convex optimization. ar Xiv preprint ar Xiv:1802.09188, 2018a.

Durmus, A., Moulines, E., and Pereyra, M. Efﬁcient bayesian computation by proximal markov chain monte carlo: when langevin meets moreau. SIAM Journal on Imaging Sciences, 11(1):473 506, 2018b.

Dwivedi, R., Chen, Y., Wainwright, M. J., and Yu, B. Logconcave sampling: Metropolis-hastings algorithms are fast! ar Xiv preprint ar Xiv:1801.02309, 2018.

Ge, R., Kakade, S. M., Kidambi, R., and Netrapalli, P. The step decay schedule: A near optimal, geometrically decaying learning rate procedure. ar Xiv preprint ar Xiv:1904.12838, 2019.

Gibbs, A. L. and Su, F. E. On choosing and bounding probability metrics. International statistical review, 70 (3):419 435, 2002.

Gozlan, N. and Léonard, C. Transport inequalities. a survey. ar Xiv preprint ar Xiv:1003.3852, 2010.

Hazan, E. and Kale, S. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489 2512, 2014.

Hsieh, Y.-P., Kavis, A., Rolland, P., and Cevher, V. Mirrored langevin dynamics. In Advances in Neural Information Processing Systems, pp. 2883 2892, 2018.

Li, C., Chen, C., Carlson, D., and Carin, L. Preconditioned stochastic gradient langevin dynamics for deep neural networks. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016a.

Li, W., Ahn, S., and Welling, M. Scalable mcmc for mixed membership stochastic blockmodels. In Artiﬁcial Intelligence and Statistics, pp. 723 731, 2016b.

Lovász, L. and Vempala, S. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007.

Luu, T., Fadili, J., and Chesneau, C. Sampling from nonsmooth distribution through langevin diffusion. 2017.

Ma, Y.-A., Chen, T., and Fox, E. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pp. 2917 2925, 2015.

Patterson, S. and Teh, Y. W. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in neural information processing systems, pp. 3102 3110, 2013.

Pinsker, M. S. Information and information stability of random variables and processes. 1960.

Villani, C. Optimal transport old and new, volume 338 of a series of comprehensive studies in mathematics, 2009.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681 688, 2011.

Youseﬁan, F., Nedi c, A., and Shanbhag, U. V. On stochastic gradient and subgradient methods with adaptive steplength sequences. Automatica, 48(1):56 67, 2012.

Zou, D., Xu, P., and Gu, Q. Stochastic variancereduced hamilton monte carlo methods. ar Xiv preprint ar Xiv:1802.04791, 2018.