# sampling_from_logconcave_distributions_with_infinitydistance_guarantees__4931782c.pdf

Sampling from Log-Concave Distributions with

Inﬁnity-Distance Guarantees

Oren Mangoubi Worcester Polytechnic Institute

Nisheeth K. Vishnoi

Yale University

For a d-dimensional log-concave distribution ﬁ( ) Ã e f( ) constrained to a convex body K, the problem of outputting samples from a distribution which is Á-close in inﬁnity-distance sup œK | log ( )

ﬁ( )| to ﬁarises in differentially private optimization. While sampling within total-variation distance Á of ﬁcan be done by algorithms whose runtime depends polylogarithmically on 1

Á, prior algorithms for sampling in Á inﬁnity distance have runtime bounds that depend polynomially on 1

Á. We bridge this gap by presenting an algorithm that outputs a point Á-close to ﬁin inﬁnity distance that requires at most poly(log 1

Á, d) calls to a membership oracle for K and evaluation oracle for f, when f is Lipschitz. Our approach departs from prior works that construct Markov chains on a 1 Á2 -discretization of K to achieve a sample with Á inﬁnity-distance error, and present a method to directly convert continuous samples from K with total-variation bounds to samples with inﬁnity bounds. This approach also allows us to obtain an improvement on the dimension d in the running time for the problem of sampling from a log-concave distribution on polytopes K with inﬁnity distance Á, by plugging in TV-distance running time bounds for the Dikin Walk Markov chain.

1 Introduction

The problem of sampling from a log-concave distribution is as follows: For a convex body K Rd and a convex function f : K æ R, output a sample from the distribution ﬁ( ) Ã e f( ). This is a basic problem in computer science, statistics, and machine learning, with applications to optimization and integration [1, 29], Bayesian statistics [39], reinforcement learning [6], and differential privacy [32, 20, 2, 24]. Sampling exactly from ﬁis known to be computationally hard for most interesting cases of K and f [16] and, hence, the goal is to output samples from a distribution that is at a small (speciﬁed) distance to ﬁ. For applications such as computing the integral of ﬁ, bounds in the total variation (TV) distance [1] or KL divergence (which implies a TV bound) are sufﬁcient. In applications such as computing the expectation of a Lipschitz function with respect to ﬁ, Wasserstein distance may also be sufﬁcient. In differentially private optimization [32, 20, 2, 19, 24], one requires bounds on the stronger inﬁnity-distance

dŒ( , ﬁ) := sup

----log ( )

to guarantee pure differential privacy, and TV, KL, or Wasserstein bounds are insufﬁcient; see [14].

Pure differential privacy (DP) is the strongest notion of DP and has been extensively studied (see e.g. the survey [14]). It has advantages over weaker notions of differential privacy. E.g., when privacy of groups of individuals (rather than just single individuals) must be preserved, any mechanism which is (pure) Á-DP (with respect to single individuals), is also kÁ-DP with respect to subsets of k individuals. Motivated by applications to differential privacy, we study the problem of designing efﬁcient algorithms to output samples from a distribution which is Á-close in dŒ to ﬁ.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Related works. Several lines of work have designed Markov chains that generate samples from distributions that are close to a given log-concave distribution. These results differ in both their assumptions on the log-density and its support, as well as the distance used to measure closeness. One line of work includes bounds for sampling from a log-concave distribution on a compactly supported convex body within TV distance O( ), including results with running time that is polylogarithmic in 1

[1, 30, 29, 34] (as well as other results which give a running time bound that is polynomial in 1

[18, 17, 5, 4]). In addition to assuming access to a value oracle for f, some Markov chains just need access to a membership oracle for K [1, 30, 29], while others assume that K is a given polytope: { œ Rd : A Æ b} [23, 33, 35, 34, 26]. They often also assume that K is contained in a ball of radius R and contains a ball of smaller radius r. Many of these results assume that the target logconcave distribution satisﬁes a well-rounded condition which says that the variance of the target distribution is Θ(d) [30, 29], or that it is in isotropic position (all eigenvalues of its covariance matrix are Θ(1)) [25]; when applied to log-concave distributions that are not well-rounded or isotropic, these results require a rounding pre-processing procedure to ﬁnd a linear transformation which makes the target distribution well-rounded or isotropic. Finally, it is often assumed that the function f is such that f is L-Lipschitz or -smooth [34], including works handling the widely-studied special case when f is uniform on K where L = = 0 (see e.g. [28, 23, 33, 35, 26, 8, 21]).

Another line of work gives sampling algorithms with bounds on the distance to the target density ﬁ in terms of Wasserstein distance [13, 11], KL divergence [40, 12], and Renyi divergence [36]. In contrast to works which assume access to an oracle for the value of f, many of these results instead assume access to an oracle for the gradient of f and require the log-density to be L-Lipschitz or -smooth on all of Rd (or on, e.g., a cube containing K) for some L, > 0. However, as noted earlier, bounds in the Wasserstein distance, KL divergence, and -Renyi divergence (for < Œ) also do not imply bounds on the inﬁnity distance, and the running time bounds provided by these works are polynomial in 1

Á. (See also Appendix A for additional discussion and challenges.)

Among prior works that give algorithms with bounds on dŒ, [20] applies the grid walk Markov chain of [1] to sample from a uniform distribution on a convex body. [2] extends the approach of [20] to log-Lipschitz log-concave distributions. Unlike the TV-distance case where algorithms whose running time depends logarithmically on the error are known (e.g., [1, 29, 34]), the best available bounds for sampling within O(Á) inﬁnity-distance [20, 2] have runtime that is polynomial in 1

Á and a relatively large polynomial in d.

Our contributions. We present a new approach to output samples, which come with dŒ bounds, from a log-concave and log-Lipschitz distribution constrained to a a convex body. Speciﬁcally, when K := { : A Æ b} is a polytope (where one is given A and b) our main result (Theorem 2.1) guarantees samples from a distribution that is within O(Á) error in dŒ and whose runtime depends logarithmically on 1

Á compared to the polynomial dependence of [2]. Our approach departs from prior works that construct Markov chains on a 1 Á2 -discretization of K to achieve a sample with Á inﬁnity-distance error, and we present a method (Algorithm 1) to directly convert continuous samples from K with total-variation bounds to samples with inﬁnity bounds (Theorem 2.2). This continuous-space approach also allows us to obtain an improvement on the dimension d in the running time when K is a polytope by plugging in TV-distance running time bounds for the Dikin Walk Markov chain of [34]. As immediate applications, we obtain faster algorithms for differentially private empirical risk minimization (Corollary 2.4) and low rank approximation (Corollary 2.5).

Let B(v, s) := {z œ Rd : Îz vÎ2 Æ s} and Ê denote the matrix-multiplication constant.

Theorem 2.1 (Main result) There exists an algorithm which, given Á, L, r, R > 0, A œ Rm d, b œ Rm that deﬁne a polytope K := { œ Rd : A Æ b} contained in a ball of radius R, a point a œ Rd such that K contains a ball B(a, r) of smaller radius r, and an oracle for the value of a convex function f : K æ Rd, where f is L-Lipschitz, and deﬁning ﬁto be the distribution ﬁÃ e f, outputs a point from a distribution such that dŒ( , ﬁ) < Á. Moreover, with very high probability1, this algorithm takes O(T) function evaluations and O(T mdÊ 1) arithmetic operations, where T = O((m2d3 + m2d L2R2) [LR + d log( Rd+LRd

1The number of steps is O( T), where E[ ] Æ 3, P( Ø t) Æ ! 2

"t for t Ø 0, and Æ O(d log( R

r ) + LR) w.p. 1.

In comparison to the polynomial in 1

Á runtime bounds of [2], Theorem 2.1 guarantees a runtime that is logarithmic in 1

Á, and also improves the dependence on the dimension d, in the setting where K is a polytope. Speciﬁcally, [2] show that the number of steps of the grid walk to sample from ﬁwith inﬁnity-distance error at most Á is

Á2 (d10 + d6L4R4) polylog

r , R, L, d

(Lemma 6.5 in the Arxiv version of [2]). When applying their algorithm to the setting where f is constrained to a polytope K = {x œ Rd : Ax Æ b}, each step of their grid walk Markov chain requires computing a membership oracle for K and the value of the function f. The membership oracle can be computed in O(md) arithmetic operations. Thus the bound on the number of arithmetic operations for each step of their grid walk is O(md) (provided that each function evaluation takes at most O(md) arithmetic operations). Thus the bound on the number of arithmetic operations to obtain a sample from ﬁis O( 1

Á2 (md11 + md7L4R4) polylog( 1

r, R, L, d)). Thus, Theorem 2.1 improves on this bound by a factor of roughly 1 Á2m3 d8 Ê. For example, when m = O(d), as may be the case in differentially private applications, the improvement is 1

We note that the bounds of [2] also apply in the more general setting where K is a convex body with membership oracle. One can extend our bounds to achieve a runtime that is logarithmic in 1

Á (and polynomial in d, L, R) in the more general setting where K is a convex body with membership oracle; we omit the details (see Remark 2.3).

Moreover, we also note that while there are several results which achieve O( ) TV bounds in time logarithmic in 1

, TV bounds do not in general imply O(Á) bounds on the KL or Renyi divergence, or on the inﬁnity-distance, for any > 0.2 On the other hand, an Á-inﬁnity-distance bound does immediately imply a bound of Á on the KL divergence DKL, and -Renyi divergence D , since DKL(µ, ﬁ) Æ dŒ(µ, ﬁ) and D (µ, ﬁ) Æ dŒ(µ, ﬁ) for any > 0 and any pair of distributions µ, ﬁ. Thus, under the same assumptions on K and f, Theorem 2.1 implies a method of sampling from a Lipschitz concave log-density on K with Á KL and Renyi divergence error in a number of arithmetic operations that is logarithmic in 1

Á, with the same bound on the number of arithmetic operations.

The polynomial dependence on 1

Á in [2] is due to the fact that they rely on a discrete-space Markov chain [1], on a grid with cells of width w = O( Á L

d), to sample from ﬁwithin O(Á) inﬁnity-distance. Since their Markov chain s runtime bound is polynomial in w 1, they get a runtime bound for sampling within O(Á) inﬁnity-distance that is polynomial in 1

Á. The proof of Theorem 2.1 bypasses the use of discrete grid-based Markov chains by introducing Algorithm 1 which transforms any sample within = O(Áe d n LR) TV distance of the distribution ﬁÃ e f on the continuous set K (as opposed to a discretization of K), into a sample within O(Á) inﬁnity-distance from ﬁ. This allows us to make use of a continuous-space Markov chain, whose step size is not restricted to a grid of width O( Á L

d) and is instead independent of Á, to obtain a sample within O(Á) inﬁnity-distance from ﬁin time that is logarithmic in 1

Theorem 2.2 (Main technical contribution: Converting TV bounds to inﬁnity-distance bounds) There exists an algorithm (Algorithm 1) which, given Á, r, R, L > 0, a membership oracle for a convex body K contained in a ball of radius R and containing a ball B(0, r), and an oracle which outputs a point from a distribution µ which has TV distance

3R(d log(R/r) + LR)2

from a distribution ﬁÃ e f where f : K æ R is an L-Lipschitz function (see Appendix B for the exact values of and related hyper-parameters), outputs a point ˆ œ K such that the distribution

2For instance, if ﬁ( ) = 1 with support on [0, 1], for every > 0 there is a distribution where Î ﬁÎTV Æ 2 and yet dŒ( , ﬁ) Ø DKL( , ﬁ) Ø 1 2. ( ( ) = e

1 on œ [0, e 1

, 1] and ( ) = 0 otherwise)

of ˆ satisﬁes dŒ( , ﬁ) Æ Á. Moreover, with very high probability3, this algorithm ﬁnishes in O(1) calls to the sampling and membership oracles, plus O(d) arithmetic operations.

To the best of our knowledge Theorem 2.2 is the ﬁrst result which for any Á, L, r, R > 0, when provided as input a sample from a continuous-space distribution on a convex body K within some TV distance = (Á, r, R, L) > 0 from a given L-log-Lipschitz distribution ﬁon K, where K is contained in a ball of radius R and containing a ball of smaller radius r, outputs a sample with distribution within inﬁnity-distance O(Á) from ﬁ. This is in contrast to previous works [2] (see also [20] which applies only to the special case where ﬁis the uniform distribution on K) which instead require as input a sample with bounded TV distance from the restriction of ﬁon a discrete grid on K, and then convert this discrete-space sample into a sample within inﬁnity-distance O(Á) from the continuous-space distribution ﬁ: K æ R.

Algorithm 1: Interior point TV to inﬁnity-distance converter Input: d œ N Input: A membership oracle for a convex body K œ Rd and an r > 0 such that B(0, r) K. Input: A sampling oracle which outputs a point from a distribution µ : K æ R Output: A point ˆ œ K.

1 Hyperparameters: > 0, max œ N (set in Appendix B)

2 for i = 1, . . . , max do

3 Sample a point µ

4 Sample a point Unif(B(0, 1))

5 Set Z Ω + r

6 Set ˆ Ω 1 1 Z

7 If ˆ œ K, output ˆ with probability 1

2 and halt. Otherwise, continue.

9 Sample a point ˆ Unif(B(0, r))

10 Output ˆ

Remark 2.3 (Extension to convex bodies with membership oracles) We note that Theorem 2.1 can be extended to the general setting where K is an arbitrary convex body in a ball of radius R and containing a ball of smaller radius r, and we only have membership oracle access to K. Namely, one can plug in the results of [29] to our Theorem 2.2 to generate a sample from a LLipschitz concave log-density on an arbitrary convex body K in a number of operations that is (poly)-logarithmic in 1

r and polynomial on d, L, R. We omit the details.

Applications to differentially private optimization. Sampling from distributions with O(Á) inﬁnity-distance error has many applications to differential privacy. Here, the goal is to ﬁnd a randomized mechanism h : Dn æ R which, given a dataset x œ Dn consisting of n datapoints, outputs model parameters ˆ œ R in some parameter space R, which minimize a given (negative) utility function f( , x), under the constraint that the output ˆ preserves the pure Á-differential privacy of the data points x. A randomized mechanism h : Dn æ R is said to be Á-differentially private if for any datasets x, xÕ œ D which differ by a single datapoint, and any S R, we have that

P(h(x) œ S) Æ eÁP(h(xÕ) œ S);

As one application of Theorem 2.1, we consider the problem of ﬁnding an (approximate) minimum

ˆ of an empirical risk function f : K Dn æ R under the constraint that the output ˆ is Ádifferentially private, where f( , x) := qn

i=1 i( , xi). Following [2], we assume that the i( , x) are L-Lipschitz for all x œ Dn, i œ N, for some given L > 0. In this setting [2] show that the minimum ERM utility bound under the constraint that ˆ is pure Á-differentially private, Eˆ [f(ˆ , x)] min œK f( , x) = Θ( d LR

Á ), is achieved if one samples ˆ from the exponential mechanism ﬁÃ e Á 2LR f with inﬁnity-distance error at most O(Á). Plugging Theorem 2.1 into the framework of

3Algorithm 1 ﬁnishes in calls to the sampling and membership oracles, plus O( d) arithmetic operations, where E[ ] Æ 3 and P( Ø t) Æ ! 2

"t for all t Ø 0 and Æ 5d log( R

r ) + 5LR + 2 w.p. 1.

the exponential mechanism, we obtain a pure Á-differentially private mechanism which achieves the minimum expected risk (Corollary 2.4, see Section C.1 for a proof).

Corollary 2.4 (Differentially private empirical risk minimization) There exists an algorithm which, given Á, L, r, R > 0, A œ Rm d, b œ Rm that deﬁne a polytope K := { œ Rd : A Æ b} contained in a ball of radius R and containing a ball B(0, r) of smaller radius r, and a convex function f( , x) := qn

i=1 i( , xi), where each i : K æ R is L-Lipschitz, outputs a random point ˆ œ K which is pure Á-differentially private and satisﬁes

Eˆ [f(ˆ , x)] min

œK f( , x) Æ O

Moreover, this algorithm takes at most T mdÊ 1 arithmetic operations plus T evaluations of the function f, where T = O

(m2d3 + m2dn2Á2"

(Án + d)log2( n Rd

Corollary 2.4 improves on the previous bound [2] of O(( 1

Á2 (m + n)d11 + Á2n4(m + n)d7)

polylog( n Rd

rÁ ))) arithmetic operations by a factor of roughly max

d8 Ê Á2m2 , 1 Ám2 nd52

, in the setting

where the i are L-Lipschitz on a polytope K and each i can be evaluated in O(d) operations. See Appendix C.1 for a proof of this corollary.

As another application, we consider the problem of ﬁnding a low-rank approximation of a sample covariance matrix Σ = qn

i where the datapoints ui œ Rd satisfy ÎuiÎ Æ 1, in a differentially private manner. Given any k > 0, the goal is to ﬁnd a (random) rank-k projection matrix P which maximizes the average variance EP [ÈΣ, PÍ] of the matrix Σ (also reffered to as the utility of P), under the constraint that the mechanism which outputs the matrix P is Á-differentially private. This problem has many applications to statistics and machine learning, including differentially private principal component analysis (PCA) [7, 3, 15, 24].

When privacy is not a concern, the solution P which maximizes the variance is just a projection matrix onto the subspace spanned by top-k eigenvectors of Σ, and the maximum variance satisﬁes ÈΣ, PÍ = qk

i=1 i, where 1 Ø Ø d > 0 denote the eigenvalues of Σ. However, when privacy is a concern, there is a tradeoff between the desired privacy level Á and the utility EP [ÈΣ, PÍ], and the maximum utility EP [ÈΣ, PÍ] one can achieve decreases with the privacy parameter Á. The best current utility bound for an Á-differentially private low rank approximation algorithm was achieved in [24], who show that one can ﬁnd a pure Á-differentially private random rank-k projection matrix P such that EP [ÈΣ, PÍ] Ø (1 ) qk

i=1 i whenever qk

for any > 0. To generate the matrix P, their algorithm generates a sample, with inﬁnity-distance error O(Á), from a Lipschitz concave log-density on a polytope, and transforms this sample into a projection matrix P. The sampling algorithm used in [24] has a bound of poly( 1

Á, d, 1 d) arithmetic operations and they leave as an open problem whether this can be improved from a polynomial dependence on

1 Á to a logarithmic dependence on 1

Á. Corollary 2.5 shows that a direct application of Theorem 2.1 resolves this problem. (See Section C.2 for a proof.)

Corollary 2.5 (Differentially private low rank approximation) There exists an algorithm which, given a sample covariance matrix Σ = qn

i for datapoints ui œ Rd satisfying ÎuiÎ Æ 1, its eigenvalues 1 Ø . . . d > 0, an integer k, and Á, > 0, outputs a random rank-k symmetric projection matrix P such that P is Á-differentially private and satisﬁes the utility bound

EP [ÈΣ, PÍ] Ø (1 )

whenever qk

i=1 i(Σ) Ø C dk

for some universal constant C > 0. Moreover the number of arithmetic operations is logarithmic in 1

Á and polynomial in d and 1 d.

3 Proof Overviews

Given any Á, and a function f : Rd æ R, the goal is to sample from a distribution ﬁ( ) Ã e f( ), constrained to a d-dimensional convex body K with inﬁnity-distance error at most O(Á) in a number

of arithmetic operations that is logarithmic in 1

Á. We assume that K is contained in a ball of some radius R > 0 and contains a ball of some radius r > 0, and f is L-Lipschitz. In addition we would also like our bounds to be polylogarithmic in 1

r, and polynomial in d, L, R with a lowerorder dependence on the dimension d than currently available bounds for sampling from Lipschitz concave log-densities on a polytope in inﬁnity-distance [2]. We note that since whenever K is contained in a ball of radius R and contains a ball B(0, r) of smaller radius r, we also have that B(0, r) K B(0, 2R), without loss of generality, we may assume that B(0, r) K B(0, R) as this would only change the bounds provided in our main theorems by a constant factor.

The main ingredient in the proof of Theorem 2.1 is Theorem 2.2 that uses Algorithm 1 to transform a TV-bounded sample into a sample from ﬁwith error bounded in dŒ. Subsequently, we invoke Theorem 2.1 when K is given as a polytope K := {x œ Rd : Ax Æ b} and plug in the Dikin Walk Markov chain of [34] which generates independent samples from ﬁwith bounded TV error. We ﬁrst present an overview of the proof of Theorem 2.2. (The full proof has been omitted to space restrictions and presented in Appendix B.) The proof of Theorem 2.1 is presented in Section 3.2.

3.1 Converting samples with TV bounds to inﬁnity-distance bounds; proof of Theorem 2.2

Impossibility of obtaining log-dependence on inﬁnity-distance via grid walk. One approach is to observe that if e f has support on a discrete space S with at most |S| points, then any such that Î ﬁÎTV Æ Á also satisﬁes

dŒ( , ﬁ) Æ 2|S|maxzœS e f(z)

minzœS e f(z) Á

for any Á Æ minzœS ﬁ(z). This suggests forming a grid G over K, then using a discrete Markov chain to generate a sample within O(Á) TV distance of the discrete distribution ﬁG Ã e f with support on the grid G, and then designing an algorithm which takes as input and outputs a point with bounded inﬁnity-distance to the continuous distribution ﬁ. This approach was used in [20] in the special case when ﬁis uniform on K, and then extended by [2] to log-Lipschitz log-concave distributions. In their approach, [2] ﬁrst run a grid-walk Markov chain on a discrete grid in a cube containing K. They then apply the bound from [1] which says that the grid walk obtains a sample Z within TV distance O( ) from the distribution Ã e f (restricted to the grid) in time that is polylogarithmic in 1

and quadratic in a 1, where a is the distance between neighboring grid points. Since their grid has size |S| = Θ(( R

a )d), a TV distance of O( ) automatically implies an inﬁnity-distance of O( c|S|), where c is the ratio of the maximum to the minimum probability mass satisfying c Æ e LR since f is L-Lipschitz on K B(0, R). Thus, by using the grid walk to sample

within TV-distance = O

from the discrete distribution ﬁG, they obtain a sample Z which

also has inﬁnity-distance O(Á) from ﬁG. Finally, to obtain a sample from the distribution ﬁÃ e f

on the continuous space K, they sample a point uniformly from the grid cell [Z a, Z + a]d

centered at Z. Since f is L-Lipschitz, the ratio e f(Z)

e f(w) is bounded by O(Á) for all w in the grid cell

[Z a, Z + a]d as long as a = O

, implying that the sample is an inﬁnity-distance of O(Á)

from ﬁÃ e f. However, since the running time bound of the grid walk is quadratic in a 1, the grid coarseness a = O

needed to achieve O(Á) inﬁnity-distance from ﬁleads to a running time

bound which is quadratic in 1

To get around this problem, rather than relying on the use of a discrete-space Markov chain such as the grid walk to sample within O(Á) inﬁnity-distance from ﬁ, we introduce an algorithm (Algorithm 1) which transforms any sample within = O

TV distance from the distribution ﬁÃ e f on the continuous space K (as opposed to a grid-based discretization of K), into a sample within O(Á) inﬁnity-distance from ﬁ. This allows us to make use of a continuous-space Markov chain, such

as the Dikin walk of [34], whose step-size is not restricted by a grid of coarseness w = O

and instead is independent of Á, in order to generate a sample within O(Á) inﬁnity-distance from ﬁ in runtime that is logarithmic in 1

Converting continuous space TV-bounded samples to inﬁnity-distance bounded samples. As discussed in Section 1, there are many Markov chain results which allow one to sample from a log-

concave distribution on K with error bounded in weaker metrics such as total variation, Wasserstein, or KL divergence. However, when sampling from a continuous distribution, bounds in these metrics do not directly imply bounds in inﬁnity-distance. And techniques used to prove bounds in weaker metrics do not easily extend to methods for bounding the inﬁnity-distance; see Section A.1.

Convolving with continuous noise. As a ﬁrst attempt, we consider the following simple algorithm: sample a point µ from a distribution µ with total variation error Îµ ﬁÎTV Æ O(Á). Since f is L-Lipschitz, for any < Á

L and any ball B(z, ) in the -interior of K (denoted by int (K);

see Deﬁnition B.1), we can obtain a sample from a distribution such that log

Æ Á for all

z œ int (K) by convolving µ with the uniform distribution on the ball B(0, ). Sampling from this distribution can be achieved by ﬁrst sampling µ and then adding noise Unif(B(0, )) to the sample .

Unfortunately, this simple algorithm does not allow us to guarantee that log( (z)

µ(z)) Æ Á at points z /œ int (K) which are a distance less than from the boundary of K. To see why, suppose that K = [0, 1]d is the unit cube, that f is constant on K, and consider a point w = (1, . . . , 1) at the corner of the cube K. In this case we could have that (z) Æ 2 dﬁ(z) for all z in some ball

containing w, and hence dŒ( , ﬁ) = sup

---log( (z)

--- Ø d log(2), no matter how small we make .

Stretching the convex body to handle points close to the boundary. To get around this problem, we would like to design an algorithm which samples from some distribution such that

--- Æ Á for all z œ K, including at points z near the corners of K. Towards this end, we ﬁrst consider the special case where K is itself contained in the -interior of another convex body KÕ, the function f : K æ R extends to an L-Lipschitz function on KÕ (also referred to here with slight abuse of notation as f) and where we are able to sample from the distribution Ã e f on KÕ with O(Á) total variation error. If we sample e f on KÕ with total variation error O( ) where Æ Áe d log(R), add noise Unif(B(0, )) to for = LR, and then reject + only if it is not in K, we obtain a sample whose distribution is O(Á) from the distribution Ã e f on K in inﬁnity-distance.

However, we would still need to deﬁne and sample from such a convex body KÕ, and to make sure that KÕ is not too large when compared to K; otherwise the samples from the distribution Ã e f on KÕ may be rejected with high probability. Moreover, another issue we need to deal with is that f may not even be deﬁned outside of K.

Figure 1: The construction used in the proof of Lemma B.1.

To get around these two problems, in Algorithm 1, we begin by taking as input a point µ sampled from some distribution µ supported on K where Îµ ﬁÎTV Æ for some Æ Áe d log(R), and add

noise unif(B(0, r)) in order to sample from a distribution ˆµ which satisﬁes

---log ˆµ(z)

for all z œ int r(K). Here r is the radius of the small ball contained in K; the choice of radius r for the noise distribution is because we will show in the following paragraphs that to sample from

the distribution ﬁon K with inﬁnity-distance error Á it is sufﬁcient sample a point in the r-interior of K and to then apply a stretching operation to K.

We still need a method of sampling within O(Á) inﬁnity-distance error of ﬁon all of K, including in the region K\int r(K) near the boundary of K. Towards this end, after Algorithm 1 generates a point Z = + from the above-mentioned distribution ˆµ, it then multiplies Z by 1 1 and returns the resulting point ˆ := 1 1 Z if it is K, in other words, if Z œ (1 )K. If we can show that

(1 )K int r(K), then this would imply that

---log ˆµ(z)

--- Æ Á for all z œ (1 )K, and hence

that the distribution of ˆ of the returned point ˆ satisﬁes

----log ˆ (z) ﬁ((1 )z)

----log ˆµ((1 )z)

---- + log 1 (1 )d Æ O(Á)

for all z œ K. Since f is L-Lipschitz we have

---log ﬁ( ) ﬁ((1 ) )

--- = O(Á) for all œ K, and hence we

would then have that the distribution ˆ of the point ˆ returned by Algorithm 1 satisﬁes

----log ˆ (z)

----log ˆ (z) ﬁ((1 )z)

---- + O(Á) Æ O(Á) z œ K. (1)

However, for (1) to hold, we still need to show that (1 )K int r(K) (proved in Lemma B.1). In other words, we would like to show that for any point Z œ (1 )K, there is a ball B(Z, r) centered at Z of radius r contained in K. To show this fact, it is sufﬁcient to consider the convex hull C of B(0, r) ﬁ

K, and show that it contains the ball B(Z, r). Towards this end,

we make the following geometric construction: we let p be a point such that the line pˆ is tangent to B(0, r), and q the point on pˆ which minimizes the distance Îq ZÎ2 (see Figure 1). Since \0pˆ and \Zqˆ are both right angles, we have that the triangles 0pˆ and Zqˆ are similar triangles, and hence that ÎZ qÎ2

r = ÎZ ˆ Î2

Îˆ 0Î2 . In other words,

ÎZ qÎ2 = ÎZ ˆ Î2

2 ... 1 1 Z

implying a ball of radius r centered at Z is contained in C K, and hence that

(1 )K int r(K).

Bounding the inﬁnity distance error. To complete the bound on the inﬁnity-distance of the distribution ˆ of the point returned by Algorithm 1 to the target distribution ﬁ, we must show both a lower bound (Lemma B.5) and an upper bound (Lemma B.6) on the ratio ˆ ( ) ﬁ( ) at every point œ K. Both the upper and lower bounds are necessary to bound the inﬁnity-distance dŒ(ˆ ( ), ﬁ( )) = sup œK

---log ˆ ( )

Both Lemmas B.5 and B.6 require the input point to have TV error < Á( R

r) de LR. The term ( R

r) d is a lower bound on the ratio of the volume of K to the volume of the smoothing ball B(0, r); this bound holds since K is contained in a ball of radius R. The term e LR is a lower bound on the ratio minwœK ﬁ(w)

maxwœK ﬁ(w) of the minimum value of the density ﬁto the maximum value of ﬁ at any two points in K; this bound holds since f is L-Lipschitz.

The above choice of ensures that in any ball B(z, r) with center z in the r-interior of K, the distribution µ of the input point, which satisﬁes Îµ ﬁÎTV Æ , will have between e Á and eÁ times the probability mass which the target distribution ﬁhas inside the ball B(z, r). Thus, when the distribution µ is smoothed by adding noise uniformly distributed on a ball of radius r, the smoothed distribution ( ) is within e Á and eÁ times the target probability density ﬁ( ) at any point in the r-interior of K, allowing us to bound the inﬁnity distance error of the smoothed distribution at any point in the r-interior of K. We then apply this fact, together with Lemma

B.1 which says that for any point œ K the point (1 ) is in the r-interior of K, to bound the distribution of the output point (after the stretching operation) as follows,

( ) Ø (1 )d ((1 ) )

Ø (1 )dﬁ((1 ) ) e Á Ø ﬁ( )e Á

Here our choice of hyperparameter Æ Á max(d,LR) ensures that (1 )d = Ω(1) and, since f is L-Lipschitz, that ﬁ((1 ) ) Ø e Áﬁ( ). This proves the lower bound (Lemma B.5). The proof of the upper bound (Lemma B.6) follows in a similar way as equation (2) but with the inequalities going in the opposite direction.

Bounding the number of iterations and concluding the proof of Theorem 2.2. We still need to deal with the problem that the point ˆ may not be accepted. If this occurs, roughly speaking, we repeat the above procedure until a point ˆ with distribution ˆ is accepted. To bound the number of iterations, we show that ˆ is in K with high probability. Towards this end, we ﬁrst use the facts that f is LLipschitz and K B(0, R), to show that the probability a point sampled from ﬁÃ e f lies inside (1 )K is at least (1 )de L R Ø 9 10 (Lemma B.2). Lemma B.2 says that if you stretch the polytope by a factor of 1 1 , then most of the volume of the stretched polytope ( 1 1 )K remains inside the original polytope K. The term (1 )d is just the ratio of the volume of (1 )K to the volume of K. And, since f is L-Lipschitz, the term e L R bounds the ratio ﬁ( ) ﬁ( 1 1 ) of the target

density at any point œ K to the value of ﬁat the point 1 1 to which the stretching operation transports , whenever 1 1 œ K. The choice of hyperparameter Æ Á max(d,LR) ensures that the acceptance probability (1 )de L R guaranteed by Lemma B.2 is at least 9 10. Since the convex body (1 )K contains the ball B(0, r

2), applying Lemma B.1 a second time (this time to the convex body (1 )K) we get that

(1 3 )K int r((1 )K).

Thus, by Lemma B.2 we have that lies inside int r((1 )K) with probability at least 9

10 Ø 8 10 (as is sampled from ﬁwith TV error Æ ). Therefore, since B(0, r), we must also have that the probability that the point ˆ = 1 1 ( + ) is in K (and is therefore not rejected) is greater than 8

4. This implies that the number of iterations until our algorithm returns a point ˆ is less than k > 0 with probability at least 1 2 k, and the expected number of iterations is at most 2 (proved in Corollary B.4).

Since each iteration requires one random sample from the distribution µ, and one call to a membership oracle for K (to determine if 1 1 Z œ K), the number of sampling oracle and membership oracle calls required by Algorithm 1 is O(1) with very high probability. Therefore, with high probability, Algorithm 1 returns a point ˆ from a distribution with inﬁnity-distance at most Á from ﬁafter O(1) calls to the sampling and membership oracles.

Since Algorithm 1 succeeds with probability 1 2 k after k iterations, after

max = 5d log(R

r ) + 5LR + Á

iterations Algorithm 1 will have succeeded with probability roughly 1 Á( R

r ) 5de 5LR. In the very unlikely event that Algorithm 1 still has not succeeded after max iterations, Algorithm 1 simply outputs a point sampled from the uniform distribution on the ball B(0, r) of radius r contained in K. The probability mass of the target distribution ﬁinside this ball is at least as large as ( R

r ) de LR; thus, since f is L-Lipschitz, we show in Corollary B.4 that outputing a sample from the uniform distribution on this ball with probability Á( R

r ) 5de 5LR does not change the Œ-distance error of the sample returned by the algorithm by more than Á.

4In Algorithm 1 we reject ˆ with a slightly higher probability to ensure that, in differential privacy applications, in addition to the privacy of the point returned by the algorithm, the runtime is also Á-differentially private.

3.2 Completing the proof of Theorem 2.1

Proof: By Theorem 1, the output of Algorithm 1 has inﬁnity-distance error bounded by Á as long as the input samples have TV distance error bounded by

3R(d log(R/r) + LR)2

and, with high probability, Algorithm 1 requires O(1) such independent samples. To generate a sample from ﬁwith TV error O( ) when K = { œ Rd : A Æ b} is a polytope deﬁned by m inequalities, we use the Dikin Walk Markov chain of [34]. This Markov chain requires an initial point from some distribution µ0 which is w-warm with respect to the stationary distribution ﬁ, that is, supzœK

ﬁ(z) Æ w. To obtain a warm start, we let µ0 be the uniform distribution on the ball with radius r contained in K and sample from µ0. Since f is L-Lipschitz, and K is contained in a ball of radius R, µ0 is w-warm with

w Æ 1 Vol(B(0, r))

3maxzœK ﬁ( )

minzœK ﬁ( ) Vol(B(0, R))

From [34], we have that from this w-warm start the Dikin Walk Markov chain requires at most O((m2d4 + m2d2L2R2) log( w

)) steps to generate a sample with TV distance at most from ﬁ, where each step makes one function evaluation and O(mdÊ 1) arithmetic operations. Plugging in the above values of , w the number of Markov chain steps is

T = O((m2d3 + m2d L2R2) [LR + d log(Rd + LRd

to generate each independent sample with the required TV error O( ). Since the number of independent samples required as input for Algorithm 1 is O(1) w.h.p., the number of arithmetic operations for Algorithm 1 to output a point with at most Á inﬁnity-distance error is O(T mdÊ 1).

Finally, we note that in the more general setting where K is a convex body with membership oracle (but not necessarily) a polytope, we can instead use, for instance, the hit-and-run Markov chain of [29] to generate samples from ﬁwith TV error O( ) in a number of membership and function evaluation oracle calls that is polynomial in d and poly-logarithmic in 1

, R, r. We can then plug this sample into our Algorithm 1 to obtain a sample from ﬁwith inﬁnity-distance error O(Á) in a number of oracle calls that is (poly)-logarithmic in 1

r and polynomial on d, L, R. (see Remark 2.3).

4 Conclusion, Limitations, and Future Work

To the best of our knowledge, this is the ﬁrst work that presents an algorithm for sampling from logconcave distributions on convex bodies that comes with inﬁnity-distance bounds and whose running time depends logarithmically on 1/Á. Towards this, the main technical contribution is Algorithm 1 (and Theorem 2.2) which achieves this improved dependence on Á by taking as input continuous samples from a convex body with TV bounds and converting them to samples with inﬁnity-distance bounds.

On the other hand, our bounds are polynomial in LR, yet there are algorithms for sampling from logconcave distributions ﬁÃ e f on a convex body in the total variation distance that are polylogarithmic in R and do not assume f to be Lipschitz [29]. Thus, the main open problem that remains is whether one can also obtain running time bounds for sampling in the inﬁnity-distance which are poly-logarithmic in R and do not require f to be Lipschitz.

Our main result also has direct applications to differentially private optimization (Corollaries 2.4 and 2.5). Differential privacy is a notion which has been embraced in many technologies in societal contexts where privacy of individuals is a concern. Hence, we see our work to have a potential of positive societal impact and do not foresee any potential negative societal impacts.

Acknowledgments and Disclosure of Funding

This research was supported in part by NSF CCF-2104528, CCF-1908347, and CCF-2112665 awards.

[1] David Applegate and Ravi Kannan. Sampling and integration of near log-concave functions.

In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 156 163, 1991.

[2] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efﬁ-

cient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464 473. IEEE, 2014.

[3] Avrim Blum, Cynthia Dwork, Frank Mc Sherry, and Kobbi Nissim. Practical privacy: the Su LQ

framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128 138, 2005.

[4] Nicolas Brosse, Alain Durmus, Éric Moulines, and Marcelo Pereyra. Sampling from a log-

concave distribution with compact support with proximal Langevin Monte Carlo. In Conference on Learning Theory, pages 319 342. PMLR, 2017.

[5] Sebastien Bubeck, Ronen Eldan, and Joseph Lehec. Finite-time analysis of projected Langevin

Monte Carlo. In Advances in Neural Information Processing Systems, pages 1243 1251. Citeseer, 2015.

[6] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. Advances in

neural information processing systems, 24:2249 2257, 2011.

[7] Kamalika Chaudhuri, Anand Sarwate, and Kaushik Sinha. Near-optimal differentially private

principal components. Advances in neural information processing systems, 25:989 997, 2012.

[8] Yuansi Chen, Raaz Dwivedi, Martin J Wainwright, and Bin Yu. Vaidya walk: A sampling

algorithm based on the volumetric barrier. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1220 1227. IEEE, 2017.

[9] Xiang Cheng and Peter Bartlett. Convergence of Langevin MCMC in KL-divergence. In

Algorithmic Learning Theory, pages 186 211. PMLR, 2018.

[10] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-

concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651 676, 2017.

[11] Arnak S Dalalyan and Lionel Riou-Durand. On sampling from a log-concave density using

kinetic Langevin diffusions. Bernoulli, 26(3):1956 1988, 2020.

[12] Alain Durmus, Szymon Majewski, and Blazej Miasojedow. Analysis of Langevin Monte Carlo

via convex optimization. J. Mach. Learn. Res., 20:73 1, 2019.

[13] Alain Durmus and Eric Moulines. Nonasymptotic convergence analysis for the unadjusted

Langevin algorithm. Annals of Applied Probability, 27(3):1551 1587, 2017.

[14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Founda-

tions and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

[15] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal

bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11 20, 2014.

[16] M. E. Dyer and A. M. Frieze. On the complexity of computing the volume of a polyhedron.

SIAM Journal on Computing, 17(5):967 974, 1988.

[17] Alan Frieze and Ravi Kannan. Log-Sobolev inequalities and sampling from log-concave dis-

tributions. The Annals of Applied Probability, 9(1):14 26, 1999.

[18] Alan Frieze, Ravi Kannan, and Nick Polson. Sampling from log-concave distributions. The

Annals of Applied Probability, pages 812 837, 1994.

[19] Arun Ganesh and Kunal Talwar. Faster differentially private samplers via Rényi divergence

analysis of discretized Langevin MCMC. Advances in Neural Information Processing Systems, 33, 2020.

[20] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the

forty-second ACM symposium on Theory of computing, pages 705 714, 2010.

[21] He Jia, Aditi Laddha, Yin Tat Lee, and Santosh S. Vempala. Reducing isotropy and volume to

KLS: an Oú(n3Â2) volume algorithm. In STOC 21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 961 974. ACM, 2021.

[22] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker Planck equation. SIAM journal on mathematical analysis, 29(1):1 17, 1998.

[23] Ravi Kannan and Hariharan Narayanan. Random walks on polytopes and an afﬁne interior

point method for linear programming. Mathematics of Operations Research, 37(1):1 20, 2012.

[24] Jonathan Leake, Colin S Mc Swiggen, and Nisheeth K Vishnoi. A polynomial-time algorithm

and applications for matrix sampling from Harish-Chandra Itzykson-Zuber densities. In ACM symposium on theory of computing STOC, 2021.

[25] Yin Tat Lee and Santosh Vempala. Eldan s stochastic localization and the KLS hyperplane

conjecture: An improved lower bound for expansion. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 998 1007. IEEE, 2017.

[26] Yin Tat Lee and Santosh S Vempala. Convergence rate of Riemannian Hamiltonian Monte

Carlo and faster polytope volume computation. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1115 1121, 2018.

[27] László Lovász and Miklós Simonovits. Random walks in a convex body and an improved

volume algorithm. Random structures & algorithms, 4(4):359 412, 1993.

[28] László Lovász and Santosh Vempala. Hit-and-run is fast and fun. preprint, Microsoft Research,

[29] László Lovász and Santosh Vempala. Fast algorithms for logconcave functions: Sampling,

rounding, integration and optimization. In Foundations of Computer Science, 2006. FOCS 06. 47th Annual IEEE Symposium on, pages 57 68. IEEE, 2006.

[30] László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling

algorithms. Random Structures & Algorithms, 30(3):307 358, 2007.

[31] Oren Mangoubi and Aaron Smith. Mixing of Hamiltonian Monte Carlo on strongly logconcave distributions: Continuous dynamics. The Annals of Applied Probability, in press.

[32] Frank Mc Sherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual

IEEE Symposium on Foundations of Computer Science (FOCS 07), pages 94 103. IEEE, 2007.

[33] Hariharan Narayanan. Randomized interior point methods for sampling and optimization. The

Annals of Applied Probability, 26(1):597 641, 2016.

[34] Hariharan Narayanan and Alexander Rakhlin. Efﬁcient sampling from time-varying logconcave distributions. The Journal of Machine Learning Research, 18(1):4017 4045, 2017.

[35] Sushant Sachdeva and Nisheeth K Vishnoi. The mixing time of the Dikin walk in a polytope a

simple proof. Operations Research Letters, 44(5):630 634, 2016.

[36] Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algo-

rithm: Isoperimetry sufﬁces. In Advances in Neural Information Processing Systems, 2019.

[37] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.

[38] Nisheeth K. Vishnoi. An introduction to Hamiltonian Monte Carlo method for sampling. Co RR, abs/2108.12107, 2021.

[39] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics.

In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688. Citeseer, 2011.

[40] Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics

as a composite optimization problem. In Conference on Learning Theory, pages 2093 3027. PMLR, 2018.

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justiﬁcation to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? [N/A] Did you include the license to the code and datasets? [N/A] Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] (See

Section 4) (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] (See Section 3.2,

Appendix B and C) 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main exper-

imental results (either in the supplemental material or as a URL)? [Yes] See supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type

of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data

you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁ-

able information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]