# computational_separations_between_sampling_and_optimization__96d42d36.pdf

Computational Separations between Sampling and Optimization

Kunal Talwar

Google Brain Mountain View, CA kunal@google.com

Two commonly arising computational tasks in Bayesian learning are Optimization (Maximum A Posteriori estimation) and Sampling (from the posterior distribution). In the convex case these two problems are efﬁciently reducible to each other. Recent work [Ma et al., 2019] shows that in the non-convex case, sampling can sometimes be provably faster. We present a simpler and stronger separation. We then compare sampling and optimization in more detail and show that they are provably incomparable: there are families of continuous functions for which optimization is easy but sampling is NP-hard, and vice versa. Further, we show function families that exhibit a sharp phase transition in the computational complexity of sampling, as one varies the natural temperature parameter. Our results draw on a connection to analogous separations in the discrete setting which are well-studied.

1 Introduction

Given a a compact set X Rd and function f : X R, one can deﬁne two natural problems:

Optimize(f, X, ε) : Find x X such that f(x) f(x ) + ε for all x X.

Sample(f, X, η) : Sample from a distribution on X that is η-close to µ (x) exp( f(x)).

These problems arise naturally in machine learning settings. When f is the negative log likelihood function of a posterior distribution, the optimization problem corresponds to the Maximum A Posteriori (MAP) estimate, whereas the Sampling problem gives us a sample from the posterior. In this work we are interested in the computational complexities of these tasks for speciﬁc families of functions.

When f and X are both convex, these two problems have a deep connection (see e.g. Lovasz and Vempala [2006]) and are efﬁciently reducible to each other in a very general setting. There has been considerable interest in both these problems in the non-convex setting. Given that in practice, we are often able to practically optimize certain non-convex loss functions, it would be appealing to extend this equivalence beyond the convex case. If sampling could be reduced to optimization for our function of interest (e.g. differentiable Lipschitz functions), that might allow us to design sampling algorithms for the function that are usually efﬁcient in practice. Ma et al. [2019] recently showed that in the case when f is not necessarily convex (and X = Rd), these problems are not equivalent. They exhibit a family of continuous, differentiable functions for which approximate sampling can be done efﬁciently, but where approximate optimization requires exponential time (in an oracle model à la Nemirovsky and Yudin [1983]). In this work, we study the relationship of these two problems in more detail.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

To aid the discussion, it will be convenient to consider a more general sampling problem where we want to sample with probability proportional to exp( λf(x)) for a parameter λ > 0. Such a scaling has no effect on the optimization problem, up to scaling of ε. However changing λ can signifcantly change the distribution for the sampling problem. In statistical physics literature, this parameter is the inverse temperature. For families F that are invariant to multiplication by a positive scalar (such as the family of convex functions), this λ parameter has no impact on the complexity of sampling from the family. We will however be looking at families of functions that are controlled in some way (e.g. bounded, Lipschitz, or Smooth) and do not enjoy such an invariance to scale. E.g. in some Bayesian settings, each sample may give us a 1-smooth negative log likelihood function, so we may want to consider the family Fsmooth of 1-smooth functions. Given n i.i.d. samples, the posterior log likelihood would be nf, where f = 1

i fi is in Fsmooth. The parameter λ then corresponds naturally to the number of samples n.

This phenomenon of sampling being easier than optimization is primarily a high temperature or low signal phenomenon. As λ approaches inﬁnity, the distribution exp( λf) approaches a point mass at the minimizer of f. This connection goes back to at least Kirkpatrick et al. [1983] and one can easily derive a quantitative ﬁnite-λ version of this statement for many function families. Ma et al. [2019] reconcile this with their separation by pointing out that their sampling algorithm becomes inefﬁcient as λ increases.

We ﬁrst show a more elementary and stronger separation. We give a simple family of continuous Lipschitz functions which are efﬁciently samplable but hard even to approximately optimize. This improves on the separation in Ma et al. [2019] since our sampler is exact (modulo representation issues), and much simpler. The hardness of optimization here is in the oracle model, where the complexity is measured in terms of number of point evaluations of the function or its gradients.

While these oracle model separations rule out black-box optimization, they leave open the possibility of efﬁcient algorithms that access the function in a different way. We next show that this hardness can be strengthened to an NP-hardness for an efﬁciently computable f. This allows for the implementation of any oracle for f or its derivatives. Thus assuming the Exponential Time Hypothesis [Impagliazzo and Paturi, 2001], our result implies the oracle model lower bounds. Additionally, it rules out efﬁcient non-blackbox algorithms that could examine the implementation of f beyond point evaluations. We leave open the question of whether other oracle lower bounds [Nemirovsky and Yudin, 1983, Nesterov, 2014, Bubeck, 2015, Hazan, 2016] in optimization can be strengthened to NP-hardness results.

We next look at the large λ case. As discussed above, for large enough λ sampling must be hard if optimization is. Is hardness of optimization the only obstruction to efﬁcient sampling? We answer this question in the negative. We exhibit a family of functions for which optimization is easy, but where sampling is NP-hard for large enough λ. We draw on extensive work on the discrete analogs of these questions, where f is simple (e.g. linear) but X is a discrete set.

Our upper bound on optimization for this family can be strengthened to work under weaker models of access to the function, where we only have blackbox access to the function. In other words, there are functions that can be optimized via gradient descent for which sampling is NP-hard. Conceptually, this separation is a result of the fact that ﬁnding one minimizer sufﬁces for optimization whereas sampling intuitively requires ﬁnding all minima.

Both the separation result of Ma et al. [2019], and our small-λ result have the property that the sampling algorithm s complexity increases exponentially in poly(λ). Thus as we increase λ, the problem gradually becomes harder. Is there always a smooth transition in the complexity of sampling?

Our ﬁnal result gives a surprising negative answer. We exhibit a family of easy-to-optimize functions for which there is a sharp threshold: there is a λc such that for λ < λc, sampling can be done efﬁciently, whereas for λ > λc, sampling becomes NP-hard. In the process, this demonstrates that for some families of functions, efﬁcient sampling algorithms can be very structure-dependent, and do not fall into the usual Langevin-dynamics, or rejection-sampling categories.

Our results show that once we go beyond the convex setting, the problems of sampling and optimization exhibit a rich set of computational behaviors. The connection to the discrete case should help further understand the complexities of these problems.

The rest of the paper is organized as follows. We start with some preliminaries in Section 2. We give a simple separation between optimization and sampling in Section 3 and derive a computational version of this separation. Section 4 relates the discrete sampling/optimization problems on the hypercube to their continuous counterparts, and uses this connection to derive NP-hardness results for sampling for function families where optimization is easy. In Section 5, we prove the sharp threshold for λ. We describe additional related work in Section 6. Some missing proofs and strengthenings of our results are deferred to the supplementary material.

2 Preliminaries

We consider real-valued functions f : Rd R. We will be restricting ourselves to functions that are continuous and bounded. We say a function f is L-Lipschitz if f(x) f(x ) L x x for all x, x Rd. In this work, will denote the Euclidean norm.

We will look at speciﬁc families of functions which have compact representations, and ask questions about efﬁciency of optimization and sampling. We will think of d as a parameter, and look at function families such that at any function in the family can be computed in poly(d) time and space.

We will look at constrained optimization in this work and our constraint set X will be a Euclidean ball. Our hardness results however do not stem from the constraint set, and nearly all of our results can be extended easily to the unconstrained case.

Given λ > 0 and a function f, we let Dλ,X f denote the distribution on X with Pr[x] exp( λf(x)).

When X is obvious from context, we will usually omit it and write Dλ f . We will Zλ,X f for the partition function R

X exp( λf(x))dx.

We will also look at real-valued functions h : Hd R, where Hd = { 1, 1}d is the d-dimensional hypercube. We will often think of a y Hd as being contained in Rd. Analagous to the Euclidean space case, we deﬁne Dλ,Hd h as the distribution over the hypercube with Pr[y] exp( λh(y)), and deﬁne Zλ,Hd h to be P

y Hd exp( λh(y)).

We say that an algorithm η-samples from Dλ f if it samples from a distribution that is η-close to Dλ f in statistical distance. We will also use the Wasserstein distance between distributions on Rd, deﬁned as W(P, Q) eq= infπ E(x,x ) π[ x x 2], where the inf is taken over all couplings π between P and Q. We remark that our results are not sensitive to the choice of distance between distributions and extend in a straightforward way to other distances. As is common in literature on sampling from continuous distributions, we will for the most part assume that we can sample from a real-valued distribution such as a Gaussian and ignore bit-precision issues. Statements such as our NP-hardness results usually require ﬁnite precision arithmetic. This issue is discussed at length by Tosh and Dasgupta [2019] and following them, we will discuss using Wasserstein distance in those settings. An ε-optimizer of f is a point x X such that f(x) f(x ) + ε for any x X.

In the supplementary material, we quantify the folklore results showing that sampling for high λ implies approximate optimization. Quantitatively, they say that for L-Lipschitz functions over a ball of radius R, sampling implies ε-approximate optimization if λ Ω(d ln LR

ε /ε). Similarly, for β-smooth functions over a ball of radius R, λ Ω(d ln βR

ε /ε) sufﬁces to get ε-close to an optimum.

3 A Simple separation

We consider the case when X = Bd(0, 1) is the unit norm Euclidean ball in d dimensions. We let FLip be the family of all 1-lipschitz functions from X to R. We show that for any f FLip, exact sampling can be done in time exp(O(λ)). On the other hand, for any algorithm, there is an f FLip that forces the algorithm to use Ω(λ/ε)d queries to f to get an ε-approximate optimal solution. Thus, e.g., for constant λ, sampling can be done in poly(d) time, whereas optimization requires time exponential in the dimension. Moreover, for any λ d, there is an exponential gap between the complexities of these two problems. Our lower bound proof is similar to the analagous claim in Ma et al. [2019], but has better parameters due to the simpler setting. Our upper bound proof is signifantly simpler and gives an exact sampler.

Theorem 1 (Fast Sampling). There is an algorithm that for any f FLip, outputs a sample from Dλ f and makes an expected O(exp(2λ)) oracle calls to computing f.

Proof. The algorithm is based on rejection sampling. We ﬁrst compute f(0) and let M = f(0) 1. By the Lipschitzness assumption, f(x) [M, M + 2] for all x in the unit ball. The algorithm now repeatedly samples a random point x from the unit ball. With probability exp(λ(M f(x))), this point is accepted and we output it. Otherwise we continue.

Since exp(λ(M f(x))) [exp( 2λ), 1], this is a rejection sampling algorithm, and it outputs an exact sample from Dλ f . Each step accepts with probability at least exp( λ). Thus the algorithm terminates in an expected O(exp(2λ)) many steps, each of which requires one evaluation of f.

Remark 1. The above algorithm assumes access to an oracle to sample from Bd(0, 1) to arbitrary precision. This can be replaced by sampling from a grid of ﬁnite precision points in the ball. This creates two sources of error. Firstly, the function is not constant in the grid cell. This error is easily bounded since f is Lipschitz. Secondly, some grid cells may cross the boundary of Bd(0, 1). This is a probability d2 b event when sampling a grid point with b bits of precision. Taking these errors into account gives us a sample within Wasserstein distance at most O((d + λ)2 b). Remark 2. The above is a Las Vegas algorithm. One can similarly derive a Monte Carlo algorithm by aborting and outputting a random x after exp(2λ) log 1

Remark 3. Under the assumptions in Ma et al. [2019] (β-smooth f, (0) = 0, κβ-strong convexity outside a ball of radius R), a direct reduction to our setting will be lossy and a rejection-samplingbased approach will not be efﬁcient. The Langevin dynamics based sampler in that work is more efﬁcient under their assumptions.

Theorem 2 (No Fast Optimization). For any algorithm A that queries f or any of its derivatives at less than (1/4ε)d points, there is an f FLip for which A fails to output an ε-optimizer of f except with negligible probability.

Proof. Consider the function fx that is zero everywhere, except for a small ball of radius 2ε around x, where it is f(y) = y x 2ε. i.e. the function1 is fx(y) = min(0, y x 2ε). This function has optimum 2ε. Let g be the zero function.

Let A be an algorithm (possibly randomized) that queries f or its derivatives at T (1/4ε)d points. Consider running A on a function f chosen randomly as:

f = g with probability 1

2 fx for a x chosen u.a.r. from B(0, 1) otherwise.

Until A has queried a point in B(x, 2ε), the behavior of the algorithm on fx and g is identical, since the functions and all their derivatives agree outside this ball. Since an ε-approximation must distinguish these two cases, for A to succeed, it must query this ball. The probability that A queries in this ball in any given step is at most vol(B(x,2ε))

vol(B(0,1)) = (2ε)d. As A makes only (1/4ε)d queries in total, the expected number of queries A makes to the ball B(x, 2ε) is at most 2 d. Thus with probability at least 1 1

2d , the algorithm fails to distinguish g from fx, and hence cannot succeed.

3.1 Making the Separation Computational

The oracle-hardness of Theorem 2 stems from possibly hiding a minimizer x of f. The computational version of this hardness result will instead possibly hide an appropriate encoding of a satisfying assignment to a 3SAT formula.

Theorem 3 (No Fast Optimization: Computational). There is a constant ε > 0 such that it is NP-hard to ε-optimize an efﬁciently computable Lipschitz function over the unit ball.

Proof. Let φ : {0, 1}d Bd(0, 1) be a map such that φ is efﬁciently computable, φ(y) φ(y ) 4ε and such that given x Bd(φ(y), 2ε), we can efﬁciently recover y. For a small enough absolute

1As deﬁned, this function is Lipschitz but not smooth. It can be easily modiﬁed to a 2-Lipschitz, 2/ε-smooth function by replacing its linear behavior in the ball by an appropriately Huberized version.

Figure 1: (Left) An example of a function h for d = 2, with the 2-d hypercube shown in gray and the values of h denoted by blue points. (Right) The corresponding function f that results from the transformation, for M = 4, R = 2.

constant ε, such maps can be easily constructed using error correcting codes and we defer details to supplementary material.

We start with an an instance I of 3SAT on d variables, and deﬁne f as follows. Given a point x, we ﬁrst ﬁnd a y {0, 1}d, if any, such that x Bd(φ(y), 2ε). If no such y exists, f(x) is set to 0. If such a y exists, we interpret it as an assignment to the variables in the 3SAT instance I and set f(x) to be min(0, φ(y) x 2ε) if y is a satisfying assignment to instance I, and to 0 otherwise.

It is clear from deﬁnition that f as deﬁned is efﬁciently computable. Moreover, the minimum attained value for f is 2ε if I is satsiﬁable, and 0 otherwise. Since distinguishing between these two cases is NP-hard, so is ε-optimization of f.

We note that assuming the exponential time hypothesis, this implies the exp(Ω(d)) oracle complexity lower bound of Theorem 2.

4 Relating Discrete and Continuous Settings

For any function h on the hypercube, we can construct a function on f on Rd such that optimization of f and h are reducible to each other, and similarly sampling from f and h are reducible to each other. This would allow us to use separation results for the hypercube case to establish analagous separation results for the continuous case.

Theorem 4. Let h : Hd R have range [0, d]. Fix M 2d, R 2

d. Then there is a funtion f : Bd(0, R) R satisfying the following properties:

Efﬁciency: Given x Bd(0, R) and oracle access to h, f can computed in polynomial time.

Lipschitzness: f is continuous and L-Lipschitz for L = 2M.

Sampler Equivalence: Fix λ 4d ln 24R

M . Given access to an η-sampler for Dλ,Hd h , there is an efﬁcient η -sampler for Dλ f , for η = η + exp( Ω(d)). Conversely, given access to an η-sampler for Dλ f , there is an efﬁcient η -sampler for Dλ,Hd h for η = η + exp( Ω(d)).

Proof. The function f is fairly natural: it takes a large value M 2d at most points, except in small balls around the hypercube vertices. At each hypercube vertex, f is equal to the h value at the vertex, and we interpolate linearly in a small ball. See Figure 2 for an illustration.

Formally, let round : R { 1, 1} be the function that takes the value 1 for x 0 and 1 otherwise, and let round : Rd Hd be its natural vectorized form that applies the function coordinate-wise. Let g(x) = x round(x) denote the Euclidean distance from x to round(x). Let s = 2M. The

function f is deﬁned as follows:

( h(round(x)) + s g(x) if g(x) M h(round(x))

s M if g(x) M h(round(x))

s It is easy to verify that f is continuous. Moreover, since M 2d, and h has range [0, d], the value M h(round(x))

s is in the range [ 1

2]. It follows that f takes the value M outside balls of radius 1

2 around the hypercube vertices, and is strictly smaller than M in balls for radius 1

Since round(x) is easy to compute, this implies that f can be computed in polynomial time, using a single oracle call to h. Moreover it is immediate from the deﬁnition that the function f has Lipschitz constant s.

Note that f(y) = h(y) for y Hd and f(x) h(round(x)) for all x Bd(0, R). This implies that the minimum value of f is the same as the minimum value of h, and indeed any (ε-)minimizer y of h also (ε-)minimizes f. Conversely, let x be an ε-minimizer of f. Since h(round(x)) f(x), it follows that round(x) is an ε-minimizer of h.

Finally we prove the equivalence of approximate sampling. Towards that goal, we deﬁne an intermediate distribution on Bd(0, R). Let c Dλ f be the distribution Dλ f conditioned on being in y Hd Bd(y, 1

We ﬁrst argue that η-samplability of c Dλ f is equivalent to η-samplability of Dλ,Hd h . Suppose that X is

a sample from c Dλ f . Then for any y Hd,

Pr[X Bd(y , 1

4 ) exp( λf(x)) dx P

4 ) exp( λf(x)) dx

= exp( λh(y )) R

4 ) exp( sλg(x)) dx P

y Hd exp( λh(y)) R

4 ) exp( sλg(x)) dx

= exp( λh(y )) P

y Hd exp( λh(y))

Thus round(X) is a sample from Dλ,Hd h . Conversely, the same calculation implies that given a sample Y from Dλ,Hd h , and a vector Z Bd(0, 1

4) sampled proportional to exp( sλ z ), Y + Z is

a sample from c Dλ f . Noting that Z is a sample from a efﬁciently sample-able log-concave distribution completes the equivalence.

We next argue that Dλ f and c Dλ f are exp( Ω(d))-close as distributions. Since c Dλ f is a conditioning of Dλ f , this is equivalent to showing that nearly all of the mass of Dλ f lies in y Hd Bd(y, 1

4). We write

Bd(0,R) exp( λf(x)) dx

4 ) exp( λf(x)) dx

y Hd exp( λh(y)) Z

4 ) exp( sλg(x)) dx

Let c Zf λ denote this last expression. We will argue that Zf λ (1 + exp( Ω(d)))c Zf λ. We write Zf λ as Z

Bd(0,R) exp( λf(x)) dx X

2 ) exp( λf(x)) dx + Z

Bd(0,R) exp( λM) dx

y Hd exp( λh(y)) Z

2 ) exp( sλg(x)) dx

Bd(0,R) exp( λM) dx

A simple calculation, formalized as Lemma 16 in supplementary material shows that the integral R

2 ) exp( sλg(x)) dx is within (1 + 2 exp( d)) of R

4 ) exp( sλg(x)) dx for sλ > 16d.

This implies that the term (A) above is at most (1 + 2 exp( d))c Zf λ. To bound the second term (B) above, we argue that a ball of radius 1

8 around any single vertex y of the hypercube contributes signiﬁcantly more than than the term (B). Indeed Z

8 ) exp( λf(x)) dx Z

8 ) exp( λ(h(y) + s

exp( λ(d + M

Bd(0,1) dx exp( λ(3M

Bd(0,R) exp( λM) dx = exp( λM) Rd Z

Bd(0,1) dx.

Thus (B)/c Zf λ is at most exp( λM/4+d ln 8R). Under the assumptions on λ, it follows that (B) is at

most exp( Ω(d)) times c Zh λ. In other words, we have shown that Zf λ is at most (1+exp( Ω(d)))c Zh λ.

Remark 4. The equivalence of sampling extends immediately to Wasserstein distance. Indeed given a sampler for Dλ,Hd h , one gets a Wasserstein sampler for c Dλ f by sampling from a simple isotropic logconcave distribution. A Wasserstein sampler for a ball sufﬁces for this. Since W(P, Q) is bounded by the diameter times the statistical distance, this gives a η + exp( Ω(d)) Wasserstein sampler for

Dλ f . Similarly, a η Wasserstein sampler for Dλ f conditioned on the support of c Dλ f immediately yields an O(η)-sampler for Dλ,Hd h . Moreover, it is easy to check that this conditioning is on a constant probability event as long as η < 1 16.

4.1 Optimization can be Easier than Sampling

Given the reduction from the previous section, there are many options for a starting discrete problem to apply the reduction. We will start from one of the most celebrated NP-hard problems. The NP-hardness of Hamiltonian Cycle dates back to Karp [1972]. Theorem 5 (Hardness of HAMCYCLE). Given a constant-degree graph G = (V, E), it is NP-hard to distinguish the following two cases.

YES Case: G has a Hamiltonian Cycle.

NO Case: G has no Hamiltonian Cycle.

We can then amplify the gap between the number of long cycles in the two cases. Theorem 6 (#CYCLE hardness). Given a constant-degree graph G = (V, E) and for L = |V |/2, it is NP-hard to distinguish the following two cases.

YES Case: G has at least 1 + 2L simple cycles of length L.

NO Case: G has exactly one simple cycle C(planted) of length L, and no longer simple cycle.

Moreover, C(planted) can be efﬁciently found in polynomial time.

The proof of the above uses a simple extension of a relatively standard reduction (see e.g. Vadhan [2002]) from Hamiltonian Cycle. Starting with an instance G1 of Hamiltonian Cycle, we replace each edge by a two-connected path of length t, for some integer t. For L = nt, this gives us 2L cycles of length L for every Hamiltonian cycle in G1. Moreover, any simple cycle of length L must correspond to a Hamiltonian Cycle in G1. We add to G a simple cycle of length L on a separate set of L vertices. This planted cycle is easy to ﬁnd, since it forms its own connected component of size L. A full proof is deferred to supplementary material.

Armed with these, we form a function on the hypercube in d = |E | dimensions such that optimizing it is easy, but sampling is hard.

Theorem 7. There exists a function h : Hd [0, d] satisfying the following properties.

Efﬁciency: h can be computed efﬁciently on Hd.

Easy Optimization: One can efﬁciently ﬁnd a particular minimizer y(planted) of h on Hd.

Sampling is hard: Let λ 2d. It is NP-hard to distinguish the following two cases, for L = Ω(d):

YES Case: Pry D λ,Hd h [y = y(planted)] 1 2L

NO Case: Pry D λ,Hd h [y = y(planted)] 1 1 2L

In particular this implies that 1 1 2L 2 -sampling from Dλ,Hd h is NP-hard.

Proof. Let G = (V, E) a graph produced by Theorem 6 and let d = |E|. A vertex y of the hypercube Hd is easily identiﬁed with a set Ey E consisting of the edges {e E : ye = 1}. The function h1(y) is equal to zero if Ey does not deﬁne a simple cycle and is equal to the length of the cycle otherwise. To convert this into a minimization problem, we deﬁne h(y) = d h1(y). It is immediate that a minimizer of h corresponds to a longest simple cycle in G.

Given a vertex y, testing whether Ey is a simple cycle can done efﬁciently, and the length of the cycle is simply |Ey|. This implies that h can be efﬁciently computed on Hd. Further, since we can ﬁnd the planted cycle in G efﬁciently, we can efﬁciently construct a minimizer y(planted) of h.

Suppose that G has at least (2L + 1) cycles of length L. In this case, the distribution Dλ,Hd h restricted to the minimizers is uniform, and thus the probability mass on a speciﬁc minimizer y(planted) is at most 1 2L+1. This also therefore upper bounds the probability mass on y(planted) in the Dλ,Hd h .

On the other hand, suppose that planted cycle is the unique longest simple cycle in G. Thus the probability mass on y(planted) is at least exp(λ(d L))/Zλ,Hd h . Since every other cycle is of length

at most L 1, and there are at most 2d cycles, it follows that Z λ,Hd h exp(λ(d L)) 1 + 2d exp(λ). For λ 2d, this ratio is (1 + exp( d)) (1 1

2d ) 1. The claim follows.

We can now apply Theorem 4 to derive the following result.

Theorem 8. There is a family F of functions f : Bd(0, R) R such that the following hold.

Efﬁciency: Every f F is computable in time poly(d).

Easy Optimization: An optimizer of f can be found in time poly(d)

Sampling is NP-hard: For λ 2d and η < 1 exp( Ω(d)), there is no efﬁcient η-sampler for Dλ f unless NP = RP.

Remark 5. In the statement above, efﬁciently computable means that given a t-bit representation of x, one can compute f(x) to t bits of accuracy in time poly(d, t). Remark 6. The easy optimization result above can be considerably strengthened. We can ensure that 0 is the optimizer of f and that except with negligible probability, gradient descent starting at a random point will converge to this minimizer. Further, one can ensure that all local minima are global and that f is strict-saddle. Thus not only is the function easy to optimize given the representation, black box oracle access to f and its gradients sufﬁces to optimize f. We defer details to the supplementary material. Remark 7. The hardness of sampling holds also for Wasserstein distance 1 16, given Remark 4.

5 A Sharp Threshold for λ

We start with the following threshold result for sampling from the Gibb s distribution on independent sets due to Weitz [2006], Sly and Sun [2012].

Theorem 9. For any 6, there is a threshold λc( ) > 0 such that the following are true.

FPRAS for small λ: For any λ < λc( ), the problem of sampling independent sets with Pr[I] exp(λ|I|) on -regular graphs has a fully polynomial time approximation scheme.

NP-hard for large λ: For any λ > λc( ), unless NP = RP, there is no fully polynomial time randomized approximation scheme for the problem of sampling independent sets with Pr[I] exp(λ|I|) on -regular graphs.

In the supplementarly material, we show that this implies the following result.

Theorem 10. There is a family F of efﬁciently computable functions f : Bd(0, R) R such that the sampling problem has a sharp computational threshold. There is a constant λc > 0 such that for any 1

d < λ < λc, there is a poly(d/η)-time η-sampler for the distribution Dλ f . On the other hand, for any λ > λc, there is a constant η > 0 such that no polynomial time algorithm η -samples from Dλ f unless NP = RP.

6 Related Work

The problems of counting solutions and sampling solutions are intimately related, and well-studied in the discrete case. The class #P was deﬁned in Valiant [1979], where he showed that the problem of computing the permanenet of matrix was complete for this class. This class has been wellstudied, and Toda [1991] showed efﬁcient exact algorithms for any #P-complete problem would imply a collapse of the polynomial hierarchy. Many common problems in #P however admit efﬁcient approximation schemes, that for any ε > 0, allow for a randomized (1 + ε)-approximation in time polynomial in n/ε. Such Fully Polynomial Randomized Approximation Schemes (FPRASes) are known for many problems in #P, perhaps the most celebrated of them being that for the permananet of a non-negative matrix [Jerrum et al., 2004].

These FPRASes are nearly always based on Markov Chain methods, and their Metropolis Hastings [Metropolis et al., 1953, Hastings, 1970] variants. These techniques have been used both in the discrete case (e.g. Jerrum et al. [2004]) and the continuous case (e.g. Lovasz and Vempala [2006]). The closely related technique of Langevin dynamics [Rossky et al., 1978, O. Roberts and Stramer, 2002, Durmus and Moulines, 2017] and its Metropolis-adjusted variant are often faster in practice and have only recently been analyzed.

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.

Sébastien Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8(3-4):231 357, November 2015. ISSN 1935-8237. doi: 10.1561/2200000050. URL http: //dx.doi.org/10.1561/2200000050.

Alain Durmus and Éric Moulines. Nonasymptotic convergence analysis for the unadjusted langevin algorithm. Ann. Appl. Probab., 27(3):1551 1587, 06 2017. doi: 10.1214/16-AAP1238. URL https://doi.org/10.1214/16-AAP1238.

W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1):97 109, 1970. ISSN 00063444. URL http://www.jstor.org/stable/ 2334940.

Elad Hazan. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157 325, 2016. ISSN 2167-3888. doi: 10.1561/2400000013. URL http://dx.doi.org/ 10.1561/2400000013.

Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-SAT. Journal of Computer and System Sciences, 62(2):367 375, 2001. ISSN 0022-0000. doi: https://doi.org/ 10.1006/jcss.2000.1727. URL http://www.sciencedirect.com/science/article/pii/ S0022000000917276.

Mark Jerrum, Alistair Sinclair, and Eric Vigoda. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. J. ACM, 51(4):671 697, July 2004. ISSN 0004-5411. doi: 10.1145/1008731.1008738. URL http://doi.acm.org/10.1145/1008731. 1008738.

Jørn Justesen. Class of constructive asymptotically good algebraic codes. IEEE Trans. Information Theory, 18:652 656, 1972.

R. Karp. Reducibility among combinatorial problems. In R. Miller and J. Thatcher, editors, Complexity of Computer Computations, pages 85 103. Plenum Press, 1972.

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671 680, 1983. ISSN 0036-8075. doi: 10.1126/science.220.4598.671. URL https: //science.sciencemag.org/content/220/4598/671.

L. Lovasz and S. Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 06), pages 57 68, Oct 2006. doi: 10.1109/FOCS.2006.28.

Yi-An Ma, Yuansi Chen, Chi Jin, Nicolas Flammarion, and Michael I. Jordan. Sampling can be faster than optimization. Proceedings of the National Academy of Sciences, 116(42):20881 20885, 2019. ISSN 0027-8424. doi: 10.1073/pnas.1820003116. URL https://www.pnas.org/content/ 116/42/20881.

Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092, 1953. doi: 10.1063/1.1699114. URL https://doi.org/ 10.1063/1.1699114.

Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efﬁciency in optimization. John Wiley & Sons, 1983.

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014. ISBN 1461346916, 9781461346913.

G O. Roberts and Osnat Stramer. Langevin diffusions and metropolis-hastings algorithms. Methodology And Computing In Applied Probability, 4:337 357, 01 2002. doi: 10.1023/A:1023562417138.

P. J. Rossky, J. D. Doll, and H. L. Friedman. Brownian dynamics as smart monte carlo simulation. The Journal of Chemical Physics, 69(10):4628 4633, 1978. doi: 10.1063/1.436415. URL https://doi.org/10.1063/1.436415.

Allan Sly and Nike Sun. The computational hardness of counting in two-spin models on d-regular graphs. In Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, FOCS 12, pages 361 369, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 978-0-7695-4874-6. doi: 10.1109/FOCS.2012.56. URL https://doi.org/10.1109/FOCS. 2012.56.

S. Toda. PP is as hard as the polynomial-time hierarchy. SIAM Journal on Computing, 20(5):865 877, 1991. doi: 10.1137/0220053. URL https://doi.org/10.1137/0220053.

Christopher Tosh and Sanjoy Dasgupta. The relative complexity of maximum likelihood estimation, map estimation, and sampling. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 2993 3035, Phoenix, USA, 25 28 Jun 2019. PMLR. URL http: //proceedings.mlr.press/v99/tosh19a.html.

Salil Vadhan. Computational complexity lecture notes. 2002. URL https://people.seas. harvard.edu/~salil/cs221/fall02/scribenotes/nov22.ps.

Leslie G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2): 189 201, 1979. ISSN 0304-3975. doi: https://doi.org/10.1016/0304-3975(79)90044-6. URL http://www.sciencedirect.com/science/article/pii/0304397579900446.

Dror Weitz. Counting independent sets up to the tree threshold. In Proceedings of the Thirtyeighth Annual ACM Symposium on Theory of Computing, STOC 06, pages 140 149, New York, NY, USA, 2006. ACM. ISBN 1-59593-134-1. doi: 10.1145/1132516.1132538. URL http://doi.acm.org/10.1145/1132516.1132538.