# structured_voronoi_sampling__9dd72c8d.pdf

Structured Voronoi Sampling

Afra Amini1 Li Du2 Ryan Cotterell1

1ETH Zürich 2Johns Hopkins University {afra.amini, ryan.cotterell}@inf.ethz.ch leodu@cs.jhu.edu

Gradient-based sampling algorithms have demonstrated their effectiveness in text generation, especially in the context of controlled text generation. However, there exists a lack of theoretically grounded and principled approaches for this task. In this paper, we take an important step toward building a principled approach for sampling from language models with gradient-based methods. We use discrete distributions given by language models to define densities and develop an algorithm based on Hamiltonian Monte Carlo to sample from them. We name our gradient-based technique Structured Voronoi Sampling (SVS). In an experimental setup where the reference distribution is known, we show that the empirical distribution of SVS samples is closer to the reference distribution compared to alternative sampling schemes. Furthermore, in a controlled generation task, SVS is able to generate fluent and diverse samples while following the control targets significantly better than other methods.

https://github.com/Afra Amini/svs

1 Introduction

Gradient-based sampling algorithms such as Hamiltonian Monte Carlo [HMC; 32] and Langevin dynamics [46] are widely used in Bayesian inference due to their efficiency in drawing samples from high-dimensional space [4]. Such algorithms construct a Markov Chain that has the desired distribution as its stationary distribution and use the gradient information of this distribution to efficiently navigate the state space. Additionally, gradient-based sampling schemes have recently been deployed in computer vision to generate high-quality images from state-of-the-art models [13, 43], and are a popular choice for tasks such as image synthesis [5] and image-to-image translation [40].

In natural language processing, there have been several attempts to apply gradient-based sampling techniques to sampling text from neural language models [21, 23, 38]. The motivation behind this approach to text generation is to sample from energy-based probabilistic models, where the normalization factor is not tractable. One such case is in controlled text generation, where energy functions are usually defined as a linear combination of LM probabilities and the probability of satisfying a set of predefined constraints [21, 23, 38]. In contrast to computer vision, however, applying gradient-based sampling schemes to text generation is nuanced as text, in contrast to images, is discrete.

Upon closer inspection, none of the proposed algorithms actually defines a valid Markov Chain Monte Carlo [MCMC; 12, 29] scheme that will draw samples from the model in the limit. For instance, Qin et al. [38] relax the language model from a distribution over strings to a distribution over logits. While the relaxation does transform the language model into a continuous distribution, it introduces bias. Kumar et al. [MUCOLA; 21] take a different approach. They derive a constrained gradient-based

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

sampler where the constraint is enforced through a projection. However, the projection invalidates the MCMC procedures, leading to an algorithm without guarantees.1

We derive a principled Hamiltonian Monte Carlo scheme for generating text from language models, which we call a structured Voronoi sampler. Our scheme consists of two steps. First, we give a recipe for encoding discrete distributions as densities over Rd; we term the resulting encoding Voronoi measures.2 Second, we derive a refractive Hamiltonian Monte Carlo algorithm [30] for sampling from an arbitrary Voronoi measure. In our theoretical analysis, we show that, despite the presence of discontinuities, we are able to give proof that our sampler satisfies the detailed balance condition and, thus, is a correct MCMC scheme.

To empirically evaluate the performance of structured Voronoi sampling, we begin by applying it to a toy example where the exact reference probability distribution is known. We compare the empirical distribution of drawn samples with the reference distribution and show Voronoi sampler s distribution is closer to the reference distribution than MUCOLA or unconstrained HMC. Furthermore, we use our sampling scheme for controlled generation, where the goal is to use GPT-2 to generate restaurant reviews for a target type of food, e.g., Italian, Fast food, or Japanese, and separately to generate text with a positive sentiment. We find that structured Voronoi sampling outperforms FUDGE [48], MUCOLA, and Langevin Dynamics algorithms in terms of adhering to the control target. Additionally, the samples generated by structured Voronoi sampling are comparably fluent and diverse to those produced by the other methods.

2 Language Models

Let Σ be an alphabet, a finite, non-empty set. By Σ def= S n=0 Σn, we denote the Kleene closure of Σ.3 A probability distribution over Σ is called a language model (LM). The elements of Σ may be characters, subword pieces, or words; the choice lies with the modeler. Language models can be factored autoregressively by means of the chain rule of probability, i.e., for any string w = w1 w N Σ , we can write

p(w) = p(EOS | w)

n=1 p(wn | w<n), (1)

where EOS Σ is a distinguished end-of-sequence token and w<n is the prefix of length (n 1) of the string w. We define Σ def= Σ {EOS}, and require the conditional distributions p( | w<n) to be defined over Σ. While all language models can be factored autoregressively, not all conditionals p( | w<n) can be assembled into a language model. In some cases, probability mass must be placed on infinite sequences [6]. In this work, we assume working with tight language models, i.e., that they indeed define valid probability distributions over Σ .

2.1 Language Modeling with Embeddings

Most neural language models make use of embeddings. Moreover, in most language models, the weights are shared between the language model head and the embedding layer. Such an embedding-based language model is defined as follows

p(wn | w<n) def= exp vwn enc(w<n) P

w Σ exp vw enc(w<n), (2)

where vw Rd is the embedding of w and enc : Σ Rd is a real-valued encoding of the context. Notably, the context embedding enc(w<n) Rd is obtained by inputting the context w<n into the language model, converting it to embeddings vw<n, passing it through neural network layers, and extracting the encoding from the model at position n 1.

1In fact, a similar projection step is often applied in computer vision applications, which motivated Lou and Ermon [26] to develop a principled approach that avoids projection. 2We call the measures structured Voronoi samplers when the state face is larger and factored.

3We define Σ0 def = {ε}.

Interestingly, one can lift an embedding-based language model to be a distribution over the set of base embeddings: B = {vw : w Σ}. Specifically, we can associate a sequence of embeddings V = [v1, . . . , v N] to any string w = w1 w N. Substituting each w with the corresponding vw, we can rewrite Eq. (1) as

p(V) = p(v EOS | V)

n=1 p(vn | V<n). (3)

This transformation encodes a language model as a distribution over real-valued vectors. However, p(V) only places a positive probability on a countable set, and is zero everywhere else.

2.2 Controlled Language Modeling with Embeddings

In a controlled generation task, we are interested in a subset of strings w that have a target property t. Therefore, we want to model and sample from a conditional distribution p(w | t), e.g., sample a sentence given a topic. Following Bayes rule, one can write p(w | t) p(t | w) p(w). Previous papers model p(t | w) with an embedding-augmented classifier. Such a classifier receives embeddings V associated with w and predicts the probability of the target t. Notably, if the classifier and the LM share the same base embeddings B, the controlled LM can also be lifted as a distribution over the base embeddings

p(V | t) = 1

Zt p(t | V) p(V), (4)

where Zt = P

V p(t | V) p(V) is an intractable normalization factor, that sums over the embeddings of all possible strings.4 Identically to p(V), p(V | t) only places a positive probability on a countable set.

3 Voronoi Measures

In this section, we demonstrate how to encode an embedding-based language model as a density that places positive probability on a set with a measure greater than zero. Such encoding allows us to derive a principled gradient-based sampling approach to generate samples in 6. We start with some definitions. Definition 1. An embedding-augmented probability distribution over the first M positive integers [M] is an array p = [p1, . . . , p M] such that pm 0 and PM m=1 pm = 1 where we assume that there is a real-valued embedding {vm}M m=1 Rd associated with each m [M].

Embedding-augmented distributions can be viewed as densities over Rd using the following simple encoding

p(x) = pm, if x = vm 0, otherwise. (5)

Eq. (5), however, yields a density that is 0 almost everywhere (with respect to the standard Lebesgue measure) and its gradient with respect to pm is also zero almost everywhere. Thus, Eq. (5) is not amenable to gradient-based sampling, and to derive a meaningful gradient-based sampling we require a more nuanced encoding.

To provide such an encoding, we introduce the Voronoi measure. Given an embedding-augmented distribution p = [p1, . . . , p M] with embeddings {vm}M m=1, and a compact set K Rd that covers the embeddings, i.e., {vm}M m=1 K, we define the Voronoi cell for the mth item with respect to the compact set K as follows

Cm = n x : x K, ||x vm||2 2 ||x vm ||2 2, m = m o . (6)

Now, using the definition of a Voronoi cell Cm given in Eq. (6), we can define a density that is not zero almost everywhere as follows. The strategy is to spread out the probability mass pm over the entirety of the set Cm. To do so, we assume access to a set of base measures {µm}M m=1 that give us a reference for how to judge the probability mass in each Cm. We make this encoding formal in the following definition.

4We will discuss in 5.2 how gradient-based sampling can help to sample from this distribution without the need to compute Zt.

Definition 2. Let p = [p1, . . . , p M] be an embedding-augmented distribution with embeddings {vm}M m=1 Rd, and let K be a compact set such that {vm}M m=1 K. Furthermore, let {µm}M m=1 be a set of base measures over Rd that are absolutely continuous with respect to the standard Lebesgue measure λ over Rd, i.e., µm λ. Define the (K, µ)-Voronoi measure as follows

p V(x) def=

( pm (x) µm(Cm (x)) dµm

dλ (x), if x K

0, otherwise (7)

where we define projection m (x) def= argmin m [M] ||x vm||2 2. (8)

with ties broken arbitrarily.5

0.56= p(v1) p(v2) =

p(v3) = p(v4) =

Figure 1: The example shows how Voronoi Sampling navigates through the space to sample one embedding in R2. Each Voronoi is annotated with the probability of its center, i.e., Voronoi measure of the cell.

In the following proposition, we make precise the sense in which a Voronoi measure encodes the original embedding-augmented distribution. Proposition 1. Let p = [p1, . . . , p M] be an embedding-augmented distribution with embeddings {vm}M m=1 Rd, and let p V be the corresponding Voronoi measure Eq. (7). Then, p V(Cm) = pm where Cm is defined as in Eq. (6). See App. C.1 for proof. Example 1. Suppose p = [p1, . . . , p4] is a categorical distribution, and there are 4 embeddings in R2 associated with each pi, namely: v1 = [1, 1], v2 = [ 1, 1], v3 = [ 1, 1], v4 = [1, 1]. Given the K = [ 2, 2] [ 2, 2] and the embedding-augmented probability distribution p, Eq. (7) defines a Voronoi measure over this space, where the Voronoi cells are visualized in Fig. 1. We will discuss in 6 how Voronoi sampling navigates this space.

3.1 Structured Voronoi Measures

To encode language models as densities more naturally, we introduce a generalization of the Voronoi measure, which we term a structured Voronoi measure. Now, rather than a distribution over M elements, we assume to have a sequence of length N. Each token in the sequence takes value in [M]. Let m = [m1, . . . , m N] [M]N. We define a structured Voronoi cell as Cm = QN n=1 Cmn, where Q denotes the Cartesian product and we define the individual Voronoi cell as

Cmn = n x : x K, ||x vmn||2 2 ||x vm ||2 2, m = mn o . (9)

Proposition 2. Let µ be a measure on Rd. Then, we have the product measure space as µ(Cm) = QN n=1 µ(Cmn). See App. C.2 for proof.

Definition 3. Let p be an embedding-augmented distribution over [M]N. For m [M]N, we denote m s probability as pm, and m s embedding as Vm RN d. Let K be a compact set that covers the embeddings Vm and let µ λ be a base measure absolutely continuous with respect to the Lebesgue measure λ. We define the (K, µ)-structured Voronoi measure as follows

p V(x) def=

( pm (x) µ(Cm (x)) dµ dλ(x), if x K

0, otherwise (10)

where we define the structured projection

m (x) def= argmin m [M]N

n=1 ||xn Vmn||2 2. (11)

5The set of x that induces a tie is a set of measure zero under the Lebesgue measure.

4 Application to Text Generation

Def. 3 gives us the flexibility to use any probability distribution over sequences of embeddings to define a structured Voronoi measure. For example, one can substitute pm (x) with an embedding-augmented LM, i.e., Eq. (3), to encode a language model as a structured Voronoi measure. Another example is to encode a controlled LM as a structured Voronoi measure by substituting pm (x) with Eq. (4).

Base Measure. We have explained how to encode our desired distribution as a structured Voronoi measure. However, in order to actually implement a gradient-based sampler, we need to specify the base probability measures {µm}M m=1. Given an embedding-augmented probability p(V) it is natural to follow the gradient of log p(V) with respect to the word embedding, i.e., gm = V log p(V). Thus, if we want to follow a direction similar to gm, one natural choice for µ is a Gaussian measure centered at the gradient gm restricted to Voronoi cell Cm,6 which we define:

µ(A) = 1 µ(Cm)

dλ(x). (12)

The normalizer ensures the measure is a probability measure. Furthermore, we have that µ s Radon Nikodym derivative with respect to the Lebesgue measure λ is given by

dµ dλ(x) def=

( 1 µ(Cm) exp 1

2||gm x||2 2 , if x Cm 0, otherwise (13)

Eq. (13) should be recognizable as the standard Gaussian density, albeit one that is truncated to the Voronoi cell Cm [27]. Proposition 3. Eq. (13) is absolutely continuous with respect to the Lebesgue measure λ. See App. C.3 for proof.

Proposition 4. The gradient of the log of the Voronoi measure p V is given by

x log p V(x) =

gm x, if x int(Cm) undefined, if x Cm 0, otherwise (14)

where the first two blocks in the case statement apply if there exists some m such that x int(Cm) or x Cm. See App. C.4 for proof.

5 Gradient-Based Sampling

In this section, we first review gradient-based sampling methods, namely HMC and Langevin dynamics. Then we discuss how they have been used in previous papers on text generation. This will set the stage for our algorithm in 6, which is based on HMC.

5.1 Hamiltonian Monte Carlo

The goal of HMC, originally proposed by Duane et al. [7], is to design a better proposal distribution in the standard Metropolis Hastings MCMC by taking advantage of the gradient information in a principled way. Concretely, to sample from a given distribution p(x) where x Rd, HMC treats x as the coordinates of the particles in some fictitious physical system. It then introduces an auxiliary momentum variable r Rd associated with each coordinate and defines a Hamiltonian function H(x, r). Here, the Hamiltonian H has the intuitive physical interpretation of the total energy of some conservative system, and, in classical mechanics, decomposes into the potential energy U(x) and the kinetic energy K(r), i.e., H(x, r) = U(x) + K(r). This formulation is convenient in part because if we define the joint distribution p(x, r) e H(x,r) as in energy-based models, then

p(x, r) e U(x) e K(r), (15)

6Please refer to App. B for a discussion on the reasoning behind choosing the Gaussian measure.

which means we can treat x and r as independent variables. Naturally, we can let U(x) = log p(x) so that x has the marginal of the target distribution. It is also common practice to set K(r) = r M 1r/2 so that the momentum variable has a Gaussian distribution. Here, M Rd d is called the mass matrix and commonly set to identity.7

Algorithm 1 HMC Input: xt: current sample, U: potential energy function, ε: step size, L: number of leapfrog steps Output: next sample: xt+1

1: r N(0, I) 2: Ht U(xt) + K(r) 3: xt+1 xt

4: for l = 1, . . . , L : 5: r r ε

2 U(xt+1) 6: xt+1 xt+1 + εr 7: r r ε

8: Ht+1 U(xt+1) + K(r) 9: H Ht+1 Ht

10: if s U(0, 1) < e H : 11: return xt+1

12: else 13: return xt+1 xt

The Hamiltonian H determines the equations of motion in a physical system, given by the Hamiltonian equations, which is also known as Hamiltonian dynamics,

We are now ready to give a high-level description of how HMC generates a single sample (see Algorithm 1): First sample a momentum r from a Gaussian (line 1), then evolve the system using the Hamiltonian equations Eq. (16) for some predetermined amount of time (lines 4 to 7), and finally accept the new state with the Metropolis Hastings acceptance criterion (lines 8 to 13). Note that the Hamiltonian equations often admit no closed-form solution in practice, and hence one needs to use numerical integrators for approximation. In particular, the leapfrog integrator, corresponding to lines 5 to 7, is almost always used in HMC.

Ingeniously designed by Duane et al. [7], the efficiency of this elegant procedure is a result of several favorable properties of Hamiltonian mechanics namely volume preservation, reversibility, and conservation. Upon exact simulation of the Hamiltonian dynamics, these properties will lead to the acceptance probability of one, i.e., every proposed move will be accepted. In App. D, we will give intuition about these properties. Since the method developed later in 6 will subsume HMC as a special case, we delay the proof of correctness until then.

Langevin Dynamics. Langevin dynamics is a simplification of HMC, where only one leapfrog step is performed. Furthermore, what is usually referred to as Langevin dynamics is in fact the uncorrected Langevin dynamics, where the acceptance criterion is ignored, i.e., every proposed sample is accepted; see Algorithm 2. While uncorrected Langevin dynamics is guaranteed to converge when the energy function is smooth [31, 5.3], there is no such guarantee with the presence of a discontinuity in the energy function. Nevertheless, due to its simplicity, Langevin dynamics is the only gradient-based sampling method that has been applied for text generation applications.

5.2 Applications to Controlled Generation

One of the primary advantages of using gradient-based techniques for controlled generation is that it provides a means to sample from the conditional distribution Eq. (4) without having to calculate the normalization factor Zt. For example in HMC algorithm (Algorithm 1), all we need to calculate regarding the potential energy U(V) = log p(V | t) is two terms: (1) the gradient of U with respect to V: U, and (2) the difference between the potential energy of two points U for the Metropolis criterion. Fortunately, both terms are independent of Zt, and we can sample from the conditional distribution without the need to compute Zt.

MUCOLA. As defined in Eq. (4), p(V | t) is Rd-valued, so it is tempting to apply a gradientbased sampling technique to sample from it. And, indeed, Kumar et al. [21] have proposed such a scheme based on Langevin dynamics. They define the potential energy of the embeddings as

7There exist sophisticated methods [15] to tune the mass matrix, but given that LM gradients are expensive to compute, we will not attempt such methods in this paper.

U(V) def= log p(V | t) and apply Langevin dynamics out of the box; see Algorithm 3. However, since U(V) is zero for any vector other than vectors in B (defined in 2.1), they modify the sampling process and project the proposed sample to the base embeddings set at each step of the sampling process (line 3 of the algorithm). The added projection step, however, is neither volume-preserving nor time-reversible. Hence, this sampling procedure does not sample from the intended distribution and is not a valid MCMC algorithm.

COLD and other schemes. Alternative approaches have been proposed in the literature to reformulate a language model as potential energy over the logit [COLD; 38] or simplex space [14, 20]. However, these formulations are not suitable for principled gradient-based sampling. COLD only employs a heuristic energy function to select among the candidate generations obtained via top-k sampling, and the simplex-based approach requires an extra constraint to ensure the sample stays on the simplex.

6 Structured Voronoi Sampling

Given our structured Voronoi measure p V, one can apply HMC to sample from it. In this section, we take one step further and propose a variation of HMC that is more suitable to sample from p V. Importantly, p V contains discontinuities whereas the generic leapfrog integrator does not account for such sudden jumps in the potential function. In other words, even if the leapfrog integrator itself is volume preserving and time-reversible, the sudden jumps in potential can lead to large deviations in the Hamiltonian value, causing a low acceptance rate. We therefore would like to find an alternative to leapfrog in such situations.

A Physical Analogy. In classical mechanics, discontinuity with smooth boundary in the potential function occurs naturally, e.g., in collision dynamics or a slab magnetic field, and is referred to as a potential barrier (or interface). Upon encountering a potential barrier, a particle will either be reflected from or transmitted through the barrier surface, depending on whether it has enough kinetic energy to overcome the potential jump (e.g., [8, 4.6.2]). Such behavior is similar to reflection refraction phenomenon in optics. The key insight here is that, in both cases, the Hamiltonian is conserved.

Reflection and Refraction. To give a precise mechanical description of this behavior, suppose a particle encounters a potential barrier at position x with momentum r. We can decompose momentum as r = r + r where r is normal to the barrier and r parallel to it. Let U be the signed potential energy difference between the two sides of the barrier. If r 2 2 > 2 U, then the particle has enough kinetic energy to overcome the barrier, and its momentum s normal component will instantaneously become r = p

r 2 2 2 U r r 2 2 after being transmitted through the barrier (refraction). Otherwise, if r 2 2 2 U, the particle will be reflected from the barrier and the normal component will instantaneously become r = r . We show in App. E.1 that Hamiltonian is conserved in either case. The reflect refract process is summarized in Algorithm 5.

6.1 A Sampling Algorithm

Noting that the discontinuity surfaces of p V are all piecewise smooth, we can build on the above and develop a sampling algorithm for p V to handle discontinuity in a principled and effective way. In fact, we only need to make one change to the generic HMC, which is updates (x, r) according to the mechanical description given above. Concretely, we need to replace step 2 in the HMC outline in 5.1: When a single step of leapfrog encounters no discontinuity, we may advance to the next point as in HMC; however, when there is discontinuity, if a full leapfrog step is taken, we need to proceed by repeatedly computing where the discontinuity is encountered, taking a smaller step up to the discontinuity and refracting reflecting based on the potential energy difference. This process is continued until we have exhausted the step size. Since refracting reflecting conserves Hamiltonian (App. E.1), this process yields a better acceptance rate in the presence of a discontinuity. See Algorithm 4 for the details of this sampling procedure, and App. F.1 for how to efficiently find discontinuities. We will supply proof of correctness in App. E.2.

A note on calculating the base measure. To adjust the momentum, one needs to compute U, which implies computing the difference between two base measures, as defined in Eq. (12). However, computing such an integral in a high-dimensional space is not practical. Therefore, make an

6DPSOLQJ 0HWKRG

0X&R/D +0& 6DPSOLQJ 9RURQRL 6DPSOLQJ

7HPSHUDWXUH

-6 'LYHUJHQFH

5HIHUHQFH 'LVWULEXWLRQ

Figure 2: Left: JS divergence between the reference probability and empirical probability distribution. Voronoi Sampling clearly outperforms others in low temperatures. Right: reference probability distribution annealed with 3 temperatures: 0.25 (peaked), 1, and 2 (close to uniform).

2 4 6 8 10 12 14

Figure 3: Perplexity of 100 samples taken with different gradient-based algorithms, compared to 1000 samples taken with the ancestral sampling (in green). While LANGEVIN, SVS, and MUCOLA are comparably close to the ancestral samples distribution, SVS models the tail of the distribution better.

assumption that the base measures of Voronoi cells are equal, and thus do not have an effect on U. However, such an assumption might not hold. See App. A for limitations.

7 Experiments

We empirically assess the performance of Voronoi sampling in a series of experiments. In each experiment, we perform a grid search to find the best set of hyperparameters, these experimental details can be found in App. H. Our open-source implementation will be available upon publication.

7.1 Toy Example

We first apply our Voronoi sampling8 method on the toy example discussed earlier in Example 1, where the reference probability distribution is tractable and known p(x). The potential energy is then set to U(x) = log p V(x). Importantly, the toy experiment is intentionally designed such that the base measure of all of the Voronoi cells is equal, therefore, we can safely ignore calculating the base measure and arrive at exact sampling methods.

We compare Voronoi sampling to MUCOLA and HMC. To make a fair comparison, we add the Metropolis criterion9 to the MUCOLA algorithm and only do one leapfrog step in all algorithms. Furthermore, to see the effect of the reference distribution on the performance of the sampling algorithm, we anneal this distribution with 6 temperatures, where the lower temperatures lead to peaky distributions and the higher temperatures to uniform-like distributions.

We take 200 samples after 500 burn-in iterations, and compare the empirical distributions of samples to the reference by measuring the Jensen Shannon (JS) divergence between the two. As results in Fig. 2 show, the JS divergence between the reference distribution and the empirical distribution of Voronoi samples is the smallest. The difference between the methods is more pronounced at lower temperatures, as the change in potential energy is greater, resulting in more errors in the leapfrog integrator. In App. I we provide more empirical support that Voronoi sampling converges faster to the reference distribution, especially when we increase the dimensionality of the sampling space.

7.2 Sampling from Language Models

Next, we apply our method to sample from a language model. The underlying LM is a finetuned GPT-210 on E2E dataset [34]; see App. G for dataset statistics. As opposed to the previous

8Note that in the case of the toy experiment, we are not sampling a sequence of text, but rather a single embedding. To highlight this point, we use Voronoi sampling instead of structured Voronoi sampling to refer to our algorithm in this section. 9We accept transitioning from xt to xt+1 with probability e Ht Ht+1. 10We use gpt2 checkpoint from the Huggingface library [47].

experiment, the reference distribution of the LM is not tractable. Therefore, we use the empirical distribution of ancestral samples as an unbiased estimate of the reference distribution. Note that ancestral sampling incrementally draws samples from a language model, where at each step of the generation a token wn is sampled with the probability given by the LM: p(wn | w<n). Therefore, the process can give unbiased estimates of the reference distribution.

We follow 4 to define a structured Voronoi measure p V using the LM probability. We then implement SVS, where the potential energy is set to log p V. To empirically measure the benefit of reflection refraction step in SVS, we compare it to applying Langevin dynamics directly to p V. We implement MUCOLA as a baseline, which writes the potential energy using an embedding-augmented LM, i.e., Eq. (3).

We show the distribution of samples perplexity in Fig. 3. The green trace is the empirical distribution of 1000 ancestral samples. While all the sampling methods result in distributions comparably close to ancestral samples distribution, we observe that SVS manages to model the tail of the distribution better. On the other hand, MUCOLA and LANGEVIN tend to take samples from the mode of the distribution more often.

7.3 Controlled Generation

Finally, we apply our structured Voronoi sampling to 2 controlled generation task. The goal of the first task is to generate restaurant reviews for a target food type t, e.g., Italian, Fast food, Japanese, etc. The goal of the second task is to control the sentiment of the generations to enforce a positive sentiment. We train classifiers to predict the target t (food type or positive sentiment) from the input sequence p(t | w).11 We implement two baselines:

FUDGE. Yang and Klein [48] offer a heuristic approach to sample from the conditional distribution. They incrementally sample tokens under the language model. At each sampling step, they adjust the probabilities given by the LM, by feeding each candidate prefix to a classifier and obtain the probability of that prefix following the control target.

MUCOLA. Kumar et al. [21] treat p(V | t), Eq. (4), as a distribution in Rd and apply Langevin dynamics directly to sample a sequence of embeddings V. The potential energy is defined as log p(V | t). When rewriting this potential energy with Bayes rule, it has been shown empirically, that adding a hyperparameter γ is helpful to keep the balance between the classifier and LM. Therefore, the final potential energy is defined as:

U(V) def= log p(V) γ log p(t | V). (17) As mentioned earlier, p(V | t) only places a positive probability on a countable set. We, therefore, use Def. 3 to define structured Voronoi measures and set the potential energy to

U(x) def= log p V(x) γ log p V(t | x). (18) We then apply Langevin dynamics and SVS to sample according to this potential energy.

Evaluation. We sample 120 sentences of length 2012 and evaluate the generations on three metrics:

Success: is defined as the percentage of generations that adhere to the control target. To determine whether a generation conforms to the specified target we use an evaluator classifier. Fluency: is measured by the mean and standard deviation of perplexity under the language model. Diversity: is measured by the mean number of distinct n-grams (n = 1, 2, 3) in a set of samples, normalized by the length of the sequence.

As results in Table 1 show,13 FUDGE tends to achieve the highest diversity, however, it fails to follow the control target. MUCOLA either generates fluent results without paying enough attention to the control, or sacrifices fluency in favor of following the control target; thus, the high variance in success rates. Both LANGEVIN and SVS result in a high success rate and maintain fluency and diversity, and SVS is effective in maintaining a balance between various metrics and producing fluent sentences that adhere to control targets.

11See App. H for more experimental details about classifiers. 12We sample 20 sentences per control target. 13Please refer to Fig. 7 to see a visualization of the topic control results, and to Table 6 for results per target type.

Topic Control Sentiment Control

Success( ) PPL( ) Dist-1( ) Dist-2( ) Dist-3( ) Success( ) PPL( ) Dist-1( ) Dist-2( ) Dist-3( )

GPT2 0.12 0.10 5.10 2.06 0.40 0.56 0.67 0.55 0.49 21.33 27.17 0.40 0.60 0.71 FUDGE 0.30 0.12 5.59 0.60 0.39 0.55 0.65 0.57 0.49 24.27 12.46 0.40 0.60 0.70 MUCOLA 0.58 0.23 33.09 36.32 0.26 0.40 0.51 0.66 0.47 85.74 152.18 0.28 0.42 0.53

LANGEVIN 0.91 0.12 14.26 2.55 0.24 0.39 0.51 0.82 0.38 26.76 14.42 0.16 0.30 0.41 SVS 0.92 0.05 13.9 2.04 0.22 0.37 0.49 0.84 0.36 32.73 16.70 0.14 0.28 0.41 Table 1: Evaluation of different sampling methods on controlled generation, using three criteria: success in following the control target (measured by the evaluator classifier), fluency (measured by perplexity), and diversity.

8 Related Work

Controlled Generation. Numerous approaches have been proposed to enforce controls during the text generation process [19, 24, 41, inter alia]. For example, weighted decoding [10, 17] scores each candidate token with a weighted sum of its score under the language model and its adherence to control targets, subsequently selecting candidates with the highest scores. FUDGE method adopts a similar scoring function, resembling a Bayesian formulation for p(w | t). After making simplifying assumptions and factorizing p(w | t), FUDGE samples tokens autoregressively based on their scores. More recently, a line of research attempts to directly sample from p(w | t) by reformulating it as an energy-based model and sampling from it using efficient gradient-based sampling algorithms. As discussed in 5.2 COLD reformulates p(w | t) as an energy-based model on the logit space and uses that to select samples with high energy from a number of candidate generations. MUCOLA offers a sampling algorithm motivated by Langevin Dynamics that operates in the embedding space.

Gradient-based Sampling. Our work is closely related to the line of research that makes use of gradient information to sample from complex distributions [7, 31, 46]. Gradient-based samplers [15, 32] are shown to be highly effective when sampling from continuous distributions [3, 4, 37]. However, it is a difficult problem to adapt gradient-based samplers to discrete settings [36, 50]. More recently, several papers proposed promising gradient-based MCMC for discrete distribution that are reversible chains [11, 39, 49]. Our work instead formulates an irreversible Markov chain based on HMC. We leave it to future work to explore the utility of these recent papers on text generation.

9 Conclusion

In this work, we propose structured Voronoi sampling, a principled gradient-based sampling method for text generation. To formulate the energy function used in SVS, we define structured Voronoi measures on the embedding space and show how such measures can encode language models. In a controlled generation task, SVS outperformed other sampling methods in following the control target while producing comparably fluent and diverse samples.

Broader Impacts

It has been repeatedly shown that LMs can generate harmful, toxic, or non-factual content [9, 35, 42]. In fact, an application of the controlled generation scheme discussed in this paper could be to mitigate such issues. However, the same method could be used to generate misinformation, or toxic content intentionally.

Acknowledgements

We thank Tim Vieira, Tiago Pimentel, and Clara Meister for their feedback on this work. We also thank the anonymous reviewers for their constructive feedback during the review process. Afra Amini is supported by ETH AI Center doctoral fellowship.

[1] Luigi Ambrosio. 2008. Transport Equation and Cauchy Problem for Non-Smooth Vector Fields, pages 1 41. Springer Berlin Heidelberg, Berlin, Heidelberg.

[2] V. I. Arnold. 1989. Mathematical Methods of Classical Mechanics. Springer, New York, NY.

[3] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. 2018. Pyro: Deep universal probabilistic programming. Journal of Machine Learning Research.

[4] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of statistical software, 76(1).

[5] Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, volume 34, pages 8780 8794. Curran Associates, Inc.

[6] Li Du, Lucas Torroba Hennigen, Tiago Pimentel, Clara Meister, Jason Eisner, and Ryan Cotterell. 2022. A measure-theoretic characterization of tight language models.

[7] Simon Duane, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. Hybrid Monte Carlo. Physics Letters B, 195(2):216 222.

[8] J. Michael. Finn. 2010. Jones and Bartlett Publishers, Sudbury, MA. [link].

[9] Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. Real Toxicity Prompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356 3369, Online. Association for Computational Linguistics.

[10] Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. 2017. Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations, pages 43 48, Vancouver, Canada. Association for Computational Linguistics.

[11] Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. 2021. Oops I took a gradient: Scalable sampling for discrete distributions. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3831 3841. PMLR.

[12] W. K. Hastings. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97 109.

[13] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840 6851. Curran Associates, Inc.

[14] Cong Duy Vu Hoang, Gholamreza Haffari, and Trevor Cohn. 2017. Towards decoding as continuous optimisation in neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 146 156, Copenhagen, Denmark. Association for Computational Linguistics.

[15] Matthew D. Hoffman and Andrew Gelman. 2014. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47):1593 1623.

[16] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In Proceedings of the 8th International Conference on Learning Representations.

[17] Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine Bosselut, David Golub, and Yejin Choi. 2018. Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1638 1649, Melbourne, Australia. Association for Computational Linguistics.

[18] Shi Jin and Xin Wen. 2005. Hamiltonian-preserving schemes for the Liouville equation with discontinuous potentials. Communications in Mathematical Sciences, 3(3):285 315.

[19] Ben Krause, Akhilesh Deepak Gotmare, Bryan Mc Cann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2021. Ge Di: Generative discriminator guided sequence generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4929 4952, Punta Cana, Dominican Republic. Association for Computational Linguistics.

[20] Sachin Kumar, Eric Malmi, Aliaksei Severyn, and Yulia Tsvetkov. 2021. Controlled text generation as continuous optimization with multiple constraints. In Advances in Neural Information Processing Systems.

[21] Sachin Kumar, Biswajit Paria, and Yulia Tsvetkov. 2022. Constrained sampling from language models via langevin dynamics in embedding spaces. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

[22] Cornelius Lanczos. 1949. The Variational Principles of Mechanics. Mathematical Expositions, No. 4. University of Toronto Press, Toronto.

[23] Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. 2022. Diffusion-LM improves controllable text generation. In Advances in Neural Information Processing Systems.

[24] Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691 6706, Online. Association for Computational Linguistics.

[25] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Ro BERTa: A robustly optimized BERT pretraining approach.

[26] Aaron Lou and Stefano Ermon. 2023. Reflected diffusion models. In Proceedings of the 40th International Conference on International Conference on Machine Learning.

[27] Hassan Maatouk and Xavier Bay. 2016. A new rejection sampling method for truncated multivariate Gaussian random variables restricted to convex sets. In Monte Carlo and Quasi Monte Carlo Methods, pages 521 530, Cham. Springer International Publishing.

[28] Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2022. Locally typical sampling. Transactions of the Association for Computational Linguistics.

[29] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087 1092.

[30] Hadi Mohasel Afshar and Justin Domke. 2015. Reflection, refraction, and Hamiltonian monte carlo. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.

[31] Radford M. Neal. 1993. Probabilistic inference using Markov chain Monte Carlo methods. Department of Computer Science, University of Toronto Toronto, ON, Canada.

[32] Radford M. Neal. 2011. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo, chapter 5. Chapman and Hall/CRC.

[33] Akihiko Nishimura, David B. Dunson, and Jianfeng Lu. 2020. Discontinuous Hamiltonian Monte Carlo for discrete parameters and discontinuous likelihoods. Biometrika, 107(2):365 380.

[34] Jekaterina Novikova, Ondˇrej Dušek, and Verena Rieser. 2017. The E2E dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201 206, Saarbrücken, Germany. Association for Computational Linguistics.

[35] Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812 4829, Online. Association for Computational Linguistics.

[36] Ari Pakman and Liam Paninski. 2013. Auxiliary-variable exact hamiltonian monte carlo samplers for binary distributions. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.

[37] Du Phan, Neeraj Pradhan, and Martin Jankowiak. 2019. Composable effects for flexible and accelerated probabilistic programming in numpyro. ar Xiv preprint ar Xiv:1912.11554.

[38] Lianhui Qin, Sean Welleck, Daniel Khashabi, and Yejin Choi. 2022. COLD decoding: Energybased constrained text generation with Langevin dynamics. In Advances in Neural Information Processing Systems.

[39] Benjamin Rhodes and Michael U. Gutmann. 2022. Enhanced gradient-based MCMC in discrete spaces. Transactions on Machine Learning Research.

[40] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, New York, NY, USA. Association for Computing Machinery.

[41] Timo Schick, Sahana Udupa, and Hinrich Schütze. 2021. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. Transactions of the Association for Computational Linguistics, 9:1408 1424.

[42] Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. 2021. Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4275 4293, Online. Association for Computational Linguistics.

[43] Yang Song and Stefano Ermon. 2020. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, volume 33, pages 12438 12448. Curran Associates, Inc.

[44] Molei Tao and Shi Jin. 2022. Accurate and efficient simulations of hamiltonian mechanical systems with discontinuous potentials. Journal of Computational Physics, 450:110846.

[45] Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural text generation with unlikelihood training. In International Conference on Learning Representations.

[46] Max Welling and Yee Whye Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, page 681 688, Madison, WI, USA. Omnipress.

[47] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: Stateof-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38 45, Online. Association for Computational Linguistics.

[48] Kevin Yang and Dan Klein. 2021. FUDGE: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511 3535, Online. Association for Computational Linguistics.

[49] Ruqi Zhang, Xingchao Liu, and Qiang Liu. 2022. A Langevin-like sampler for discrete distributions. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 26375 26396. PMLR.

[50] Yichuan Zhang, Zoubin Ghahramani, Amos J. Storkey, and Charles Sutton. 2012. Continuous relaxations for discrete hamiltonian monte carlo. In Advances in Neural Information Processing Systems, volume 25.

A Limitations

Approximating the base measure. In this work, we go one step closer to implementing a principled approach for text generation. However, in our text generation experiments, we follow the previous work in using uncorrected Langevin dynamics. Moreover, when implementing SVS for text generation, we assume all Voronoi cells to have an equal base measure, which might not hold.

Efficiency. As reported in App. H, all gradient-based samplers are considerably slower than ancestral sampling and heuristics such as FUDGE. Further efforts are needed to make gradient-based methods faster.

Text Quality. As shown in previous work, sequences with high probability under the LM can be repetitive, dull, or degenerate [16]. Therefore, as with any other sampling method, SVS might sample degenerate sentences, and the quality of the samples depends on the LM probability distribution. This is an active area of research with various proposals from changing the loss function during training [45], to modifying the decoding objective [28].

B A Note on the Choice of the Gaussian Measure

We thank the reviewers and the meta reviewer for their careful assessment of this work and their thoughtful feedback. As a response to an important point raised by the meta reviewer, here we explain the reasoning behind using a Gaussian measure in Eq. (12). The gradient gm is only defined at Voronoi centers and the goal is to follow a direction similar to gm when we are in the vicinity of the Voronoi center, i.e., in the Voronoi cell. We note that this direction needs to be adjusted depending on the position in the Voronoi cell x. A natural choice for this would be to use a truncated Gaussian that is centered at gm, which implies that the gradient in x should be gm x, to follow a similar direction gm at the center of the Voronoi.

C.1 Proof of Proposition 1

Proposition 1. Let p = [p1, . . . , p M] be an embedding-augmented distribution with embeddings {vm}M m=1 Rd, and let p V be the corresponding Voronoi measure Eq. (7). Then, p V(Cm) = pm where Cm is defined as in Eq. (6). See App. C.1 for proof.

p V(Cm) = Z

Cm p V(x)dλ (19a)

pm µ(Cm) dµ dλ(x)dλ (19b)

dµ dλ(x)dλ (19c)

= pm µ(Cm)µ(Cm) = pm (19d)

C.2 Proof of Proposition 2

Proposition 2. Let µ be a measure on Rd. Then, we have the product measure space as µ(Cm) = QN n=1 µ(Cmn). See App. C.2 for proof.

Proof. Let m = [m1, . . . , m N] [M]N. We have

n=1 µ(Cmn) (20b)

C.3 Proof of Proposition 3

Proposition 3. Eq. (13) is absolutely continuous with respect to the Lebesgue measure λ. See App. C.3 for proof.

Proof. Choose E B(Rd) such that λ(E) = 0.14 Note that E = (E \ K) SM m=1(E Cm) where µ(E \ K) = 0 by Eq. (13). Since the Gaussian measure over Rd itself is absolutely continuous with respect to λ, we can also conclude that µ(E Cm) = 0 for any m. Hence, µ(E) µ(E \ K) + P

m µ(E Cm) = 0 which means µ(E) = 0. So µ λ.

C.4 Proof of Proposition 4

Proposition 4. The gradient of the log of the Voronoi measure p V is given by

x log p V(x) =

gm x, if x int(Cm) undefined, if x Cm 0, otherwise (14)

where the first two blocks in the case statement apply if there exists some m such that x int(Cm) or x Cm. See App. C.4 for proof.

Proof. We have three cases. Suppose x int(Cm) for some m [M]. Then, by Observation 1, we have that p V(x), and, thus, log p V(x) is differentiable. Direct computation reveals:

x log p V(x) = x log pm µ(Cm) | {z } =0

= x log exp 1

2||gm x||2 2

x log µ(Cm) | {z } =0

2 x||gm x||2 2 (21b)

= gm x (21c)

Next, suppose x / SM m=1 Cm. Then, the measure is zero, so the gradient is as well. Finally, we have that x Cm for some m. Then, by the next Observation (Observation 1), we have that p V is discontinuous, so the derivative is not defined.

Observation 1. A (K, µ)-Voronoi measure p V is differentiable with respect to pm on the set M m=1int (Cm), and discontinuous on the set K \ M m=1int (Cm), i.e., the union of the Voronoi cells boundaries M m=1 Cm.

14We use B( ) to denote the standard Borel σ-algebra.

D Properties of Hamiltonian Dynamics

Preservation of Measure. First of all, the Hamiltonian equations are volumeor measurepreserving. Intuitively, this means that if we move a region in the phase space along the dynamics for an arbitrary amount of time, the volume of the region would stay unchanged.15 Concretely, we can define a function gt : (x(0), r(0)) 7 (x(t), r(t)) (22)

as moving every point in the phase space along the Hamiltonian equations for some time t.16 Then, using gt as a change of variable would leave the underlying probability measure unchanged.

Time Reversibility. Next, the Hamiltonian equations are time reversible, meaning that if the equations can move (x, r) to (x , r ), then it would also take (x , r ) to (x, r). Due to our choice of distribution of r is always symmetric, these properties simplify the acceptance probability to be only the energy difference, i.e., we can ignore the conditional probabilities in the standard Metropolis Hastings acceptance probability.

Conservation of Hamiltonian. Finally, we can quickly verify that the Hamiltonian H is conserved by the Hamiltonian dynamics:17

(definition in Eq. (16)) (24)

(symmetry) (25)

This shows that the acceptance probability will be exactly 1 if the Hamiltonian equations are integrated exactly. In practice, numerical errors will lead to fluctuations in H. The Metropolis Hastings acceptance probability ensures detailed balance.

E Details of Voronoi Sampling

E.1 Conservation of Hamiltonian in Refraction Reflection

Proposition 5. A single step of refraction reflection conserves the Hamiltonian.

Proof. Suppose an instantaneous refraction reflection takes (x, r) to (x , r ). In other words, (x, r) is the particle s coordinate and momentum at the boundary prior to the refraction reflection, and (x , r ) is the particle s coordinate and momentum after the refraction reflection, which includes instantaneous changes in its potential energy and momentum. Recall the refraction reflection equation is

r 2 2 2 U r r 2 2 (refraction when r 2 2 > 2 U )

r (reflection when r 2 2 2 U ) (27a)

15This result is known as Liouville s theorem in classical mechanics [2, 16]. Upon noticing that the Hamiltonian dynamics is divergenceless, one can interpret it as a simple application of the (higher-dimensional) divergence theorem, which itself is a consequence of the generalized Stokes theorem. 17We say a transformation taking (x, r) to (x , r ) conserves Hamiltonian if H(x, r) = H(x , r ). 17This function is called the Hamiltonian phase flow, or simply Hamiltonian flow [2, 16]. Flow is a general and important notion in the study of differential equations and related subjects, in which some additive group acts on (R, +). In the case of a smooth Hamiltonian H, {gt}t R can be seen a one-parameter group of diffeomorphisms over R2d.

where r = r + r is the normal decomposition of r and r = r + r . Then,

H(x , r ) = U(x ) + 1

2 r 2 2 (27b)

= U(x ) + 1

2 r + r 2 2 (27c)

= U(x ) + 1

2 r 2 2 + r , r + 1

2 r 2 2 (27d)

= U(x ) + 1

2 r 2 2 + 1

2 r 2 2 ( r , r = 0) (27e)

( U(x) + U + 1

2( r 2 2 2 U) + 1

2 r 2 2 (refraction) U(x) + 1

2 r 2 2 + 1

2 r 2 2 (reflection) (27f)

( U(x) + U + 1

2 r 2 2 U + 1

2 r 2 2 (refraction) U(x) + 1

2 r 2 2 + 1

2 r 2 2 (reflection) (27g)

2 r 2 2 (27h)

= H(x, r). (27i)

E.2 Proof of Correctness

In this section, we give a proof of correctness of our sampler by establishing its detailed balance. We will first prove several useful properties of the sampler in E.2.1 to E.2.3 and then combine them to prove the detailed balance in App. E.2.4.

E.2.1 Measure Preservation in Leapfrog

First of all, we note that the two leapfrog steps are measure-preserving, since they are composed of a series of shear transformations.

Proposition 6. The leapfrog steps are measure-preserving.

Proof. Recall that the two leapfrog steps used in HMC (algorithm 1) are (x, r) (r, r ε

2 U(x)) and (x, r) (x + εr, r).

By the change-of-variable formula, to show a transformation is measure-preserving, it suffices to show that its Jacobian has determinant 1. The leapfrog step (x, r) (x, r ε

2 U(x)) as a function has the Jacobian (x, r ε

2 U(x)) (x, r) = I 0 ε

where 2U(x) denotes the Hessian of U. This Jacobian (28) has determinant 1 and hence the leapfrog transformation (x, r) to (x, r ε

2 U(x)) is measure preserving. Similarly, the other leapfrog step (x, r) (x + εr, r) has Jacobian

(x + εr, r)

(x, r) = I εI 0 I

which also has determinant 1.

E.2.2 Measure Preservation in Refraction Reflection

The computation for showing that refraction and reflection steps are measure preserving is slightly more involved than the leapfrog case. Our proof below largely follows the one found in Mohasel Afshar and Domke [30], where we simplified and corrected some details.

Proposition 7. The refraction reflection step is measure-preserving.

Proof. We consider a single step of refraction reflection as a function γ that takes (x, r) to (x , r ) with total step size ε and with only one discontinuity. Note that, without loss of generality, we may only consider the discontinuity surface to be the hyperplane x1 = 0 since it is equivalent to any tangent hyperplane of the discontinuity (hyper)surface through an affine transformation. Then, a normal vector to the hyperplane x1 = 0 is e1 = (1, 0, . . . , 0) and

r = (r1, 0, . . . , 0) and r = (0, r2, . . . , rd). (30)

In this case,

i 2, r i = ri and x i = xi + εri (31)

since r is not affected by refraction reflection. Therefore,

i 2, x i xi = r i ri = 1 (32)

i, j 2 and j = i, x i xj = x i rj = r i xj = r i rj = 0. (by Eq. (31)) (33)

Eq. (32) and Eq. (33) shows that the absolute determinant of the Jacobian of γ is the same as

x 1 x1 x 1 r1 r 1 x1 r 1 r1

r 1 r1 x 1 r1

We analyze all four quantities in Eq. (34) individually below. Again, without loss of generality, suppose x1 0 and x 1 0. Let U(x) be the function of the signed potential difference as x1 changes from negative to positive, defined only when x1 = 0. Let y = (0, y2, . . . , yd) be the point where the discontinuity is encountered. In other words, the particle begins at x, moved past y and arrived at x .

Refraction. In the case of refraction, r2 1 > 2 U(y) and

r2 1 2 U(y). (35)

Let δ be the step size it takes to reach y. Then

x 1 = (ε δ)r 1 = ε + x1

y = x + δr = x x1

x 1 x1 = x1

r 1 (Eq. (37)) (39a)

= ε r 1 x1 + 1

r1 r 1 + x1

(product rule) (39b)

r 1 x1 + r 1 r1

(product rule), (39c)

x 1 r1 = r1

r 1 (Eq. (37)) (40a)

= x1r 1 r2 1 + ε + x1

r 1 r1 , (40b)

r2 1 2 U(y)

(by Eq. (35)) (41a)

r2 1 2 U(y)

(r2 1 2 U(y))

(chain rule) (41b)

and r 1 r1 = p

r2 1 2 U(y)

(by Eq. (35)) (42a)

r2 1 2 U(y)

(r2 1 2 U(y))

(chain rule) (42b)

det γ = x 1 x1

r 1 r1 x 1 r1

r 1 x1 + r 1 r1

r 1 r1 x1r 1 r2 1 + ε + x1

(Eq. (39) and (40))

r 1 r1 + x1

(Eq. (41) and (42))

(chain rule) (44b)

(Eq. (38)) (44c)

= 0. (44d) Hence, Eq. (43) and Eq. (44) together imply that | det γ| = 1. (45) So γ is measure-preserving in the case of refraction.

Reflection. In the case of reflection, r2 1 2 U(y) and r 1 = r1. (46) As in the case of refraction, let δ be the step size it takes to reach y. Then

x 1 = (ε δ)r 1 = ε + x1

( r 1) = εr 1 x1. (48)

We can then directly calculate

x 1 x1 = 1, x 1 x1 = ε, (Eq. (48)) (49a)

x 1 x1 = 0, x 1 x1 = 1. (Eq. (46)) (49b)

| det γ| = |( 1)( 1) 0 ( ε)| = 1. (50)

So γ is measure preserving in the case of reflection as well.

E.2.3 Time Reversibility

Proposition 8. The refraction reflection step is time-reversible.

Proof. A refraction reflection step is time-reversible means that, if a single step of refraction reflection procedure takes (x, r) to (x , r ), then it would also take (x , r ) to (x, r). We can show this by considering each cases:

Reflection: When reflection happens, r 2 2 2 U, and hence r 2 2 = r 2 2 = r 2 2 2 U, meaning that the reverse trajectory would be reflected as well; Refraction: If U > 0, then the reverse trajectory is crossing the boundary from the other side, seeing a sudden decrease in potential, and hence would be refracted and regain the magnitude of momentum lost in the forward step. If U < 0, then r 2 2 > 2 U and hence the reverse trajectory would also be refracted, ending in the original momentum with the sign reversed.

Proposition 8, combined with time reversibility of leapfrog step as in basic HMC, shows that Voronoi sampling is time reversible. Hence we only need to use the Hamiltonian difference in the Metropolis Hastings acceptance probability.

E.2.4 Detailed Balance

Theorem 1. A step of Structured Voronoi Sampling (Algorithm 4) satisfies the detailed balance.

Proof Sketch. From measure preservation (App. E.2.1 and App. E.2.2), the change of variable as introduced by integrating Hamiltonian equations have Jacobian with absolute determinant 1 and hence can be omitted. From time reversibility (App. E.2.3), we only need to use the Hamiltonian difference in the Metropolis Hastings acceptance probability. And, finally, by using a Metropolis Hastings accepting step (lines 8 to 13) we ensure the detailed balance.

E.3 Related Work

Such an integrator scheme appears to be well-known in the computational physics community, dating back to as early as Jin and Wen [18], and quite possibly even earlier. Integration algorithms based on this idea continue to receive active research [44].

The first usage of such integrator with discrete stepin the context of MCMC appeared in Mohasel Afshar and Domke [30].18 Mohasel Afshar and Domke [30] primarily based their motivation on the optical reflection refraction analogy. We note that, in fact, Hamiltonian mechanics originated from Hamilton s formulation of geometrical optics (or Hamiltonian optics) [22, VIII.7]. This connection, important to modern physics, should be interpreted with care in our context since momentum isn t a natural quantity in optical rays.

18In an earlier work, Pakman and Paninski [36] used the same idea but solved the trajectory exactly for simpler systems instead of using discrete steps.

Later, Nishimura et al. [33] provided a derivation of this integrator from the first principles of the calculus of variations [1]. Importantly, it should be noted that, due to the potential function being discontinuous, Eq. (16) loses its well-posedness as a system of differential equations, hence the necessity of justification based on non-smooth analysis.

F Algorithms

Algorithm 2 Langevin Dynamics Input: xt: current sample, U: potential energy function, ε: step size Output: next sample: xt+1

1: r N(0, I) 2: r r ε

2 U(xt) 3: xt+1 xt + εr 4: return xt+1

Algorithm 3 MUCOLA Input: Vt: current sample, U: potential energy function, ε: step size Output: next sample: Vt+1

1: r N(0, I) 2: r r ε

2 U(Vt) 3: Vt+1 Proj B(Vt + εr) 4: return Vt+1

Algorithm 4 Structured Voronoi Sampling Input: xt: current embeddings, U: potential function, ε: step size Output: next sample: xt+1

1: r N(0, εI) 2: Ht U(Vm (xt)) + K(r) 3: r r ε

2 U(Vm (xt)) 4: xt+1 FINDDISC(xt, r) 5: r r ε

2 U(Vm (xt+1)) 6: Ht+1 U(xt+1) + K(r) 7: H Ht+1 Ht

8: if s U(0, 1) < e H : Metropolis criterion 9: return xt+1

10: else 11: return xt+1 xt

Algorithm 5 REFRACTREFLECT Input: rt: current momentum, b normal vector of the boundary, U: difference in potential energy Output: next momentum: rt+1

1: rt b T rt

2: rt rt rt 3: if rt 2 2 > 2 U :

rt 2 2 2 U rt rt 2 2 refract

5: else 6: rt+1 rt reflect

7: rt+1 rt+1 + rt 8: return rt+1

Split Chinese English Fast food French Indian Italian Japanese Total

train 2929 4100 5891 5889 4412 5909 5996 42061 valid 1489 1780 - - - - - 4672 test 492 613 632 639 497 608 638 4693 Table 2: Number of restaurant reviews in each split and food type. Note that some reviews do not represent any specific food type, therefore, the total count is bigger than the sum count of reviews in each food category.

F.1 Efficiently Finding Discontinuities

As explained in 6, the key insight in SVS is to adjust the momentum when facing discontinuity in the potential function. Therefore, we first need to find discontinuities efficiently and then reflect/refract on the boundary of the discontinuity. Remember that at each leapfrog step, we move from xt to xt+1 = xt + εr. To find discontinuities along this trajectory, we divide each leapfrog step into fractions of a step and check for discontinuity. Concretely, we only advance the embedding by a fraction α, i.e. x = x + αεr (line 4 in Algorithm 6). Then, we check for discontinuity by looking at the difference between the potential energies. If we don t observe any changes in the potential function, we take the next fraction of a step. Otherwise, we find the boundary and adjust the momentum (line 10 in Algorithm 6). Note that in SVS, finding the boundary is straightforward, and it is a hyperplane characterized by the normal vector, as the line goes through the two Voronoi centers on each side of the hyperplane.

Algorithm 6 Find Discontinuity Input: xt: current embeddings, r: current momentum, U: potential function, ε: step size, α: discontinuity step size Output: next sample: xt+1

1: τ 0 2: x xt

3: while τ < 1 : 4: x x + αεr 5: U U(Vm (x )) U(Vm (x)) 6: if U = 0 : there is no discontinuity 7: x x

8: τ τ + α 9: else 10: b, α FINDB(x, x ) Returns the intersection α and the normal vector of the boundary b 11: x x + α εr 12: τ τ + α

13: r REFRACTREFLECT(r, b, U) 14: return xt+1 x

G Dataset Statistics

This dataset is made available under the CC BY-SA 4.0 license. Our use of this dataset is for i) fine-tuning GPT-2, and ii) training classifiers for topic control, which clearly matches the intended use of this dataset mentioned by the creators as end-to-end training and generating text with content selection. A summary of dataset statistics can be found in Table 2.

H Experimental Details

Sampling algorithms. Hyperparameters for each experiment are reported in Table 4. Following prior work [46], in algorithms based on Langevin dynamics, we apply an exponential decay to the step size by decreasing it to 0.05 after 500 steps. In all settings, we take 500 burn-in steps.

Toy Example LM Sampling Controlled Generation

ε α ε α ε α γ

VS 0.1 0.1 - - - - - SVS 0.1 0.1 1. 0.4 1.5 0.3 1.5 HMC 0.1 0.1 - - - - - LANGEVIN - - 1. - 1.5 - 1.5 MUCOLA 0.1 0.1 1. - 1. - 2. Table 4: Hyperparameters used in sampling algorithms.

Table 3: Inference time for different methods on the controlled generation experiments. All experiments are done on a single A100-40GB GPU.

method sec/batch

FUDGE 10 MUCOLA 30 LANGEVIN (ours) 31 SVS (ours) 84

Food Classifiers. We train 3 classifiers. First, for experiments with FUDGE, we follow the experimental detail as in the original paper [48] and train a 3-layered BILSTM classifier with 0.5 dropout. The hidden dimension is set to 300, and the embedding layer is trained from scratch. Second, in experiments with MUCOLA, LANGEVIN, and SVS, to make the setup as close as to FUDGE, we train a 3-layered BILSTM classifier with 0.5 dropout. However, this time the BILSTM is trained on top of frozen GPT-2 representations, thus sharing the embedding layer that is necessary for these methods to work. Finally, to evaluate the success of the methods in following the control targets, we finetune a ROBERTA [25] base model. The accuracy of all the classifiers and the number of trained parameters are reported in Table 5. We train all the models on a single gtx_1080_ti GPU with approximately 2 hours of total computational budget.

Inference times. We report inference times based on sec/batch. The number of decoded sentences per batch depends on the GPU memory and the size of the model. As depicted in table 3, using Voronoi measures does not increase the inference time (compare MUCOLA and LANGEVIN). We observe that SVS inference time is longer, because of the extra momentum adjustment steps. However, one can reduce the inference time by increasing α.

model f1-score precision recall # params

BILSTM 0.87 0.87 0.87 17M BILSTMPROBE 0.84 0.84 0.84 37M ROBERTA 0.90 0.91 0.90 124M

BILSTMPROBE 0.90 0.90 0.90 878M Table 5: Performance of food classifiers, and their number of learnable parameters, used in controlled generation experiment. All classifiers are trained and tested on E2E dataset.

I More Experiments on The Toy Model

To better understand the advantages of Voronoi sampling over HMC or MUCOLA, we further look at the JS divergence between the reference distribution and the distribution of samples when increasing the number of iterations. As depicted in Fig. 4, while the divergence decreases with more iterations across all sampling methods, Voronoi sampling converges to the reference distribution with fewer iterations. We then look at the distribution of sampled elements in Fig. 5. We observe that with 100 iterations, MUCOLA undersamples the element with the maximum probability while oversampling other elements. Finally, we extend the toy model from 4 squared cells in R2 to 2k hypercube cells in Rk (Fig. 6). As the dimensionality increases, the divergence between the samples distribution and the reference distribution also increases in all sampling methods. Importantly, Voronoi sampling consistently converges faster across different values for k.

0.5 1 1.5 2

Sampling Method

Mu Co La Voronoi Sampling HMC Sampling

Temperature

JS Divergence

(a) 500 iterations

0.5 1 1.5 2

0.07 Sampling Method

Mu Co La Voronoi Sampling HMC Sampling

Temperature

JS Divergence

(b) 1000 iterations

0.5 1 1.5 2

Sampling Method

Mu Co La Voronoi Sampling HMC Sampling

Temperature

JS Divergence

(c) 100 iterations

Figure 4: Comparing JS divergence of methods using different numbers of iterations. In general, Voronoi sampling converge to the true distribution faster compared to HMC and MUCOLA. As the number of iterations increases, the divergence between the samples distribution and the true distribution decreases across all sampling methods.

J Controlled Generation Results

Table 6, Fig. 7 show the performance of 4 sampling methods on different metrics. We show a sample of generations for each control target in Table 7.

FUDGE MUCOLA LANGEVIN SVS

Success PPL Dist-1 Dist-2 Dist-3 Success PPL Dist-1 Dist-2 Dist-3 Success PPL Dist-1 Dist-2 Dist-3 Success PPL Dist-1 Dist-2 Dist-3

Chinese 0.30 5.41 0.38 0.53 0.63 0.30 10.90 0.21 0.35 0.47 1.00 16.21 0.22 0.36 0.48 0.90 12.42 0.22 0.36 0.49 English 0.15 5.82 0.41 0.57 0.67 0.45 17.76 0.26 0.39 0.49 0.70 15.82 0.27 0.43 0.55 0.85 15.00 0.25 0.41 0.54 Fast food 0.25 6.44 0.41 0.57 0.68 0.95 5.39 0.26 0.38 0.47 1.00 10.09 0.23 0.37 0.49 1.00 11.43 0.23 0.38 0.50 French 0.25 6.02 0.39 0.55 0.66 0.75 88.19 0.29 0.43 0.54 0.80 12.87 0.21 0.36 0.49 0.95 17.23 0.21 0.35 0.47 Indian 0.55 4.52 0.41 0.55 0.64 0.40 7.87 0.25 0.40 0.51 0.95 17.67 0.26 0.41 0.54 0.95 12.89 0.19 0.35 0.47 Italian 0.35 5.49 0.37 0.53 0.64 0.50 18.19 0.27 0.42 0.53 0.95 14.29 0.26 0.41 0.53 0.90 15.47 0.26 0.41 0.52 Japanese 0.30 5.44 0.42 0.55 0.65 0.75 83.37 0.27 0.42 0.54 1.00 12.90 0.21 0.36 0.47 0.95 12.89 0.18 0.33 0.45

Table 6: Evaluation of different sampling methods on controlled generation, using three criteria: success in following the control target (measured by the evaluator classifier), fluency (measured by perplexity), and diversity.

Mu Co La HMC Sampling Voronoi Sampling Reference

Probability

(a) 500 iterations

1 Mu Co La HMC Sampling Voronoi Sampling Reference

Probability

(b) 1000 iterations

Mu Co La HMC Sampling Voronoi Sampling Reference

Probability

(c) 100 iterations

Figure 5: Comparing the distribution of sampled elements at temperature 0.25. With 100 iterations, MUCOLA undersamples the element with the highest probability while oversampling other elements.

2 3 4 5 6 7 8

Sampling Method

Mu Co La Voronoi Sampling HMC Sampling

JS Divergence

Figure 6: Comparing the distribution of sampled elements with the true distribution after 100 iterations, at temperature 0.5. There are 2k Voronoi cells with equal base measures in Rk, where elements to sample from are located at the center of each Voronoi cell. Voronoi sampling converges faster to the true distribution across all k values. As the dimensionality of the sample space increases, the divergence of all methods increases.

Figure 7: Evaluation of different sampling methods on restaurant review generation, along 6 axes: mean and standard deviation of negative perplexity19(-ppl-mean , -ppl-std ), the percentage of generated sentences adhering to the control target (success ), and diversity metrics (dist-1 , dist-2 , dist-3 ). For most control targets, SVS achieves the highest success rate, with relatively low perplexity.

Chinese FUDGE In the city centre near Yippee Noodle Bar Chinese, is Alimentum. It has moderate prices and MUCOLA and has a 1 out of 5. It has food and high customer rating. The Rice Boat is LANGEVIN It serves Chinese food with a low customer rating. The fast food and restaurant The Golden Curry is a SVS It has a low customer rating and a price. The highly rated Chinese restaurant The Phoenix has a high

English FUDGE It has an average customer Rating. Bibimbap House has English food in the riverside area near MUCOLA and has a low customer rating. The Golden Curry is a children friendly, serving English food, with LANGEVIN It has low rating and is located near the to the city centre. The Phoenix is a English food SVS Alimentum in the city centre near the a moderate price range. It serves English food, is

Fast food FUDGE A fast food, coffee shop, Strada has a low customer rating, has a price range of over 30. It is MUCOLA and is family friendly and serves fast food. The Wrestlers is a fast food coffee shop in the LANGEVIN It is located near the riverside, is a cheap family friendly fast food restaurant, and is called SVS It is located near the river. The Mill is a cheap, fast food and coffee shop near the

French FUDGE It has a low-priced Inn French food. It is near Café Rouge.The Alimentum is a kid friendly fast food MUCOLA The French restaurant The Waterman is located in the city centre. The price range is less than LANGEVIN It is a restaurant located in the riverside, the restaurant, offers French food with a price SVS It is a family restaurant that serves French food with a price range and has a low customer rating.

Indian FUDGE The Phoenix Indian restaurant has moderate prices with a 3 out of 5 rating. Located on the MUCOLA It is in the city and has a low customer rating. The Waterman is a low priced LANGEVIN It is not child friendly and it is near the river. It serves Indian food and a customer rating SVS It is located in the city centre near The Portland Arms Indian food and has a low customer rating.

Italian FUDGE It has family Italian food and has a low a moderate price range. The Rice Boat has an average MUCOLA is a high priced Italian food restaurant with a customer rating of average. The Phoenix is a high LANGEVIN It is located in the city centre, it is not family friendly and is a coffee shop serving Italian SVS It is located in the the city centre near The Portland Arms.The Eagle is an Italian restaurant.

Japanese FUDGE Japanese food. Its customer rating is 3 out of 5.The Phoenix is Japanese in the city centre MUCOLA for Japanese food is located in the city centre. It has a low customer rating. The Golden LANGEVIN It is located in the riverside. It is a Japanese food. It is a pub restaurant SVS It is located in the riverside. It is a low rated Japanese restaurant, and coffee shop.

Table 7: Examples of sampled sentences from different control food targets.