# structured_voronoi_sampling__9dd72c8d.pdf Structured Voronoi Sampling Afra Amini1 Li Du2 Ryan Cotterell1 1ETH Zürich 2Johns Hopkins University {afra.amini, ryan.cotterell}@inf.ethz.ch leodu@cs.jhu.edu Gradient-based sampling algorithms have demonstrated their effectiveness in text generation, especially in the context of controlled text generation. However, there exists a lack of theoretically grounded and principled approaches for this task. In this paper, we take an important step toward building a principled approach for sampling from language models with gradient-based methods. We use discrete distributions given by language models to define densities and develop an algorithm based on Hamiltonian Monte Carlo to sample from them. We name our gradient-based technique Structured Voronoi Sampling (SVS). In an experimental setup where the reference distribution is known, we show that the empirical distribution of SVS samples is closer to the reference distribution compared to alternative sampling schemes. Furthermore, in a controlled generation task, SVS is able to generate fluent and diverse samples while following the control targets significantly better than other methods. https://github.com/Afra Amini/svs 1 Introduction Gradient-based sampling algorithms such as Hamiltonian Monte Carlo [HMC; 32] and Langevin dynamics [46] are widely used in Bayesian inference due to their efficiency in drawing samples from high-dimensional space [4]. Such algorithms construct a Markov Chain that has the desired distribution as its stationary distribution and use the gradient information of this distribution to efficiently navigate the state space. Additionally, gradient-based sampling schemes have recently been deployed in computer vision to generate high-quality images from state-of-the-art models [13, 43], and are a popular choice for tasks such as image synthesis [5] and image-to-image translation [40]. In natural language processing, there have been several attempts to apply gradient-based sampling techniques to sampling text from neural language models [21, 23, 38]. The motivation behind this approach to text generation is to sample from energy-based probabilistic models, where the normalization factor is not tractable. One such case is in controlled text generation, where energy functions are usually defined as a linear combination of LM probabilities and the probability of satisfying a set of predefined constraints [21, 23, 38]. In contrast to computer vision, however, applying gradient-based sampling schemes to text generation is nuanced as text, in contrast to images, is discrete. Upon closer inspection, none of the proposed algorithms actually defines a valid Markov Chain Monte Carlo [MCMC; 12, 29] scheme that will draw samples from the model in the limit. For instance, Qin et al. [38] relax the language model from a distribution over strings to a distribution over logits. While the relaxation does transform the language model into a continuous distribution, it introduces bias. Kumar et al. [MUCOLA; 21] take a different approach. They derive a constrained gradient-based 37th Conference on Neural Information Processing Systems (Neur IPS 2023). sampler where the constraint is enforced through a projection. However, the projection invalidates the MCMC procedures, leading to an algorithm without guarantees.1 We derive a principled Hamiltonian Monte Carlo scheme for generating text from language models, which we call a structured Voronoi sampler. Our scheme consists of two steps. First, we give a recipe for encoding discrete distributions as densities over Rd; we term the resulting encoding Voronoi measures.2 Second, we derive a refractive Hamiltonian Monte Carlo algorithm [30] for sampling from an arbitrary Voronoi measure. In our theoretical analysis, we show that, despite the presence of discontinuities, we are able to give proof that our sampler satisfies the detailed balance condition and, thus, is a correct MCMC scheme. To empirically evaluate the performance of structured Voronoi sampling, we begin by applying it to a toy example where the exact reference probability distribution is known. We compare the empirical distribution of drawn samples with the reference distribution and show Voronoi sampler s distribution is closer to the reference distribution than MUCOLA or unconstrained HMC. Furthermore, we use our sampling scheme for controlled generation, where the goal is to use GPT-2 to generate restaurant reviews for a target type of food, e.g., Italian, Fast food, or Japanese, and separately to generate text with a positive sentiment. We find that structured Voronoi sampling outperforms FUDGE [48], MUCOLA, and Langevin Dynamics algorithms in terms of adhering to the control target. Additionally, the samples generated by structured Voronoi sampling are comparably fluent and diverse to those produced by the other methods. 2 Language Models Let Σ be an alphabet, a finite, non-empty set. By Σ def= S n=0 Σn, we denote the Kleene closure of Σ.3 A probability distribution over Σ is called a language model (LM). The elements of Σ may be characters, subword pieces, or words; the choice lies with the modeler. Language models can be factored autoregressively by means of the chain rule of probability, i.e., for any string w = w1 w N Σ , we can write p(w) = p(EOS | w) n=1 p(wn | w 2 U, then the particle has enough kinetic energy to overcome the barrier, and its momentum s normal component will instantaneously become r = p r 2 2 2 U r r 2 2 after being transmitted through the barrier (refraction). Otherwise, if r 2 2 2 U, the particle will be reflected from the barrier and the normal component will instantaneously become r = r . We show in App. E.1 that Hamiltonian is conserved in either case. The reflect refract process is summarized in Algorithm 5. 6.1 A Sampling Algorithm Noting that the discontinuity surfaces of p V are all piecewise smooth, we can build on the above and develop a sampling algorithm for p V to handle discontinuity in a principled and effective way. In fact, we only need to make one change to the generic HMC, which is updates (x, r) according to the mechanical description given above. Concretely, we need to replace step 2 in the HMC outline in 5.1: When a single step of leapfrog encounters no discontinuity, we may advance to the next point as in HMC; however, when there is discontinuity, if a full leapfrog step is taken, we need to proceed by repeatedly computing where the discontinuity is encountered, taking a smaller step up to the discontinuity and refracting reflecting based on the potential energy difference. This process is continued until we have exhausted the step size. Since refracting reflecting conserves Hamiltonian (App. E.1), this process yields a better acceptance rate in the presence of a discontinuity. See Algorithm 4 for the details of this sampling procedure, and App. F.1 for how to efficiently find discontinuities. We will supply proof of correctness in App. E.2. A note on calculating the base measure. To adjust the momentum, one needs to compute U, which implies computing the difference between two base measures, as defined in Eq. (12). However, computing such an integral in a high-dimensional space is not practical. Therefore, make an 6DPSOLQJ 0HWKRG 0X&R/D +0& 6DPSOLQJ 9RURQRL 6DPSOLQJ 7HPSHUDWXUH -6 'LYHUJHQFH 5HIHUHQFH 'LVWULEXWLRQ Figure 2: Left: JS divergence between the reference probability and empirical probability distribution. Voronoi Sampling clearly outperforms others in low temperatures. Right: reference probability distribution annealed with 3 temperatures: 0.25 (peaked), 1, and 2 (close to uniform). 2 4 6 8 10 12 14 Figure 3: Perplexity of 100 samples taken with different gradient-based algorithms, compared to 1000 samples taken with the ancestral sampling (in green). While LANGEVIN, SVS, and MUCOLA are comparably close to the ancestral samples distribution, SVS models the tail of the distribution better. assumption that the base measures of Voronoi cells are equal, and thus do not have an effect on U. However, such an assumption might not hold. See App. A for limitations. 7 Experiments We empirically assess the performance of Voronoi sampling in a series of experiments. In each experiment, we perform a grid search to find the best set of hyperparameters, these experimental details can be found in App. H. Our open-source implementation will be available upon publication. 7.1 Toy Example We first apply our Voronoi sampling8 method on the toy example discussed earlier in Example 1, where the reference probability distribution is tractable and known p(x). The potential energy is then set to U(x) = log p V(x). Importantly, the toy experiment is intentionally designed such that the base measure of all of the Voronoi cells is equal, therefore, we can safely ignore calculating the base measure and arrive at exact sampling methods. We compare Voronoi sampling to MUCOLA and HMC. To make a fair comparison, we add the Metropolis criterion9 to the MUCOLA algorithm and only do one leapfrog step in all algorithms. Furthermore, to see the effect of the reference distribution on the performance of the sampling algorithm, we anneal this distribution with 6 temperatures, where the lower temperatures lead to peaky distributions and the higher temperatures to uniform-like distributions. We take 200 samples after 500 burn-in iterations, and compare the empirical distributions of samples to the reference by measuring the Jensen Shannon (JS) divergence between the two. As results in Fig. 2 show, the JS divergence between the reference distribution and the empirical distribution of Voronoi samples is the smallest. The difference between the methods is more pronounced at lower temperatures, as the change in potential energy is greater, resulting in more errors in the leapfrog integrator. In App. I we provide more empirical support that Voronoi sampling converges faster to the reference distribution, especially when we increase the dimensionality of the sampling space. 7.2 Sampling from Language Models Next, we apply our method to sample from a language model. The underlying LM is a finetuned GPT-210 on E2E dataset [34]; see App. G for dataset statistics. As opposed to the previous 8Note that in the case of the toy experiment, we are not sampling a sequence of text, but rather a single embedding. To highlight this point, we use Voronoi sampling instead of structured Voronoi sampling to refer to our algorithm in this section. 9We accept transitioning from xt to xt+1 with probability e Ht Ht+1. 10We use gpt2 checkpoint from the Huggingface library [47]. experiment, the reference distribution of the LM is not tractable. Therefore, we use the empirical distribution of ancestral samples as an unbiased estimate of the reference distribution. Note that ancestral sampling incrementally draws samples from a language model, where at each step of the generation a token wn is sampled with the probability given by the LM: p(wn | w 2 U ) r (reflection when r 2 2 2 U ) (27a) 15This result is known as Liouville s theorem in classical mechanics [2, 16]. Upon noticing that the Hamiltonian dynamics is divergenceless, one can interpret it as a simple application of the (higher-dimensional) divergence theorem, which itself is a consequence of the generalized Stokes theorem. 17We say a transformation taking (x, r) to (x , r ) conserves Hamiltonian if H(x, r) = H(x , r ). 17This function is called the Hamiltonian phase flow, or simply Hamiltonian flow [2, 16]. Flow is a general and important notion in the study of differential equations and related subjects, in which some additive group acts on (R, +). In the case of a smooth Hamiltonian H, {gt}t R can be seen a one-parameter group of diffeomorphisms over R2d. where r = r + r is the normal decomposition of r and r = r + r . Then, H(x , r ) = U(x ) + 1 2 r 2 2 (27b) = U(x ) + 1 2 r + r 2 2 (27c) = U(x ) + 1 2 r 2 2 + r , r + 1 2 r 2 2 (27d) = U(x ) + 1 2 r 2 2 + 1 2 r 2 2 ( r , r = 0) (27e) ( U(x) + U + 1 2( r 2 2 2 U) + 1 2 r 2 2 (refraction) U(x) + 1 2 r 2 2 + 1 2 r 2 2 (reflection) (27f) ( U(x) + U + 1 2 r 2 2 U + 1 2 r 2 2 (refraction) U(x) + 1 2 r 2 2 + 1 2 r 2 2 (reflection) (27g) 2 r 2 2 (27h) = H(x, r). (27i) E.2 Proof of Correctness In this section, we give a proof of correctness of our sampler by establishing its detailed balance. We will first prove several useful properties of the sampler in E.2.1 to E.2.3 and then combine them to prove the detailed balance in App. E.2.4. E.2.1 Measure Preservation in Leapfrog First of all, we note that the two leapfrog steps are measure-preserving, since they are composed of a series of shear transformations. Proposition 6. The leapfrog steps are measure-preserving. Proof. Recall that the two leapfrog steps used in HMC (algorithm 1) are (x, r) (r, r ε 2 U(x)) and (x, r) (x + εr, r). By the change-of-variable formula, to show a transformation is measure-preserving, it suffices to show that its Jacobian has determinant 1. The leapfrog step (x, r) (x, r ε 2 U(x)) as a function has the Jacobian (x, r ε 2 U(x)) (x, r) = I 0 ε where 2U(x) denotes the Hessian of U. This Jacobian (28) has determinant 1 and hence the leapfrog transformation (x, r) to (x, r ε 2 U(x)) is measure preserving. Similarly, the other leapfrog step (x, r) (x + εr, r) has Jacobian (x + εr, r) (x, r) = I εI 0 I which also has determinant 1. E.2.2 Measure Preservation in Refraction Reflection The computation for showing that refraction and reflection steps are measure preserving is slightly more involved than the leapfrog case. Our proof below largely follows the one found in Mohasel Afshar and Domke [30], where we simplified and corrected some details. Proposition 7. The refraction reflection step is measure-preserving. Proof. We consider a single step of refraction reflection as a function γ that takes (x, r) to (x , r ) with total step size ε and with only one discontinuity. Note that, without loss of generality, we may only consider the discontinuity surface to be the hyperplane x1 = 0 since it is equivalent to any tangent hyperplane of the discontinuity (hyper)surface through an affine transformation. Then, a normal vector to the hyperplane x1 = 0 is e1 = (1, 0, . . . , 0) and r = (r1, 0, . . . , 0) and r = (0, r2, . . . , rd). (30) In this case, i 2, r i = ri and x i = xi + εri (31) since r is not affected by refraction reflection. Therefore, i 2, x i xi = r i ri = 1 (32) i, j 2 and j = i, x i xj = x i rj = r i xj = r i rj = 0. (by Eq. (31)) (33) Eq. (32) and Eq. (33) shows that the absolute determinant of the Jacobian of γ is the same as x 1 x1 x 1 r1 r 1 x1 r 1 r1 r 1 r1 x 1 r1 We analyze all four quantities in Eq. (34) individually below. Again, without loss of generality, suppose x1 0 and x 1 0. Let U(x) be the function of the signed potential difference as x1 changes from negative to positive, defined only when x1 = 0. Let y = (0, y2, . . . , yd) be the point where the discontinuity is encountered. In other words, the particle begins at x, moved past y and arrived at x . Refraction. In the case of refraction, r2 1 > 2 U(y) and r2 1 2 U(y). (35) Let δ be the step size it takes to reach y. Then x 1 = (ε δ)r 1 = ε + x1 y = x + δr = x x1 x 1 x1 = x1 r 1 (Eq. (37)) (39a) = ε r 1 x1 + 1 r1 r 1 + x1 (product rule) (39b) r 1 x1 + r 1 r1 (product rule), (39c) x 1 r1 = r1 r 1 (Eq. (37)) (40a) = x1r 1 r2 1 + ε + x1 r 1 r1 , (40b) r2 1 2 U(y) (by Eq. (35)) (41a) r2 1 2 U(y) (r2 1 2 U(y)) (chain rule) (41b) and r 1 r1 = p r2 1 2 U(y) (by Eq. (35)) (42a) r2 1 2 U(y) (r2 1 2 U(y)) (chain rule) (42b) det γ = x 1 x1 r 1 r1 x 1 r1 r 1 x1 + r 1 r1 r 1 r1 x1r 1 r2 1 + ε + x1 (Eq. (39) and (40)) r 1 r1 + x1 (Eq. (41) and (42)) (chain rule) (44b) (Eq. (38)) (44c) = 0. (44d) Hence, Eq. (43) and Eq. (44) together imply that | det γ| = 1. (45) So γ is measure-preserving in the case of refraction. Reflection. In the case of reflection, r2 1 2 U(y) and r 1 = r1. (46) As in the case of refraction, let δ be the step size it takes to reach y. Then x 1 = (ε δ)r 1 = ε + x1 ( r 1) = εr 1 x1. (48) We can then directly calculate x 1 x1 = 1, x 1 x1 = ε, (Eq. (48)) (49a) x 1 x1 = 0, x 1 x1 = 1. (Eq. (46)) (49b) | det γ| = |( 1)( 1) 0 ( ε)| = 1. (50) So γ is measure preserving in the case of reflection as well. E.2.3 Time Reversibility Proposition 8. The refraction reflection step is time-reversible. Proof. A refraction reflection step is time-reversible means that, if a single step of refraction reflection procedure takes (x, r) to (x , r ), then it would also take (x , r ) to (x, r). We can show this by considering each cases: Reflection: When reflection happens, r 2 2 2 U, and hence r 2 2 = r 2 2 = r 2 2 2 U, meaning that the reverse trajectory would be reflected as well; Refraction: If U > 0, then the reverse trajectory is crossing the boundary from the other side, seeing a sudden decrease in potential, and hence would be refracted and regain the magnitude of momentum lost in the forward step. If U < 0, then r 2 2 > 2 U and hence the reverse trajectory would also be refracted, ending in the original momentum with the sign reversed. Proposition 8, combined with time reversibility of leapfrog step as in basic HMC, shows that Voronoi sampling is time reversible. Hence we only need to use the Hamiltonian difference in the Metropolis Hastings acceptance probability. E.2.4 Detailed Balance Theorem 1. A step of Structured Voronoi Sampling (Algorithm 4) satisfies the detailed balance. Proof Sketch. From measure preservation (App. E.2.1 and App. E.2.2), the change of variable as introduced by integrating Hamiltonian equations have Jacobian with absolute determinant 1 and hence can be omitted. From time reversibility (App. E.2.3), we only need to use the Hamiltonian difference in the Metropolis Hastings acceptance probability. And, finally, by using a Metropolis Hastings accepting step (lines 8 to 13) we ensure the detailed balance. E.3 Related Work Such an integrator scheme appears to be well-known in the computational physics community, dating back to as early as Jin and Wen [18], and quite possibly even earlier. Integration algorithms based on this idea continue to receive active research [44]. The first usage of such integrator with discrete stepin the context of MCMC appeared in Mohasel Afshar and Domke [30].18 Mohasel Afshar and Domke [30] primarily based their motivation on the optical reflection refraction analogy. We note that, in fact, Hamiltonian mechanics originated from Hamilton s formulation of geometrical optics (or Hamiltonian optics) [22, VIII.7]. This connection, important to modern physics, should be interpreted with care in our context since momentum isn t a natural quantity in optical rays. 18In an earlier work, Pakman and Paninski [36] used the same idea but solved the trajectory exactly for simpler systems instead of using discrete steps. Later, Nishimura et al. [33] provided a derivation of this integrator from the first principles of the calculus of variations [1]. Importantly, it should be noted that, due to the potential function being discontinuous, Eq. (16) loses its well-posedness as a system of differential equations, hence the necessity of justification based on non-smooth analysis. F Algorithms Algorithm 2 Langevin Dynamics Input: xt: current sample, U: potential energy function, ε: step size Output: next sample: xt+1 1: r N(0, I) 2: r r ε 2 U(xt) 3: xt+1 xt + εr 4: return xt+1 Algorithm 3 MUCOLA Input: Vt: current sample, U: potential energy function, ε: step size Output: next sample: Vt+1 1: r N(0, I) 2: r r ε 2 U(Vt) 3: Vt+1 Proj B(Vt + εr) 4: return Vt+1 Algorithm 4 Structured Voronoi Sampling Input: xt: current embeddings, U: potential function, ε: step size Output: next sample: xt+1 1: r N(0, εI) 2: Ht U(Vm (xt)) + K(r) 3: r r ε 2 U(Vm (xt)) 4: xt+1 FINDDISC(xt, r) 5: r r ε 2 U(Vm (xt+1)) 6: Ht+1 U(xt+1) + K(r) 7: H Ht+1 Ht 8: if s U(0, 1) < e H : Metropolis criterion 9: return xt+1 10: else 11: return xt+1 xt Algorithm 5 REFRACTREFLECT Input: rt: current momentum, b normal vector of the boundary, U: difference in potential energy Output: next momentum: rt+1 1: rt b T rt 2: rt rt rt 3: if rt 2 2 > 2 U : rt 2 2 2 U rt rt 2 2 refract 5: else 6: rt+1 rt reflect 7: rt+1 rt+1 + rt 8: return rt+1 Split Chinese English Fast food French Indian Italian Japanese Total train 2929 4100 5891 5889 4412 5909 5996 42061 valid 1489 1780 - - - - - 4672 test 492 613 632 639 497 608 638 4693 Table 2: Number of restaurant reviews in each split and food type. Note that some reviews do not represent any specific food type, therefore, the total count is bigger than the sum count of reviews in each food category. F.1 Efficiently Finding Discontinuities As explained in 6, the key insight in SVS is to adjust the momentum when facing discontinuity in the potential function. Therefore, we first need to find discontinuities efficiently and then reflect/refract on the boundary of the discontinuity. Remember that at each leapfrog step, we move from xt to xt+1 = xt + εr. To find discontinuities along this trajectory, we divide each leapfrog step into fractions of a step and check for discontinuity. Concretely, we only advance the embedding by a fraction α, i.e. x = x + αεr (line 4 in Algorithm 6). Then, we check for discontinuity by looking at the difference between the potential energies. If we don t observe any changes in the potential function, we take the next fraction of a step. Otherwise, we find the boundary and adjust the momentum (line 10 in Algorithm 6). Note that in SVS, finding the boundary is straightforward, and it is a hyperplane characterized by the normal vector, as the line goes through the two Voronoi centers on each side of the hyperplane. Algorithm 6 Find Discontinuity Input: xt: current embeddings, r: current momentum, U: potential function, ε: step size, α: discontinuity step size Output: next sample: xt+1 1: τ 0 2: x xt 3: while τ < 1 : 4: x x + αεr 5: U U(Vm (x )) U(Vm (x)) 6: if U = 0 : there is no discontinuity 7: x x 8: τ τ + α 9: else 10: b, α FINDB(x, x ) Returns the intersection α and the normal vector of the boundary b 11: x x + α εr 12: τ τ + α 13: r REFRACTREFLECT(r, b, U) 14: return xt+1 x G Dataset Statistics This dataset is made available under the CC BY-SA 4.0 license. Our use of this dataset is for i) fine-tuning GPT-2, and ii) training classifiers for topic control, which clearly matches the intended use of this dataset mentioned by the creators as end-to-end training and generating text with content selection. A summary of dataset statistics can be found in Table 2. H Experimental Details Sampling algorithms. Hyperparameters for each experiment are reported in Table 4. Following prior work [46], in algorithms based on Langevin dynamics, we apply an exponential decay to the step size by decreasing it to 0.05 after 500 steps. In all settings, we take 500 burn-in steps. Toy Example LM Sampling Controlled Generation ε α ε α ε α γ VS 0.1 0.1 - - - - - SVS 0.1 0.1 1. 0.4 1.5 0.3 1.5 HMC 0.1 0.1 - - - - - LANGEVIN - - 1. - 1.5 - 1.5 MUCOLA 0.1 0.1 1. - 1. - 2. Table 4: Hyperparameters used in sampling algorithms. Table 3: Inference time for different methods on the controlled generation experiments. All experiments are done on a single A100-40GB GPU. method sec/batch FUDGE 10 MUCOLA 30 LANGEVIN (ours) 31 SVS (ours) 84 Food Classifiers. We train 3 classifiers. First, for experiments with FUDGE, we follow the experimental detail as in the original paper [48] and train a 3-layered BILSTM classifier with 0.5 dropout. The hidden dimension is set to 300, and the embedding layer is trained from scratch. Second, in experiments with MUCOLA, LANGEVIN, and SVS, to make the setup as close as to FUDGE, we train a 3-layered BILSTM classifier with 0.5 dropout. However, this time the BILSTM is trained on top of frozen GPT-2 representations, thus sharing the embedding layer that is necessary for these methods to work. Finally, to evaluate the success of the methods in following the control targets, we finetune a ROBERTA [25] base model. The accuracy of all the classifiers and the number of trained parameters are reported in Table 5. We train all the models on a single gtx_1080_ti GPU with approximately 2 hours of total computational budget. Inference times. We report inference times based on sec/batch. The number of decoded sentences per batch depends on the GPU memory and the size of the model. As depicted in table 3, using Voronoi measures does not increase the inference time (compare MUCOLA and LANGEVIN). We observe that SVS inference time is longer, because of the extra momentum adjustment steps. However, one can reduce the inference time by increasing α. model f1-score precision recall # params BILSTM 0.87 0.87 0.87 17M BILSTMPROBE 0.84 0.84 0.84 37M ROBERTA 0.90 0.91 0.90 124M BILSTMPROBE 0.90 0.90 0.90 878M Table 5: Performance of food classifiers, and their number of learnable parameters, used in controlled generation experiment. All classifiers are trained and tested on E2E dataset. I More Experiments on The Toy Model To better understand the advantages of Voronoi sampling over HMC or MUCOLA, we further look at the JS divergence between the reference distribution and the distribution of samples when increasing the number of iterations. As depicted in Fig. 4, while the divergence decreases with more iterations across all sampling methods, Voronoi sampling converges to the reference distribution with fewer iterations. We then look at the distribution of sampled elements in Fig. 5. We observe that with 100 iterations, MUCOLA undersamples the element with the maximum probability while oversampling other elements. Finally, we extend the toy model from 4 squared cells in R2 to 2k hypercube cells in Rk (Fig. 6). As the dimensionality increases, the divergence between the samples distribution and the reference distribution also increases in all sampling methods. Importantly, Voronoi sampling consistently converges faster across different values for k. 0.5 1 1.5 2 Sampling Method Mu Co La Voronoi Sampling HMC Sampling Temperature JS Divergence (a) 500 iterations 0.5 1 1.5 2 0.07 Sampling Method Mu Co La Voronoi Sampling HMC Sampling Temperature JS Divergence (b) 1000 iterations 0.5 1 1.5 2 Sampling Method Mu Co La Voronoi Sampling HMC Sampling Temperature JS Divergence (c) 100 iterations Figure 4: Comparing JS divergence of methods using different numbers of iterations. In general, Voronoi sampling converge to the true distribution faster compared to HMC and MUCOLA. As the number of iterations increases, the divergence between the samples distribution and the true distribution decreases across all sampling methods. J Controlled Generation Results Table 6, Fig. 7 show the performance of 4 sampling methods on different metrics. We show a sample of generations for each control target in Table 7. FUDGE MUCOLA LANGEVIN SVS Success PPL Dist-1 Dist-2 Dist-3 Success PPL Dist-1 Dist-2 Dist-3 Success PPL Dist-1 Dist-2 Dist-3 Success PPL Dist-1 Dist-2 Dist-3 Chinese 0.30 5.41 0.38 0.53 0.63 0.30 10.90 0.21 0.35 0.47 1.00 16.21 0.22 0.36 0.48 0.90 12.42 0.22 0.36 0.49 English 0.15 5.82 0.41 0.57 0.67 0.45 17.76 0.26 0.39 0.49 0.70 15.82 0.27 0.43 0.55 0.85 15.00 0.25 0.41 0.54 Fast food 0.25 6.44 0.41 0.57 0.68 0.95 5.39 0.26 0.38 0.47 1.00 10.09 0.23 0.37 0.49 1.00 11.43 0.23 0.38 0.50 French 0.25 6.02 0.39 0.55 0.66 0.75 88.19 0.29 0.43 0.54 0.80 12.87 0.21 0.36 0.49 0.95 17.23 0.21 0.35 0.47 Indian 0.55 4.52 0.41 0.55 0.64 0.40 7.87 0.25 0.40 0.51 0.95 17.67 0.26 0.41 0.54 0.95 12.89 0.19 0.35 0.47 Italian 0.35 5.49 0.37 0.53 0.64 0.50 18.19 0.27 0.42 0.53 0.95 14.29 0.26 0.41 0.53 0.90 15.47 0.26 0.41 0.52 Japanese 0.30 5.44 0.42 0.55 0.65 0.75 83.37 0.27 0.42 0.54 1.00 12.90 0.21 0.36 0.47 0.95 12.89 0.18 0.33 0.45 Table 6: Evaluation of different sampling methods on controlled generation, using three criteria: success in following the control target (measured by the evaluator classifier), fluency (measured by perplexity), and diversity. Mu Co La HMC Sampling Voronoi Sampling Reference Probability (a) 500 iterations 1 Mu Co La HMC Sampling Voronoi Sampling Reference Probability (b) 1000 iterations Mu Co La HMC Sampling Voronoi Sampling Reference Probability (c) 100 iterations Figure 5: Comparing the distribution of sampled elements at temperature 0.25. With 100 iterations, MUCOLA undersamples the element with the highest probability while oversampling other elements. 2 3 4 5 6 7 8 Sampling Method Mu Co La Voronoi Sampling HMC Sampling JS Divergence Figure 6: Comparing the distribution of sampled elements with the true distribution after 100 iterations, at temperature 0.5. There are 2k Voronoi cells with equal base measures in Rk, where elements to sample from are located at the center of each Voronoi cell. Voronoi sampling converges faster to the true distribution across all k values. As the dimensionality of the sample space increases, the divergence of all methods increases. Figure 7: Evaluation of different sampling methods on restaurant review generation, along 6 axes: mean and standard deviation of negative perplexity19(-ppl-mean , -ppl-std ), the percentage of generated sentences adhering to the control target (success ), and diversity metrics (dist-1 , dist-2 , dist-3 ). For most control targets, SVS achieves the highest success rate, with relatively low perplexity. Chinese FUDGE In the city centre near Yippee Noodle Bar Chinese, is Alimentum. It has moderate prices and MUCOLA and has a 1 out of 5. It has food and high customer rating. The Rice Boat is LANGEVIN It serves Chinese food with a low customer rating. The fast food and restaurant The Golden Curry is a SVS It has a low customer rating and a price. The highly rated Chinese restaurant The Phoenix has a high English FUDGE It has an average customer Rating. Bibimbap House has English food in the riverside area near MUCOLA and has a low customer rating. The Golden Curry is a children friendly, serving English food, with LANGEVIN It has low rating and is located near the to the city centre. The Phoenix is a English food SVS Alimentum in the city centre near the a moderate price range. It serves English food, is Fast food FUDGE A fast food, coffee shop, Strada has a low customer rating, has a price range of over 30. It is MUCOLA and is family friendly and serves fast food. The Wrestlers is a fast food coffee shop in the LANGEVIN It is located near the riverside, is a cheap family friendly fast food restaurant, and is called SVS It is located near the river. The Mill is a cheap, fast food and coffee shop near the French FUDGE It has a low-priced Inn French food. It is near Café Rouge.The Alimentum is a kid friendly fast food MUCOLA The French restaurant The Waterman is located in the city centre. The price range is less than LANGEVIN It is a restaurant located in the riverside, the restaurant, offers French food with a price SVS It is a family restaurant that serves French food with a price range and has a low customer rating. Indian FUDGE The Phoenix Indian restaurant has moderate prices with a 3 out of 5 rating. Located on the MUCOLA It is in the city and has a low customer rating. The Waterman is a low priced LANGEVIN It is not child friendly and it is near the river. It serves Indian food and a customer rating SVS It is located in the city centre near The Portland Arms Indian food and has a low customer rating. Italian FUDGE It has family Italian food and has a low a moderate price range. The Rice Boat has an average MUCOLA is a high priced Italian food restaurant with a customer rating of average. The Phoenix is a high LANGEVIN It is located in the city centre, it is not family friendly and is a coffee shop serving Italian SVS It is located in the the city centre near The Portland Arms.The Eagle is an Italian restaurant. Japanese FUDGE Japanese food. Its customer rating is 3 out of 5.The Phoenix is Japanese in the city centre MUCOLA for Japanese food is located in the city centre. It has a low customer rating. The Golden LANGEVIN It is located in the riverside. It is a Japanese food. It is a pub restaurant SVS It is located in the riverside. It is a low rated Japanese restaurant, and coffee shop. Table 7: Examples of sampled sentences from different control food targets.