# a_langevinlike_sampler_for_discrete_distributions__11b9a1ec.pdf

A Langevin-like Sampler for Discrete Distributions

Ruqi Zhang 1 Xingchao Liu 1 Qiang Liu 1

We propose discrete Langevin proposal (DLP), a simple and scalable gradient-based proposal for sampling complex high-dimensional discrete distributions. In contrast to Gibbs sampling-based methods, DLP is able to update all coordinates in parallel in a single step and the magnitude of changes is controlled by a stepsize. This allows a cheap and efficient exploration in the space of high-dimensional and strongly correlated variables. We prove the efficiency of DLP by showing that the asymptotic bias of its stationary distribution is zero for log-quadratic distributions, and is small for distributions that are close to being log-quadratic. With DLP, we develop several variants of sampling algorithms, including unadjusted, Metropolis-adjusted, stochastic and preconditioned versions. DLP outperforms many popular alternatives on a wide variety of tasks, including Ising models, restricted Boltzmann machines, deep energy-based models, binary neural networks and language generation.

1. Introduction

Discrete variables are ubiquitous in machine learning problems ranging from discrete data such as text (Wang & Cho, 2019; Gu et al., 2018) and genome (Wang et al., 2010), to discrete models such as low-precision neural networks (Courbariaux et al., 2016; Peters & Welling, 2018). As data and models become large-scale and complicated, there is an urgent need for efficient sampling algorithms on complex high-dimensional discrete distributions.

Markov Chain Monte Carlo (MCMC) methods are typically used to perform sampling, of which the efficiency is largely affected by the proposal distribution (Brooks et al., 2011). For general discrete distributions, Gibbs sampling is broadly applied, which resamples a variable from its

1The University of Texas at Austin. Correspondence to: Ruqi Zhang <ruqiz@utexas.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

conditional distribution with the remaining variables fixed. Recently, gradient information has been incorporated in the proposal of Gibbs sampling, leading to a substantial boost to the convergence speed in discrete spaces (Grathwohl et al., 2021). However, Gibbs-like proposals often suffer from high-dimensional and highly correlated distributions due to conducting a small update per step. In contrast, proposals in continuous spaces that leverage gradients can usually make large effective moves. One of the most popular methods is the Langevin algorithm (Grenander & Miller, 1994; Roberts & Tweedie, 1996; Roberts & Stramer, 2002), which drives the sampler towards high probability regions following a Langevin diffusion. Due to its simplicity and efficiency, the Langevin algorithm has been widely used for sampling from complicated high-dimensional continuous distributions in machine learning and deep learning tasks (Welling & Teh, 2011; Li et al., 2016; Grathwohl et al., 2019; Song & Ermon, 2019). Its great success makes us ask: what is the simplest and most natural analogue of the Langevin algorithm in discrete domains?

In this paper, we develop such a Langevin-like proposal for discrete distributions, which can update many coordinates of the variable based on one gradient computation. By reforming the proposal from the standard Langevin algorithm, we find that it can be easily adapted to discrete spaces and can be cheaply computed in parallel due to coordinatewise factorization. We call this proposal discrete Langevin proposal (DLP). Inheriting from the Langevin algorithm, DLP is able to update all coordinates in a single step in parallel and the magnitude of changes is controlled by a stepsize. Using this proposal, we are able to obtain high-quality samples conveniently on a variety of tasks. We summarize our contributions as the following:

We propose discrete Langevin proposal (DLP), a gradient-based proposal for sampling discrete distributions. DLP is able to update many coordinates in a single step with only one gradient computation.

We theoretically prove the efficiency of DLP by showing that without a Metropolis-Hastings correction, the asymptotic bias of DLP is zero for log-quadratic distributions, and is small for distributions that are close to being log-quadratic.

A Langevin-like Sampler for Discrete Distributions

With DLP, we develop several variants of sampling algorithms, including unadjusted, Metropolis-adjusted, stochastic and preconditioned versions, indicating the general applicability of DLP for different scenarios.

We provide extensive experimental results, including Ising models, restricted Boltzmann machines, deep energy-based models, binary Bayesian neural networks and text generation, to demonstrate the superiority of DLP in general settings.

2. Related Work

Gibbs Sampling-based Methods Gibbs sampling is perhaps the de facto method for sampling from general discrete distributions. In each step, it iteratively updates one variable leaving the others unchanged. Updating a block of variables is possible, but typically with an increasing cost along with the increase of the block size. To speed up the convergence of Gibbs sampling in high dimensions, Grathwohl et al. (2021) uses gradient information to choose which coordinate to update and Titsias & Yau (2017) introduces auxiliary variables to trade off the number of updated variables in a block for less computation. However, inheriting from Gibbs sampling, these methods still require a large overhead to make significant changes (e.g. > 5 coordinates) to the configuration in one step.

Locally-Balanced Proposals Based on the information of a local neighborhood of the current position, locallybalanced proposals have been developed for sampling from discrete distributions (Zanella, 2020). Later they have been extended to continuous-time Markov processes (Power & Goldman, 2019) and have been tuned via mutual information (Sansone, 2021). Similar to Gibbs sampling-based methods, this type of proposals is very expensive to construct when the local neighborhood is large, preventing them from making large moves in discrete spaces. A concurrent work (Sun et al., 2022) explores a larger neighborhood by making a sequence of small movements. However, it still only updates one coordinate per gradient computation and the update has to be done in sequence, while on the contrary, our method can update many coordinates based on one gradient computation in parallel.

Continuous Relaxation Incorporating gradients in the proposal has been a great success in continuous spaces, such as the Langevin algorithm, Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal et al., 2011) and their variants. To take advantage of this success, continuous relaxation is applied which performs sampling in a continuous space by gradient-based methods and then transforms the collected samples to the original discrete space (Pakman & Paninski, 2013; Nishimura et al., 2020; Han et al., 2020;

Zhou, 2020; Jaini et al., 2021; Zhang et al., 2022). The efficiency of continuous relaxation highly depends on the properties of the extended continuous distributions which may be difficult to sample from. As shown in previous work, this type of methods usually does not scale to high dimensional discrete distributions (Grathwohl et al., 2021).

3. Preliminaries

We consider sampling from a target distribution

Z exp(U(θ)), θ Θ,

where θ is a d-dimensional variable, Θ is a finite1 variable domain, U is the energy function, and Z is the normalizing constant for π to be a distribution. In this paper, we restrict to a factorized domain, that is Θ = Qd i=1 Θi, and mainly consider Θ to be {0, 1}d or {0, 1, . . . , S 1}d. Additionally, we assume that U can be extended to a differentiable function in Rd. Many popular models have such natural extensions such as Ising models, Potts models, restricted Boltzmann machines, and (deep) energy-based models.

Langevin Algorithm In continuous spaces, one of the most powerful sampling methods is the Langevin algorithm, which follows a Langevin diffusion to update variables:

2 U(θ) + αξ, ξ N (0, Id d) ,

where α is a stepsize. The gradient helps the sampler to explore high probability regions efficiently. Generally, computing the gradient and sampling a Gaussian variable can be done cheaply in parallel on CPUs and GPUs. As a result, the Langevin algorithm is especially compelling for complex high-dimensional distributions, and has been extensively used in machine learning and deep learning.

4. Discrete Langevin Proposal

In this section, we propose discrete Langevin proposal (DLP), a simple counterpart of the Langevin algorithm in discrete domains.

At the current position θ, the proposal distribution q( |θ) produces the next position to move to. As introduced in Section 3, q( |θ) of the Langevin algorithm in continuous spaces can be viewed as a Gaussian distribution with mean θ + α/2 U(θ) and covariance αId d. Obviously we could not use this Gaussian proposal in discrete spaces. However, we notice that by explicitly indicating the spaces where

1We consider finite discrete distributions in the paper. However, our algorithms can be easily extended to infinite distributions. See Appendix C for a discussion.

A Langevin-like Sampler for Discrete Distributions

the normalizing constant is computed over, this proposal is essentially applicable to any kind of spaces. Specifically, we write out the variable domain Θ explicitly in the proposal distribution,

q(θ |θ) = exp 1

ZΘ(θ) , (1)

where the normalizing constant is integrated (continuous) or summed (discrete) over Θ (we use sum below)

Here, Θ can be any space without affecting q being a valid proposal. As a special case, when Θ = Rd, it follows that ZRd(θ) = (2πα)d/2 and recovers the Gaussian proposal in the standard Langevin algorithm. When Θ is a discrete space, we naturally obtain a gradient-based proposal for discrete variables.

Computing the sum over the full space in ZΘ(θ) is generally very expensive, for example, the cost is O(Sd) for Θ = {0, 1, . . . , S 1}d. This is why previous methods often restrict their proposals to a small neighborhood. A key feature of the proposal in Equation (1) is that it can be factorized coordinatewisely. To see this, we write Equation (1) as q(θ |θ) = Qd i=1 qi(θ i|θ), where qi(θ i|θ) is a simple categorical distribution of form:

Categorical Softmax 1

2 U(θ)i(θ i θi) (θ i θi)2

with θ i Θi (note that the equation does not contain the term (α/2 U(θ)i)2 because it is independent of θ and will not affect the softmax result). Combining with coordinatewise factorized domain Θ, the above proposal enables us to update each coordinate in parallel after computing the gradient U(θ). The cost of gradient computation is also O(d), therefore, the overall cost of constructing this proposal depends linearly rather than exponentially on d. This allows the sampler to explore the full space with the gradient information without paying a prohibitive cost.

We denote the proposal in Equation (2) as Discrete Langevin Proposal (DLP). DLP can be used with or without a Metropolis-Hastings (MH) step (Metropolis et al., 1953; Hastings, 1970), which is usually combined with proposals to make the Markov chain reversible. Specifically, after generating the next position θ from a distribution q( |θ), the MH step accepts it with probability

min 1, exp (U(θ ) U(θ)) q(θ|θ )

By rejecting some of the proposed positions, the Markov chain is guaranteed to converge asymptotically to the target distribution.

We outline the sampling algorithms using DLP in Algorithm 1. We call DLP without the MH step as discrete unadjusted Langevin algorithm (DULA) and DLP with the MH step as discrete Metropolis-adjusted Langevin algorithm (DMALA). Similar to MALA and ULA in continuous spaces (Grenander & Miller, 1994; Roberts & Stramer, 2002), DMALA contains two gradient computations and two function evaluations and is guaranteed to converge to the target distribution, while DULA may have asymptotic bias, but only requires one gradient computation, which is especially valuable when performing the MH step is expensive such as in large-scale Bayesian inference (Welling & Teh, 2011; Durmus & Moulines, 2019).

Connection to Locally-Balanced Proposals Zanella (2020) has developed a class of locally-balanced proposals that can be used in both discrete and continuous spaces. One of the locally-balanced proposals is defined as

r(θ |θ) exp

2U(θ) θ θ 2

where θ Θ. DLP can be viewed as a first-order Taylor series approximation to r(θ |θ) using

U(x) U(θ) U(θ) (x θ), x Θ.

Zanella (2020) discussed the connection between their proposals and Metropolis-adjusted Langevin algorithm (MALA) in continuous spaces but did not explore it in discrete spaces. Grathwohl et al. (2021) uses a similar Taylor series approximation for another locally-balanced proposal,

r (θ |θ) exp 1

2U(θ) , (4)

where θ belongs to a hamming ball centered at θ with window size 1. Like Gibbs sampling, their proposal only updates one coordinate per step. They also propose an extension of their method to update X coordinates per step, but with X times gradient computations. See Appendix D for a discussion on Taylor series approximation for Equation (4) without window sizes.

Beyond previous works, we carefully investigate the properties of the Langevin-like proposal in Equation (2) in discrete spaces, providing both convergence analysis and extensive empirical demonstration. We find this simple approach explores the discrete structure surprisingly well, leading to a substantial improvement on a range of tasks.

A Langevin-like Sampler for Discrete Distributions

Algorithm 1 Samplers with Discrete Langevin Proposal (DULA and DMALA).

given: Stepsize α. loop

for i = 1 : d do {Can be done in parallel}

construct qi( |θ) as in Equation (2) sample θ i qi( |θ) end for

Optionally, do the MH step compute q(θ |θ) = Q

i qi(θ i|θ) and q(θ|θ ) = Q

i qi(θi|θ ) set θ θ with probability in Equation (3) end loop output: samples {θk}

5. Convergence Analysis for DULA

In the previous section, we showed that DLP is a convenient gradient-based proposal for discrete distributions. However, the effectiveness of a proposal also depends on how close its underlying stationary distribution is to the target distribution. Because if it is far, even if using the MH step to correct the bias, the acceptance probability will be very low. In this section, we provide an asymptotic convergence analysis for DULA (i.e. the sampler using DLP without the MH step). Specifically, we first prove in Section 5.1 that when the stepsize α 0, the asymptotic bias of DULA is zero for log-quadratic distributions, which are defined as

π(θ) exp (θ Wθ + b θ) , θ Θ (5)

with some constants W Rd d and b Rd. Without loss of generality, we assume W is symmetric (otherwise we can replace W with (W + W )/2 for the eigendecomposition).

Later in Section 5.2, we extend the result to general distributions where we show the asymptotic bias of DULA is small for distributions that are close to being log-quadratic.

5.1. Convergence on Log-Quadratic Distributions

We consider a log-quadratic distribution π(θ) as defined in Equation (5). This type of distributions appears in common tasks such as Ising models. The following theorem summarizes DULA s asymptotic accuracy for such π.

Theorem 5.1. If the target distribution π is log-quadratic as defined in Equation (5). Then the Markov chain following transition q( |θ) in Equation (2) (i.e. DULA) is reversible with respect to some distribution πα and πα converges weakly to π as α 0. In particular, let λmin be the smallest eigenvalue of W, then for any α > 0,

πα π 1 Z exp 1

where Z is the normalizing constant of π.

Theorem 5.1 shows that the asymptotic bias of DULA decreases at a O(exp( 1/(2α)) rate which vanishes to zero as the stepsize α 0. This is similar to the case of the Langevin algorithm in continuous spaces, where it converges asymptotically when the stepsize goes to zero.

We empirically verify this theorem in Figure 1a. We run DULA with varying stepsizes on a 2 by 2 Ising model. For each stepsize, we run the chain long enough to make sure it converged. The results clearly show that the distance between the stationary distribution of DULA and the target distribution decreases as the stepsize decreases. Moreover, the decreasing speed roughly aligns with a function containing exp( 1/(2α)), which demonstrates the convergence rate with respect to α in Theorem 5.1.

5.2. Convergence on General Distributions

To generalize the convergence result from log-quadratic distributions to general distributions, we first note that the stationary distribution of DULA always exists and is unique since its transition matrix is irreducible (Levin & Peres, 2017). Then we want to make sure the bound is tight in the log-quadratic case and derive one that depends on how much the distribution differs from being log-quadratic. A natural measure of this difference is the distance of U to a linear function. Specifically, we assume that W Rd d, b R, ϵ R+, such that

U(θ) (2Wθ + b) 1 ϵ, θ Θ. (6)

Then we have the following theorem for the asymptotic bias of DULA on general distributions.

Theorem 5.2. Let π be the target distribution and π (θ) = exp (θ Wθ + bθ) /Z be the log-quadratic distribution satisfying the assumption in Equation (6), then the stationary distribution of DULA satisfies

πα π 1 2c1 (exp (c2ϵ) 1) + Z exp 1

where c1 is a constant depending on π and α; c2 is a constant depending on Θ and maxθ,θ Θ θ θ .

The first term in the bound captures the bias induced by the deviation of π from being log-quadratic, which decreases in a O(exp(ϵ)) rate. The second term captures the bias by using a non-zero stepsize which directly follows from Theorem 5.1. To ensure a satisfying convergence, Theorem 5.2 suggests that we should choose a continuous extension for U of which the gradient is close to a linear function.

Example. To empirically verify Theorem 5.2, we consider a 1-d distribution π(θ) exp aθ2 + bθ + 2ϵ sin(θπ/2)

A Langevin-like Sampler for Discrete Distributions

where θ { 1, 1} and a, b, ϵ R. The gradient is U(θ) = 2aθ + b + ϵπ cos(θπ/2) of which the closeness to a linear function in R is controlled by ϵ. From Figure 1b, we can see that when ϵ decreases, the asymptotic bias of DULA decrease, aligning with Theorem 5.2.

In fact, the example of π(θ) exp (2ϵ sin(θπ/2)) with θ { 1, 1} can be considered as an counterexample for any existing gradient-based proposals in discrete domains. The gradient on { 1, 1} is always zero regardless the value of ϵ whereas the target distribution clearly depends on ϵ. This suggests that we should be careful with the choice of the continuous extension of U since some extensions will not provide useful gradient information to guide the exploration. Our Theorem 5.2 provides a guide about how to choose such an extension.

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

Stepsize ( )

L1 Distance

0.6 * exp( 1 2x)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

L1 Distance

Figure 1. Empirical verification of theorems. Left: DULA with varying stepsizes on an Ising model. Right: 1-d distribution with varying closeness ϵ to being log-quadratic.

6. Other Variants

Thanks to the similarity with the standard Langevin algorithm, discrete Langevin proposal can be extended to different usage scenarios following the rich literature of the standard Langevin algorithm. We briefly discuss two such variants in this section.

With Stochastic Gradients Similar to Stochastic gradient Langevin dynamics (SGLD) (Welling & Teh, 2011), we can replace the full-batch gradient with an unbiased stochastic estimation ˆ U in DLP, which will further reduce the cost of our method on large-scale problems. To show the influence of stochastic estimation, we consider a binary domain for simplicity. We further assume that the stochastic gradient has a bounded variance and the norm of the true gradient and the stochastic gradient are bounded. Then we have the following theorem.

Theorem 6.1. Let Θ = {0, 1}d. We assume that the true gradient U and the stochastic gradient ˆ U satisfy: E[ ˆ Ui] = Ui, E[ ˆ Ui] σ2 for some constant σ; | ˆ Ui|, | Ui| L for some constant L. Let qi and ˆqi be the discrete Langevin proposal for the coordinate i using the full-batch gradient and the stochastic gradient respectively,

E [ˆqi] qi 1 2σ exp 1

This suggests that when the variance of the stochastic gradient or the stepsize decreases, the stochastic DLP in expectation will be closer to the full-batch DLP. We test DULA with the stochastic gradient on a binary Bayesian neural network and empirically verify it works well in practice in Appendix I.5.

With Preconditioners When π alters more quickly in some coordinates than others, a single stepsize may result in slow mixing. Under this situation, a preconditiner that adapts the stepsize for different coordinates can help alleviate this problem. We show that it is easy for DLP to incorporate diagonal preconditioners, as long as the coordinatewise factorization still holds. For example, when the preconditioner is constant, that is, we scale each coordinate by a number gi, discrete Langevin proposal becomes

qi(θ i|θ) exp 1

2 U(θ)i(θ i θi) (θi θ i)2

Similar to preconditioners in continuous spaces, gi adjusts the stepsize for different coordinates considering their variation speed. The above proposal is obtained by applying a coordinate transformation to θ and then transforming the DLP update back to the original space. In this way, the theoretical results in Section 5 directly apply to it. We put more details in Appendix H.

7. Experiments

We conduct a thorough empirical evaluation of discrete Langevin proposal (DLP), comparing to a range of popular baselines, including Gibbs sampling, Gibbs with Gradient (GWG) (Grathwohl et al., 2021), Hamming ball (HB) (Titsias & Yau, 2017) three Gibbs-based approaches; discrete Stein Variational Gradient Descent (DSVGD) (Han et al., 2020) and relaxed MALA (RMALA) (Grathwohl et al., 2021) two continuous relaxation methods; and a locally balanced sampler (LB-1) (Zanella, 2020) which uses the locally-balanced proposal in Equation (4) with window size 1. We denote Gibbs-X for Gibbs sampling with a block-size of X, GWG-X for GWG with X indices being modified per step (see D.2 in Grathwohl et al. (2021) for more details of GWG-X), HB-X-Y for HB with a block size of X and a hamming ball size of Y . All methods are implemented in Pytorch and we use the official release of code from previous papers when possible. In our implementation, DMALA (i.e. discrete Langevin proposal with an MH step) has a similar cost per step with GWG-1 (the main costs for them are one gradient computation and two function

A Langevin-like Sampler for Discrete Distributions

evaluations), which is roughly 2.5x of Gibbs-1. DULA (i.e. discrete Langevin proposal without an MH step) has a similar cost per step with Gibbs-1 (the main cost for DULA is one gradient computation, and for Gibbs-1 is d function evaluations). We released the code at https://github. com/ruqizhang/discrete-langevin.

7.1. Sampling From Ising Models

We consider a 5 by 5 lattice Ising model with random variable θ { 1, 1}d, and d = 5 5 = 25. The energy function is U(θ) = aθ Wθ + bθ,

where W is a binary adjacency matrix, a = 0.1 is the connectivity strength and b = 0.2 is the bias. We first show that DLP can change many coordinates in one iteration while still maintaining a high acceptance rate in Figure 2 (Left). When the stepsize α = 0.6, on average DMALA can change 6 coordinates in one iteration with an acceptance rate 52% in the MH step. In comparison, GWG-6 (which at most changes 6 coordinates) only has 43% acceptance rate, not to mention it requires 6x cost of DMALA. This demonstrates that DLP can make large and effective moves in discrete spaces. We compare the root-mean-square error (RMSE) between the estimated mean and the true mean in Figure 3. DMALA is the fastest to converge in terms of both runtime and iterations. This demonstrates the importance of (1) using gradient information to explore the space compared to Gibbs and HB; (2) sampling in the original discrete space compared to DSVGD and RMALA; and (3) changing many coordinates in one step compared to LB-1 and GWG-1. GWG-4 underperforms because of a lower acceptance rate than DMALA. DULA can achieve a similar result as LB-1 and GWG-1 but worse than DMALA, indicating that the MH step accelerates the convergence on this task. In Figure 2 (Right), we compare the effective sample size (ESS) per second for exact samplers (i.e. having the target distribution as its stationary distribution). DMALA significantly outperforms other methods, indicating the correlation among its samples is low due to making significant updates in each step. We additionally present results on Ising models with different connectivity strength a in Appendix I.2.

In what follows, we mainly compare our method with GWG1 and Gibbs-1, as other methods either could not give reasonable results or are too costly to run.

7.2. Sampling From Restricted Boltzmann Machines

Restricted Boltzmann Machines (RBMs) are generative artificial neural networks, which learn an unnormalized probability over inputs,

i Softplus(Wθ + a)i + b θ,

0.2 0.3 0.4 0.5 0.6 Stepsize ( )

Acceptance Prob

# of Changed Coordinates

Gibbs-4 HB-10-1 GWG-4 LB-1 DMALA Methods

ESS Per Second

Figure 2. Ising model sampling results. Left: DLP is able to change many coordinates while keeping a high acceptance rate. Right: DMALA yields the largest effective sample size (ESS) per second among all the methods compared.

0 10000 20000 30000 40000 50000 Iters

0 20 40 60 80 100 120 140 Runtime (s)

Gibbs-1 Gibbs-4 HB-10-1 GWG-1 GWG-4

LB-1 DSVGD RMALA DMALA DULA

Figure 3. Ising model sampling results. DMALA outperforms the baselines in both number of iterations and running time.

where {W, a, b} are parameters and θ {0, 1}d. Following Grathwohl et al. (2021), we train {W, a, b} with contrastive divergence (Hinton, 2002) on the MNIST dataset for one epoch. We measure the Maximum Mean Discrepancy (MMD) between the obtained samples and those from Block-Gibbs sampling, which utilizes the known structure.

Results The results are shown in Figure 4. Since DULA and DMALA can update multiple coordinates in a single iteration, they converge remarkably faster than baselines in both iterations and runtime. Furthermore, DMALA reaches the lowest MMD ( 6.5) among all the methods after 5,000 iterations, demonstrating the importance of the MH step. We leave the sampled images in Appendix I.3.

0 5 10 15 20 25 Runtime (s)

Gibbs-1 HB-10-1 GWG-1 DMALA DULA

0 1000 2000 3000 4000 5000 Iters

Gibbs-1 HB-10-1 GWG-1 DMALA DULA

Figure 4. RBM sampling results. DULA and DMALA converge to the true distribution faster than other methods.

7.3. Learning Energy-based Models

Energy-based models (EBMs) have achieved great success in various areas in machine learning (Le Cun et al., 2006). Generally, the density of an energy-based model is defined as pθ(x) = exp( Eθ(x))/Zθ, where Eθ is a function pa-

A Langevin-like Sampler for Discrete Distributions

0 2000 4000 6000 8000 10000 Iters

Gibbs-1 GWG-1 DMALA DULA

0 200 400 600 800 1000 1200 1400 Runtime (s)

Gibbs-1 GWG-1 DMALA DULA

20 40 60 80 100 Sampling Steps

GWG-1 DMALA DULA

(a) (b) (c)

Figure 5. Ising model learning results. DULA and DMALA find W in a shorter time than baselines, and the improvement gets larger with smaller number of sampling steps.

Dataset VAE (Conv) EBM (Gibbs) EBM (GWG) EBM (DULA) EBM (DMALA) Static MNIST -82.41 -117.17 -80.01 -80.71 -79.46 Dynamic MNIST -80.40 -121.19 -80.51 -81.29 -79.54 Omniglot -97.65 -142.06 -94.72 -145.68 -91.11 Caltech Silhouettes -106.35 -163.50 -96.20 -100.52 -87.82

Table 1. EBM learning results. We report the log-likelihoods on the test set for different models trained on discrete image datasets.

rameterized by θ and Zθ is the normalizing constant. Training EBM usually involves maximizing the log-likelihood, L(θ) Ex pdata [log pθ(x)]. However, direct optimization needs to compute Zθ, which is intractable in most scenarios. To deal with it, we usually estimate the gradient of the log-likelihood instead,

L(θ) = Ex pθ [ θEθ(x)] Ex pdata [ θEθ(x)] .

Though the first term is easy to estimate from the data, the second term requires samples from pθ. Better samplers can improve the training process of Eθ, leading to EBMs with higher performance.

7.3.1. ISING MODELS

As in Grathwohl et al. (2021), we generate a 25 by 25 Ising model and generate training data by running a Gibbs sampler. In this experiment, Eθ is an Ising model with learnable parameter ˆW. We evaluate the samplers by computing RMSE between the estimated ˆW and the true W.

Results Our results are summarized in Figure 5. In (a), DMALA and DULA always have smaller RMSE than baselines given the same number of iterations. In (b), DMALA and DULA get a log-RMSE of 5.0 in 800s, while the baseline methods fail to reach 5.0 in 1, 400s. In (c), we vary the number of sampling steps per iteration from 5 to 100 (we omit the results of Gibbs-1 since it diverges with less than 100 steps) and report the RMSE after 10,000 iterations. DMALA and DULA outperform GWG consistently and the improvement becomes larger when the number of sampling steps becomes smaller, demonstrating the fast mixing of our discrete Langevin proposal.

7.3.2. DEEP EBMS

We train deep EBMs where Eθ is a Res Net (He et al., 2016) with Persistent Contrastive Divergence (Tieleman, 2008; Tieleman & Hinton, 2009) and a replay buffer (Du & Mordatch, 2019) following Grathwohl et al. (2021). We run DMALA and DULA for 40 steps per iteration. After training, we adopt Annealed Importance Sampling (Neal, 2001) to estimate the likelihood. The results for GWG and Gibbs are taken from Grathwohl et al. (2021), and for VAE are taken from Tomczak & Welling (2018).

Results In Table 1, we see that DMALA yields the highest log-likelihood among all methods and its generated images in Figure 9 in Appendix I.4 are very close to the true images. DULA runs the same number of steps as DMALA and GWG but only with half of the cost. We hypothesis that running DULA for more steps or with an adaptive stepsize schedule (Song & Ermon, 2019) will improve its performance.

7.4. Binary Bayesian Neural Networks

Bayesian neural networks have been shown to provide strong predictions and uncertainty estimation in deep learning (Hern andez-Lobato & Adams, 2015; Zhang et al., 2020; Liu et al., 2021a). In the meanwhile, binary neural networks (Courbariaux et al., 2016; Rastegari et al., 2016; Liu et al., 2021b), i.e. the weight is in { 1, 1}, accelerate the learning and significantly reduce computational and memory costs. To combine the benefits of both worlds, we consider training a binary Bayesian neural network with discrete sampling. We conduct regression on four UCI datasets (Dua &

A Langevin-like Sampler for Discrete Distributions

Dataset Training Log-likelihood ( ) Test RMSE ( ) Gibbs GWG DULA DMALA Gibbs GWG DULA DMALA COMPAS -0.4063 0.0016 -0.3729 0.0019 -0.3590 0.0003 -0.3394 0.0016 0.4718 0.0022 0.4757 0.0008 0.4879 0.0024 0.4879 0.0003 News -0.2374 0.0003 -0.2368 0.0003 -0.2339 0.0003 -0.2344 0.0005 0.1048 0.0025 0.1028 0.0023 0.0971 0.0012 0.0943 0.0009 Adult -0.4587 0.0029 -0.4444 0.0041 -0.3287 0.0027 -0.3190 0.0019 0.4826 0.0027 0.4709 0.0034 0.3971 0.0014 0.3931 0.0019 Blog -0.3829 0.0036 -0.3414 0.0028 -0.2694 0.0025 -0.2670 0.0031 0.4191 0.0061 0.3728 0.0034 0.3193 0.0021 0.3198 0.0043

Table 2. Experiment results with binary Bayesian neural networks on different datasets.

Model Methods Self-BLEU ( ) Unique n-grams (%) ( ) Corpus BLEU ( ) Self WT103 TBC n = 2 n = 3 n = 2 n = 3 n = 2 n = 3

Bert-Base Gibbs 86.84 10.98 16.08 18.57 32.21 21.22 33.05 23.82 GWG 81.97 15.12 21.79 22.76 37.59 24.72 37.98 22.84 DULA 72.37 23.33 32.88 27.74 45.85 30.02 46.75 21.82 DMALA 72.59 23.26 32.64 27.99 45.77 30.32 46.49 21.85

Bert-Large Gibbs 88.78 9.31 13.74 17.78 30.50 20.48 31.23 22.57 GWG 86.50 11.03 16.13 19.25 33.20 21.42 33.54 23.08 DULA 77.96 17.97 26.64 23.69 41.30 26.18 42.14 21.28 DMALA 76.27 19.83 28.48 25.38 42.94 27.87 43.77 21.73

Table 3. Quantative results on text infilling. The reference text for computing the Corpus BLEU is the combination of WT103 and TBC. DULA and DMALA generate sentences with a similar level of quality as Gibbs and GWG (measured by Corpus BLEU) while attaining higher diversity (measured by self-BLEU and Unique n-grams).

Graff, 2017), and the energy function is defined as,

i=1 ||fθ(xi) yi||2,

where D = {xi, yi}N i=1 is the training dataset, and fθ represents a two-layer neural network with Tanh activation and 500 hidden neurons. The dimension of the weight d varies on different datasets ranging from 7, 500 to 45, 000. We report the log-likelihood on the training set together with root mean-square-error (RMSE) on the test set.

Results From Table 2, we observe that DMALA and DULA outperform other methods significantly on all datasets except test RMSE on COMPAS, which we hypothesize is because of overfitting (the training set only has 4, 900 data points). These results demonstrate that our methods converge fast for high dimensional distributions, due to the ability to make large moves per iteration, and suggest that our methods are compelling for training low-precision Bayesian neural networks of which the weight is discrete.

7.5. Text Infilling

Text infilling is an important and intriguing task where the goal is to fill in the blanks given the context (Zhu et al., 2019; Donahue et al., 2020). Prior work has realized it by sampling from a categorical distribution produced by BERT (Devlin et al., 2019; Wang & Cho, 2019). However, there are a huge number of word combinations, which makes sampling from

the categorical distribution difficult. Therefore, an efficient sampler is needed to producing high-quality text.

We randomly sample 20 sentences from TBC (Zhu et al., 2015) and Wi Ki Text-103 (Merity et al., 2017), mask 25% of the words in the sentence (Zhu et al., 2019; Donahue et al., 2020), and sample 25 sentences from the probability distribution given by BERT. We run all samplers for 50 steps based on two models, BERT-base and BERT-large. As a common practice in non-autoregressive text generation, we select the top-5 sentences with the highest likelihood out of the 25 sentences to avoid low-quality generation (Gu et al., 2018; Zhou et al., 2019). We evaluate the methods from two perspectives, diversity and quality. For diversity, we use self-BLEU (Zhu et al., 2018) and the number of unique n-grams (Wang & Cho, 2019) to measure the difference between the generated sentences. For quality, we measure the BLEU score (Papineni et al., 2002) between the generated texts and the original dataset (TBC+Wiki Text-103) (Wang & Cho, 2019; Yu et al., 2017). Note that the BLEU score can only approximately represent the quality of the generation since it cannot handle out-of-distribution generations. We use one-hot vectors to represent categorical variables. We put the discussion of DLP with categorical variables in practice in Appendix B.

Results The quantitative and qualitative results are shown in Table 3 and Figure 6. We find that DULA and DMALA can fill in the blanks with similar quality as Gibbs and GWG but with much higher diversity. Due to the nature of languages,

A Langevin-like Sampler for Discrete Distributions

Gibbs Results: all of which he started all of which he started all of which he started all of which he started four of which he started

Infilling Task: it also marked the first time the dodgers had won six straight opening day games , [MASK] of which he [MASK] .

GWG Results: five of which he started five of which he started four of which he started four of which he pitched two of which he started

DMALA Results: two of which he started one of which he started three of which he led all of which he selected six of which he missed

Gibbs Results: given me the chance but had abandoned me instead given me the chance but had abandoned me instead given me the chance but had abandoned me instead given me the chance but had abandoned me completely given me the chance but had abandoned me anyway

Infilling Task: he had not , after all , [MASK] me the chance but [MASK] abandoned me [MASK] .

GWG Results: given me the chance but had abandoned me instead given me the chance but had abandoned me himself offered me the chance but had abandoned me completely gave me the chance but had abandoned me anyway given me the chance but he abandoned me instead

DMALA Results: shown me the chance but had abandoned me anyway shown me the chance but not abandoned me immediately gives me the chance but also abandoned me perhaps grants me the chance but really abandoned me entirely offered me the chance but yet abandoned me instead

Figure 6. Examples of the generated sentences on text infilling. Blue and Red words are generated where Red indicates repetitive generation. DMALA can generate semantically meaningful sentences with much higher diversity.

there exist strong correlations among words. It is generally difficult to change one word given the others fixed while still fulfilling the context. However, it is likely to have another combination of words, which are all different from the current ones, to satisfy the infilling. Because of this, the ability to update all coordinates in one step makes our methods especially suitable for this task, as reflected in the evaluation metrics and generated sentences.

8. Conclusion

We propose a Langevin-like proposal for efficiently sampling from complex high-dimensional discrete distributions. Our method, discrete Langevin proposal (DLP), is able to explore discrete structure effectively based on the gradient information. For different usage scenarios, we have developed several variants with DLP, including unadjusted, Metropolisadjusted, stochastic, and preconditioned versions. We prove the asymptotic convergence of DLP without the MH step under log-quadratic and general distributions. Empirical results on many different problems demonstrate the superiority of our method over baselines in general settings.

While the Langevin algorithm has achieved great success in continuous spaces, there has always lacked a counterpart of such simple, effective and general-purpose samplers in discrete spaces. We hope our method sheds light on building practical and accurate samplers for discrete distributions.

Acknowledgements

We would like to thank Yingzhen Li and the anonymous reviewers for their thoughtful comments on the manuscript. This research is supported by CAREER1846421, Sen SE2037267, EAGER-2041327, Office of Navy Research, and NSF AI Institute for Foundations of Machine Learning (IFML).

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Handbook of markov chain monte carlo. CRC press, 2011.

Buza, K. Feedback prediction for blogs. In Data analysis, machine learning and knowledge discovery, pp. 145 152. Springer, 2014.

Cho, G. E. and Meyer, C. D. Comparison of perturbation bounds for the stationary distribution of a markov chain. Linear Algebra and its Applications, 335(1-3):137 150, 2001.

Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and Bengio, Y. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. ar Xiv preprint ar Xiv:1602.02830, 2016.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. Association for Computational Linguistics, 2019.

Donahue, C., Lee, M., and Liang, P. Enabling language models to fill in the blanks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2492 2501, 2020.

Du, Y. and Mordatch, I. Implicit generation and generalization in energy-based models. Advances in Neural Information Processing Systems, 2019.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid monte carlo. Physics letters B, 195(2):216 222, 1987.

Durmus, A. and Moulines, E. High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli, 25(4A):2854 2882, 2019.

A Langevin-like Sampler for Discrete Distributions

Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., Norouzi, M., and Swersky, K. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2019.

Grathwohl, W., Swersky, K., Hashemi, M., Duvenaud, D., and Maddison, C. J. Oops i took a gradient: Scalable sampling for discrete distributions. International Conference on Machine Learning, 2021.

Grenander, U. and Miller, M. I. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):549 581, 1994.

Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R. Non-autoregressive neural machine translation. In International Conference on Learning Representations, 2018.

Han, J., Ding, F., Liu, X., Torresani, L., Peng, J., and Liu, Q. Stein variational inference for discrete distributions. In International Conference on Artificial Intelligence and Statistics, pp. 4563 4572. PMLR, 2020.

Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. 1970.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hern andez-Lobato, J. M. and Adams, R. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International conference on machine learning, pp. 1861 1869. PMLR, 2015.

Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002.

J. Angwin, J. Larson, S. M. and Kirchner, L. Machine bias: There s software used across the country to predict future criminals. and it s biased against blacks. Pro Publica, 2016.

Jaini, P., Nielsen, D., and Welling, M. Sampling in combinatorial spaces with survae flow augmented mcmc. In International Conference on Artificial Intelligence and Statistics, pp. 3349 3357. PMLR, 2021.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. International Conference for Learning Representations, 2015.

Le Cun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.

Levin, D. A. and Peres, Y. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.

Li, C., Chen, C., Carlson, D., and Carin, L. Preconditioned stochastic gradient langevin dynamics for deep neural networks. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.

Liu, X., Tong, X., and Liu, Q. Sampling with trusthworthy constraints: A variational gradient framework. Advances in Neural Information Processing Systems, 34, 2021a.

Liu, X., Ye, M., Zhou, D., and Liu, Q. Post-training quantization with multiple points: Mixed precision without mixed precision. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 8697 8705, 2021b.

Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. International conference on machine learning, 2017.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087 1092, 1953.

Neal, R. M. Annealed importance sampling. Statistics and computing, 11(2):125 139, 2001.

Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.

Nishimura, A., Dunson, D. B., and Lu, J. Discontinuous hamiltonian monte carlo for discrete parameters and discontinuous likelihoods. Biometrika, 107(2):365 380, 2020.

Pakman, A. and Paninski, L. Auxiliary-variable exact hamiltonian monte carlo samplers for binary distributions. Advances in Neural Information Processing Systems, 2013.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311 318, 2002.

Peters, J. W. and Welling, M. Probabilistic binary neural networks. ar Xiv preprint ar Xiv:1809.03368, 2018.

Power, S. and Goldman, J. V. Accelerated sampling on discrete spaces with non-reversible markov processes. ar Xiv preprint ar Xiv:1912.04681, 2019.

Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. ar Xiv preprint ar Xiv:1710.05941, 2017.

A Langevin-like Sampler for Discrete Distributions

Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525 542. Springer, 2016.

Roberts, G. O. and Stramer, O. Langevin diffusions and metropolis-hastings algorithms. Methodology and computing in applied probability, 4(4):337 357, 2002.

Roberts, G. O. and Tweedie, R. L. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996.

Sansone, E. Lsb: Local self-balancing mcmc in discrete spaces. ar Xiv preprint ar Xiv:2109.03867, 2021.

Schweitzer, P. J. Perturbation theory and finite markov chains. Journal of Applied Probability, 5(2):401 413, 1968.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 2019.

Sun, H., Dai, H., Xia, W., and Ramamurthy, A. Path auxiliary proposal for mcmc in discrete space. In International Conference on Learning Representations, 2022.

Tieleman, T. Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp. 1064 1071, 2008.

Tieleman, T. and Hinton, G. Using fast weights to improve persistent contrastive divergence. In Proceedings of the 26th annual international conference on machine learning, pp. 1033 1040, 2009.

Titsias, M. K. and Yau, C. The hamming ball sampler. Journal of the American Statistical Association, 112(520): 1598 1611, 2017.

Tomczak, J. and Welling, M. Vae with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214 1223. PMLR, 2018.

Wang, A. and Cho, K. Bert has a mouth, and it must speak: Bert as a markov random field language model. ar Xiv preprint ar Xiv:1902.04094, 2019.

Wang, J., Huda, A., Lunyak, V. V., and Jordan, I. K. A gibbs sampling strategy applied to the mapping of ambiguous short-sequence tags. Bioinformatics, 26(20):2501 2508, 2010.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning, pp. 681 688. Citeseer, 2011.

Yu, L., Zhang, W., Wang, J., and Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.

Zanella, G. Informed proposals for local mcmc in discrete spaces. Journal of the American Statistical Association, 115(530):852 865, 2020.

Zhang, D., Malkin, N., Liu, Z., Volokhova, A., Courville, A., and Bengio, Y. Generative flow networks for discrete probabilistic modeling. ar Xiv preprint ar Xiv:2202.01361, 2022.

Zhang, R., Li, C., Zhang, J., Chen, C., and Wilson, A. G. Cyclical stochastic gradient mcmc for bayesian deep learning. In International Conference on Learning Representations, 2020.

Zhou, C., Gu, J., and Neubig, G. Understanding knowledge distillation in non-autoregressive machine translation. In International Conference on Learning Representations, 2019.

Zhou, G. Mixed hamiltonian monte carlo for mixed discrete and continuous variables. Advances in Neural Information Processing Systems, 33:17094 17104, 2020.

Zhu, W., Hu, Z., and Xing, E. Text infilling. ar Xiv preprint ar Xiv:1901.00158, 2019.

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19 27, 2015.

Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1097 1100, 2018.

A Langevin-like Sampler for Discrete Distributions

A. Algorithm with Binary Variables

When the variable domain Θ is binary {0, 1}d, we could simplify Algorithm 1 further and obtain Algorithm 2, which clearly shows that our method can be cheaply computed in parallel on CPUs and GPUs.

Algorithm 2 DULA and DMALA with Binary Variables.

given: Stepsize α. loop

compute P(θ) = exp( 1

2 U(θ) (2θ 1) 1

2α ) exp( 1

2 U(θ) (2θ 1) 1

2α )+1 sample µ Unif(0, 1)d

I dim(µ P(θ)) θ flipdim(I)

Optionally, do the MH step compute q(θ |θ) = Q

i qi(θ i|θ) = Q

i I P(θ)i Q

i/ I(1 P(θ)i)

compute P(θ ) = exp( 1

2 U(θ ) (2θ 1) 1

2α ) exp( 1

2 U(θ ) (2θ 1) 1

2α )+1 compute q(θ|θ ) = Q i qi(θi|θ ) = Q i I P(θ )i Q i/ I(1 P(θ )i) set θ θ with probability

min 1, exp (U(θ ) U(θ)) q(θ|θ )

end loop output: samples {θk}

B. Algorithm with Categorical Variables

When using one-hot vectors to represent categorical variables, our discrete Langevin proposal becomes

Categorical Softmax 1

2 U(θ) i (θ i θi) θ i θi 2 2 2α

where θi, θ i are one-hot vectors.

However, if the variables are ordinal with clear ordering information, we can also use integer representation θ {0, 1, . . . , S 1}d and compute DLP as in Equation(2).

C. Extension to Infinite Domains

We only consider finite discrete distributions in the paper. However, our algorithms can be easily extended to infinite distributions. One way to do so is to add a window size for the proposal such that the proposal only considers a local region in each step. For example, if the domain is infinite integer-valued: Θ = {0, 1, 2, . . .}d, then our DLP can use a proposal window of θ i {θi 1, θi, θi + 1}. It is also possible to use other window sizes. We leave the comprehensive study of our method on infinite discrete distributions for future work.

D. Importance of the Term θ θ 2 /(2α)

Similar to DLP, we can also use a first-order Taylor series approximation to Equation (4) and get the following proposal

Categorical Softmax 1

2 U(θ)i(θ i θi) .

The above proposal does not contain the stepsize term as in DLP, which will result in a very low acceptance probability in practice. For example, in the RBM experiment (Section 7.2), its acceptance probability is 0.005% while DMALA s acceptance probability is 51% on average. This demonstrates the importance of the term θ θ 2 /(2α) in DLP where the stepsize α can control the closeness of θ to the current θ. The effect of α is similar to the stepsize in the standard Langevin algorithm, and by tuning it our method can achieve a desirable acceptance probability leading to efficient sampling.

A Langevin-like Sampler for Discrete Distributions

E. Proof of Theorem 5.1

Proof. We divide the proof into two parts. In the fisrt part, we will prove the weak convergence and in the second part we will prove the convergence rate with respect to the stepsize α.

Weak Convergence. The main idea of the proof is to replace the gradient term in the proposal by the energy difference U(θ ) U(θ) using Taylor series approximation, and then show the reversibility of the chain based on the proof of Theorem 1 in Zanella (2020) .

Recall that the target distribution is π(θ) = exp (θ Wθ + b θ) /Z. We have that U(θ) = 2W θ + b, 2U(θ) = 2W. Since 2U is a constant, we can rewrite the proposal distribution as the following

qα(θ |θ) = exp 1 2 U(θ) (θ θ) 1 2α θ θ 2

x exp 1 2 U(θ) (x θ) 1 2α x θ 2

2 U(θ) (θ θ) + 1

2(θ θ) W(θ θ) (θ θ) ( 1

2 U(θ) (x θ) + 1

2(x θ) W(x θ) (x θ) ( 1

2 (U(θ ) U(θ)) (θ θ) ( 1

2 (U(x) U(θ)) (x θ) ( 1

where the last equation is because U(θ ) U(θ) = U(θ) (θ θ)+ 1

2(θ θ) 2W(θ θ) by Taylor series approximation.

Let Zα(θ) = P

2 (U(x) U(θ)) (x θ) ( 1

2W)(x θ) , and πα = Zα(θ)π(θ) P

x Zα(x)π(x), now we will show that qα is reversible w.r.t. πα.

We have that

πα(θ)qα(θ |θ) = Zα(θ)π(θ) P

x Zα(x)π(x) exp 1

2 (U(θ ) U(θ)) (θ θ) ( 1

2 (U(θ ) + U(θ)) (θ θ) ( 1

x Zα(x)π(x) .

It is clear that this expression is symmetric in θ and θ . Therefore qα is reversible and the stationary distribution is πα.

Now we will prove that πα converges weakly to π as α 0. Notice that for any θ,

2 (U(x) U(θ)) (x θ) 1

2 (U(x) U(θ)) δθ(x)

where δθ(x) is a Dirac delta. It follows that πα converges pointwisely to π(θ). By Scheff e s Lemma, we attain that πα converges weakly to π.

Convergence Rate w.r.t. Stepsize. Let us consider the convergence rate in terms of the L1-norm

Zα(θ)π(θ) P

x Zα(x)π(x) π(θ) .

We write out each absolute value term Zα(θ)π(θ) P x Zα(x)π(x) π(θ) = π(θ) Zα(θ) P x Zα(x)π(x) 1

1 + P x =θ exp 1

2 U(θ) (x θ) 1

1 + P y 1 Z exp (U(y)) P x =y exp 1

2 U(y) (x y) 1

2 W (x y) 1

A Langevin-like Sampler for Discrete Distributions

Since λmin(W) x 2 x Wx, x, it follows that

2W (x θ) 1 + αλmin

We also notice that minx =θ x θ 2 = 1, thus when Zα(θ) P

x Zα(x)π(x) 1 > 0, we get

Zα(θ)π(θ) P x Zα(x)π(x) π(θ) = π(θ)

1 + P x =θ exp 1

2 U(θ) (x θ) 1

1 + P y 1 Z exp (U(y)) P x =y exp 1

2 U(y) (x y) 1

2 W (x y) 1

2 U(θ) 1 + αλmin

1 + exp 1 + αλmin

exp 1 + αλmin

x exp (U(x))

exp 1 + αλmin

= π(θ)Z exp 1 + αλmin

Similarly, when Zα(θ) P

x Zα(x)π(x) 1 < 0, we have,

Zα(θ)π(θ) P x Zα(x)π(x) π(θ) = π(θ)

1 1 + P x =θ exp 1

2 U(θ) (x θ) 1

1 + P y 1 Z exp (U(y)) P x =y exp 1

2 U(y) (x y) 1

1 + P y 1 Z exp (U(y)) P x =y exp 1 2 U(x) 1

2 U(y) 1+αλmin

P y 1 Z exp (U(y)) P x =y exp 1 2 U(x) 1

2 U(y) 1+αλmin

1 + P y 1 Z exp (U(y)) P x =y exp 1 2 U(x) 1

2 U(y) 1+αλmin

1 Z exp (U(y)) X

exp 1 + αλmin

x exp (U(x))

exp 1 + αλmin

= π(θ)Z exp 1 + αλmin

Therefore, the difference between πα and π can be bounded as follows

θ π(θ)Z exp 1 + αλmin

= Z exp 1 + αλmin

F. Proof of Theorem 5.2

Proof. We use a log-quadratic distribution that is close to π as an intermediate term to bound the bias of DULA. Recall that π is the target distribution, π is the log-quadratic distribution that is close to π and πα is the stationary distribution of DULA. We let π α be the stationary distributions of DULA targeting π , then by triangle inequality,

πα π 1 πα π α 1 + π α π 1 + π π 1 .

A Langevin-like Sampler for Discrete Distributions

Bounding π π 1. Let the energy function of π be V (θ) = θ Wθ + bθ. Since Θ is a discrete space, there exists a bounded subset Ω Rd such that Θ is a subset of Ω. By Poincar e inequality, we get

|U(θ) V (θ)| C1 U(θ) (Wθ + b) 1 C1ϵ, θ Θ

where the constant C1 depends on Ω.

Recall that π(θ) = exp(U(θ))

Z . Let π (θ) = exp(V (θ))

Z where Z is the normalizing constant to make π a distribution. Then

Z exp(V (θ))

We notice that θ,

exp(U(θ)) exp(V (θ)) = exp(U(θ) V (θ)) exp(|U(θ) V (θ)|) exp(Cϵ),

and similarly

exp(V (θ)) exp(U(θ)) exp(Cϵ).

Therefore we have θ,

Z exp(V (θ))

= exp(U(θ)) N exp(V (θ)) Z

(exp(2Cϵ) 1) max exp(U(θ))

Z , exp(V (θ))

(exp(2Cϵ) 1) exp(U(θ))

Z + exp(V (θ))

Plugging the above in Equation (7), we obtain

π π 1 2 (exp(2Cϵ) 1) .

Bounding π α π 1. By Theorem 5.1, we know

π α π 1 Z exp 1 + αλmin

Bounding πα π α 1. Now we will bound πα π α 1 by perturbation bounds. Let Tα, Tα be the transition matrices of DULA on π and π respectively. We consider the difference between these two matrices.

Tα Tα = max θ

exp 1 2 U(θ) (θ θ) 1 2α θ θ 2

Zα(θ) exp 1 2 V (θ) (θ θ) 1 2α θ θ 2

2 U(θ) (x θ) 1

2α x θ 2 , Z α(θ) = X

2 V (θ) (x θ) 1

A Langevin-like Sampler for Discrete Distributions

We denote D = maxθ,θ Θ θ θ . By the assumption U(θ) V (θ) 1 ϵ, similar to Equation (8), we have

exp 1 2 U(θ) (θ θ) 1 2α θ θ 2

Zα(θ) exp 1 2 V (θ) (θ θ) 1 2α θ θ 2

(exp (Dϵ) 1)

exp 1 2 U(θ) (θ θ) 1 2α θ θ 2

Zα(θ) , exp 1 2 V (θ) (θ θ) 1 2α θ θ 2

(exp (Dϵ) 1)

exp 1 2 U(θ) (θ θ) 1 2α θ θ 2

Zα(θ) + exp 1 2 V (θ) (θ θ) 1 2α θ θ 2

Now we substitute it to Tα Tα ,

T T 2 (exp (Dϵ) 1) .

By the perturbation bound in Schweitzer (1968),

πα π α 1 C2 Tα Tα = 2C2 (exp (Dϵ) 1)

where C2 is a constant depending on π and α. Please note that it is also possible to use other perturbation bounds (Cho & Meyer, 2001).

Combining these three bounds, we get

πα π 1 πα π α 1 + π α π 1 + π π 1

2C2 (exp (Dϵ) 1) + Z exp 1 + αλmin

+ 2 (exp(2C1ϵ) 1) .

We define c1 := 2 max(2, 2C2) and c2 := max(2C1, D), then we reach the the final result

πα π 1 πα π α 1 + π α π 1 + π π 1

2c1 (exp (c2ϵ) 1) + Z exp 1 + αλmin

G. Proof of Theorem 6.1

Proof. The proposal distribution for the coordinate i with the stochastic gradient is

ˆqi(θ i|θ) = Categorical Softmax exp ˆ U(θ)i(θ i θi) 1

2α(θ i θi)2 .

We consider the binary case i.e. Θi = {0, 1}. When the current position θi = 0 the probability for θ i is

exp ˆ U(θ)i 1 2α 1 , when θ i = 0; 1

exp ˆ U(θ)i + 1 2α + 1 , when θ i = 1.

A Langevin-like Sampler for Discrete Distributions

The difference between E [ˆqi] and qi is

E [ˆqi] qi 1 =

θ i=0 |E [ˆqi(θ i|θ)] qi(θ i|θ)| .

We consider each absolute value term. When θ i = 0,

|E [ˆqi(θ i = 0|θ)] qi(θ i = 0|θ)| =

exp ˆ U(θ)i 1 2α + 1

1 exp U(θ)i 1 2α + 1

exp ˆ U(θ)i 1 2α + 1 1 exp U(θ)i 1 2α + 1

Since the absolute value of the derivative for f(x) = 1 exp(x)+1 is exp(x) (exp(x)+1)2 , which is monotonically increasing for

x ( , 0] and monotonically decreasing for x (0, ). Then by the assumption ˆ U(θ)i L, | U(θ)i| L and Lipschitz continuity,

|E [ˆqi(θ i = 0|θ)] qi(θ i = 0|θ)| max

2α + L) exp( 1

2α + L) + 1 2 , exp( 1

2α L) exp( 1

2α L) + 1 2

E h ˆ U(θ)i U(θ)i i

2α + L E h ˆ U(θ)i U(θ)i i

where the last equation is because of our assumption on the variance of the stochastic gradient Var h ˆ U(θ)i i σ2, which leads to

E h ˆ U(θ)i U(θ)i i2 E ˆ U(θ)i U(θ)i 2 = Var h ˆ U(θ)i i σ2.

Similarly, when θi = 0

|E [ˆqi(θ i = 1|θ)] qi(θ i = 1|θ)| max

2α L) exp( 1

2α L) + 1 2 , exp( 1

2α + L) exp( 1

2α + L) + 1 2

E h ˆ U(θ)i U(θ)i i

2α + L E h ˆ U(θ)i U(θ)i i

We substitute the above two bounds to the L1 distance

E [ˆqi] qi 1 2σ exp 1

The same analysis applies to the case when the current position θi = 1. To see this, we write out the proposal distribution

exp ˆ U(θ)i + 1 2α + 1 , when θ i = 0; 1

exp ˆ U(θ)i 1 2α + 1 , when θ i = 1.

The remaining follows the same as in θi = 0.

A Langevin-like Sampler for Discrete Distributions

H. Preconditioned Discrete Langevin Proposal

Our discrete Langevin proposal can be easily combined with diagonal preconditioners as in the standard Langevin proposal. We consider the case when the preconditioner is constant.

Assume that the constant diagonal preconditioner is G = diag(g) where g Rd, then we can derive the preconditioned discrete Langevin proposal by a simple coordinate transformation. Let the new coordinates be ηi := g 1 i θi, then the domain for ηi is Hi = {g 1 i θi : θi Θi}. The discrete Langevin proposal for ηi is

q(η i|η) = exp gi

2 U(Gη)i(η i ηi) (η i ηi)2

xi Hi exp gi

2 U(Gη)i(xi ηi) (xi ηi)2

Let η i = g 1 i θ i, xi = g 1 i yi, then the above proposal distribution is equivalent to

q(θ i|θ) = exp gi

2 U(θ)i(g 1 i θ i g 1 i θi) (g 1 i θ i g 1 i θi)2

yi Θi exp gi

2 U(θ)i(g 1 i yi g 1 i θi) (g 1 i yi g 1 i θi)2

= exp 1 2 U(θ)i(θ i θi) (θ i θi)2

yi Θi exp 1 2 U(θ)i(yi θi) (yi θi)2

Example. To demonstrate the use of the preconditioned discrete Langevin proposal, we consider an Ising model U(θ) = θ Wθ where W = diag( 0.001, 1000) and θ { 1, 1}. Due to the drastically different scale for the derivative of different coordinates, if using the same stepsize for both coordinates, no values will work. However if we use the preconditioned proposal as shown above and let α = 1, g1 = 1000, g2 = 0.001. Then the Markov chain will mix quickly. For example, in terms of the log RMSE between the estimated mean and the true mean, preconditioned DLP gives 6.2 whereas the standard DLP with any α can only achieve around 0.34.

I. Additional Experiments Results and Setting Details

I.1. Ising Model for Theory Verification

To verify Theorem 5.1, we use a 2 by 2 Ising model U(θ) = θ c Wθ + bθ where W is the binary adjacency matrix and c = 0.1, b = 0.2. To verify Theorem 5.2, we set a = 1 and b = 0.1.

I.2. Sampling from Ising Models

We adopt the experiment setting in Section F.1 of Grathwohl et al. (2021). The energy of the ising model is defined as,

log p(x) = ax Jx + b x,

where a controls the connectivity strength and J is an adjacency matrix whose elements are either 0 or 1. In our experiments, J is the adjacency matrix of a 2D lattice graph. DMALA and DULA use a stepsize α of 0.4 and 0.2 respectively. We show the results with varying connectivity strength a. We run all methods for the same amount of time (GWG-1 and DMALA for 50, 000 iterations; Gibbs-1 and DULA for 125, 000 iterations). The results are shown in Figure 7. We can clearly observe that DMALA outperforms other methods on all connectivity strength.

I.3. Sampling from RBMs

The RBM used in our experiments has 500 hidden units and 784 visible units. The model is trained using contrastive divergence (Hinton, 2002). We set the batch size to 100. The stepsize α is set to be 0.1 for DULA and 0.2 for DMALA, respectively. The model is optimized by an Adam optimizer with a learning rate of 0.001. The Block-Gibbs sampler is described in details by Grathwohl et al. (2021). Here, we give a brief introduction. RBM defines a distribution over data x and hidden variables h by, log p(x, h) = h Wx + b x + c h log Z.

A Langevin-like Sampler for Discrete Distributions

0.05 0.00 0.05 0.10 0.15 0.20 a

Gibbs-1 GWG-1 DMALA DULA

Figure 7. Ising models with varying connectivity strength

By marginalizing out h, we get,

log p(x) = X

i Softplus(Wx + a)i + b x log Z.

The joint and marginal distribution are both unnormalized and hard to sample from. However, the special structure of the problem leads to easy conditional distributions,

p(x|h) = Bernoulli(Wx + c)

p(h|x) = Bernoulli(W h + b).

Therefore, we can perform Block-Gibbs sampling by updating x and h alternatively. This Block-Gibbs sampler can update all the 784 coordinates at the same time, and has much higher efficiency. However, this is due to the special structure of the problem. We provide the generated images in Figure 8, showing that the results of DULA and DMALA are closer to that of Block-Gibbs.

A Langevin-like Sampler for Discrete Distributions

(a) Gibbs-1 (b) HB-10-1 (c) GWG-1

(d) DULA (e) DMALA (f) Block-Gibbs

Figure 8. Generated samples from an RBM trained on MNIST. Our methods (DULA and DMALA) generate images that are closer to Block-Gibbs samples.

I.4. Learning EBMs

For all the experiments in this section, we set the stepsize α to be 0.1 for DULA and 0.15 for DMALA.

Ising Model We construct a training dataset of 2, 000 instances by running 1, 000, 000 steps of Gibbs sampling for each instance. The model is trained by Persistent Contrastive Divergence (Tieleman, 2008) with a buffer size of 256 samples. We also use the Adam optimizer with a learning rate of 0.001. The batch size is 256. We train all models with an ℓ1 penalty with a penalty coefficient of 0.01 to encourage sparsity. The experiment setting is basically the same as Section F.2 in Grathwohl et al. (2021).

Deep EBMs We adopt the same Res Net structure and experiment protocol as in Grathwohl et al. (2021), where the network has 8 residual blocks with 64 feature maps. There are 2 convolutional layers for each residual block. The network uses Swish activation function (Ramachandran et al., 2017). For static/dynamic MNIST and Omniglot, we use a replay buffer with 10, 000 samples. For Caltech, we use a replay buffer with 1, 000 samples. We evaluate the models every 5,000 iterations by running AIS for 10, 000 steps. The reported results are from the model which performs the best on the validation set. The final reported numbers are generated by running 300, 000 iterations of AIS. All the models are trained with Adam (Kingma & Ba, 2015) with a learning rate of 0.0001 for 50, 000 iterations.

We show the generated images with DULA and DMALA in Figure 9. We see that the generated images from DMALA are very close to the true images on all four datasets. DULA can generate high-quality images on static and dynamic MNIST, but mediocre images on Omniglot and Caltech Silhouette. We hypothesis that we need to run DULA for more steps per iteration to get better results.

A Langevin-like Sampler for Discrete Distributions

Figure 9. Left to right: Static MNIST, Dynamic MNIST, Omniglot, Caltech Silhouettes. Top: DMALA. Bottom: DULA.

I.5. Binary Bayesian Neural Networks

Details of the Datasets (1) COMPAS: COMPAS (J. Angwin & Kirchner, 2016) is a dataset containing the criminal records of 6,172 individuals arrested in Florida. The task is to predict whether the individual will commit a crime again in 2 years. We use 13 attributes for prediction. (2) News: Online News Popularity Data Set 2 contains 39,797 instances of the statistics on the articles published by a website. The task is to predict the number of clicks in several hours after the articles published. (3) Adult: Adult Income Dataset 3 is a dataset containing the information of US individuals from 1994 census. The prediction task is to predict whether an individual makes more than 50K dollars per year. The dataset contains 44,920 data points. (4) Blog: Blog Feedback (Buza, 2014) is a dataset containing 54,270 data points from blog posts. The raw HTML-documents of the blog posts were crawled and processed. The prediction task associated with the data is the prediction of the number of comments in the upcoming 24 hours. The feature of the dataset has 276 dimensions.

More Details on Training We run 50 chains in parallel and collect the samples at the end of training. All datasets are randomly partitioned into 80% for training and 20% for testing. The features and the predictive targets are normalized to (0, 1). We set α = 0.1 for all datasets. We use a uniform prior over the weights. We train the Bayesian neural network for 1, 000 steps. We use full-batch training for the results in Section 7 so that Gibbs and GWG are also applicable.

Experimental Results of Stochastic DLP As mentioned in Section 6, we empirically evaluate the performance of stochastic DLP here. We use a batch size of 10, 100, 500 and report the performance of the obtained binary Bayesian neural networks in Table 4. Note that Gibbs-based sampling techniques are not suitable for mini-batches. We find that our method works quite well with mini-batches, and the performance increases as the batch size becomes larger.

Dataset Training Log-likelihood ( ) Test RMSE ( ) batchsize=10 batchsize=100 batchsize=500 full-batch(DULA) batchsize=10 batchsize=100 batchsize=500 full-batch(DULA) COMPAS -0.4288 0.0052 -0.3689 0.0027 -0.3603 0.0007 -0.3590 0.0019 0.4929 0.0052 0.4864 0.0037 0.4848 0.0015 0.4879 0.0024 News -0.2355 0.0002 -0.2342 0.0001 -0.2344 0.0006 -0.2339 0.0003 0.1072 0.0013 0.0971 0.0026 0.0976 0.0009 0.0971 0.0012 Adult -0.3738 0.0039 -0.3275 0.0006 -0.3233 0.0009 -0.3287 0.0027 0.4309 0.0027 0.3963 0.0016 0.3954 0.0015 0.3971 0.0014 Blog -0.3199 0.0072 -0.2710 0.0009 -0.2689 0.0024 -0.2694 0.0025 0.3533 0.0043 0.3202 0.0018 0.3198 0.0026 0.3193 0.0021

Table 4. Experimental results of binary Bayesian neural networks on different datasets. We examine the influence of batch size on the stochastic version of DLP.

2https://archive.ics.uci.edu/ml/datasets/online+news+popularity 3https://archive.ics.uci.edu/ml/datasets/adult

A Langevin-like Sampler for Discrete Distributions

I.6. Text Infilling

We use the BERT model pre-trained in Py Torch-Pretrained-BERT 4. We use the same TBC and Wiki Text-103 corpus as in Wang & Cho (2019) by randomly sampling 5, 000 sentences from the complete TBC and Wiki Text-103 corpus. BERT is a masked language model, which means that given a sentence with MASK tokens and contexts, the model is able to provide a probability distribution over all the candidate words to indicate which word the MASK position is likely to be. Supposing we have a sentence X = (x1, x2, . . . , xn) which has n words, and m of the n words are MASKs. For each MASK[i], the BERT model can output a vector f( |XMASK[i]) RN which is a normalized probability distribution over the candidate words representing the probability of a certain word appears at MASK[i] when the context is given. The BERT model we used has N = 30, 522 candidate words. Hence, θ {1, 2, . . . , N}m. Given the user-defined context and MASK positions, we set MASK[i] to θi, and the energy function is defined as,

i=1 log f(θi|XMASK[i]).

There are N m combinations in total, so directly sampling from the categorical distribution is intractable for m 2. More generated texts are displayed in Figure 10.

beginning in december 1934 , training exercises were conducted for the tetrarchs and their crews using hamilcar gliders beginning in march 1946 , training exercises were conducted by the tetrarchs and their crews with hamilcar gliders . beginning in may 1926 , training exercises were conducted between the tetrarchs and their crews using hamilcar gliders . beginning in late 1942 , training exercises were conducted with the tetrarchs and their crews onboard hamilcar gliders . beginning in september 1961 , training exercises were conducted between the tetrarchs and their crews in hamilcar gliders .

beginning in february 1943 , training exercises were conducted with the tetrarchs and their crews in hamilcar gliders . beginning in march 1942 , training exercises were conducted for the tetrarchs and their crews in hamilcar gliders . beginning in august 1943 , training exercises were conducted between the tetrarchs and their crews in hamilcar gliders . beginning in april 1917 , training exercises were conducted between the tetrarchs and their crews in hamilcar gliders . beginning in march 1918 , training exercises were conducted between the tetrarchs and their crews in hamilcar gliders .

beginning in february 1943 , training exercises were conducted with the tetrarchs and their crews in hamilcar gliders . beginning in june 1943 , training exercises were conducted with the tetrarchs and their crews in hamilcar gliders . beginning in may 1945 , training exercises were conducted with the tetrarchs and their crews in hamilcar gliders . beginning in july 1916 , training exercises were conducted with the tetrarchs and their crews in hamilcar gliders . beginning in april 1941 , training exercises were conducted by the tetrarchs and their crews in hamilcar gliders .

beginning in october 1943 , training exercises were conducted between the tetrarchs and their crews in hamilcar gliders . beginning in september 1940 , training exercises were conducted by the tetrarchs and their crews on hamilcar gliders . beginning in september 1918 , training exercises were conducted with the tetrarchs and their crews flying hamilcar gliders . beginning in august 1941 , training exercises were conducted among the tetrarchs and their crews with hamilcar gliders . beginning in september 1951 , training exercises were conducted for the tetrarchs and their crews in hamilcar gliders .

Figure 10. Examples of the generated sentences in the text infilling task. Blue and Red words are generated. Red indicates repetitive generation.

4https://github.com/maknotavailable/pytorch-pretrained-BERT