# lsb_local_selfbalancing_mcmc_in_discrete_spaces__bd3d4380.pdf LSB: Local Self-Balancing MCMC in Discrete Spaces Emanuele Sansone 1 We present the Local Self-Balancing sampler (LSB), a local Markov Chain Monte Carlo (MCMC) method for sampling in purely discrete domains, which is able to autonomously adapt to the target distribution and to reduce the number of target evaluations required to converge. LSB is based on (i) a parametrization of locally balanced proposals, (ii) a newly proposed objective function based on mutual information and (iii) a selfbalancing learning procedure, which minimises the proposed objective to update the proposal parameters. Experiments on energy-based models and Markov networks show that LSB converges using a smaller number of queries to the oracle distribution compared to recent local MCMC samplers. 1. Introduction Sampling from complex and intractable probability distributions is of fundamental importance for learning and inference (Mac Kay, 2003). In this regard, MCMC methods are promising solutions to tackle the intractability of sampling in high dimensions. They have been successfully applied in several domains, including Bayesian statistics and statistical physics (Neal, 1993; Robert & Casella, 2013), bioinformatics and computational biology (Altekar et al., 2004; Alexeev et al., 2020) as well as machine learning (Andrieu et al., 2003; Koller & Friedman, 2009; Nijkamp et al., 2020). While MCMC is a general method, which can be used to sample from any target distribution, its efficiency strongly depends on the choice of the proposal. Indeed, the proposal must be adapted to the distribution we want to sample from (Andrieu & Thoms, 2008; Hoffman & Gelman, 2014) and machine learning can help to automate this process (Zhang et al., 2012; Pakman & Paninski, 2013; Af- 1Department of Computer Science, KU Leuven, Belgium. Correspondence to: Emanuele Sansone . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). shar et al., 2015; Dinh et al., 2017; Nishimura et al., 2020). However, less effort has been devoted to devise strategies for adapting the proposal distribution to a discrete target. Most common solutions embed the discrete problem into a continuous one, thus allowing to reuse sampling strategies designed for the continuous domain. However, these strategies are not always optimal, as they either require specific analytic forms for the target (Zhang et al., 2012), or simply because the embeddings do not capture the topological properties of the original domain (Pakman & Paninski, 2013; Afshar et al., 2015; Dinh et al., 2017; Nishimura et al., 2020). We can avoid these issues by sampling directly in the discrete domain and using local MCMC strategies (Zanella, 2020). In this work, we propose an adaptation strategy for local MCMC. Specifically, we introduce (i) a new parametrization for locally balanced proposals, (ii) a new objective function based on mutual information to compute the statistical dependence between consecutive samples, and (iii) a learning procedure to adapt the parameters of the proposal to the target distribution using our proposed objective. The resulting procedure, called the Local Self-Balancing sampler (LSB), automatically discovers a locally balanced proposal with the advantage of reducing the number of target evaluations required to converge. Furthermore, we show that our framework generalizes over two recent works (Zanella, 2020) and (Grathwohl et al., 2021). We conduct experiments on energy-based models and Markov network, and show that LSB queries the oracle distribution in a more efficient way compared to recent local MCMC samplers. In summary, the key contributions of the work are: A parametrization of locally balanced proposals, which enable to convert the problem of proposal selection into a learning problem. The use of mutual information as a criterion to learn the proposal distribution and accelerate MCMC. An estimator to efficiently compute the mutual information objective and to enable efficient learning. We start by providing some background on locally balanced proposal distributions (Section 2), we introduce LSB by describing the parametrizations, the objective and the learning LSB: Local Self-Balancing MCMC in Discrete Spaces procedure (Section 3), we discuss the related work (Section 4) and the experiments (Section 5), and finally we conclude by highlighting the main limitations of LSB and the possible future directions (Section 6). 2. Background We consider the problem of sampling from a distribution p with a support defined over a large finite discrete sample space X, i.e. p(x) = p(x)/ P x X p(x ), where the normalization term cannot be tractably computed and only p can be evaluated. One solution to the problem consists of sampling using MCMC (Neal, 1993). The main idea of MCMC is to sequentially sample from a tractable surrogate distribution, alternatively called proposal, and to use an acceptance criterion to ensure that generated samples are distributed according to the original distribution. More formally, MCMC is a Markov chain with a transition probability of the form:1 T(x |x) = A(x , x)Q(x |x) (1) where Q(x |x) is the probability of sampling x given a previously sampled x, namely the proposal distribution, and A(x , x) is the probability of accepting sample x given x, e.g. A(x , x) = min 1, p(x )Q(x|x ) p(x)Q(x |x) .2 In this work, we consider the family of locally informed proposals (Zanella, 2020), which are characterized by the following expression: Q(x |x) = g p(x ) p(x) 1[x N(x)] where N(x) is the neighborhood of x based on the Hamming metric.3 Note that the choice of g has a dramatic impact on the performance of the Markov chain, as investigated in (Zanella, 2020). In fact, there is a family of functions called balancing functions, satisfying the relation g(t) = tg(1/t) (for all t > 0), which have extremely desirable properties, briefly recalled hereunder. Acceptance rate. The balancing property allows to rewrite the acceptance function in a form that depends only from the ratio Z(x) Z(x ), specifically A(x , x) = min 1, Z(x) Z(x ) . Therefore if p(x) is smooth enough, then the ratio Z(x) Z(x ) will be close to 1, as x, x are close to each other. However, a proper choice of g can additionally influence the value of this ratio. 1For all x, x X such that x = x , see Eq. (7) in Appendix B 2Other choices are available (Neal, 1993) as well. 3In this work, N(x) is a set of points having Hamming distance 1 from x. Detailed balance. Note that for all x = x, detailed balance trivially holds, viz. p(x)T(x |x) = p(x )T(x|x ). In all other cases, detailed balance can be proved, by exploiting the fact that T(x |x) = A(x , x)Q(x |x) and by using the balancing property (see the Appendix A for more details). Detailed balance is a sufficient condition for invariance. Consequently, the target p is a fixed point of the Markov chain. Ergodicity. Under mild assumptions, we have also ergodicity (we leave more detailed discussion to Appendix B). In other words, the Markov chain converges to the fixed point distribution p independently from its initialization. Efficiency. The efficiency of MCMC is generally measured in terms of the resulting asymptotic variance for sample mean estimators. This is indeed a proxy to quantify the level of correlation between samples generated through MCMC. Higher levels of asymptotic variance correspond to higher levels of correlation, meaning that the Markov chain produces more dependent samples and it is therefore less efficient. Balancing functions are asymptotically optimal according to Peskun ordering (Zanella, 2020). The work in (Zanella, 2020) proposes a pool of balancing functions with closed-form expression together with some general guidelines to choose one. However, this pool is only a subset of the whole family of balancing functions and several cases do not even have an analytical expression. Consequently, it is not clear which function to use in order to sample efficiently from the target distribution. Indeed, we will see in the experimental section that (i) the optimality of the balancing function depends on the target distribution and that (ii) in some cases the optimal balancing function may be different from the ones proposed in (Zanella, 2020). In the next sections, we propose a strategy to automatically learn the balancing function from the target distribution, thus reducing the number of target evaluations in order to achieve convergence compared to recent discrete MCMC samplers. 3. LSB: Local Self-Balancing Strategy We start by introducing two different parametrizations for the family of balancing functions in increasing order of functional expressiveness. Then, we propose an objective criterion based on mutual information to learn the parametrization and to reduce the number of steps required to converge. Finally, we introduce an approximate strategy to further reduce the computational overhead of each step. 3.1. Parametrizations We state the following proposition and then use it to devise the first parametrization. LSB: Local Self-Balancing MCMC in Discrete Spaces Proposition 3.1. Given n balancing functions g(t) = [g1(t), . . . , gn(t)]T and any vector of scalar weights w = [w1, . . . , xn]T , the linear combination g(t) .= w T g(t) satisfies the balancing property. Proof. g(t) = w T g(t) = Pn i=1 wigi(t) = t Pn i=1 wigi(1/t) = tw T g(1/t) = tg(1/t) Despite its simplicity, the proposition has important implications. First of all, it allows to convert the problem of choosing the optimal balancing function into a learning problem. Secondly, the linear combination introduces only few parameters (in the experiments we consider n = 4) and therefore the learning problem can be solved in an efficient way. Additionally, we consider only positive weights to avoid negative values in Eq. 2, thus having a valid proposal distribution. The first parametrization (LSB 1) consists of the relations wi = eθi/ Pn j=1 eθj for all i = 1, . . . , n, where θ = [θ1, . . . , θn] Rn. Note that the softmax is used to smoothly select one among the n balancing functions. Therefore, we refer to this parametrization as learning to select among existing balancing functions. The second parametrization (LSB 2) is obtained from the following proposition. Proposition 3.2. Given gθ(t) .= hθ(t) 2 + thθ(1/t) 2 , where hθ is any real valued function parameterized by vector θ Rk (e.g. a neural network), gθ(t) satisfies the balancing property. Furthermore, for any balancing function ℓ, there always exists θ Rk such that g θ(t) = ℓ(t) for all t > 0. In other words, LSB 2 parameterizes all possible balancing functions. Proof. The first part follows directly from the definition of the balancing property. Given any balancing function ℓ, we can always find a θ such that h θ(t) = ℓ(t) for all t > 0 (because hθ is a universal function approximator (Hornik et al., 1989)). This implies that h θ satisfies the balancing property, i.e. h θ(t) = th θ(1/t) for all t > 0. Consequently, by the first part of the proposition, we have that g θ(t) = h θ(t). Finally, we can conclude that g θ(t) = ℓ(t) for all t > 0. The proposition enables to parameterize the whole family of balancing functions, thus generalizing the result obtained for LSB 1. Note that the increased level of expressiveness of LSB 2 comes at the cost of a higher computation. However, in practice we can always trade-off expressiveness and computation by specifying the capacity of the function approximator (e.g. the hyperparameters of the neural network). We leave this discussion to the experimental section. In the next paragraphs, we propose an objective and a learning strategy to train the parameters of LSB 1 and LSB 2. 3.2. Objective and Learning Algorithm The goal here is to devise a criterion to find the balancing function reducing the number of target likelihood evaluations and increasing the convergence/mixing rate to distribution p. As already mentioned in previous section, MCMC based on locally informed proposal converges to the true target distribution independently of its initialization. Furthermore, the rate of convergence and mixing is controlled by the balancing function. Importantly, we can speedup convergence and mixing by minimizing the amount of statistical dependence between consecutive samples. To this purpose, we introduce the following mutual information-based criterion: Iθ = KL p(x)Tθ(x |x) p(x)p(x ) (3) where KL is the Kullback Leibler divergence and Tθ is the transition probability with explicit dependence on parameter θ. Now, we are ready to highlight some properties of Eq. 3 with the following theorem (see Appendix C for the proof). Theorem 3.3. Consider X = {0, 1}d and p(x) = p(x)/Γ, where Γ is a normalizing constant. Define two auxiliary distributions Q1 and Q2, such that for all x X, Q1(x) > 0 whenever p(x) > 0, and for all x N(x), Q2(x ) > 0 whenever Q(x |x) > 0. Therefore, (a) if M(x) .= 1 P x N(x) A(x ,x)Q(x |x), then Lθ =Ex Q1 x Q2 p(x)Q(x |x) Q1(x)Q2(x )A(x ,x) log A(x ,x)Q(x |x) Q1(x)M(x) log M(x) and Iθ = Lθ (b) if M(x) .= 1 A(x ,x)Q(x |x), where x is randomly sampled according to a uniform distribution over N(x), then Lθ =Ex Q1 x Q2 p(x)Q(x |x) Q1(x)Q2(x )A(x ,x) log A(x ,x)Q(x |x) Q1(x) ηM(x) p(x)(log η + 1) Γ , for η R+. LSB: Local Self-Balancing MCMC in Discrete Spaces Algorithm 1 Local Self-Balancing (LSB) Input: Learning rate γ = 1e 2, π = 1e 8, initial parameter θ0, burn-in iterations K and batch of samples N {ˆx(i)}N i=1 UX for k = 1 to K do {x(i)}N i=1 Q1 {x (i)}N i=1 Q2 b Lθ Estimate Lθ using {x(i)}N i=1, {x (i)}N i=1 θ θ γ N θ b Lθ Update η Accept/Reject samples Update {ˆx(i)}N i=1 with accepted samples θ0 θ end for The theorem tells that for case (a), Lθ is equal to Iθ up to a constant, whereas, for case (b) Lθ is an upper bound of Iθ. In both cases, we can use Lθ as a surrogate objective to minimize Eq. 3. However, note that computing the two expectations in Eq. 4 or Eq. 5 is generally intractable, but one can estimate these two quantities by using Monte Carlo, for instance by sampling first from Q1 and then sampling from Q2. Also, note that the conditions on Q1 and Q2 in the theorem require that Q1 and Q2 are positive on the support of p(x) and Q(x |x), respectively. These conditions can be met by choosing Q1(x) = πUX (x)+(1 π)δ(x ˆx), where U is a uniform distribution over the whole space X, δ is the Dirac delta function, ˆx is the last accepted sample and π defines the proportions of the mixture, and Q2(x ) = Qθ0(x |x), where Qθ0(x |x) is the proposal distribution with parameter vector θ0. Importantly, while the choice of Q2 is quite natural, we choose to use a mixture distribution including a deterministic and a uniform distribution for Q1, as the deterministic distribution remembers and exploits the previous iterate in the current sampling step, whereas the uniform distribution discards this information and acts as an uninformative prior on the support of p(x), thus enabling the exploration of the whole input domain. Therefore, the mixing proportion parameter π balances between exploration and exploitation. In order to learn a balancing function, there is no real need to have exploration (π is set to 1e 8 in all our experiments). However, the mixing parameter and the exploration term might play a more important role in other learning settings, for instance when learning global proposal distributions. Once Lθ is estimated, we can update the parameters of the two parametrizations by using an off-the-shelf gradient-based optimizer. In our training algorithm, we choose case (b) as our surrogate objective to minimize the amount of target likelihood evaluations. Indeed, note that the estimate for case (a) requires O(d2) evaluations, because we need to com- Algorithm 2 Fast Local Self-Balancing (FLSB). Input: Learning rate γ = 1e 2, π = 1e 8, initial parameter θ0, burn-in iterations K and batch of samples N {ˆx(i)}N i=1 UX for k = 1 to K do {x(i)}N i=1 Q1 {x (i)}N i=1 Q2 (using the approximation) b Lθ Estimate Lθ using {x(i)}N i=1, {x (i)}N i=1 θ θ γ N θ b Lθ Update η Accept/Reject samples (exact accept. score) Update {ˆx(i)}N i=1 with accepted samples θ0 θ end for pute Q(x |x) for all x N(x), whereas the one for case (b) requires only O(d) evaluations. Finally, we learn the additional parameter η in case (b) using standard gradient descent. The whole learning procedure is shown in Algorithm 1. 3.3. Constant Target Evaluation Without loss of generality, we can express p(x) = ef(x) for some f : Rd R and consider X = {0, 1}d.4 By definition of N(x), the proposal distribution in Eq. 2 can be written in an equivalent form as Q(x |x) = Pd i=1 Q(x |x, i)Q(i|x), where Q(x |x, i) = δ(x x i) with x i obtained by flipping i bit in x, and Q(i|x) = Cat n Norm h g edf1(x), . . . , g edfd(x) io i where Cat stands for a categorical distribution, Norm is a normalization operator acting on a d dimensional vector and dfi(x) = f(x i) f(x). It s clear that computing previous equation requires to evaluate f for O(d) times. However, if we assume that f is known and differentiable (which is true for instance in energy-based models), we can use Taylor expansion to approximate the difference dfi(x). Indeed, we have that f(x i) f(x) xf( x)|T x=x(x i x) .= df i(x) which allows to evaluate the target only O(1) times. Therefore, our new proposal distribution is defined as follows: Q(i|x) = Cat n Norm h g e df 1(x), . . . , g e df d(x) io Interestingly, if we choose g(t) = t, we recover the exact same proposal distribution of a recent sampler, called Gibbs With-Gradients (Grathwohl et al., 2021). Thanks to this 4 p(x) is obviously defined over a discrete sample space. However, we can always identify a real-valued function which coincides with log p(x) on the discrete support. LSB: Local Self-Balancing MCMC in Discrete Spaces Table 1. Summary comparing locally balanced proposals (LB) of (Zanella, 2020), Gibbs-With-Gradients (GWG) (Grathwohl et al., 2021) and our samplers (LSB and FLSB). Name g(t) dfi Eval./step LB Fixed Exact O(d) GWG Fixed (g(t) = t) Approx.* O(1) LSB Learnt Exact O(d) FLSB Learnt Approx.* O(1) * Assumption: f is known and differentiable. approximation, we can propose a more efficient version of Algorithm 1, called Fast Local Self-Balancing procedure (FLSB), shown in Algorithm 2. Finally, we can summarize the main differences among locally balanced proposals of (Zanella, 2020), Gibbs-With Gradients (Grathwohl et al., 2021) and our samplers in Table 1. 4. Related Work It s important to devise strategies, which enable the automatic adaption of proposals to target distributions, not only to reduce user intervention, but also to increase the efficiency of MCMC samplers (Andrieu & Thoms, 2008; Hoffman & Gelman, 2014). Recently, there has been a surge of interest in using machine learning and in particular deep learning to learn proposals directly from data, especially in the continuous domain. Here, we provide a brief overview of recent integrations of machine learning and MCMC samplers according to different parametrizations and training objectives. Parametrizations and objectives in the continuous domain. The work in (Wang et al., 2018) proposes a strategy based on block Gibbs sampling, where blocks are large motifs of the underlying probabilistic graphical structure. It parameterizes the conditional distributions of each block using mixture density networks and trains them using metalearning on a log-likelihood-based objective. The work in (Song et al., 2017) considers a global sampling strategy, where the proposal is parameterized by a deep generative model. The model is learnt through adversarial training, where a neural discriminator is used to detect whether or not generated samples are distributed according to the target distribution. Authors in (Habib & Barber, 2018) propose a global sampling strategy based on MCMC with auxiliary variables (Higdon, 1998). The proposals are modelled as Gaussian distributions parameterized by neural networks and are trained on a variational bound of a log-likelihoodbased objective. The works in (Levy et al., 2018; Gong et al., 2019) propose a gradient-based MCMC (Duane et al., 1987; Grenander & Miller, 1994), where neural models are used to learn the hyperparameters of the equations governing the dynamics of the sampler. Different objectives are used during training. In particular, the work in (Gong et al., 2019) uses a log-likelihood based objective, whereas the work in (Levy et al., 2018) considers the expected squared jump distance, namely a tractable proxy for the lag-1 autocorrelation function (Pasarica & Gelman, 2010). The work in (Zhu, 2019) proposes a global two-stage strategy, which consists of (i) sampling according to a Gaussian proposal and (ii) updating its parameters using the firstand second-order statistics computed from a properly maintained pool of samples. The parameter update can be equivalently seen as finding the solution maximizing a log-likelihood function defined over the pool of samples. Finally, the work in (Pompe, E. and Holmes, C. and Łatuszy nski, K. and others, 2020) extends this last strategy to the case of Gaussian mixture proposals. All these works differ from the current one in at least two aspects. Firstly, it is not clear how these parametrizations can be applied to sampling in the discrete domain. Secondly, the proposed objectives compute either a distance between the proposal distribution and the target one, namely using an adversarial objective or a variational bound on the log-likelihood, or a proxy on the correlation between consecutive generated samples, namely the expected squared jump distance. Instead, our proposed objective is more general in the sense that it reduces the statistical dependence between consecutive samples, as being closely related to mutual information. Sampling in the discrete domain. Less efforts have been devoted to devise sampling strategies for a purely discrete domain. Most of the works consider problem relaxations by embedding the discrete domain into a continuous one, applying existing strategies like Hamiltonian Monte Carlo (Zhang et al., 2012; Pakman & Paninski, 2013; Afshar et al., 2015; Dinh et al., 2017; Nishimura et al., 2020) on it and then moving back to the original domain. These strategies are suboptimal, either because they consider limited settings, where the target distribution has specific analytic forms (Zhang et al., 2012), or because they make strong assumptions on the properties of the embeddings, thus not preserving the topological properties of the discrete domain (Pakman & Paninski, 2013; Afshar et al., 2015; Dinh et al., 2017; Nishimura et al., 2020).5. The work in (Zanella, 2020) provides an extensive experimental comparison between several discrete sampling strategies, including the ones based on embeddings, based on stochastic local search (Hans et al., 2007) and the Hamming ball sampler (Titsias & Yau, 2017), which can be regarded as a more efficient version of block Gibbs sampling. Notably, the sampling strategy based on locally informed proposals and balancing functions proposed in (Zanella, 5For example by considering transformations that are bijective and/or by proposing transformations which allow to tractably compute the marginal distribution on the continuous domain. LSB: Local Self-Balancing MCMC in Discrete Spaces Figure 1. Examples of α in different settings of the Ising model (30 30), i.e noisy µ = 1, σ = 3 and clean µ = 3, σ = 3. 2020) can be considered as the current state of the art for discrete MCMC. Our work builds and extends upon this sampler by integrating it with a machine learning strategy. It s important to mention that there are a couple of neural approaches applied to the discrete domain. The first one (Jaini et al., 2021) proposes to use normalizing flows to (i) learn a continuous relaxation of the discrete domain by dequantizing input data, and to (ii) learn a latent embedding more amenable to MCMC sampling. Learning the input-latent transformation is performed by maximizing the log-likelihood computed on data sampled from the latent space. The second one (Dai et al., 2020) proposes a strategy to learn an initializing distribution for a fixed local discrete MCMC sampler in the context of energy-based models. This is achieved by forcing the distribution to be close enough (in terms of KL divergence) to the fixed point of the MCMC kernel. Both works differ from ours in at least three aspects. Indeed, our work parameterizes the kernel of a local MCMC sampler, while the others consider a more global approach. Secondly, we are learning a one-dimensional real valued function through a simple neural network, instead of learning a more complex deep latent variable model which must transform input data. Finally, we are providing a mutual information objective, which directly tackle the problem of reducing the statistical dependence, and therefore also correlation, of consecutive samples. 5. Experiments Firstly, we analyze samplers performance on energy-based models, including the 2D Ising model and Restricted Boltzmann Machines. Then, we perform experiments on additional UAI benchmarks. Code to replicate the experiments is available at https://github.com/emsansone/ LSB.git. 5.1. 2D Ising Model We consider the Ising model applied to image segmentation, to identify an object from its background. Consider a binary state space X = { 1, 1}V , where (V, E) defines a square lattice graph of the same size of the analyzed image, namely n n. For each state configuration x = (xi)i V X, define a prior distribution pprior(x) exp λ X (i,j) E xixj where λ is a non-negative scalar used to weight the dependence among neighboring variables in the lattice. Then, consider that each pixel yi is influenced only by the corresponding hidden variable xi and generated according to a Gaussian density with mean µxi and variance σ2. Note that each variable in the lattice tells whether the corresponding pixel belongs to the object or to the background (1 or -1, respectively). The corresponding posterior distribution of a hidden state x given an observed image is defined as follows: i V αixi + λ X (i,j) E xixj where αi = yiµ/σ2 is a coefficient biasing xi towards either 1 or 1. Therefore, α = (αi)i V contains information about the observed image. Figure 1 shows two synthetically generated examples of α. We evaluate the sampling performance on the distribution defined in Eq. 6. Importantly, the topological graph structure of the lattice and the exponential form of the posterior distribution allows to compute a locally balanced proposals in O(1) target evaluations, without the need of using a gradient-based approximation (cf. Section 3.3). Therefore, we consider only comparisons between LB and LSB Learning the balancing function. We consider the balancing functions proposed in (Zanella, 2020), namely g(t) = t/(1 + t) (a.k.a Barker function), t, min{1, t} and max{1, t}.6 We compare these four balancing functions with our two parametrizations on four different settings of the Ising model, namely independent and noisy (λ, µ, σ) = (0, 1, 3), independent and clean (λ, µ, σ) = (0, 3, 3), dependent and noisy (λ, µ, σ) = (1, 1, 3) and dependent and clean (λ, µ, σ) = (1, 3, 3) cases and show the corresponding performance in Figure 2 and Table 2. We leave additional details and results to the Appendices D and E. From Figure 2, we can see that our first parametrization LSB 1 is able to always select an unbounded balancing function during burn-in, while when approaching convergence it is able to adapt to preserve fast mixing, as measured by the effective sample size (ESS) in Table 2. It s interesting to mention also that the softmax nonlinearity used in LSB 1 can sometimes slow down the adaptation due to vanishing 6As discussed in Section 3.3, the case g(t) = t corresponds to the exact version of GWG. Here there is no need for using the approximation, as we can exploit the structure of the lattice and the exponential form of the distribution to achieve constant target evaluation. LSB: Local Self-Balancing MCMC in Discrete Spaces Figure 2. Samplers performance on four cases of the Ising model (30 30) for the burn-in phase. (a) Case 1: Independent-noisy, (b) case 2: Independent-clean, (c) case 3: Dependent-noisy, (d) case 4: Dependent-clean Table 2. Quantitative performance for mixing measured by effective sample size on the four cases of the Ising model (30 30). max{1, t} is performing significantly worse in statistical terms than the other functions. Setting t 1+t t min{1, t} max{1, t} LSB 1 LSB 2 Case 1 2.55 0.30 2.55 0.30 2.55 0.30 2.29 0.23 2.29 0.23 2.29 0.23 2.43 0.22 2.43 0.22 2.43 0.22 1.71 0.13 2.32 0.27 2.32 0.27 2.32 0.27 2.34 0.23 2.34 0.23 2.34 0.23 Case 2 3.30 0.36 3.30 0.36 3.30 0.36 2.89 0.28 2.89 0.28 2.89 0.28 2.96 0.30 2.96 0.30 2.96 0.30 1.68 0.11 2.85 0.28 2.85 0.28 2.85 0.28 2.30 0.29 Case 3 2.39 0.98 2.39 0.98 2.39 0.98 1.83 0.56 1.83 0.56 1.83 0.56 2.31 0.92 2.31 0.92 2.31 0.92 1.20 0.10 2.01 0.70 2.01 0.70 2.01 0.70 2.44 0.96 2.44 0.96 2.44 0.96 Case 4 2.10 0.91 7.08 4.07 1.74 0.26 1.74 0.51 2.52 1.11 20.11 15.58 20.11 15.58 20.11 15.58 Figure 3. Realizations obtained after 300 burn-in iterations on the Ising model. gradients. This can be observed by looking at the case 4 of Figure 2, where for a large part of the burn-in period the strategy prefers max{1, t} over t. Nevertheless, it is still able to recover a solution different from max{1, t} at the end of burn-in, as confirmed by the larger effective sample size in Table 2 compared to the one achieved by max{1, t}. Furthermore, we observe that our second parametrization LSB 2, which is functionally more expressive compared to LSB 1, allows to outperform all previous cases in terms of number of target evaluations required to converge, as shown in Figure 2 and Table 2. This provides evidence that the optimality of the balancing function depends on the target distribution and that exploiting information about the target can lead to significant improvements (e.g. in case 3 of Figure 2, LSB 2 converges twice time faster as the best balancing function t). Figure 3 provides some realizations obtained by the samplers for the cases with dependent variables λ = 1. We clearly see from these pictures that convergence for LSB 2 occurs at an earlier stage than the other balancing functions and therefore the latent variables in the Ising model converge faster to their ground truth configuration. 5.2. Sampling in Restricted Boltzmann Machines We also evaluate the performance on the challenging task of sampling from Restricted Boltzmann Machines, using the experimental setup of (Grathwohl et al., 2021). In particular, we train a RBM with 250 hidden units on the binary MNIST dataset using contrastive divergence and use this model as our base distribution for evaluating the samplers. Notably, since the distribution has an analytic differentiable form, we can exploit the gradient-based approximation explained in Section 3.3. Therefore, we compare our sampler (FLSB) against Gibbs-With-Gradients (GWG) (Grathwohl et al., 2021), block Gibbs sampling (Gibbs 2) and the Hamming Ball sampler (HB-10-1) (Titsias & Yau, 2017).7 We use two scores to measure the performance. The first one consists of the maximum mean discrepancy (MMD) distance between the generated samples and the ground truth ones, obtained through the block-Gibbs sampling procedure available in RBMs. The second one consists of the effective sample size (ESS) of a summary statistics computed on the sampling trajectories (more details about the experiments are available in Appendix D). We also provide the effective sample size normalized over time (ESS/sec). In Figure 4, we see that FLSB 1 recovers the performance of GWG. FLSB 2 is able to adapt to the target distribution and to converge using a much smaller number of target likelihood evaluations. Furthermore, we observe that FLSB 2 performs well also in terms of ESS. These experiments 7Gibbs-X refers to Gibbs sampling with block size of X, whereas HB-X-Y refers to Hamming Ball sampler with block size of X and a Hamming ball of size Y. LSB: Local Self-Balancing MCMC in Discrete Spaces Figure 4. Samplers performance on RBMs. On the top, burn-in performance computed using MMD (in logarithmic scale). On the bottom, mixing performance computed after burn-in using log(ESS) (on the left) and ESS per seconds (on the right). Figure 5. Samplers performance on Markov networks from UAI competition (100 variables). (a)-(b) are the traceplots for the burnin phase. provide further evidence that there is a clear advantage on learning the balancing function. However, it is important to mention that the improved performance comes with a computational overhead. Indeed, when comparing the samplers based on ESS/sec (Figure 4), we observe that, while FLSB 1 and GWG achieve comparable results, the performance of FLSB 2 are inferior on average. This is explained by the fact that each sampling iteration requires to evaluate a more complex balancing function (i.e. a multilayer perceptron network with one 10-neuron hidden layer, corresponding to 31 parameters vs. 4 parameters for FLSB 1). Clearly, there is a distinction between the number of evaluations of the likelihood and the number of evaluations for the balancing function. Our proposed strategy reduces the first ones and the introduced computational overhead affects only the second kind of evaluations. This can be mitigated for example by looking at more efficient block-wise implementations. However, this is left to future work. Additional comparisons of FLSB2 against block-wise sampling strategies together with a summary of its benefits are provided in Appendix F. 5.3. Markov Networks: UAI data Lastly, we evaluate how our strategy generalizes to different graph topologies compared to the one of the Ising model. In particular, we consider two Markov networks, with 100 discrete variables each and near-deterministic dependencies, from the 2006 UAI competition.8 In this setting, we can t leverage the gradient-based approximation of Section 3.3, as the distribution is specified in tabular form. Therefore, we compare our sampler LSB with the four balancing functions proposed in (Zanella, 2020). Similarly to previous experiments, we analyze the convergence of the burn-in phase (using traceplots) and the mixing performance according to ESS. Further details about the simulations are available in the Appendices D and G. Also in this case, we observe that the proposed strategy is able to adapt to the target distribution and reduce the number of target likelihood evaluations required to converge (Figure 5). In this case, we observe similar performance in terms of ESS (cf. Table 3). 6. Conclusions and Future Work We have presented a strategy to learn locally balanced proposals for MCMC in discrete spaces. The strategy consists of (i) a new parametrization of balancing functions and (ii) a learning procedure adapting the proposal to the target distribution. This allows to reduce the number of target likelihood evaluations required to converge. We believe that the proposed strategy can play an important role on applications where querying an oracle distribution is very expensive (like in deep energy-based models). We will investigate this in the future. Note that the LSB sampler belongs to the family of local sampling strategies, thus inheriting their limitations. The locality assumption can be quite restrictive, for example when sampling from discrete distributions with deterministic dependencies among variables. In such situations, local sampling might fail to correctly sample from the target in a finite amount of time, as being required to cross regions with zero probability mass. The locality assumption can be relaxed for instance by leveraging a recent extension of the works in (Zanella, 2020; Grathwohl et al., 2021), called Path Auxiliary MCMC (Sun et al., 2022), which uses a trajectory of samples generated by the repeated application of the same base locally balanced proposal, and by adapting our proposed strategy to this new setting. Alternatively, we can consider block-wise implementations of locally balanced proposals, as also suggested in (Zanella, 2020). However, we leave all these extensions to future work. To the best of our knowledge, we are the first to consider using mutual information to accelerate MCMC. We have focused on a local MCMC sampler for discrete domains. In the future, it will be interesting to see the application of the mutual information criterion to more global settings as well 8http://sli.ics.uci.edu/ ihler/uai-data/ LSB: Local Self-Balancing MCMC in Discrete Spaces Table 3. Quantitative performance for mixing measured by effective sample size on two Markov networks from UAI competition. Dataset t 1+t t min{1, t} max{1, t} LSB 1 LSB 2 MN 1 2.90 0.76 2.90 0.76 2.90 0.76 3.41 0.77 3.41 0.77 3.41 0.77 2.54 0.32 2.70 0.63 2.70 0.63 2.70 0.63 3.19 0.46 3.19 0.46 3.19 0.46 3.22 0.38 3.22 0.38 3.22 0.38 MN 2 3.43 0.75 3.43 0.75 3.43 0.75 3.92 0.94 3.92 0.94 3.92 0.94 3.78 0.50 3.78 0.50 3.78 0.50 3.63 0.67 3.63 0.67 3.63 0.67 3.52 0.42 3.52 0.42 3.52 0.42 3.44 0.44 3.44 0.44 3.44 0.44 as its extension to the continuous case. Lastly, further theoretical investigation is required to establish the local convergence of our training algorithm. Indeed, standard results of gradient-based optimization do not apply in this setting, as the iterates are correlated. This remains an open challenge for future work. Acknowledgements This research was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215. The author would like to thank Luc de Raedt for the financial support. Afshar, H. M., Domke, J., et al. Reflection, Refraction, and Hamiltonian Monte Carlo. In Neur IPS, pp. 3007 3015, 2015. Alexeev, N., Isomurodov, J., Sukhov, V., Korotkevich, G., and Sergushichev, A. Markov Chain Monte Carlo for Active Module Identification Problem. BMC Bioinformatics, 21(6):1 20, 2020. Altekar, G., Dwarkadas, S., Huelsenbeck, J. P., and Ronquist, F. Parallel Metropolis Coupled Markov Chain Monte Carlo for Bayesian Phylogenetic Inference. Bioinformatics, 20(3):407 415, 2004. Andrieu, C. and Thoms, J. A Tutorial on Adaptive MCMC. Statistics and Computing, 18(4):343 373, 2008. Andrieu, C., De Freitas, N., Doucet, A., and Jordan, M. I. An Introduction to MCMC for Machine Learning. Machine learning, 50(1):5 43, 2003. Dai, H., Singh, R., Dai, B., Sutton, C., and Schuurmans, D. Learning Discrete Energy-based Models via Auxiliaryvariable Local Exploration. In Neur IPS, pp. 10443 10455, 2020. Dinh, V., Bilge, A., Zhang, C., and Matsen I. V., F. A. Probabilistic Path Hamiltonian Monte Carlo. In ICML, pp. 1009 1018, 2017. Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid Monte Carlo. Physics letters B, 195(2):216 222, 1987. Gong, W., Li, Y., and Hern andez-Lobato, J. M. Meta Learning for Stochastic Gradient MCMC. In ICLR, 2019. Grathwohl, W., Swersky, K., Hashemi, M., Duvenaud, D., and Maddison, C. J. Oops I Took A Gradient: Scalable Sampling for Discrete Distributions. In ICML, 2021. Grenander, U. and Miller, M. I. Representations of Knowledge in Complex Systems. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):549 581, 1994. Habib, R. and Barber, D. Auxiliary Variational MCMC. In ICLR, 2018. Hans, C., Dobra, A., and West, M. Shotgun Stochastic Search for Large P Regression. Journal of the American Statistical Association, 102(478):507 516, 2007. Higdon, D. M. Auxiliary Variable Methods for Markov Chain Monte Carlo with Applications. Journal of the American statistical Association, 93(442):585 595, 1998. Hoffman, M. D. and Gelman, A. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593 1623, 2014. Hornik, K., Stinchcombe, M., and White, H. Multilayer Feedforward Networks are Universal Approximators. Neural networks, 1989. Jaini, P., Nielsen, D., and Welling, M. Sampling in Combinatorial Spaces with Sur VAE Flow Augmented MCMC. In AISTATS, pp. 3349 3357, 2021. Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT press, 2009. Levy, D., H., M. D., and Sohl-Dickstein, J. Generalizing Hamiltonian Monte Carlo with Neural Networks. In ICLR, 2018. Mac Kay, D. J. C. Information Theory, Inference and Learning Algorithms. Cambridge university press, 2003. Neal, R. Probabilistic Inference Using Markov Chain Monte Carlo Methods. Department of Computer Science, University of Toronto Toronto, Ontario, Canada, 1993. Nijkamp, E., Hill, M., Han, T., Zhu, S. C., and Wu, Y. N. On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models. In AAAI, pp. 5272 5280, 2020. LSB: Local Self-Balancing MCMC in Discrete Spaces Nishimura, A., Dunson, D. B., and Lu, J. Discontinuous Hamiltonian Monte Carlo for Discrete Parameters and Discontinuous Likelihoods. Biometrika, 107(2):365 380, 2020. Pakman, A. and Paninski, L. Auxiliary-Variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions. In Neur IPS, 2013. Pasarica, C. and Gelman, A. Adaptively Scaling the Metropolis Algorithm Using Expected Squared Jumped Distance. Statistica Sinica, pp. 343 364, 2010. Pompe, E. and Holmes, C. and Łatuszy nski, K. and others. A Framework for Adaptive MCMC Targeting Multimodal Distributions. Annals of Statistics, 48(5):2930 2952, 2020. Robert, C. and Casella, G. Monte Carlo Statistical Methods. Springer Science & Business Media, 2013. Song, J., Zhao, S., and Ermon, S. A-NICE-MC: Adversarial Training for MCMC. In Neur IPS, 2017. Sun, H., Dai, H., Xia, W., and Ramamurthy, A. Path Auxiliary Proposal for MCMC in Discrete Space. In ICLR, 2022. Titsias, M. K. and Yau, C. The Hamming Ball Sampler. Journal of the American Statistical Association, 112(520): 1598 1611, 2017. Wang, T., Wu, Y., Moore, D. A., and Russell, S. J. Meta Learning MCMC Proposals. In Neur IPS, pp. 4150 4160, 2018. Zanella, G. Informed Proposals for Local MCMC in Discrete Spaces. Journal of the American Statistical Association, 115(530):852 865, 2020. Zhang, Y., Ghahramani, Z., Storkey, A. J., and Sutton, C. Continuous Relaxations for Discrete Hamiltonian Monte Carlo. Neur IPS, 25:3194 3202, 2012. Zhu, M. Sample Adaptive MCMC. In Neur IPS, volume 32, 2019. LSB: Local Self-Balancing MCMC in Discrete Spaces A. Detailed Balance We want to prove that p(x)T(x |x) = p(x )T(x|x ) for all x = x. We have that p(x)A(x , x) g p(x ) p(x) 1[x N(x)] p(x )A(x, x ) g p(x) p(x ) 1[x N(x )] By observing that 1(x N(x)) = 1(x N(x )) and using the balancing property, we can simplify previous equality to obtain the following relation: p(x)A(x , x) g p(x ) p(x )A(x, x ) p(x) p(x )g p(x ) Therefore, we can apply standard algebra to simplify even more p(x)A(x , x) = p(x)A(x, x ) Z(x) Finally, recall that for balancing functions A(x, x ) = min 1, p(x)Q(x |x) p(x )Q(x|x ) = min 1, Z(x ) Z(x) and therefore previous equality becomes an identity, namely: p(x)A(x , x) = p(x)A(x , x) thus proving detailed balance. B. Ergodicity Let s consider a Markov chain, namely T(x |x) = A(x , x)Q(x |x) + 1[x = x] X 1 A(x , x) Q(x |x)) (7) with a proposal of the following form: Q(x |x) = g p(x ) p(x) 1[x N(x)] We can prove the ergodicity of the Markov chain for the case where the fixed-point distribution p(x) > 0 for every x X and then extend it to a general distribution p. Now, assume that p(x) > 0 for any point x X, g(t) > 0 for any t > 0 and X is a d-dimensional discrete space. Then, the Markov chain in Eq. 7 with proposal defined according to Eq. 8 can reach any state x from any state x in d steps with non-zero probability. More formally, we can construct a new Markov chain by applying d times the original one and identify its transition probability with T d(x |x). We can easily check, thanks to our assumptions, that T d(x |x) > 0 for any x, x . In other words, the original Markov chain is regular. This is sufficient to satisfy the assumptions of the fundamental theorem of homogeneous Markov chains (Neal, 1993), thus proving ergodicity. We can extend the previous result to any arbitrary p (namely considering cases where p(x) = 0 for some x X). This can be achieved by modifying our assumptions on g, namely considering that g(t) > 0 for any t 0 and reusing the same proof strategy. C. Proof of Theorem 1 We want to prove the following theorem. LSB: Local Self-Balancing MCMC in Discrete Spaces Theorem C.1. Consider X = {0, 1}d and p(x) = p(x)/Γ, where Γ is a normalizing constant. Define two auxiliary distributions Q1 and Q2, such that for all x X, Q1(x) > 0 whenever p(x) > 0, and for all x N(x), Q2(x ) > 0 whenever Q(x |x) > 0. Therefore, (a) if M(x) = 1 P x N(x) A(x ,x)Q(x |x), then Lθ =Ex Q1 x Q2 p(x)Q(x |x) Q1(x)Q2(x )A(x ,x) log A(x ,x)Q(x |x) Q1(x)M(x) log M(x) and Iθ = Lθ (b) if M(x) = 1 A(x ,x)Q(x |x), where x is randomly sampled according to a uniform distribution over N(x), then Lθ =Ex Q1 x Q2 p(x)Q(x |x) Q1(x)Q2(x )A(x ,x) log A(x ,x)Q(x |x) ηM(x) p(x) log η p(x) Γ , for η R+. Proof. Let s start from the definition of Iθ and use the fact that Tθ(x |x) = A(x , x)Q(x |x) + 1[x = x] P x X 1 A(x , x) Q(x |x)). Iθ = KL p(x)Tθ(x |x) p(x)p(x ) x X p(x)Tθ(x |x) log p(x)Tθ(x |x) x X p(x)Tθ(x |x) log Tθ(x |x) x N(x) {x} p(x)Tθ(x |x) log Tθ(x |x) x N(x) {x} p(x)Tθ(x |x) log Tθ(x |x) p(x ) + log Γ x N(x) p(x)Tθ(x |x) log Tθ(x |x) x X p(x)Tθ(x|x) log Tθ(x|x) p(x) + log Γ x N(x) p(x)A(x , x)Q(x |x) log A(x , x)Q(x |x) x X p(x)Tθ(x|x) log Tθ(x|x) p(x) + log Γ x N(x) p(x)A(x , x)Q(x |x) log A(x , x)Q(x |x) 1 A(x , x) Q(x |x) log P x X 1 A(x , x) Q(x |x) p(x) + log Γ x N(x) p(x)A(x , x)Q(x |x) log A(x , x)Q(x |x) x X p(x) 1 X x X A(x , x)Q(x |x) log x X A(x , x)Q(x |x) p(x) + log Γ x N(x) Q1(x)Q2(x ) p(x)Q(x |x) Q1(x)Q2(x )A(x , x) log A(x , x)Q(x |x) LSB: Local Self-Balancing MCMC in Discrete Spaces x X Q1(x) p(x) x X A(x , x)Q(x |x) log x X A(x , x)Q(x |x) p(x) + log Γ Now, by defining M(x) = 1 P x X A(x , x)Q(x |x), we obtain case (a). Indeed, we have that x N(x) Q1(x)Q2(x ) p(x)Q(x |x) Q1(x)Q2(x )A(x , x) log A(x , x)Q(x |x) x X Q1(x) p(x) x X A(x , x)Q(x |x) log x X A(x , x)Q(x |x) p(x) + log Γ x N(x) Q1(x)Q2(x ) p(x)Q(x |x) Q1(x)Q2(x )A(x , x) log A(x , x)Q(x |x) x X Q1(x) p(x) Q1(x)M(x) log M(x) p(x) + log Γ For case (b), we can explot the relation log y ηy log η 1 for all y > 0 we have that x N(x) Q1(x)Q2(x ) p(x)Q(x |x) Q1(x)Q2(x )A(x , x) log A(x , x)Q(x |x) x X Q1(x) p(x) x X A(x , x)Q(x |x) log x X A(x , x)Q(x |x) p(x) + log Γ x N(x) Q1(x)Q2(x ) p(x)Q(x |x) Q1(x)Q2(x )A(x , x) log A(x , x)Q(x |x) x X Q1(x) p(x) x X A(x , x)Q(x |x) η x X A(x , x)Q(x |x) p(x) log η 1 + log Γ x N(x) Q1(x)Q2(x ) p(x)Q(x |x) Q1(x)Q2(x )A(x , x) log A(x , x)Q(x |x) 1 P x X A(x , x)Q(x |x) x X A(x , x)Q(x |x) p(x) log η p(x) + log Γ Finally, we observe that 1 A( x,x)Q( x|x) 1 P x X A(x , x)Q(x |x) for all x N(x) and for all x X. Therefore, x N(x) Q1(x)Q2(x ) p(x)Q(x |x) Q1(x)Q2(x )A(x , x) log A(x , x)Q(x |x) x X A(x , x)Q(x |x) x X A(x , x)Q(x |x) p(x) log η p(x) + log Γ x N(x) Q1(x)Q2(x ) p(x)Q(x |x) Q1(x)Q2(x )A(x , x) log A(x , x)Q(x |x) 1 A(x , x)Q(x |x) η 1 A(x , x)Q(x |x) p(x) log η p(x) + log Γ and by defining M(x) = 1 A(x , x)Q(x |x), we obtain case (b). LSB: Local Self-Balancing MCMC in Discrete Spaces Table 4. Summary of the properties of different approaches. Method Target likelihood evaluations per sampling step Number of variables modified per sampling step Gibbs 2 4 2 Gibbs 4 16 4 Gibbs 10 1024 10 HB-10-1 20 1 HB-10-2 180 2 HB-10-3 960 3 FLSB 2 1 1 D. Hyperparameters Used in the Experiments Learning rate η = 1e 2 for SGD optimizer with momentum. Burn-in iterations K = 2000 (for Ising), K = 24000 (for RBM), K = 500 (for UAI). Iterations for sampling 30000 (for Ising), 120000 (for RBM), 10000 (for UAI). Batch size N = 30 (for Ising), N = 16 (for RBM), N = 5 (for UAI). MLP network with one hidden layer of 10 neurons (for ISING and RBM) and monotonic network with 20 blocks of 20 neurons (for UAI). Refer to Appendix G for additional experiments on UAI with MLP. For experiments on RBM, we reused the code provided by (Grathwohl et al., 2021) . E. Further Results for Ising See Figure 6. F. Additional Experiments for RBM We provide additional comparisons of FLSB 2 against sampling strategies modifying more than 1 variable per step. In particular, we compare FLSB 2 against Gibbs 2, Gibbs 4 and Gibbs 10 and also compare FLSB 2 against HB-10-1, HB-10-2 and HB-10-3. Results are shown in Figures 8 and 9, respectively. It is important to mention also that FLSB 2 uses a much smaller number of target likelihood evaluations compared to these previous approaches, as summarized in Table 4. G. Additional Experiments on UAI We repeated the experiments on UAI using a MLP network with one hidden layer of 10 neurons, shown in Figure 10. We found that in this domain characterized by large close-to-zero mass regions, the initial random input x is likely to fall outside the distribution support. In such case, there is no need to have balancing functions with g(0) > 0. Notably, this doesn t occur when using a monotonic network, as inspected by the experiments in Figure 11, where we train the monotonic network to match a function like max{1, t}. LSB: Local Self-Balancing MCMC in Discrete Spaces Figure 6. Traceplots on four cases of the Ising model (30 30) for the mixing phase. (a) Case 1: Independent-noisy, (b) case 2: Independent-clean, (c) case 3: Dependent-noisy, (d) case 4: Dependent-clean Figure 7. Traceplots on UAI benchmarks model (100 vars near-deterministic dependencies) for the mixing phase. LSB: Local Self-Balancing MCMC in Discrete Spaces Figure 8. Comparison of FLSB 2 against block Gibbs sampling with block size of 2, 4 and 10 variables on RBMs. Figure 9. Comparison of FLSB 2 against the Hamming Ball sampler using 10 variables per block and updating 1, 2 and 3 variables per step. (a) MN 1 (Burn-in) (b) MN 2 (Burn-in) (c) MN 1 (mixing) (d) MN 2 (mixing) Figure 10. Samplers performance on Markov networks from UAI competition (100 variables) with MLP network. (a)-(b) are the traceplots for the burn-in phase, while (c)-(d) are the autocorrelation function for the mixing one (a) Iteration 0 (b) Iteration 1000 (c) Iteration 2000 (d) Iteration 3000 (e) Iteration 4000 (f) Iteration 5000 (g) Iteration 6000 Figure 11. Training the monotonic network to match max{1, t} balancing function.