# generative_marginalization_models__c5689718.pdf

Generative Marginalization Models

Sulin Liu 1 Peter J. Ramadge 1 Ryan P. Adams 1

We introduce marginalization models (MAMs), a new family of generative models for highdimensional discrete data. They offer scalable and flexible generative modeling by explicitly modeling all induced marginal distributions. Marginalization models enable fast approximation of arbitrary marginal probabilities with a single forward pass of the neural network, which overcomes a major limitation of arbitrary marginal inference models, such as any-order autoregressive models. MAMs also address the scalability bottleneck encountered in training any-order generative models for high-dimensional problems under the context of energy-based training, where the goal is to match the learned distribution to a given desired probability (specified by an unnormalized log-probability function such as energy or reward function). We propose scalable methods for learning the marginals, grounded in the concept of marginalization self-consistency . We demonstrate the effectiveness of the proposed model on a variety of discrete data distributions, including images, text, physical systems, and molecules, for maximum likelihood and energy-based training settings. MAMs achieve orders of magnitude speedup in evaluating the marginal probabilities on both settings. For energy-based training tasks, MAMs enable any-order generative modeling of high-dimensional problems beyond the scale of previous methods. Code is available at github.com/Princeton LIPS/Ma M.

1. Introduction

Deep generative models have enabled remarkable progress across diverse fields, including image generation, audio synthesis, natural language modeling, and scientific discovery.

1Princeton University. Correspondence to: Sulin Liu <sulinl@princeton.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

However, there remains a pressing need to better support efficient probabilistic inference for key questions involving marginal probabilities p(x S) and conditional probabilities p(x U|x V), for appropriate subsets S, U, V of the variables. The ability to directly address such quantities is critical in applications such as outlier or machine-generated content detection [59, 48], masked language modeling [15, 85], image inpainting [86], and constrained protein/molecule design [81, 65]. Furthermore, the capacity to conduct such inferences for arbitrary subsets of variables empowers users to leverage the model according to their specific needs and preferences. For instance, in protein design, scientists may want to manually guide the generation of a protein from a user-defined substructure under a particular path over the relevant variables. This requires the generative model to perform arbitrary marginal inferences.

Any-order Fixed/Learned-order Arbitrary marginal inference

Training NN forward pass (per sample)

Likelihood infer. time/memory

Sequential discrete generative models

Ma M - MLE/EB

ARM - MLE ARM - EB

GFlow Net-TB - MLE/EB

AR Flow - MLE/EB

Figure 1. Scalability of sequential discrete generative models. The y-axis unit is # of NN forward passes required.

Towards this end, neural autoregressive models (ARMs) [3, 38] have shown great performance in conditional/marginal inference based on the idea of modeling a high-dimensional joint distribution as a factorization of univariate conditionals using the chain rule of probability. Many efforts have been made to scale up ARMs and enable any-order generative modeling under the setting of maximum likelihood estimation (MLE) [38, 78, 24], and great progress has been made in applications such as masked language modeling [85] and image inpainting [24]. However, marginal likelihood evaluation on a sequence of D variables is limited by O(D) neural network passes with the most widely-used modern neural network architectures (e.g., Transformers [80] and U-Nets [62]). This scaling makes it difficult to evaluate

Generative Marginalization Models

[?][C][=C][C][=C][C][=C][Ring1][=Branch1] [Cl][C][=C][C][=C][C][=C][Ring1][=Branch1] [F][C][=C][C][=C][C][=C][Ring1][=Branch1]

pθ( ) pθ( ) pθ( ) + =

= + p (0010??) p (1010??) p (?010??)

variable is marginalized out x1

Figure 2. Marginalization models (MAMs) enable estimation of any marginal probability with a neural network θ that learns to marginalize out variables. The figure illustrates marginalization of a single variable on bit strings (representing molecules) with two alternatives (versus K in general) for clarity. The bars represent probability masses.

likelihoods on long sequences arising in data such as natural language and proteins. In addition to MLE, the setting of energy-based training (EB) has recently received growing interest with its applications in science domains [49, 12, 35]. Instead of empirical data samples, we only have access to an unnormalized (log) probability function (specified by a reward or energy function) that can be evaluated pointwise for the generative model to match. In such settings, ARMs are limited to fixed-order generative modeling and lack scalability in training. The subsampling techniques developed to scale the training of conditionals for MLE are no longer applicable when matching log probabilities in energy-based training (see Section 4.3 for details).

To enhance scalability and flexibility in the generative modeling of discrete data, we propose a new family of generative models, marginalization models (MAMs), that directly model the marginal distribution p(x S) for any subset of variables x S in x. Direct access to marginals has two important advantages: 1) significantly speeding up inference for any marginal, and 2) enabling scalable training of any-order generative models under both MLE and EB settings.

The unique structure of the model allows it to simultaneously represent the coupled collection of all marginal distributions of a given discrete joint probability mass function. For the model to be valid, it must be consistent with the sum rule of probability, a condition we refer to as marginalization self-consistency (see Figure 2); learning to enforce this with scalable training objectives is one of the key contributions of this work.

We show that MAMs can be trained under both maximum likelihood and energy-based training settings with scalable learning objectives. We demonstrate the effectiveness of MAMs in both settings on a variety of discrete data distributions, including binary images, text, physical systems, and molecules. We empirically show that MAMs achieve orders of magnitude speedup in marginal likelihood evaluation. For

energy-based training, MAMs are able to scale training of any-order generative models to high-dimensional problems that previous methods fail to achieve.

2. Background

We first review two prevalent settings for training a generative model: maximum likelihood estimation and energybased training. Then we introduce autoregressive models.

Maximum likelihood (MLE) Given a dataset D = {x(i)}N i=1 drawn i.i.d. from a data distribution p = pdata, we aim to learn the distribution pθ(x) via maximum likelihood estimation:

max θ Ex pdata [log pθ(x)] 1

i=1 log pθ(x(i)) (1)

which is equivalent to minimizing the Kullback-Leibler divergence under the empirical distribution, i.e., minimizing DKL(pdata(x)||pθ(x)). This is the setting that is most commonly used in generation of images (e.g., diffusion models [69, 22, 70]) and language (e.g. GPT [58]) where we can easily draw observed data from the distribution.

Energy-based training (EB) In other cases, data from the distribution are not always available. Instead, we have access to an unnormalized probability distribution f( ) typically specified as f(x) = exp(r(x)/τ) where r(x) is an energy (or reward) function and τ > 0 is a temperature parameter. In this setting, the objective is to match pθ(x) to f(x)/Z, where Z is the normalization constant of f. This can be done by minimizing the KL divergence [49, 84, 12],

The reward function r(x) can be defined either by human preferences or by the physical system from first principles. For example, (a) In aligning large language models, r(x)

Generative Marginalization Models

can represent human preferences [51, 50]; (b) In molecular/material design, it can specify the proximity of a sample s measured or calculated properties to some functional desiderata [2]; and (c) In modeling the thermodynamic equilibrium ensemble of physical systems, it is the (negative) energy function of a given sample [49, 84, 12, 35].

The training objective in Equation (2) can be optimized using a Monte Carlo estimate of the gradient using the REINFORCE algorithm [83]. A generative model θ allows us to efficiently generate samples approximately from the distribution, which would otherwise be much more expensive via running MCMC with the energy function f( ).

Autoregressive models Autoregressive models (ARMs) [3, 38] model a complex high-dimensional distribution p(x) by factorizing it into univariate conditionals using the chain rule:

log pϕ(x) = XD

d=1 log pϕ (xd | x<d) , (3)

where x<d = {x1, . . . , xd 1}. ARMs generate examples by sequentially drawing x1 under pϕ(x1), then x2 under pϕ(x2|x1), and so on. The ARM approach has produced successful discrete-data neural models for natural language, proteins [68, 40, 44], and molecules [66, 19].

Any-order ARMs (AO-ARMs) Uria et al. [78] propose to learn the conditionals of ARMs for arbitrary orderings that include all permutations of {1, . . . , D}. Under the MLE setting, the model ϕ is trained by maximizing a lowerbound objective [78, 24] using an expectation under the uniform distribution of orderings. This objective allows scalable training of AO-ARMs with architectures such as the U-Net [62] and Transformers [80], by leveraging efficient parallel evaluation of multiple one-step conditionals for all next-tokens in one forward pass. However, modeling conditionals alone with ARMs results in limitations in both inference and training (more details in Section 4.3):

1. Test-time marginal likelihood inference: evaluation of pϕ(x) or pϕ(xs) requires up to D neural network passes, making it costly for high-dimensional data.

2. Energy-based training for high-dimensional problems: the objective in Equation (2) requires evaluating log pϕ(x) in full with D network forward passes in order to calculate the difference of log pϕ(x) and f(x)/Z. Monte Carlo estimate of log pϕ(x) no longer works since the objective is matching log p s instead of maximizing log p (the MLE case). As a result, this significantly limits ARM s training scalability under the EB setting when D becomes large.

3. Marginalization Models

We propose marginalization models (MAMs), a new type of generative model that enables scalable any-order generative modeling on high-dimensional problems as well as efficient marginal evaluation, for both maximum likelihood and energy-based training. The flexibility and scalability of marginalization models are enabled by the explicit modeling of the marginal distribution and scalable training objectives derived from marginalization self-consistency.

In this paper, we focus on generative modeling of discrete structures using vectors of discrete variables. The vector representation encompasses various real-world problems with discrete structures, including language sequence modeling, protein design, and molecules with string-based representations (e.g., SMILES [82] or SELFIES [36]). Moreover, vector representations are inherently applicable to any discrete problem, since it is feasible to encode any discrete object into a vector of discrete variables.

Definition Let p(x) be a discrete probability distribution, where x = (x1, . . . , x D) is a D-dimensional vector and each xd takes K possible values, i.e. xd X {1, . . . , K}.

Marginalization For a subset of indices S {1, . . . , D}, let x S and x Sc denote the subvectors corresponding to S and the complement set, Sc = {1, . . . , D}\S. The marginal of x S is obtained by summing over all values of x Sc:

x Sc p(x S, x Sc) (4)

We refer to (4) as the marginalization self-consistency that a valid distribution should follow. The goal of a marginalization model θ is to estimate the marginals p(x S) for any subset of variables x S as closely as possible. To achieve this, we train a deep neural network that fits pθ(x) to a target distribution p(x) while fitting the marginals pθ(x S) through the marginalization self-consistency principle. In other words, MAM learns to approximately inference the marginals of an arbitrary subset of variables with a single neural net forward pass.1

Parameterization A marginalization model parameterized by a neural network θ takes in x S and outputs the marginal log probability fθ(x S) = log pθ(x S). Note that for different subsets S and S , x S and x S lie in different vector spaces. To unify the vector space that is fed into the NN, we introduce an augmented vector space that additionally includes the marginalized out variables x Sc. By defining a special symbol to denote the missing values of the marginalized out variables, the augmented vector representation is D-dimensional and is defined to

1Estimating p(x) is a special case of marginal inference where there are no variables to be marginalized.

Generative Marginalization Models

be: xaug S (i) =

( xi, if i S , otherwise . Now, the augmented vec-

tor representation xaug S of all possible x S s has the same dimension D, and for any i-th dimension xaug S (i) X aug {1, , K, }. To given an example, when D = 4 and X = {0, 1}, for x S = {x1, x3} taking values x1 = 0 and x3 = 1, xaug S = (0, , 1, ), with the corresponding marginal p(xaug S ) = P

x4 p(0, x2, 1, x4). From here onwards we will use xaug S and x S interchangeably.

Sampling With the marginalization model, one can sample from the learned distribution by picking an arbitrary order and sampling one variable or multiple variables at a time. In this paper, we focus on the sampling procedure that generates one variable at a time. Sampling multiple variables jointly can also be done in a similar way (see Appendix B.2 for ablation studies). To get the conditionals at each step for generation, we can use the product rule of probability: pθ(xσ(d)|xσ(<d)) = pθ(xσ( d))/pθ(xσ(<d)). However, the above conditional distribution is not exactly valid when the following single-step marginalization consistency in (5) is only approximately enforced,

pθ(xσ(<d)) X

xσ(d) pθ(xσ( d)), (5)

σ SD, x {1, , K}D, d [1 : D] ,

since the estimated probabilities might not sum up exactly to one. Hence we use following normalized conditional:

pθ(xσ(d)|xσ(<d)) = pθ([xσ(<d), xσ(d)]) P

xσ(d) pθ([xσ(<d), xσ(d)]) . (6)

Scalable training of marginalization self-consistency In training, we can impose the marginalization self-consistency by minimizing the squared error of the constraints in (5) in log-space. Evaluation of each marginalization constraint in (5) requires K NN forward passes, where K is the number of discrete values xd can take. This makes mini-batch training challenging to scale when K is large. To address this issue, we augment the marginalization models with learnable conditionals parameterized by another neural network ϕ. The marginalization constraints in (5) can be further decomposed into K parallel marginalization constraints2.

pθ(xσ(<d))pϕ(xσ(d)|xσ(<d)) pθ(xσ( d)), (7)

σ SD, x {1, , K}D, d [1 : D].

The consistency error for each constraint can be defined correspondingly as follows:

Consistency Error(x, σ, d)

= log pθ(xσ(<d))pϕ(xσ(d)|xσ(<d)) log pθ(xσ( d)) 2 .

2To make sure pθ is normalized, we can either additionally enforce pθ (( , , )) = 1 or let Zθ = pθ (( , , )) be the normalization constant.

Other distances such as KL divergence can also be considered. We choose squared distance for its flexibility in selecting the q(x), allowing us to fit marginals for various use cases with different q(x) at test time. It s also worth noting that training with KL divergence and squared distance are quite similar (see Malkin et al. [47]). The REINFORCE gradient of DKL pθ(x<σ(d))pϕ(xσ(d)|x<σ(d)) pθ(x σ(d)).detach()

is equivalent to the squared distance loss when q is set to pθ.

By breaking the original marginalization self-consistency in Equation (4) into highly parallel marginalization selfconsistency in Equation (7), we introduce a total of KD D! D K constraints. Although this increases the number of constraints, it becomes highly scalable to train on the marginalization self-consistency via randomly sampling constraints following a specified distribution q(x) and q(σ). In our experiments, q(σ) is set to the uniform distribution over all orderings U(SD) and q(x) is set to the data distribution of interest for marginal inference, such as the empirical data distribution pdata(x) or the generative model s distribution pθ,ϕ(x). We found that a training objective that decomposes into highly parallel terms for sampling is key to effectively fitting marginals with scalability.

4. Training the Marginalization Models

4.1. Maximum Likelihood Estimation Training

In this setting, we train MAMs with the maximum likelihood objective while additionally enforcing the marginalization constraints in Equation (5):

max θ,ϕ Ex pdata log pθ(x) (8)

s.t. pθ(xσ(<d))pϕ(xσ(d)|xσ(<d)) pθ(xσ( d)),

σ SD, x {1, , K}D, d [1 : D].

Two-stage training A typical way to solve the above optimization problem is to convert the marginalization constraint into another objective and optimize both objectives jointly. However, maximizing log pθ(x D) directly in Equation (8) is unbounded, we empirically found this causes the training to be slow and unstable by over-emphasizing likelihood at the expense of self-consistency. Instead, we identify an theoretically equivalent two-stage optimization formulation that leads to more effective training strategy based on the following observation:

Proposition 1. Solving the optimization problem in (8) is equivalent to the following two-stage optimization procedure, under mild assumptions about the neural networks

Generative Marginalization Models

used being universal approximators:

Stage 1: max ϕ Ex pdata Eσ XD

d=1 log pϕ xσ(d)|xσ(<d)

Stage 2: min θ Ex q(x)EσEd Consistency Error(x, σ, d)

where σ U(SD), d U(1, , D) and q(x) is the distribution of interest for marginal likelihood inference.

The first stage can be interpreted as fitting the conditionals in the same way as AO-ARMs [78, 24] and the second stage acts as distilling the marginals from conditionals through training on marginalization self-consistency. The intuition comes from the simple chain rule of probability: we first observe a one-to-one correspondence between the optimal conditionals log pϕ and marginals log pθ, i.e. log pθ(x) = PD d=1 log pϕ xσ(d)|xσ(<d) for any σ and x. By assuming neural networks are universal approximators, we can split the joint optimization problem into two steps by first finding the optimal conditionals pϕ, and then solving for the corresponding optimal marginals pθ. We provide proof details in Appendix A.1.

There are two main advantages with the reformulated twostage training. First, the maximum likelihood objective based on conditionals is now bounded and can be optimized in parallel. Secondly, even when compared with joint training with the reformulated conditional-based likelihoods, the decoupled two-stage training leads to improved efficiency, since it avoids wasted compute on fitting marginals on conditionals that are still being actively updated throughout training. Additionally, the two-stage approach eliminates the need to sweep over the hyperparameter that balances the two objectives. In Appendix B.4, we validate this with experiments by comparing two-stage v.s. joint training, both using the reformulated objectives. This aligns with findings in diffusion model distillation [71, 5], where training a standard diffusion model followed by distillation proves easier than training a distilled model from scratch.

4.2. Energy-based Training

In this setting, the two-stage training introduced in Section 4.1 becomes impractical for high-dimensional problems. Stage 1 training (fitting conditionals with LKL = Epθ h PD d=1 log pϕ xσ(d) | xσ(<d) log p(x) i ) scales poorly with D as it requires D NN forward passes per datapoint. Therefore, for scalability, we train MAMs by jointly minimizing the KL divergence objective over the marginals and the self-consistency loss term that include both marginals and conditionals:

min θ,ϕ DKL(pθ p)+λ Ex q(x)EσEd Consistency Error(x, σ, d),

where σ U(SD), d U(1, , D) and q(x) is the distribution of interest for marginal likelihood inference.

Unlike the unbounded likelihood maximization in Section 4.1, matching log pθ(x) with log p(x) in the KL term does not lead to training instability issues. However, joint training introduces complex dynamics, necessitating careful hyperparameter selection. We find that a wide range of small λ yield best performance. More experiments and discussion are provided in Section B.4.

Scalable training The gradient of the KL divergence term is estimated with the REINFORCE estimator [83]:

θDKL(pθ(x)||p(x))

=Ex pθ(x) [ θ log pθ (x) (log pθ (x) log f (x))] (9)

i=1 θ log pθ(x(i)) log pθ(x(i)) log f(x(i))

The consistency-error term can be estimated by randomly sampling data x, ordering σ and step d from the specified distribution.

Efficient sampling with persistent MCMC To efficiently generate approximate samples of pθ for the REINFORCE estimator, a persistent set of Markov chains are maintained by taking block-wise Gibbs sampling steps following a random ordering using the conditional distribution pϕ(xσ(d)|xσ(<d)) (full algorithm in Appendix A.3), in a similar fashion to persistent contrastive divergence [76]. The samples from the conditional network pϕ serve as good approximation to samples from the marginal network pθ, since they are close to each other when conditionals and marginals are approximately consistent with each other. In experiments, we validate this by observing that the log-probabilities from pθ and pϕ are highly consistent on both random and onpolicy samples. In cases when there is a strong discrepancy between pθ and pϕ during training, we can additionally use importance sampling to get an unbiased estimate.

4.3. Addressing Limitations of ARMs

1) Test-time marginal likelihood inference Evaluation of a marginal pϕ(xo) with ARMs (or an arbitrary marginal with AO-ARMs) requires applying the conditional pϕ up to D times, which is inefficient in time and memory for high-dimensional data. In contrast, MAMs can approximate any arbitrary marginal with just one NN forward pass. This is achieved through explicitly modeling the marginals and training with scalable self-consistency objectives.

2) EB training for high-dimensional problems There are two factors that limit the scalability of ARMs for EB training. First, the KL divergence objective in EB training requires evaluating log pϕ(x) in full with D network forward passes in order to calculate the difference of log pϕ(x) and f(x)/Z. One might consider estimating pϕ(x) with a single-step Monte Carlo estimate pϕ(xσ(d)|xσ(<d), but this leads to high variance of the REINFORCE gradient in Equation (9) due to the product of the score function and

Generative Marginalization Models

distance terms, which are both of high variance (validated in experiments, see Figure 3). Consequently, training ARMs for energy-based training necessitates a sequence of D conditional evaluations to compute the gradient of the objective function. This constraint leads to an effective batch size of B D for batch of B samples, significantly limiting the training scalability of ARMs to high-dimensional problems.

0 5 10 15 20 25 Epoch

Variance of the gradient estiamte

Enery-based training on Ising model

ARM-MC ARM-Full Ma M-Any-Order

Figure 3. Approximating log pϕ(x) with one-step conditional (ARM-MC) results in extremely high gradient variance in energybased training.

MAMs circumvent the first limiting factor by breaking down the original distribution matching problem into two subproblems: 1) minimizing the KL divergence between the model s marginal probability estimate pθ(x) and the energy function f(x), and 2) ensuring marginals log pθ and conditionals log pϕ are self-consistent. The first sub-problem requires evaluating the marginal probability pθ(x) with just one network forward pass for each x sample. The training objective for the second sub-problem is also scalable via simply sampling the highly parallel self-consistency error objectives developed in Equation (7).

The other limiting factor is associated with obtaining Monte Carlo samples for the REINFORCE gradient estimator. Previous methods that use ARMs for energy-based training [84, 12] assume a fixed ordering and require D sequential sampling steps to generate samples, which is slow and costly when the dimension is large. MAM proposes a more cost-effective sampling procedures through the utilization of persistent block-wise Gibbs sampling.

5. Related Work

Autoregressive models Developments in deep learning have greatly advanced the performance of ARMs across different modalities, including images, audio, and text. Anyorder (Order-agnostic) ARMs were first introduced in [78] by training with the any-order lower-bound objective for the maximum likelihood setting. Recent work, ARDM [24], demonstrates state-of-the-art performance for any-order discrete modeling of image/text/audio. Germain et al. [20] train

an auto-encoder with masking that outputs the sequence of all one-step conditionals for a given ordering, but does not perform as well as methods [79, 85, 24] that predict onestep conditionals under the given masking. Douglas et al. [18] train an AO-ARM as a proposal distribution and uses importance sampling to estimate arbitrary conditional probabilities in a DAG-structured Bayesian network, but with limited experiment validation on a synthetic dataset. Shih et al. [67] utilizes a modified training objective of ARMs for better marginal inference performance but loses any-order generation capability. In the energy-based training setting, ARMs are applied to science problems [12, 84], but suffer in scaling to when D is large. MAMs and ARMs are compared in detail in Section 4.3.

Arbitrary conditional/marginal models For continuous data, VAEAC [29] and ACFlow [39] extend conditional variational encoder and normalizing flow to arbitrary conditional modeling. ACE [73] improves the expressiveness of arbitrary conditional models by directly modeling the energy function, which reduces the constraints on parameterization but increases computation costs due to the need to approximate the normalizing constant. Instead of using neural networks as function approximators, probabilistic circuits (PCs) [9, 13, 56, 8, 54] offer tractable probabilistic models for both conditionals and marginals by building a computation graph with sum and product operations following specific structural constraints. Peharz et al. [54] improved the scalability of PCs by combining arithmetic operations into a single monolithic einsum-operation and automatic differentiation. Further improvements of PCs are achieved through distilling latent variables from pre-trained deep generative models [41, 42] . All methods mentioned above focus on MLE settings. MAMs focus on scalable approximate marginal inference using neural networks as function approximators on both MLE and EB settings.

GFlow Nets GFlow Nets [2, 4] formulate the problem of generation as matching the probability flow at terminal states to the target normalized density. Compared to ARMs, GFlow Nets allow flexible modeling of the generation process by assuming learnable generation paths through a directed acyclic graph (DAG). The advantages of learnable generation paths come with the trade-off of sacrificing the flexibility of any-order generation and exact likelihood evaluation. Under a fixed generation path, GFlow Nets reduce to fixed-order ARMs [87]. In Appendix A.4, we further discuss the connections and differences between GFlow Nets and AO-ARMs/MAMs. For discrete problems, Zhang et al. [88] train GFlow Nets on the squared distance loss with the trajectory balance objective [46]. This is not scalable for large D (for the same reason as ARMs in Section 4.3) and doesn t provide direct access to marginals. In the MLE setting, an energy function is additionally learned from data so that the model can be trained with energy-based training.

Generative Marginalization Models

6. Experiments

We evaluate marginalization models (MAM) on both MLE and EB settings for discrete problems including images, text, molecules and phyiscal systems. We compare MAMs with baselines that support arbitrary marginal inference3: Anyorder ARM [24], ARM [38], Parallel Any-order ARMs (P-AO-ARM) [24] and Probabilistic Circuit (PC) [54]. We also include state-of-the-art generative models on various tasks for comparison: GFlow Net [88], Discrete Flow [77], Pixel CNN++ [64], Variational Diffusion Models [33], Sparse Transformers [7, 32] and D3PM [1]. We follow training settings or results from the literature for all baselines. In Appendix B.1, we present additional studies on measuring the marginal self-consistency with a carefully curated synthetic experiment. Neural network architecture and training hyperparameter details are given in Appendix C.

6.1. Maximum Likelihood Estimation Training

We focus on three metrics: test data negative log likelihood (NLL), marginal inference time and marginal inference quality. The later two are only available with baselines that support arbitrary marginal inference. The marginals are evaluated on a randomly sampled mini-batch data of the test set (batch size = 128, metrics are averaged over batches). To evaluate marginal estimation quality, the marginal estimates of each model are compared with the marginal estimates of the best-performing model (in terms of NLL). Pearson correlation is reported to measure the quality of marginal likelihoods4. (1.0 means a perfect linear correlation with the best model s marginal estimates.) For evaluating NLL, the conditional network and marginal network perform similarly in ablation studies (see Appendix B.2). We use the conditional network for evaluating NLL. The marginal network is used for evaluating marginals.

Image We evaluate MAMs on Binary MNIST [63], CIFAR-10 [37] and Image Net32 [14, 10]. The image dimension is 1 28 28 for MNIST and 3 32 32 for CIFAR-10 and Image Net32. MAMs achieve competitive NLL on all tasks, equaling the best-performing arbitrary marginal inference models. In terms of marginal inference, MAM produces high quality marginal estimates while achieving close to 4 orders of magnitude speed-up in computation time. The Pearson correlation coefficients are close to 1.0, which means the marginal estimates are consistent with the best marginal estimates. It can also be interpreted as a measure of marginalization self-consistency, since the the marginals of MAM are evaluated against the same conditionals of

3We use to denote that the model supports arbitrary marginal inference. is used for ARMs with fixed ordering since they only partially support arbitrary marginal inference. 4When measuring AO-ARM against itself, two random orderings are measured against each other.

Table 1. Pixel modeling on Binary-MNIST

NLL (bpd) Pearson Time (s)

Gflow Net [88] 0.189 AO-ARM [24] 0.146 0.99 132.4 0.03 PC (Ei Nets) [54] 0.187 0.75 0.015 0.00 MAM 0.146 0.99 0.018 0.00

Table 2. Pixel modeling on CIFAR-10

NLL (bpd) Pearson Time (s)

D3PM [1] 3.44 Pixel CNN++ [64] 2.99 VDM [33] 2.49 Sparse Transformer [7, 32] 2.56 PC (LVD-PG) [42] 3.87 AO-ARM (800 epochs) 2.88 0.99 2401 1 MAM (800 epochs) 2.88 0.98 0.495 0.00

AO-ARM and MAM.

Molecule We evaluate MAM on the molecular generation benchmark MOSES [55] refined from the ZINC database [72]. The generated molecules from MAM and AO-ARM are comparable to standard state-of-the-art molecular generative models, such as Char RNN [66], JTNVAE [30], and Latent GAN [57] (see Tables 10 and 11), with added controllability and flexibility in any-order generation. MAM supports much much faster marginal inference, which is useful for domain scientists to easily reason about the likelihood of (sub)structures. Generated molecules and property histogram plots of are available in Appendix C.4.

Text We train a character-level generative model on Text8 [45], which consists of 100M characters from Wikipedia in chunks of 250 character. MAM achieves significant speed-up in marginal inference while maintaining comparable performance as an arbitrary marginal inference model. The test NLL is close to a Transformer model that is trained to only model one ordering (from left to right).

6.2. Energy-based Training

In the existing literature, only ARM with fixed variable order has been used for this training setting (for example in Wu et al. [84], Damewood et al. [12]). We additionally implement two more baselines: ARM-MC that uses singlestep conditional as a Monte Carlo estimate to log pϕ and GFlow Net [46]. The effective batch size for ARM and GFlow Net is B D for batch of B data samples (due to reasons mentioned in Section 4.3), and B 1 for ARM-MC and MAM . MAM and ARM use the REINFORCE gradient estimator with baseline. GFlow Net is trained on per-sample gradient of squared distance [88]. Note that MAM is an any-

Generative Marginalization Models

-0.4627, 0.0000

-2326.1995, -2156.8149

-3975.1997, -3873.0151

-5427.7236, -5406.5122

-6642.8984, -6739.9624

-7707.4326, -7895.3428

-8697.5957, -8974.0234

Figure 4. An example of the marginal estimates of an Image Net32 image along the generation trajectory using a random ordering. The numbers in the captions show that the learned (log) marginals (left) v.s. learned (log) conditionals (right) are approximately self-consistent.

Table 3. Pixel modeling on Image Net32

NLL (bpd) Pearson Time (s)

Image Transformer [53] 3.77 VDM [33] 3.72 PC (LVD-PG) [42] 4.06 AO-ARM (16 epochs) 3.60 0.99 4995 1 MAM (16 epochs) 3.60 0.98 1.243 0.00

Table 4. Character modeling on text8

NLL (bpc) Pearson Time (s)

D3PM [1] 1.47 Discrete Flow [77] 1.23 Transformer [80] 1.35 AO-ARM (3000 epochs) 1.48 0.987 41.40 0.01 MAM (3000 epochs) 1.48 0.945 0.005 0.00

order generative model, which is a more difficult learning task than ARM that uses fixed ordering and GFlow Net that uses learned ordering.

Table 5. Energy-based modeling of Ising model (D = 100)

NLL (bpd) KL div. Time (s)

ARM-One-Order [12] 0.79 -78.63 5.3 0.1e-01 ARM-MC-One-Order 24.84 -18.01 5.3 0.1e-01 GFlow Net [88] 0.78 -78.17 MAM-Any-Order 0.80 -77.77 3.7 0.1e-04

Table 6. Energy-based modeling of Ising model (D = 900)

NLL (bpd) KL div. Time (s)

ARM-One-Order [12] Out of GPU memory Random Samples 1.00 -623.9 MAM-Any-Order 0.83 -685.8 3.7 0.1e-04

Physical systems Ising models [28] model interacting spins on a square lattice and are widely studied in mathematics and physics (see Mac Kay [43]). The spins of the D sites are represented by a D-dimensional binary vector x, whose distribution p (x) exp ( EJ(x)) is determined by the energy function EJ(x) x Jx θ x, with J

0 20 40 Scores

Ising Model - 10 10 Lattice

Random Ground Truth Ma M-Any-Order GFlow Net ARM-Fixed-Order ARM-MC-Fixed-Order

0 100 200 300 Scores

Ising Model - 30 30 Lattice

Random Ground Truth Ma M-Any-Order

Figure 5. Ising model. Left: D = 10 10. Right: D = 30 30

being the adjacency matrix. We compare MAM with ARM, ARM-MC, and GFlow Net on a 10 10 (D = 100) and a larger 30 30 (D = 900) Ising model where ARMs and GFlow Nets fail to scale. We generate 2000 ground truth samples following Grathwohl et al. [21] and we measure test negative log-likelihood on those samples. We also measure DKL(pθ(x)||p ) by sampling from the learned model and evaluating PM i=1(log pθ(xi) log f (xi)). Figure 5 contains KDE plots of EJ(x) for the generated samples. We validate the analysis in Section 4.3, the ARM-MC gradient has high variance which leads to non-convergence or mode collapse. MAM achieves significant speedup in marginal inference and is the only model that supports any-order generative modeling. The performance in terms of KL divergence and likelihood are only slightly worse than models with fixed/learned order, which is expected since any-order modeling is harder than fixed-order modeling, and MAM is solving a more complicated task of jointly learning conditionals and marginals. On a 30 30 (D = 900) Ising model, MAM achieves a bpd of 0.835 while ARM and GFlow Net fail to fit in the GPU memory (see Figure 5 and Table 6).

Molecular generation with target property In this task, we are interested in training generative models towards a specific target property of interest g(x), such as lipophilicity (log P), synthetic accessibility (SA), etc. We define the distribution of molecules to follow p (x) exp( (g(x) g )2/τ), where g is the target value of the property and τ is a temperature parameter. We train ARM and MAM for lipophilicity of target values 4.0 and 4.0, both with τ = 1.0 and τ = 0.1. Both models are trained for 4000 iterations with batch size 512. Results are shown in Figure 6 (additional results in Appendix C).

Generative Marginalization Models

5.0 2.5 0.0 2.5 5.0 7.5 log P

Generated molecules, D=55, temperature=1.0

Random Ma M: Target=4 Ma M: Target=-4 ARM: Target=4 ARM: Target=-4

7.5 5.0 2.5 0.0 2.5 5.0 7.5 log P

Generated molecules, D=500, temperature=0.1

Random Ma M: Target=4 Ma M: Target=-4

Figure 6. EB molecule targeted generation. Left: 55d. Right: 500d

Findings are consistent with the Ising model experiments. Again, MAM performs just marginally below ARM. However, only MAM supports any-order modeling and scales to high-dimensional problems. Figure 6 (right) shows molecular generation with MAM for D = 500.

[C][=C][C][=C][C][=C][Ring1][=Branch1]

[?][?][?][?][C][=C][Ring1][=Branch1]

[C][=C][C][=C][?][?][?][?][?][?][?][?][?]

Figure 7. Illustration of conditional design of molecules towards low lipophilicity from a user-defined substructure in a given order. Left: Impainting the left 4 characters. Right: Impainting the right 4-20 characters. Shaded regions denote the impainted structures.

6.3. Comparison with Parallel AO-ARMs

Inference of AO-ARMs can be parallelized with fewer steps using dynamic programming at cost of minimal loglikelihood degradation, which make it a strong baseline for accelerated inference as shown in Hoogeboom et al. [24]. In Figure 8, we compare the quality of MAM against P-AOARM (PARM) with varying number of sampling steps T. MAM is consistently faster and produces better-correlated marginal estimates than PARM. PARM s effectiveness varies across datasets. Text and molecule data require more steps of PARM for accurate estimation due to their sequential dependencies. Interestingly, Image Net32 needs much fewer PARM steps for correlated log-likelihoods (despite values being quite off), suggesting easier parallelization of sampling/inference once some pixels are filled.

6.4. Out-of-distribution Robustness

The marginal estimates from MAM are not perfectlynormalized but only approximate log-likelihood values. Hence we test the how useful and robust those approximate marginals are in real-world use cases which are often out-of-distribution with various degrees.

We tested MAM s marginal estimates on generated syn-

10 1 100 101 102 Time (s)

Marginal estimate quality

MNIST Binary

Spearman Pearson

MAM PARM T = 2 PARM T = 5 PARM T = 10 PARM T = 20 ARM T=784

100 101 102 103 Time (s)

Marginal estimate quality

Image Net32

Spearman Pearson

MAM PARM T = 1 PARM T = 2 PARM T = 5 ARM T=3072

10 2 10 1 100 101 Time (s)

Marginal estimate quality

Molecular Sets

Spearman Pearson

MAM PARM T = 5 PARM T = 10 PARM T = 20 PARM T = 40 ARM Full T=55

10 2 10 1 100 101 Time (s)

Marginal estimate quality

Spearman Pearson

MAM PARM T = 2 PARM T = 5 PARM T = 10 PARM T = 20 PARM T = 40 ARM Full T=250

Figure 8. Comparison with Parallel AO-ARMs.

thetic data from masked MNIST (Appendix C.3.3) and Text8 examples (Appendix C.5). MAM log p estimates maintain a high correlation with actual log-likelihoods on data that are on-manifold but slightly out-of-distribution. In Appendix C.5.1, we also tested the model s generalizability for length extrapolation on Text8. The model is trained on D = 250 and tested on sequences with D = 300 from the same dataset. MAM s predicted log p marginals generalize gracefully to longer sequences. The quality matches Parallel AO-ARM with 10 20 steps while using significantly less time (2000 ). Finally, in Appendix C.4,we tested the model on a more challenging task: using MAM model s marginal likelihood estimates trained on Molecular Sets (a general chemical space of drug-like compounds) to distinguish between two focused chemical spaces (tyrosine kinase inhibitors and organic photodiodes) that are not seen during training. We created 1000 pairs consisting of one of each using datasets from Subramanian et al. [74], controlling for other factors like SMILES length and chemical space to increase difficulty. MAM marginals correctly identified the drug molecule 74% of the time (v.s. 79% for AO-ARM), with 90% alignment on marginal estimates with AO-ARM.

7. Conclusion

In conclusion, marginalization models are a novel family of generative models for high-dimensional discrete data that offer scalable and flexible generative modeling. These models explicitly model all induced marginal distributions, allowing for fast evaluation of arbitrary marginal probabilities with a single neural net forward pass. MAMs also support scalable training objectives for any-order generative modeling, which previous methods struggle to achieve under the energy-based training context. Potential future work includes designing novel neural network architectures that automatically satisfy the marginalization self-consistency.

Generative Marginalization Models

Acknowledgments

We thank members of the Princeton Laboratory for Intelligent Probabilistic Systems and anonymous reviewers for valuable discussions and feedback. We thank Andrew Novick and Eric Toberer for valuable discussions on energybased training in scientific applications. We thank Akshay Subramanian and Soojung Yang for valuable discussions on designing the drug-or-photodiode out-of-distribution evaluation task. This work was partially supported by NSF grants IIS-2007278 and OAC-2118201.

Impact Statement

As a deep learning model, MAM has the risk of low robustness on data from unseen domain or manifold. In practice, one should not blindly apply it to data that is far away from the training data distribution and expect the marginal likelihood estimate can be trusted. For the same reason, MAM will also be susceptible to adversarial attacks just as other commonly deep learning models.

Ma M enables training of a new type of generative model. Access to fast marginal likelihood is helpful for many downstream tasks such as outlier detection, protein/molecule design or screening. By allowing the training of order-agnostic discrete generative models scalable for distribution matching, it enhances the flexibility and controllability of generation towards a target distribution. This also poses the potential risk of deliberate misuse, leading to the generation of content/designs/materials that could cause harm to individuals.

[1] Austin, J., Johnson, D. D., Ho, J., Tarlow, D., and van den Berg, R. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981 17993, 2021. (pages 7, 8, and 17)

[2] Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems, 34:27381 27394, 2021. (pages 3, 6, and 17)

[3] Bengio, S. and Bengio, Y. Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transactions on Neural Networks, 11(3): 550 557, 2000. (pages 1 and 3)

[4] Bengio, Y., Lahlou, S., Deleu, T., Hu, E. J., Tiwari, M., and Bengio, E. Gflownet foundations. Journal of Machine Learning Research, 24(210):1 55, 2023. (pages 6, 16, and 17)

[5] Berthelot, D., Autef, A., Lin, J., Yap, D. A., Zhai, S., Hu, S., Zheng, D., Talbott, W., and Gu, E. Tract: Denoising diffusion models with transitive closure timedistillation. ar Xiv preprint ar Xiv:2303.04248, 2023. (page 5)

[6] Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519, 2015. (page 21)

[7] Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019. (page 7)

[8] Choi, Y., Vergari, A., and Van den Broeck, G. Probabilistic circuits: A unifying framework for tractable probabilistic models. UCLA. URL: http://starai. cs. ucla. edu/papers/Prob Circ20. pdf, 2020. (pages 6 and 17)

[9] Chow, C. and Liu, C. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462 467, 1968. (page 6)

[10] Chrabaszcz, P., Loshchilov, I., and Hutter, F. A downsampled variant of imagenet as an alternative to the cifar datasets. ar Xiv preprint ar Xiv:1707.08819, 2017. (pages 7 and 22)

[11] Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. (page 15)

Generative Marginalization Models

[12] Damewood, J., Schwalbe-Koda, D., and G omez Bombarelli, R. Sampling lattices in semi-grand canonical ensemble with autoregressive machine learning. npj Computational Materials, 8(1):61, 2022. (pages 2, 3, 6, 7, and 8)

[13] Darwiche, A. A differential approach to inference in Bayesian networks. Journal of the ACM (JACM), 50 (3):280 305, 2003. (page 6)

[14] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. (pages 7 and 22)

[15] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. (page 1)

[16] Dinh, L., Krueger, D., and Bengio, Y. NICE: Nonlinear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. (page 17)

[17] Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. ar Xiv preprint ar Xiv:1605.08803, 2016. (page 17)

[18] Douglas, L., Zarov, I., Gourgoulias, K., Lucas, C., Hart, C., Baker, A., Sahani, M., Perov, Y., and Johri, S. A universal marginalizer for amortized inference in generative models. Advances in Approximate Bayesian Inference, NIPS 2017 Workshop, 2017. (page 6)

[19] Flam-Shepherd, D., Zhu, K., and Aspuru-Guzik, A. Language models can learn complex molecular distributions. Nature Communications, 13(1):3293, 2022. (page 3)

[20] Germain, M., Gregor, K., Murray, I., and Larochelle, H. Made: Masked autoencoder for distribution estimation. In International conference on machine learning, pp. 881 889. PMLR, 2015. (page 6)

[21] Grathwohl, W., Swersky, K., Hashemi, M., Duvenaud, D., and Maddison, C. Oops I took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pp. 3831 3841. PMLR, 2021. (pages 8 and 22)

[22] Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020. (page 2)

[23] Hoogeboom, E., Peters, J., Van Den Berg, R., and Welling, M. Integer discrete flows and lossless compression. Advances in Neural Information Processing Systems, 32, 2019. (page 17)

[24] Hoogeboom, E., Gritsenko, A. A., Bastings, J., Poole, B., Berg, R. v. d., and Salimans, T. Autoregressive diffusion models. ar Xiv preprint ar Xiv:2110.02037, 2021. (pages 1, 3, 5, 6, 7, 9, 15, 17, 18, 22, 23, and 24)

[25] Hoogeboom, E., Nielsen, D., Jaini, P., Forr e, P., and Welling, M. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454 12465, 2021. (page 17)

[26] Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251 257, 1991. (page 15)

[27] Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. (page 15)

[28] Ising, E. Beitrag zur Theorie des Ferromagnetismus. Zeitschrift fur Physik, 31(1):253 258, February 1925. doi: 10.1007/BF02980577. (page 8)

[29] Ivanov, O., Figurnov, M., and Vetrov, D. Variational autoencoder with arbitrary conditioning. ar Xiv preprint ar Xiv:1806.02382, 2018. (page 6)

[30] Jin, W., Barzilay, R., and Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning, pp. 2323 2332. PMLR, 2018. (page 7)

[31] Johnson, D. D., Austin, J., Berg, R. v. d., and Tarlow, D. Beyond in-place corruption: Insertion and deletion in denoising probabilistic models. ar Xiv preprint ar Xiv:2107.07675, 2021. (page 17)

[32] Jun, H., Child, R., Chen, M., Schulman, J., Ramesh, A., Radford, A., and Sutskever, I. Distribution augmentation for generative modeling. In International Conference on Machine Learning, pp. 5006 5019. PMLR, 2020. (page 7)

[33] Kingma, D., Salimans, T., Poole, B., and Ho, J. Variational diffusion models. Advances in neural information processing systems, 34:21696 21707, 2021. (pages 7 and 8)

[34] Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. Advances in Neural Information Processing Systems, 29, 2016. (page 17)

Generative Marginalization Models

[35] K ohler, J., Invernizzi, M., De Haan, P., and No e, F. Rigid body flows for sampling molecular crystal structures. ar Xiv preprint ar Xiv:2301.11355, 2023. (pages 2 and 3)

[36] Krenn, M., H ase, F., Nigam, A., Friederich, P., and Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1 (4):045024, 2020. (pages 3 and 22)

[37] Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. (pages 7 and 22)

[38] Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 29 37. JMLR Workshop and Conference Proceedings, 2011. (pages 1, 3, and 7)

[39] Li, Y., Akbar, S., and Oliva, J. Acflow: Flow models for arbitrary conditional likelihoods. In International Conference on Machine Learning, pp. 5831 5841. PMLR, 2020. (page 6)

[40] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379 (6637):1123 1130, 2023. (page 3)

[41] Liu, A., Zhang, H., and Broeck, G. V. d. Scaling up probabilistic circuits by latent variable distillation. ar Xiv preprint ar Xiv:2210.04398, 2022. (pages 6 and 17)

[42] Liu, X., Liu, A., Van den Broeck, G., and Liang, Y. Understanding the distillation process from deep generative models to tractable probabilistic circuits. In International Conference on Machine Learning, pp. 21825 21838. PMLR, 2023. (pages 6, 7, 8, and 17)

[43] Mac Kay, D. J. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003. (page 8)

[44] Madani, A., Krause, B., Greene, E. R., Subramanian, S., Mohr, B. P., Holton, J. M., Olmos Jr, J. L., Xiong, C., Sun, Z. Z., Socher, R., et al. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, pp. 1 8, 2023. (page 3)

[45] Mahoney, M. Large text compression benchmark, 2011. (page 7)

[46] Malkin, N., Jain, M., Bengio, E., Sun, C., and Bengio, Y. Trajectory balance: Improved credit assignment in gflownets. Advances in Neural Information Processing Systems, 35:5955 5967, 2022. (pages 6, 7, and 17)

[47] Malkin, N., Lahlou, S., Deleu, T., Ji, X., Hu, E., Everett, K., Zhang, D., and Bengio, Y. Gflownets and variational inference. ar Xiv preprint ar Xiv:2210.00580, 2022. (pages 4 and 17)

[48] Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., and Finn, C. Detectgpt: Zero-shot machine-generated text detection using probability curvature. ar Xiv preprint ar Xiv:2301.11305, 2023. (page 1)

[49] No e, F., Olsson, S., K ohler, J., and Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457), 2019. (pages 2 and 3)

[50] Open AI. Chat GPT, 2023. URL https://openai. com. (page 3)

[51] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730 27744, 2022. (page 3)

[52] Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. Advances in Neural Information Processing Systems, 30, 2017. (page 17)

[53] Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., and Tran, D. Image transformer. In International conference on machine learning, pp. 4055 4064. PMLR, 2018. (page 8)

[54] Peharz, R., Lang, S., Vergari, A., Stelzner, K., Molina, A., Trapp, M., Van den Broeck, G., Kersting, K., and Ghahramani, Z. Einsum networks: Fast and scalable learning of tractable probabilistic circuits. In International Conference on Machine Learning, pp. 7563 7574. PMLR, 2020. (pages 6, 7, and 17)

[55] Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golovanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Artamonov, A., Aladinskiy, V., Veselov, M., et al. Molecular sets (moses): a benchmarking platform for molecular generation models. Frontiers in Pharmacology, 11:565644, 2020. (pages 7 and 26)

Generative Marginalization Models

[56] Poon, H. and Domingos, P. M. Sum-product networks: A new deep architecture. In Proceedings of the Twenty Seventh Conference on Uncertainty in Artificial Intelligence, 2011. (page 6)

[57] Prykhodko, O., Johansson, S. V., Kotsias, P.-C., Ar us Pous, J., Bjerrum, E. J., Engkvist, O., and Chen, H. A de novo molecular generation method using latent vector based generative adversarial network. Journal of Cheminformatics, 11(1):1 13, 2019. (page 7)

[58] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. (page 2)

[59] Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., Depristo, M., Dillon, J., and Lakshminarayanan, B. Likelihood ratios for out-of-distribution detection. Advances in neural information processing systems, 32, 2019. (page 1)

[60] Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530 1538. PMLR, 2015. (page 17)

[61] Rippel, O. and Adams, R. P. High-dimensional probability estimation with deep density models. ar Xiv preprint ar Xiv:1302.5125, 2013. (page 17)

[62] Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer Assisted Intervention MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234 241. Springer, 2015. (pages 1 and 3)

[63] Salakhutdinov, R. and Murray, I. On the quantitative analysis of deep belief networks. In Proceedings of the 25th international conference on Machine learning, pp. 872 879, 2008. (pages 7 and 21)

[64] Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. ar Xiv preprint ar Xiv:1701.05517, 2017. (page 7)

[65] Schneuing, A., Du, Y., Harris, C., Jamasb, A., Igashov, I., Du, W., Blundell, T., Li o, P., Gomes, C., Welling, M., et al. Structure-based drug design with equivariant diffusion models. ar Xiv preprint ar Xiv:2210.13695, 2022. (page 1)

[66] Segler, M. H., Kogej, T., Tyrchan, C., and Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Central Science, 4(1):120 131, 2018. (pages 3 and 7)

[67] Shih, A., Sadigh, D., and Ermon, S. Training and inference on any-order autoregressive models the right way. ar Xiv preprint ar Xiv:2205.13554, 2022. (pages 6 and 18)

[68] Shin, J.-E., Riesselman, A. J., Kollasch, A. W., Mc Mahon, C., Simon, E., Sander, C., Manglik, A., Kruse, A. C., and Marks, D. S. Protein design and variant prediction using autoregressive generative models. Nature Communications, 12(1):2403, 2021. (page 3)

[69] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256 2265. PMLR, 2015. (pages 2 and 17)

[70] Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019. (page 2)

[71] Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consistency models. ar Xiv preprint ar Xiv:2303.01469, 2023. (page 5)

[72] Sterling, T. and Irwin, J. J. ZINC 15 ligand discovery for everyone. Journal of Chemical Information and Modeling, 55(11):2324 2337, 2015. (page 7)

[73] Strauss, R. and Oliva, J. B. Arbitrary conditional distributions with energy. Advances in Neural Information Processing Systems, 34:752 763, 2021. (page 6)

[74] Subramanian, A., Greenman, K. P., Gervaix, A., Yang, T., and G omez-Bombarelli, R. Automated patent extraction powers generative modeling in focused chemical spaces. Digital Discovery, 2(4):1006 1015, 2023. (pages 9 and 26)

[75] Tabak, E. G. and Turner, C. V. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145 164, 2013. (page 17)

[76] Tieleman, T. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th International Conference on Machine Learning, pp. 1064 1071, 2008. (page 5)

[77] Tran, D., Vafa, K., Agrawal, K., Dinh, L., and Poole, B. Discrete flows: Invertible generative models of discrete data. Advances in Neural Information Processing Systems, 32, 2019. (pages 7, 8, and 17)

[78] Uria, B., Murray, I., and Larochelle, H. A deep and tractable density estimator. In International Conference on Machine Learning, pp. 467 475. PMLR, 2014. (pages 1, 3, 5, 6, and 15)

Generative Marginalization Models

[79] Van Den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In International conference on machine learning, pp. 1747 1756. PMLR, 2016. (page 6)

[80] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017. (pages 1, 3, and 8)

[81] Wang, J., Lisanza, S., Juergens, D., Tischer, D., Watson, J. L., Castro, K. M., Ragotte, R., Saragovi, A., Milles, L. F., Baek, M., et al. Scaffolding protein functional sites using deep learning. Science, 377(6604): 387 394, 2022. (page 1)

[82] Weininger, D., Weininger, A., and Weininger, J. L. SMILES. 2. algorithm for generation of unique SMILES notation. Journal of Chemical Information and Computer Sciences, 29(2):97 101, 1989. (pages 3 and 22)

[83] Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229 256, 1992. (pages 3 and 5)

[84] Wu, D., Wang, L., and Zhang, P. Solving statistical mechanics using variational autoregressive networks. Physical review letters, 122(8):080602, 2019. (pages 2, 3, 6, and 7)

[85] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 32, 2019. (pages 1 and 6)

[86] Yeh, R. A., Chen, C., Yian Lim, T., Schwing, A. G., Hasegawa-Johnson, M., and Do, M. N. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5485 5493, 2017. (page 1)

[87] Zhang, D., Chen, R. T., Malkin, N., and Bengio, Y. Unifying generative models with gflownets. ar Xiv preprint ar Xiv:2209.02606, 2022. (pages 6 and 17)

[88] Zhang, D., Malkin, N., Liu, Z., Volokhova, A., Courville, A., and Bengio, Y. Generative flow networks for discrete probabilistic modeling. In International Conference on Machine Learning, pp. 26412 26428. PMLR, 2022. (pages 6, 7, 8, and 17)

Generative Marginalization Models

A. Additional Technical Details

A.1. Proof of Proposition 1

Proof. From the single-step marginalization self-consistency in (7), we have

log pθ(x) = XD

d=1 log pϕ xσ(d)|xσ(<d) , x, σ.

Therefore we can rewrite the optimization in (8) as:

max ϕ Ex pdata Eσ U(SD) XD

d=1 log pϕ xσ(d) | xσ(<d) (10)

s.t. pθ(xσ(<d))pϕ(xσ(d)|xσ(<d)) = pθ(xσ( d)), σ SD, x {1, , K}D, d [1 : D].

Let p be the optimal probability distribution that maximizes the likelihood on training data, and from the chain rule we have: p = arg max p Ex pdata log p(x) = Ex pdata Eσ U(SD) XD

d=1 log p xσ(d)|xσ(<d)

Then p is also the optimal solution to (10) the marginalization constraints are automatically satisfied by p since it is a valid distribution. From the universal approximation theorem [27, 26, 11], we can use separate neural networks to model pθ (marginals) and pϕ (conditionals), and obtain optimal solution to (10) with θ and ϕ that approximates p arbitrarily well.

Specifically, if θ and ϕ satisfy the following three conditions below, they are the optimal solution to (10):

pϕ xσ(d) | xσ(<d) = p xσ(d) | xσ(<d) , x, σ (11)

pθ (xs) = p (xs)Zθ , x, s {1, , D} (12)

pθ (xσ(<d))pϕ (xσ(d)|xσ(<d)) = pθ (xσ( d)), σ SD, x {1, , K}D, d [1 : D] (13)

where Zθ is the normalization constant of pθ and is equal to pθ (( , , )). It is easy to see from the definition of conditional probabilities that satisfying any two of the optimal conditions leads to the third one.

To obtain the optimal ϕ , it suffices to solve the following optimization problem:

Stage 1: max ϕ Ex pdata Eσ U(SD) XD

d=1 log pϕ xσ(d) | xσ(<d)

because p = arg maxp Ex pdata Eσ U(SD) PD d=1 log p xσ(d)|xσ(<d) due to chain rule. Solving Stage 1 is equivalent to finding ϕ that satisfies condition (11). Then we can obtain the optimal θ by solving for condition (13) given the optimal conditionals ϕ :

Stage 2: min θ Ex q(x)Eσ U(SD)Ed U(1, ,D) log[pθ(xσ(<d))pϕ (xσ(d)|xσ(<d))] log pθ(xσ( d)) 2

A.2. Expected Lower bound of Log-Likelihood

Here we present the expected lower bound objective used for training AO-ARMs under maximum likelihood setting, which was first proposed by Uria et al. [78]. Hoogeboom et al. [24] provided the expected lower bound perspective.

Given an ordering σ,

log p(x | σ) = XD

d=1 log p xσ(d) | xσ(<d) . (14)

By taking the expectation over all orderings σ, we can derive a lower bound on the log-likelihood via Jensen s inequality.

log pϕ(x) = log Eσ pϕ(x | σ) Jensen s inequality Eσ XD

d=1 log pϕ xσ(d) | xσ(<d)

= Eσ U(SD) D Ed U(1,...,D) log pϕ xσ(d) | xσ(<d)

= D Ed Eσ 1 D d + 1

j σ( d) log pϕ xj | xσ(<d) , (15)

where σ U(SD), d U(1, . . . , D) and xσ(<d) = {xσ(1), . . . , xσ(d 1)}. U(S) denotes the uniform distribution over a finite set S and σ(d) denotes the d-th element in the ordering.

Generative Marginalization Models

A.3. Algorithms

We present the algorithms for training MAM for maximum likelihood and energy-based training settings in Algorithm 1 and Algorithm 2.

Algorithm 1 MLE training of MAMs

Input: Data Dtrain, q(x), network θ and ϕ Stage 1: Train ϕ with Equation (15) used in AO-ARM for minibatch x Dtrain do

Sample σ U(SD), d U(1, , D) L D D d+1 P

j σ( d) log pϕ xj|xσ(<d)

Update ϕ with gradient of L end for Stage 2: Train θ to distill the marginals from optimized conditionals ϕ for minibatch x q(x) do

Sample σ U(SD), d U(1, , D) L squared error of the inconsistencies in Equation (7) Update θ with gradient of L end for

Algorithm 2 Energy-based training of MAMs

Input: q(x), network θ and ϕ, Gibbs sampling block size M Joint training of ϕ and θ: for j in {1, , N} do

Sample σ U(SD) Update x pϕ(xσ( M)|xσ(>M))

Persistent block Gibbs sampling Sample x q(x) Sample d U(1, , D), σ U(SD) Lpenalty squared error of Equation (7), for d and σ with x θ,ϕDKL REINFORCE est. with x θ,ϕ θ,ϕDKL + λ θ,ϕLpenalty Update θ and ϕ with gradient end for

A.4. Connections between MAMs and GFlow Nets

In this section, we identify an interesting connection between generative marginalization models and GFlow Nets. The two type of models are designed with different motivations. GFlow Nets are motivated by learning a policy to generate according to an energy function and MAMs are motivated from any-order generation through learning to perform marginalization. However, under certain conditions, there exists an interesting connection between generative marginalization models and GFlow Nets. In particular, the marginalization self-consistency condition derived from the definition of marginals in Equation (4) has an equivalence to the detailed balance constraint in GFlow Net under the following specific conditions. Observation 1. When the directed acyclic graph (DAG) used for generation in GFlow Net is specified by the following conditions, there is an equivalence between the marginalization self-consistency condition in Equation (7) for MAM and the detailed balance constraint proposed for GFlow Net [4]. In particular, the pθ(xσ(d)|xσ(<d)) in MAM is equivalent to the forward policy PF (sd+1 | sd) in GFlow Net, and the marginals pθ(xσ(d)) are equal to the flows F(sd) up to a normalizing constant.

DAG Condition: The DAG used for generation in GFlow Net is defined by the given tree-like structure: a sequence x is generated by incrementally adding one variable at each step, following a uniformly random ordering σ i.e. σ U(SD). At step d, the state along the generation trajectory is defined to be sd = xσ( d).

Generative Marginalization Models

Backward Policy Condition: At step D d, the backward policy under the DAG is fixed by removing (un-assigning) the value of the d + 1-th element under ordering σ , i.e. PB(sd | sd+1; σ) = 1{sd=xσ( d)}. Or equivalently, the backward policy removes (un-assigns) one of the existing variables at random, i.e. PB(sd | sd+1) = 1/d+11{sd sd+1}.

Intuitively, this is straight forward to understand, since GFlow Net generates a discrete object autoregressively. The model was proposed to enhance the flexibility of generative modeling by allowing for a learned ordering, as compared with autoregressive models (see [88] Sec. 5 for a discussion). When the generation ordering is fixed, it is reduced to autoregressive models with fixed ordering, which is discussed in [87]. Observation 1 presented above for any-order ARMs can be seen as a extended result of the connection between GFlow Nets and fixed-order ARMs.

We have seen the interesting connection of GFlow Nets with ARMs (and MAMs). Next, we discuss the differences between GFlow Nets and MAMs.

Remark 1. The detailed balance constraint was proposed only as a theoretical result in Bengio et al. [4]. In actual experiments, GFlow Nets are trained using either flow matching [2] or trajectory balance [46, 88].

Zhang et al. [88] is the most relevant GFlow Net work that targets the discrete problem setting. Training is done via minimizing the squared distance loss with trajectory balance objective. For the MLE training, it proposes to additionally learn an energy function from data so that the trajectory balance objetive can still be applied. In particular, MAM is different from GFlow Net in Zhang et al. [88] in three main aspects.

First of all, MAMs target any-order generation and direct access to marginals, where as GFlow Nets aim for flexibility in learning generation paths and does not offer exact likelihood or direct access to marginals under learnable generation paths. When the generation path is fixed to follow a ordering or random ordering, they are reduced to ARMs or any-order ARMs, which allow for exact likelihood. However, training with the trajectory balance objective does not offer direct access to marginals (just like how ARMs do not offer direct access to marginals but only conditionals).

Second, training under MLE setting is signiticantly different: GFlow Nets learn an additional learned energy function to reduce MLE training back to energy-based training, while MAMs directly maximizes the expected lower bound on the log-likelihood under the marginalization self-consistent constraint.

Lastly, the training objective is different under energy-based training. GFlow Nets are trained on squared distance under the expectation to be specified to be either on-policy, off-policy, or a mixture of both. MAMs are trained on KL divergence where the expectation is defined to be on-policy. It is possible though to train MAMs with squared distance and recently Malkin et al. [47] have shown the equivalence of the gradient of KL divergence and the on-policy expectation of the per-sample gradient of squared distance (which is the gradient actually used for training GFlow Nets).

A.5. Additional literature on discrete generative models

Discrete diffusion models Discrete diffusion models learn to denoise from a latent base distribution into the data distribution. Sohl-Dickstein et al. [69] first proposed diffusion for binary data and was extended in Hoogeboom et al. [25] for categorical data and both works adds uniform noise in the diffusion process. A wider range of transition distributions was proposed in D3PM [1] and insert-and-delete diffusion processes have been explored in Johnson et al. [31]. Hoogeboom et al. [24] explored the connection between ARMs and diffusion models with absorbing diffusion and showed that OA-ARDMs are equivalent to absorbing diffusion models in infinite time limit, but achieves better performance with a smaller number of steps.

Discrete normalizing flow Normalizing flows transform a latent base distribution into the data distribution by applying a sequence of invertible transformations [61, 75, 16, 69, 60, 17, 34, 52]. They have been extended to discrete data [77, 23] with carefully designed discrete variable transformations. Their performance is competitive on character-level text modeling, but they do not allow any-order modeling and could be limited to discrete data with small number of categories due to the use of a straight-through gradient estimators.

Discussion of neural generative models and Probabilistic circuits Probabilistic circuits [8, 54, 41, 42] is a powerful modeling approach exhibiting fast and exact marginalization though the design of the model s structure and operations. In contrast, neural generative models are highly expressive, allowing them to perform powerful approximate inference. Despite not having the exact marginalization property, the neural network approach has the advantage of much greater flexibility in

Generative Marginalization Models

modeling the complex distributions found in practical applications [67, 24]. Hence, a trade-off currently exists between exact marginalization and approximate marginalization with a more expressive network. Our work falls in the neural generative models category, but directly approximates marginals. Direct modeling of marginals opens opportunities for more flexible sampling, as shown in Appendix B.2, and more scalable approximate marginal inference and training under EB settings.

B. Ablation Studies

B.1. Testing marginal self-consistency

The marginal self-consistency in MAMs is enforced through optimizing the scalable training objective. Here we empirically examine how well they are enforced in practice. First we look at checkerboard, a synthetic problem often used for testing clustering algorithms. More recently it has been used for testing and visualizing both continuous and discrete generative models. We define a discrete input space by discretizing the continuous coordinates of points in 2D. To be more concrete, the origin range [ 4, 4] of each dimension is converted into a 16-bit string following the standard way of converting float to string. The target unnormalized probability p(x) is set to 1 for points within dark squares and 1e 10 within light squares (since it is infeasible to set it to ln(0) = for a NN to learn, and in practice 1e 10 is negligible compared to 1). We trained a 5-layer MLP with hidden node size 2048 and residual connections on this problem on both MLE and EBM settings and q(x) is set to be a balanced mixture of ground truth data and samples from pθ for MLE or uniform random for EBM:

min θ Ex pdatapθ (x) + λ Ex q(x)EσEd log P

xσ(d) pθ([xσ(<d), xσ(d)]) log pθ([xσ(<d), xσ(d)]) 2 .

min θ DKL(pθ (x) p(x)) + λ Ex q(x)EσEd log P

xσ(d) pθ([xσ(<d), xσ(d)]) log pθ([xσ(<d), xσ(d)]) 2 .

For this problem, only a marginal network θ is trained to predict the log p of any marginals. Upon training to convergence, the generative models perform on par or better than state of the art discrete generative models and achieve a 20.68 test NLL. See Figure 9 for a comparison of ground truth and learned PMF heatmap. It can be seen the PMF are approximated quite accurately. We investigate how well the marginal self-consistency are enforced, by looking at the marginal estimates of MAMs trained with λ = 1e2 and λ = 1e4. We evaluate marginals over the first dimension (0 16 bits) by fixing the second dimension (17 32 bits) to 1.0 (bit string = 0001111111111111). We plot marginals by marginalizing out bit 3 16 (i.e. (x1, x2, )) and bit 5 16 (i.e. (x1, x2, x3, x4, )). In Figure 12, when λ = 1e4, the self-consistency are more strictly enforced, leading to matched marginals. Notice that there is some tiny residue PMF at the light squares due to the 1e 10 approximation applied to points with 0 probability, but they are negligible compared to the significant probability masses. After normalizing the marginals over all possibilities, the marginals are almost exactly matched. In Figure 13, when λ = 1e2, the self-consistency are more loosely enforced as compared to λ = 1e4. But it is notable that they are only shifted by a constant as compared to the ground truth marginals. This means although marignal self-consisteny is not strictly enforced when λ = 1e2, softly enforcing it leads to shifted but consistent estimates of marginals, as the NN learns to generalize and predict symmetric probabilities for symmetric regions. Using the constant-shifted marginals to sample will result in the same distribution with the ground truth, because the normalized MAM marginals match the ground truth almost exactly. This is observed in the samples generated under λ = 1e2 in Figure 9 and consistent normalized marginals in Figure 13.

B.2. Sampling with marginals v.s. conditionals

The trained marginalization model comes with two networks. The conditional network ϕ estimates any-order conditionals pϕ(xσ(d)|xσ(<d)), and the marginal network θ estimates arbitrary marginals pθ(xσ( d)). When MAM is used for sampling, either network can be used. With the conditional network ϕ, samples can be drawn autoregressively one variable at each step. Or the marginals can be used to draw variables using the normalized conditional:

pθ(xsi|xs(<i)) = pθ([xsi, xs(<i)]) P

xsi pθ([xsi, xs(<i)]) .

where xsi is the next block of variables (can be multiple) to sample at step i and xs(<i) are the previously sampled variables. We show with experiments that the marginals are also effective to be used for sampling and they provide extra flexibility

Generative Marginalization Models

Log-Scale Heatmap - Ground Truth

4 Log-Scale Heatmap - Learned, = 1e + 02

Samples - Learned, = 1e + 02

Figure 9. PMF heat map under EB training. The learned PMF and ground truth PMF are consistent to each other relatively well. The MSE on log p (or p) of dark pixels is 0.0033 (or 7.67e 20) and the MSE on light pixels is 0.0076 (or 3.73e 30). We are evaluating marginals along the red line: i.e. fixing (x17, , x32) = (0, 0, 0, 1, 1, 1, , 1), which correspond to 1 in floating number for y-axis, and perform marginalization over (x1, , x16). (0, 0, ) corresponds to [0, 2]. (0, 1, ) corresponds to [2, 4]. (1, 0, ) corresponds to [ 2, 0]. (1, 1, ) corresponds to [ 4, 2].

(0, 0) (0, 1) (1, 0) (1, 1) (x1, x2, ?, ?, ?, . . . )

Probabiilty

Marginal Consistency Comparison - Original, = 1e + 02

Ground Truth Ma M Probability Sum Ma M Marginal

(0, 0) (0, 1) (1, 0) (1, 1) (x1, x2, ?, ?, ?, . . . )

Probabiilty

Marginal Consistency Comparison - Normalized, = 1e + 02

Ground Truth Ma M Probability Sum Ma M Marginal

(0, 0, 0, 0) (0, 0, 0, 1) (0, 0, 1, 0) (0, 0, 1, 1) (0, 1, 0, 0) (0, 1, 0, 1) (0, 1, 1, 0) (0, 1, 1, 1)

(x1, x2, x3, x4, ?, ?, . . . , ?)

Probabiilty

Marginal Consistency Comparison - Original, = 1e + 02

Ground Truth Ma M Probability Sum Ma M Marginal

(0, 0, 0, 0) (0, 0, 0, 1) (0, 0, 1, 0) (0, 0, 1, 1) (0, 1, 0, 0) (0, 1, 0, 1) (0, 1, 1, 0) (0, 1, 1, 1)

(x1, x2, x3, x4, ?, ?, . . . , ?)

Probabiilty

Marginal Consistency Comparison - Normalized, = 1e + 02

Ground Truth Ma M Probability Sum Ma M Marginal

Figure 10. Marginal consistency λ = 1e2 under EB training. Ground truth: summing over ground truth PMF. MAM Probability Sum: summing over learned PMF from MAM. MAM Marginal: direct estimate with MAM. The small discrepancy in p(1, 0, ?, , ?) is due to the corner case of (1, 0, 0, 0, , 0) be assigned to a positive value due to numerical errors in float conversion.

Log-Scale Heatmap - Ground Truth

Log-Scale Heatmap - Learned, = 1e + 04

Samples - Learned, = 1e + 04

Log-Scale Heatmap - Learned, = 1e + 02

Samples - Learned, = 1e + 02

Figure 11. PMF heat map under MLE training. The learned PMF and ground truth PMF are consistent to each other relatively well. The MSE on log p (or p) of dark pixels is 0.533 (or 2.5e 19) and the MSE on light pixels is 2.5 (or 3e 28).

in the sampling procedure. We test sampling with different block sizes using the marginals with random orderings and compare them to sampling with conditionals in Figure 14. The samples generated are of similar quality. And those different sampling procedures exhibit similar likelihood on test data. However, sampling with large block size enables to trade compute memory for less time spent (due to fewer steps) in generation inference, which we find it interesting to explore for future work. Compared with the conditional network, the marginal network allows sampling in arbitrary block variable size and ordering. This illustrates the potential utility of MAMs in flexible generation with tractable likelihood.

B.3. Choice of q in sampling the marginalization self-consistency for training

In simple examples such as the synthetic checkerboard problem, it does not really matter, we have tried pdata or pθ or random, or a mixture of them. All work fairy well given that the problem is relatively easy.

Generative Marginalization Models

(0, 0) (0, 1) (1, 0) (1, 1) (x1, x2, ?, ?, ?, . . . )

Probabiilty

Marginal Consistency Comparison - Original, = 1e + 04

Ground Truth Ma M Probability Sum Ma M Marginal

(0, 0) (0, 1) (1, 0) (1, 1) (x1, x2, ?, ?, ?, . . . )

Probabiilty

Marginal Consistency Comparison - Normalized, = 1e + 04

Ground Truth Ma M Probability Sum Ma M Marginal

(0, 0, 0, 0) (0, 0, 0, 1) (0, 0, 1, 0) (0, 0, 1, 1) (0, 1, 0, 0) (0, 1, 0, 1) (0, 1, 1, 0) (0, 1, 1, 1)

(x1, x2, x3, x4, ?, ?, . . . , ?)

Probabiilty

Marginal Consistency Comparison - Original, = 1e + 04

Ground Truth Ma M Probability Sum Ma M Marginal

(0, 0, 0, 0) (0, 0, 0, 1) (0, 0, 1, 0) (0, 0, 1, 1) (0, 1, 0, 0) (0, 1, 0, 1) (0, 1, 1, 0) (0, 1, 1, 1)

(x1, x2, x3, x4, ?, ?, . . . , ?)

Probabiilty

Marginal Consistency Comparison - Normalized, = 1e + 04

Ground Truth Ma M Probability Sum Ma M Marginal

Figure 12. Marginal consistency λ = 1e4 with MLE training. Ground truth: summing over ground truth PMF. MAM Probability Sum: summing over learned PMF from MAM. MAM Marginal: direct estimate with MAM. Note that p for (0, 1) and (1, 0) should be in principle close to zero, but are non-zero due to float-to-int converting numerical errors.

(0, 0) (0, 1) (1, 0) (1, 1) (x1, x2, ?, ?, ?, . . . )

Probabiilty

Marginal Consistency Comparison - Original, = 1e + 02

Ground Truth Ma M Probability Sum Ma M Marginal

(0, 0) (0, 1) (1, 0) (1, 1) (x1, x2, ?, ?, ?, . . . )

Probabiilty

Marginal Consistency Comparison - Normalized, = 1e + 02

Ground Truth Ma M Probability Sum Ma M Marginal

(0, 0, 0, 0) (0, 0, 0, 1) (0, 0, 1, 0) (0, 0, 1, 1) (0, 1, 0, 0) (0, 1, 0, 1) (0, 1, 1, 0) (0, 1, 1, 1)

(x1, x2, x3, x4, ?, ?, . . . , ?)

Probabiilty

Marginal Consistency Comparison - Original, = 1e + 02

Ground Truth Ma M Probability Sum Ma M Marginal

(0, 0, 0, 0) (0, 0, 0, 1) (0, 0, 1, 0) (0, 0, 1, 1) (0, 1, 0, 0) (0, 1, 0, 1) (0, 1, 1, 0) (0, 1, 1, 1)

(x1, x2, x3, x4, ?, ?, . . . , ?)

Probabiilty

Marginal Consistency Comparison - Normalized, = 1e + 02

Ground Truth Ma M Probability Sum Ma M Marginal

Figure 13. Marginal consistency λ = 1e2 with MLE training.

(a) Marg.: 1 pixel per step

(b) Marg.: 2 pixels per step

(c) Marg.: 4 pixels per step

(d) Marg.: 8 pixels per step

(e) Cond.: 1 pixel per step

Cond. 1 pixel Marg. 1 pixel Marg. 2 pixels Marg. 4 pixels Marg. 8 pixels

Model Inference Time

NLL (in bpd)

Conditionals v.s. Marginals: Model Inference Time and test NLL

Model Inference Time ( ) NLL (in bpd) ( )

(f) Inference time and NLL comparison

Figure 14. MAM sampling using marginal network (a-d) with different number of variables at each step v.s. sampling using conditional network (e) with 1 variable at eacth step. (f) compares NLL under different sampling procedures and the model inference time.

In real-world data problems, it boils down to what the marginal will be used for at test time. Uniform distribution over x will be a bad choice if there is a data manifold we care about. If it will be used for generation, for example in the MNIST Binary

Generative Marginalization Models

example in Appendix B.2, q is set to a mixture of pθ and pdata. If it will be used for mariginal inference on the data manifold, pdata will be enough. We all know the NN is not robust on data it hasn t trained on, and so are the marginal networks, they will not give correct estimates if we evaluate on arbitrary datapoint off the manifold or policy.

B.4. Two-Stage v.s. Joint Training

On MNIST maximum likelihood training, we compare two-stage training and joint training in Figure 15. Both training uses the decomposed conditionals for the log likelihood objective, otherwise joint training will lead to inflated log likelihoods. We observe that two-stage training converges faster than joint training (20 epochs v.s. 80 epochs) and needs less GPU memory since it only requires gradient of one model instead of two models. For joint training, it is observed that smaller λ is preferred for fast convergence and better performance while large λ hurts the model s inference performance.

20 40 60 80 100 120 Epoch

75 Test Negative Log-Likelihood

Joint Training, = 1

Joint Training, = 1e 1 Joint Training, = 1e 2

Two-Stage Training, at 20 Epochs

0 5000 10000 15000 20000 25000 30000 35000

Self-consistency Loss

= 1 = 1e 1 = 1e 2

Figure 15. Two-stage training v.s. joint training on MNIST-Binary maximum likelihood training. λ is the penalty hyperparameter of self-consistency error term.

For joint training under energy-based training setting, we empirically test out how λ affects models performance. We find that a wide range of small λ leads to best results. See Figure 16 for training dynamics of different λ values. Our hypothesis is that LKL is easier to fit than LSC, since it only involves fitting one term instead of many constraints. When λ is relatively small, LKL is closely fitted first, then training objective is left with λLSC. Since optimization with Adam is scale-invariant, the training converges to similar solutions. When λ is too large, LSC is first fitted very close to 0, but this restricts the flexibility of the conditionals and marginals to fit LKL well, hence hurting its generative performance.

0 25 50 75 100 125 150 175 Epoch

Test NLL on Ground Truth Samples

= 4e2 = 4e1 = 4 = 4e 1 = 4e 2 = 4e 3 = 4e 4 = 4e 5 = 4e 6

0 25 50 75 100 125 150 175 Epoch

Self-consistency Loss Fit

= 4e2 = 4e1 = 4 = 4e 1 = 4e 2 = 4e 3 = 4e 4 = 4e 5 = 4e 6

Figure 16. Two-stage training v.s. joint training on Ising Model 10 10 energy-based training. λ is the penalty hyperparameter of self-consistency error term.

C. Additional Experiments Details

C.1. Dataset details

Binary MNIST Binary MNIST is a dataset introduced in [63] that stochastically set each pixel to 1 or 0 in proportion to its pixel intensity. We use the training and test split of [63] provided in https://github.com/yburda/iwae/tree/master [6].

Generative Marginalization Models

Table 7. Length Extrapolation on Text8 Method Spearman Pearson Time (s)

MAM 0.947 0.944 0.006 P-AO-ARM T = 2 0.859 0.854 2.223 P-AO-ARM T = 5 0.923 0.931 5.683 P-AO-ARM T = 10 0.927 0.931 11.63 P-AO-ARM T = 20 0.957 0.970 23.28 AO-ARM T = 300 0.969 0.966 349.7

CIFAR-10 The CIFAR-10 dataset [37] comprises 60,000 32x32 color images across 10 classes, split into 50,000 training and 10,000 test images. It s used for image recognition and classification tasks in machine learning.

Image Net32 Image Net32 [14, 10] is a downsampled variant of the Image Net dataset, resized to 32x32 pixels. It maintains the diversity of the original with over 14 million images across thousands of categories, but in a lower resolution for computational efficiency.

Molecular Sets The molecules in MOSES are represented either in SMILES [82] or SELFIES [36] strings. We construct a vocabulary (including a stop token) from all molecules and use discrete valued strings to represent molecules. It is worth noting that MAM can also be applied for modeling molecules at a coarse-grained level with predefined blocks, which we leave for future work.

The test set used for evaluating likelihood estimate quality is constructed in a similar manner to Binary MNIST, by drawing sets of random samples from the test dataset.

text8 In this dataset, we use a vocabulary of size 27 to represent the letter alphabet with an extra value to represent spaces.

The test set of datasets used for evaluating likelihood estimate quality is constructed in a similar manner to Binary MNIST, each set is generated by randomly masking out portions of a test text sequence (by 50, 100, 150, 200 tokens) and generating samples.

Ising model The Ising model is defined on a 2D cyclic lattice. The J matrix is defined to be σAN, where σ is a scalar and AN is the adjacency matrix of a N N grid. Positive σ encourages neighboring sites to have the same spins and negative σ encourages them to have opposite spins. The bias term θ places a bias towards positive or negative spins. In our experiments, we set σ to 0.1 and θ to 1 scaled by 0.2. Since we only have access to the unnormalized probability, we generate 2000 samples following [21] using Gibbs sampling with 1, 000, 000 steps for 10 10 and 30 30 lattice sizes. Those data serve as ground-truth samples from the Ising model for evaluating the test log-likelihood.

Molecular generation with target property During training, we need to optimize on the loss objective on samples generated from the neural network model. However, if the model generates SMILES strings, not all strings correspond to a valid molecule, which makes training at the start challenging when most generated SMILES strings are invalid molecules. Therefore, we use SELFIES string representation as it is a 100% robust in that every SELFIES string corresponds to a valid molecule and every molecule can be represented by SELFIES.

C.2. Training details

Binary MNIST, CIFAR10, Image Net32

Pixel values are converted to scalar values as input. 0 , 1 for Binary MNIST, 0 255 for CIFAR-10 and Image Net. takes the value 0. For each pixel, there is an additional mask indicating if it is a .

U-Net with 4 Res Net Blocks for MNIST, 32 Res Net Blocks for CIFAR-10 sand Image Net, interleaved with attention layers for both AO-ARM and MAM. MAM uses two separate neural networks for learning marginals ϕ and conditionals θ. Input resolution is 1 28 28 or 3 32 32 with 256 channels used.

The mask is concatenated to the input. 3/4 of the channels are used to encode input. The remaining 1/4 channels encode the mask cardinality (see [24] for details).

Generative Marginalization Models

MAM first learns the conditionals ϕ and then learns the marginals θ by finetuning on the downsampling blocks and an additional MLP with 2 hidden layers of dimension 4096. We observe it is necessary to distill the marginals by not only finetuning on the additional MLP but also on the downsampling blocks to get a good fitting of the marginal probability, which shows marginal network and conditional network rely on different features to make the final prediction.

Batch size is 128 for MNIST and 32 for CIFAR-10 and Image Net. Adam is used with learning rate 0.0001. Gradient clipping is set to 100. Both AO-ARM and MAM conditionals are trained for 100 epochs on MNIST, 800 epochs on CIFAR-10, 16 epochs on Image Net. MAM marginals are finetuned from the trained conditionals for 25 epochs on MNIST, 25 epochs on CIFAR-10 and 3 epochs on Image Net.

The effectiveness of the proposed two-stage training is validated during experiments. Distilling marginals from conditionals are much faster and easier than learning conditionals and marginals jointly from scratch. And distilling marginals require much fewer epochs than fitting the conditionals.

MOSES and text8

Transformer with 12 layers, 768 dimensions, 12 heads, 3072 MLP hidden layer dimensions for both AO-ARM and MAM. Two separate networks are used for MAM.

SMILES or SELFIES string representation and are first converted into one-hot encodings as input to the Transformer.

MAM first learns the conditionals ϕ and then learns the marginals θ by finetuning on the MLP of the Transformer.

Batch size is 512 for MOSES and 256 for text8.

Adam W is used with learning rate 0.0005, betas 0.9/0.99, weight decay 0.001. Gradient clipping is set to 0.25. Both AO-ARM and MAM conditionals are trained for 1000 epochs for text8 and 200 epochs for MOSES. The MAM marginals are finetuned from the trained conditionals for 200 epochs.

Ising model and molecule generation with target property

Ising model input are of {0, 1, } values and are one-encoded as input to the neural network. The same is done for molecule SELFIES strings.

MLP with residual layers, 3 hidden layers, feature dimension is 2048 for Ising model. 6 hidden layers, feature dimension 4096 for molecule target generation.

Adam is used with learning rate of 0.0001. Batch size is 512 and 4096 for molecule target generation. ARM, GFlow Net and MAM are trained with 19, 800 steps for the Ising model. ARM and MAM are trained with 3, 000 steps for molecule target generation.

Separate networks are used for conditionals and marginals of MAM. They are trained jointly with penalty parameter λ set to 4.

All models are trained on a single NVIDIA A100. The evaluation time is tested on an NVIDIA GTX 1080Ti.

C.3. Additional results on Images

C.3.1. CIFAR-10

We train Ma Ms conditionals for 800 epochs and then further train 25 epochs to fit the marginals. MAM achieves a test NLL of 2.88 bpd (if we continue training to 3000 epochs, test NLL will get close to 2.69 bpd shown in the AO-ARM literature [24]). Test NLL is compared in Table 2. Ma M achieve highly correlations in terms of log p estimate when compared with AO-ARM log p s. The marginal self-consistency error is averaged 0.3 in log p values. Generated samples are shown in Figure 17 and Figure 18.

Generative Marginalization Models

Figure 17. . CIFAR-10: conditional generation.

Figure 18. . CIFAR-10: generated samples. Note that sometimes images are flipped because Ma M is trained on augmented images.

C.3.2. IMAGENET32

We train Ma Ms conditionals for 16 epochs and train 3 more epochs for fitting the marginals. MAM achieves a test NLL of 2.88 bpd (if we continue training to 3000 epochs, test NLL will get close to 2.69 bpd shown in the AO-ARM literature [24]). Test NLL is compared in Table 3. Ma M achieve highly correlations in terms of log p estimate when compared with AO-ARM log p s. The marginal self-consistency error is averaged 0.3 in log p values. Generated samples are shown in Figure 19 and Figure 20.

Figure 19. . Image Net32: conditional generation.

C.3.3. BINARY MNIST

Likelihood estimate on partial Binary MNIST images

Figure 22 illustrates an example set of partial images that we evaluate and compare likelihood estimate from MAM against ARM. Table 8 contains the comparison of the marginal likelihood estimate quality and inference time.

Likelihood estimate on synthetic Binary MNIST images

Generative Marginalization Models

Figure 20. . Image Net32: generated samples.

Censored-100

Censored-400

Censored-700

-54.48, -57.47

Generated-100

-60.48, -63.37

Generated-400

-106.45, -108.58

Generated-700

Figure 21. An example of the data generated (with 100/400/700 pixels masked) for comparing the quality of likelihood estimate. Numbers below the images are LL estimates from MAM s marginal network (left) and AO-ARM-E s ensemble estimate (right).

-28.81, -26.70

-34.45, -32.28

-40.56, -36.14

-46.81, -43.85

-53.93, -48.44

-59.03, -55.58

Figure 22. An example set of partial images for evaluating marginal likelihood estimate quality. The numbers in the captions show the log-likelihood calculated using learned marginals (left) v.s. learned conditionals (right)

Table 8. Marginal estimates on Binary-MNIST partial images

Pearson Marg. inf. time (s)

AO-ARM 0.997 49.75 0.03 MAM 0.995 0.02 0.00

Figure 21 illustrates an example of synthetic MNIST images generated from masked MNIST images that we evaluate and compare likelihood estimate from MAM against ARM. Table 9 shows the marginal likelihood estimate shows strong correlation with actual log p from ARM, demonstrating strong generalizing to data on the manifold but not seen during training.

Generated samples

Generative Marginalization Models

Table 9. Marginal estimates on Binary-MNIST synthetic images

AO-ARM 0.993 MAM 0.993

0.1111, 0.0000

-14.2229, -16.1155

-26.9883, -28.8949

-39.1542, -42.2621

-51.3186, -53.8418

-61.0715, -63.9271

-67.4851, -71.1942

-79.8951, -84.4266

Figure 23. An example of the trajectory every 112 step when generating an MNIST digit following a random order. The future pixels are generated by conditioning on the existent filled-in pixels. The numbers in the captions show the log-likelihood calculated using learned marginals (left) v.s. learned conditionals (right)

-0.4627, 0.0000

-2326.1995, -2156.8149

-3975.1997, -3873.0151

-5427.7236, -5406.5122

-6642.8984, -6739.9624

-7707.4326, -7895.3428

-8697.5957, -8974.0234

Figure 24. An example of the trajectory when generating an Image Net image following a random order. The future pixels are generated by conditioning on the existent filled-in pixels. The numbers in the captions show the log-likelihood calculated using learned marginals (left) v.s. learned conditionals (right).

Figure 23 shows how a digit is generated pixel-by-pixel following a random order. We show generated samples from MAM using the learned conditionals ϕ in Figure 25.

C.4. Additional results on MOSES

C.4.1. COMPARING MAM WITH SOTA ON MOSES MOLECULE GENERATION

We compare the quality of molecules generated by MAM with standard baselines and state-of-the-art methods in Table 11 and Figure 26. Details of the baseline methods are provided in [55]. MAM-SMILES/SELFIES represents MAM trained on SMILES/SELFIES string representations of molecules. MAM performs either better or comparable to SOTA molecule generative modeling methods. The major advantage of MAM and AO-ARM is that their order-agnostic modeling enables generation in any desired order of the SMILES/SELFIES string (or molecule sub-blocks).

C.4.2. GENERATED MOLECULAR SAMPLES

Figure 27 and 28 plot the generated molecules from MAM-SMILES and MAM-SELFIES.

C.4.3. OUT-OF-DISTRIBUTION TEST ON DIFFERENTIATING DRUG VS. PHOTODIODE

We challenged the model with a more difficult OOD task: distinguishing between tyrosine kinase inhibitors (a specific type of drug) and organic photodiodes from focused chemical spaces while the MAM model is trained on a general chemical space of drug-like compounds. The tyrosine kinase inhibitors should be considered by the model to have higher likelihood given that it has more similar properties (such as moderate weight and lipophilicity) to the molecules in ZINC.

We created 1000 pairs consisting of one of each using datasets from Subramanian et al. [74], controlling for other factors like SMILES length and chemical space to increase difficulty. Despite this, MAM s marginals correctly identified the drug molecule 74% of the time (vs. 79% for AO-ARM), with 90% alignment between marginal estimates and AO-ARM log p s.

Generative Marginalization Models

Figure 25. Generated samples: Binary MNIST

Generative Marginalization Models

5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 0.0

MOSES molecule generation, log P

MOSES (0) HMM (0.75) NGram (0.37) Combinatorial (0.77) Char RNN (0.053) VAE (0.026) JTN-VAE (0.053) Latent GAN (0.073) Ma M-SMILES (0.074) Ma M-SELFIES (0.11)

0 1 2 3 4 5 6 7 8 0.0

MOSES molecule generation, SA

MOSES (0) HMM (0.74) NGram (0.23) Combinatorial (0.28) Char RNN (0.017) VAE (0.016) JTN-VAE (0.094) Latent GAN (0.089) Ma M-SMILES (0.084) Ma M-SELFIES (0.4)

0 200 400 600 800 1000 0.000

MOSES molecule generation, weight

MOSES (0) HMM (1.8e+02) NGram (78) Combinatorial (67) Char RNN (2.5) VAE (3.2) JTN-VAE (3.3) Latent GAN (5.4) Ma M-SMILES (2.7) Ma M-SELFIES (5.3)

0.0 0.2 0.4 0.6 0.8 1.0 0

MOSES molecule generation, QED

MOSES (0) HMM (0.27) NGram (0.089) Combinatorial (0.15) Char RNN (0.0026) VAE (0.00065) JTN-VAE (0.0036) Latent GAN (0.0057) Ma M-SMILES (0.0079) Ma M-SELFIES (0.027)

Figure 26. KDE plots of lipophilicity (log P), Synthetic Accessibility (SA), Quantitative Estimation of Drug-likeness (QED), and molecular weight for generated molecules. 30, 000 molecules are generated for each method.

Generative Marginalization Models

Figure 27. Generated samples from MAM-SMILES: MOSES

Generative Marginalization Models

Figure 28. Generated samples from MAM-SELFIES: MOSES

Generative Marginalization Models

Table 10. Character modeling on Molecular Sets

NLL (bpc) Pearson Time (s)

AO-ARM 0.655 0.994 19.32 0.01 MAM 0.655 0.995 0.006 0.00

Table 11. Performance Comparison on MOSES Model Valid Unique 10k Frag Test Scaf Test SF Int Div1 Int Div2 Filters Novelty

Training data 1.0 1.0 1.0 0.9907 0.8567 0.8508 1.0 1.0 HMM 0.076 0.5671 0.5754 0.049 0.8466 0.8104 0.9024 0.9994 NGram 0.2376 0.9217 0.9846 0.0977 0.8738 0.8644 0.9582 0.9694 Char RNN 0.9748 0.9994 0.9998 0.1101 0.8562 0.8503 0.9943 0.8419 JTN-VAE 1.0 0.9996 0.9965 0.1009 0.8551 0.8493 0.976 0.9143 MAM-SMILES 0.7192 0.9999 0.9978 0.1264 0.8557 0.8499 0.9763 0.9485 MAM-SELFIES 1.0 0.9999 0.997 0.0943 0.8684 0.8625 0.894 0.9155

Generative Marginalization Models

C.5. Additional results on text8

C.5.1. LENGTH EXTRAPOLATION ON TEXT8

In Table 7, we evaluated the model s ability to handle length extrapolation on Text8. We trained data with D = 250 and tested on sequences with D = 300.

Robustness MAM s predicted log p marginals maintain a high correlation with those calculated using AO-ARM conditionals, even on longer sequences. Its quality matches Parallel AO-ARM with 20 steps.

Graceful Extrapolation We observe the absolute errors in log pθ(x) log p ARM(x) increase due to challenge from OOD prediction, but the variance of these errors remains surprisingly similar to that observed when D = 250. This indicates that MAM gracefully extrapolates the relative scales among log p values, explaining the high observed correlation in the Table.

C.5.2. SAMPLES USED FOR EVALUATING LIKELIHOOD ESTIMATE QUALITY

We show an example of a set of generated samples from masking different portions of the same text, which is then used for evaluating and comparing the likelihood estimate quality. Their log-likelihood calculated using the conditionals with the AO-ARM are in decreasing order. We use MAM marginal network to evaluate the log-likelihood and compare its quality with that of the AO-ARM conditionals.

Original text: the subject of a book by lawrence weschler in one nine nine five entitled mr wilson s cabinet of wonder and the museum s founder david wilson received a macarthur foundation genius award in two zero zero three the museum claims to attract around six

Text generated from masking out 50 tokens: the su je t of a b ok by la r nce es h n o nine n ne five entitled mr wilson s cabinet of wonder and the museum s founder vid w l o r eive a macarthur fou a on e s awa d in two ero z r hree he museum c aims o attr ct ar u d s the subject of a book by lawrence heschell in one nine nine five entitled mr wilson s cabinet of wonder and the museum s founder david wilson received a macarthur foundation dennis award in two zero zero three the museum claims to attract around sev

Text generated from masking out 100 tokens:

the su je t f b k y l r nc es h n o nine n ne five enti l d mr wil o c b et of wond r an h mu eu s f u der vid w l eive a maca thur f u a n e a a d two er z r h ee museum c a ms o tr ct ar u the subject of a book by lawrence bessheim in one nine nine five entitled mr wilson s cabinet of wonder and the museum s founder david wilson received a macarthur foundation leaven award in two zero zero three the museum claims to detract around the

Text generated from masking out 150 tokens:

the u t f l r nc es h n o n e n ne ive e ti l m wil c et of won an s u der vid w eiv a a a th a n e a two e z r e use m c a ms tr ct a the tudepot of europe de laurence desthefs in one nine nine five entitled mr wild the cabinet of wonder anne cedallica s founder david wright received arnasa the culmination team sparked in two zero zero three the museum claims to retract athlet c a

Generative Marginalization Models

Figure 29. Samples: 10 10 Ising model. Ground truth (left) v.s. MAM (right).

Figure 30. Samples: 30 30 Ising model. Ground truth (left) v.s. MAM (right).

Text generated from masking out 200 tokens:

t f l r o e n iv e i l wil c t w a der d w e a a a n e a t e u e c a s r c a the builder of the pro walter a a e sec press one nine nine five esciele the wild men convert of wark flax notes the world undergroand whirl spiken america ascent and martin decree a letter to the antler s default museum chafes in america ascent vis

C.6. Additional experiments on Ising model

Generated samples

We compare ground truth samples and MAM samples in Figure 29 and 30.

Generative Marginalization Models

C.7. Additional experiments on molecule target generation

C.7.1. TARGET PROPERTY ENERGY-BASED TRAINING ON LIPOPHILICITY (LOGP)

Figure 31 and 32 show the log P of generated samples of length D = 55 towards target values 4.0 and 4.0 under distribution temperature τ = 1.0 and τ = 0.1. For τ = 1.0, the peak of the probability density (mass) appears around 2.0 (or 2.0) because there are more valid molecules in total with that log P than molecules with 4.0 (or 4.0), although a single molecule with 4.0 (or 4.0) has a higher probability than 2.0 (or 2.0). When the temperature is set to much lower (τ = 0.1), the peaks concentrate around 4.0 (or 4.0) because the probability of log P value being away from 4.0 (or 4.0) quickly diminishes to zero. We additionally show results on molecules of length D = 500. In this case, log P values are shifted towards the target but their peaks are closer to 0 than when D = 55, possibly due to the enlarged molecule space containing more molecules with log P around 0. Also, this is validated by the result when τ = 0.1 for D = 500, the larger design space allows for more molecules with log P values that are close to, but not precisely, the target value.

5.0 2.5 0.0 2.5 5.0 7.5 log P

Generated molecules, D=55, temperature=1.0

Random Ma M: Target=4 Ma M: Target=-4 ARM: Target=4 ARM: Target=-4

6 4 2 0 2 4 6 log P

Generated molecules, D=55, temperature=0.1

Random Ma M: Target=4 Ma M: Target=-4 ARM: Target=4 ARM: Target=-4

Figure 31. Target property matching with different temperatures. 2000 samples are generated for each method.

4 2 0 2 4 6 8 log P

Generated molecules, D=500, temperature=1.0

Random Ma M: Target=4 Ma M: Target=-4

7.5 5.0 2.5 0.0 2.5 5.0 7.5 log P

Generated molecules, D=500, temperature=0.1

Random Ma M: Target=4 Ma M: Target=-4

Figure 32. Target property matching with different temperatures. 2000 samples are generated for each method.

C.7.2. CONDITIONALLY GENERATED SAMPLES

More samples from conditionally generating towards low lipophilicity (target = 4.0, τ = 1.0) from user-defined substructures of Benzene. We are able to generate from any partial substructures with any-order generative modeling of MAM. Figure 33 shows conditional generation from masking out the left 4 SELFIES characters. Figure 34 shows conditional generation from masking the right 4 20 characters.

Generative Marginalization Models

Figure 33. Generated samples from masking out the left 4 SELFIES characters of a Benzene. Shaded region are the impainted structures.

Figure 34. Generated samples from masking out the right 4-20 SELFIES characters of a Benzene. Shaded region are the impainted structures.