# xorcd_linearly_convergent_constrained_structure_generation__bdabc368.pdf XOR-CD: Linearly Convergent Constrained Structure Generation Fan Ding 1 Jianzhu Ma 2 Jinbo Xu 3 Yexiang Xue 1 We propose XOR-Contrastive Divergence learning (XOR-CD), a provable approach for constrained structure generation, which remains difficult for state-of-the-art neural network and constraint reasoning approaches. XOR-CD harnesses XOR-Sampling to generate samples from the model distribution in CD learning and is guaranteed to generate valid structures. In addition, XOR-CD has a linear convergence rate towards the global maximum of the likelihood function within a vanishing constant in learning exponential family models. Constraint satisfaction enabled by XOR-CD also boosts its learning performance. Our real-world experiments on datadriven experimental design, dispatching route generation, and sequence-based protein homology detection demonstrate the superior performance of XOR-CD compared to baseline approaches in generating valid structures as well as capturing the inductive bias in the training set. 1. Introduction Generative modeling has received tremendous success in recent years. Notable examples include image synthesis (Goodfellow et al., 2014; Radford et al., 2015; Isola et al., 2017; Brock et al., 2018), music composition (Briot et al., 2017; Engel et al., 2017; Prenger et al., 2019), molecule synthesis, drug discovery (Liu et al., 2018; Kusner et al., 2017; Jin et al., 2018) and more. Nevertheless, learning generative models over a constrained space still remains a major research challenge. Take the example of protein homology detection, an important biological application considered in this paper, to align two protein sequences, where each amino acid in one sequence can be aligned either to another amino acid in the other 1Department of Computer Science, Purdue University, West Lafayette, USA 2Institute for Artificial Intelligence, Peking University, Beijing, China 3Toyota Technological Institute at Chicago, Illinois, USA. Correspondence to: Fan Ding , Yexiang Xue . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Origin L R P S E !2 !1 L S R - P L - A S L E Q Insertion at !2 (I2) Insertion at !1 (I1) Figure 1. An illustration of sequence-based protein homology de- tection problem. (Left) The problem is to predict the alignment two amino acids sequences S1 and S2, where one amino acid from one sequence can be aligned to either one amino acid from the other sequence (match), or to a gap (insertion, marked by ). (Right) Biological constraint: such an alignment must form a path from the top-left to the bottom-right corner in the alignment matrix, where a diagonal transition represents a match, a horizontal or a vertical transition represents an insertion. sequence or a gap. Biological constraints require such an alignment forms a continuous path from the top-left corner to the bottom-right corner in the alignment matrix (see Figure 1). Given two protein sequences, the learning problem is to predict which alignment is more likely based on samples from the training set. Previous constrained satisfaction approaches (Rychlewski et al., 2000) identify an alignment satisfying all constraints, yet are optimal only regarding a rigid expert-defined objective, which may fail to include essential biological factors. Learning-based approaches (S oding, 2005; Ma et al., 2014) harness highly flexible neural models for alignment prediction. However, the predictions often violate constraints (i.e., they do not form paths). Similar challenges are present in many realworld problems: it is difficult to generate structures which simultaneously (i) satisfy constraints, and (ii) effectively capture the inductive bias present in the training set. We present XOR-CD, a constrained generative model based on contrastive divergence and XOR-Sampling, which converges to the global optimum of the likelihood function of an exponential family model within a vanishing constant in linear number of steps. Rather than using Markov Chain Monte Carlo (MCMC) to sample from the current model distribution as in the classical case of contrastive divergence, XOR-CD generates samples using XOR-Sampling, a recent XOR-CD: Linearly Convergent Constrained Structure Generation approach with provable guarantees. XOR-Sampling reduces the sampling problem into queries to NP oracles subject to randomized XOR constraints. The empirical probability of getting one sample can be bounded within a constant multiplicative factor of the true probability. Our main contribution is to embed XOR sampling into contrastive divergence learning, which yields a learning approach with provable guarantees. XOR-CD advances the state-of-the-art in constrained structure generation in the following way: 1. Constraint Satisfaction: Because XOR-CD reduces the sample generation problem into constraint satisfaction subject to randomized constraints, the structures generated from XOR-CD are guaranteed to satisfy constraints, addressing a key limitation of many neural network based structure generation approaches. 2. Linear Convergence Speed to the Global Optimum: We are able to prove that XOR-CD has a linear convergence rate towards the global maximum of the likelihood function within a vanishing constant when learning exponential family models. 3. Constraint Satisfaction Improves Learning Perfor- mance: We observe empirically that XOR-CD learns faster and better than state-of-the-art neural-based approaches in constrained generation domains. We hypothesize that the improvement in learning performance is due to better constraint satisfaction offered by XOR-CD. Because the samples generated from the model distribution always satisfy constraints, XORCD can focus on learning the structural differences of the samples generated from the model distribution and from data. Baseline approaches, contrarily, spend most of their time struggling in generating valid structures. We demonstrate the power of XOR-CD on three real-world applications. Aside from the protein homology detection problem, our second application is on the optimal experiment design, which tests n crops in a n-by-n field. Agriculturalists require very crop to be planted exactly once in every row and column, forming a so-called Latin square. Yet, other implicit criteria can only be learned from a dataset of good designs, therefore making it a learning problem. Our third application is on dispatching route generation, in which we suggest routes for delivery drivers. The routes have to form a Hamilton cycle, visiting each requested location once and only once. Aside from this hard requirement, they also need to be similar to historical routes, satisfying drivers implicit preferences. In all 3 applications, our method generates structures that not only have higher likelihood than competing approaches, but also 100% satisfy constraints, while the validity rate of competing approaches are less than 20%. In addition, the distributions of the valid structures generated by XOR-CD closely resemble those in the training set, demonstrating that XOR-CD can successfully capture the inductive bias in the training set. Furthermore, the learned XOR-CD model can be used to complete partially filled structures. These completed structures 100% satisfy constraints and are close to those in the training set. 2. Preliminaries 2.1. Exponential Family Models We consider discrete exponential family models over random variables X 2 X {0, 1}n with parameters 2 Rd: P (X) = c(X)e T φ(X) ( ), (1) where c(X) is the carrier measure, φ : X ! Rd is the sufficient statistics and ( ) is the log partition function: c(x)e T φ(x). (2) Notice that ( ) contains a discrete integral over a constrained structure space X, which makes the entire problem computationally intractable. For example, in the protein homology detection application, X represents the space of all alignments that form valid paths. Given x1, x2, . . . , x N from the training set, the learning problem is to find the optimal parameters which maximize the log likelihood l( ) = 1 i=1 log P (xi): T φ(xi) ( ) + constant. (3) Given the learned model, the inference problem is to complete a partial structure to maximize the joint probability: (xp, x ) = arg max (xp,x)2X P (xp, x). Here, xp represents a partially filled structure, and x are the assignment to the remaining variables. Back to the protein homology detection example, the learning problem is to identify which types of alignments are more likely in the training dataset, while the inference problem is to predict an alignment given the amino acid sequences of two proteins. ( ) is a convex function of . Denote r ( ) the gradient vector of ( ). We can prove r ( ) is the expectation of the sufficient statistics φ(x) under P . That is, r ( ) = EP [φ(x)] = φ(x)P (x). (4) 2.2. Contrastive Divergence Learning Contrastive Divergence (CD) (Hinton, 2002b) applies stochastic gradient ascent to maximize the log likelihood of XOR-CD: Linearly Convergent Constrained Structure Generation an exponential family model. From Equations 3 and 4, the gradient of the log likelihood can be written as: rl( ) = ED[φ(X)] EP [φ(X)]. (5) Here, ED denotes the expectation over the data distribution, and EP denotes the expectation over the current model distribution P . Let {x1, . . . , x M} be a mini-batch of training data, CD uses the sample mean 1 M i=1 φ(xi) to approximate ED[φ(X)]. Denote x(k) i as the sample after taking k Markov Chain Monte-Carlo (MCMC) steps following the current model distribution P (x) starting from data xi. CD uses 1 M i ) to approximate EP [φ(X)]. Overall, the gradient of the log likelihood is approximated by CD hence iterates the following update t+1 = t+ gcd( t) until convergence, where is the learning rate. 2.3. XOR Sampling Our method leverages recent advancements in XOR sampling (Ermon et al., 2013b), which reduces the sampling problem into queries to NP oracles subject to XOR constraints. XOR sampling guarantees that the probability of drawing a sample is sandwiched between a multiplicative constant of the true probability. We only present the general idea of XOR-Sampling on unweighted functions here and refer the readers to the paper (Ermon et al., 2013b) for the weighted case. For the unweighted case, assuming w(x) takes binary values, we need to draw samples from the set W = {x : w(x) = 1} uniformly at random; i.e., suppose |W| = 2l, then each member in W should have 2 l probability to be sampled. XOR proceeds by adding k randomized XOR constraints XORk(x) = 1 to the original problem and returns an element uniformly at random from the constrained set Wk = {x : w(x) = 1, XORk(x) = 1} when |Wk| is small enough and can be sampled by an exact sampler. k is increased from 1 until |Wk| becomes small. Because the k-th XOR constraint removes at random half of the elements from the previous set Wk 1, one can prove a constant bound on the probabilities of getting one sample from XOR sampling (Gomes et al., 2007a;b). For the weighted case, one needs to draw samples from an unnormalized function w(x), i.e., the probability of getting a sample x0, P(x0) is proportional to w(x). The idea is to discretize w(x) and transform the weighted problem into an unweighted one with additional variables. Our paper uses the constant approximation bounds of XOR sampling on weighted functions through the following theorem. The details on the discretization scheme and the choice of the parameters of the original algorithm to reach the bound in Theorem 1 are in the supplementary materials. Theorem 1. (Ermon et al., 2013b) Let 1 < δ 2, 0 < γ < 1, w : {0, 1}n ! R+ be an unnormalized probability density function. P(x) / w(x) is the normalized distribution. Then, with probability at least 1 γ, XORSampling(w, δ, γ) succeeds and outputs a sample x0 by querying O(n ln( n γ )) NP oracles. Upon success, each x0 is produced with probability P 0(x0). We must have 1/δP(x0) P 0(x0) δP(x0). Moreover, let φ : {0, 1}n ! R+ be one non-negative function, then the expectation of one sampled φ(x) satisfies, 1 δ EP (x)[φ(x)] EP 0(x)[φ(x)] δEP (x)[φ(x)]. (7) 3. XOR-Contrastive Divergence We propose XOR-CD, a new contrastive divergence method for constrained structure generation on exponential family models, which is guaranteed to converge to the global maximum of the likelihood function within a vanishing constant in linear number of CD iterations. XOR-CD breaks down the gradient of the log likelihood function into the divergence of the expectations of the sufficient statistics over the training data and over the current model distribution, following the CD framework. However, XOR-CD leverages XOR-sampling to generate samples in estimating EP [φ(X)]. The detailed procedure of XOR-CD is shown in Algorithm 1. XOR-CD takes the exponential family model P (X) with sufficient statistics φ(X), carrier measure c(X), training data {xi}N i=1, initial model parameter 0, the learning rate , the number of CD iterations T, XOR-Sampling parameters (δ, γ), and batch sizes M, K as input, and outputs the learned parameter T . To approximate EP t [φ(X)] at step t, XOR-CD draws K samples x0 1, . . . , x0 K from P t(X) using XOR-Sampling, where K is a user-determined sample size. Because XOR-Sampling has a failure rate, XOR-CD repeatedly call XOR-Sampling until all K samples are obtained successfully (line 2 6). Then, XOR-CD also draws M samples from the training set {xi}N i=1 uniformly at random to approximate ED[φ(X)]. Once all the samples are obtained, XOR-CD uses gt = 1 M j=1 φ(xj) 1 j) as an approximation for the gradient of the log likelihood. is updated following the rule t+1 = t + gt for T steps, and is the learning rate. Finally, the average T = 1 t=1 t is the final output of the algorithm. 3.1. Linear Convergence to the Global Optimum We can show that XOR-CD converges to the global optimum of the log likelihood function in addition to a vanishing term. Moreover, the speed of the convergence is linear with respect to the number of contrastive divergence steps. XOR-CD: Linearly Convergent Constrained Structure Generation Denote V ar D(φ(x)) = ED[||φ(x)||2 2] ||ED[φ(x)]||2 2 and V ar P (φ(x)) = EP [||φ(x)||2 2] ||EP [φ(x)]||2 2 as the total variations of φ(x) w.r.t. the data distribution PD and model distribution P . The precise mathematical theorem states: Theorem 2. (main) Let P (X) : X ! R+ be the exponential family model denoted in Equation 1. Given data points {xi}N i=1, the log likelihood l( ) = 1 N i=1 log P (xi). Denote OPT = max l( ). Let V ar D(φ(x)) σ2 1, max V ar P (φ(x)) σ2 2 and ||EP [φ(x)]||2 2 "2. Suppose 1 δ 2 is used in XOR-sampling, the learning rate 2 δ2 2δ , and T is the output of XOR-CD. We have: OPT E[l( T )] δ|| 0 ||2 2 2 T + (σ2 XOR-CD is the first provable algorithm which converges to the global maximum of the likelihood function and a tail term for exponential family models. Moreover, the rate of the convergence is linear in the number of SGD iterations T. Previous approaches do not have such tight bounds. Variational inference approaches such as the Variational Auto-encoders (VAEs) (Kingma & Welling, 2013) optimize the Evidence Lower Bound (ELBO). However, the gap between the lower bound and the true likelihood can become arbitrarily large. Expectation Propagation (EP) methods (Minka, 2013; Dehaene & Barthelm e, 2015) computes an upper bound of the likelihood, which can be arbitrarily loose as well. Various Generative Adversarial Nets (GANs) (Goodfellow et al., 2014; Radford et al., 2015; Isola et al., 2017) and flow models (Kingma & Dhariwal, 2018; Prenger et al., 2019) do not have theoretic bounds. The main challenge to prove Theorem 2 lies in the fact that we cannot ensure the unbiasedness of the gradient. Because the partition function ( ) is convex with respect to in exponential family models, a gradient descent algorithm can be proven to be linearly convergent towards the maximum of the likelihood function, if the expectation of the estimated gradient is unbiased, ie, E[gt] = rl( t). However, even though we apply XOR-sampling, we still cannot guarantee the unbiasedness of gt. Instead, using Theorem 1, our bound for gt is in the following form: 1 δ [rl( t)]+ E[gt +] δ[rl( t)]+, (8) δ[rl( t)] E[gt δ [rl( t)] . (9) Here, [f]+ means the positive part of f, ie, [f]+ = max{f, 0}, and [f] means the negative part of f, ie, [f] = min{f, 0}. The bound in Equation 8 and 9 can be proven following the fact that rl( ) = ED[φ(X)] EP [φ(X)] and applying Equation 7. The proof of Theorem 2 relies on our following new result (Theorem 3) on Stochastic Gradient Descent (SGD) algorithms which only Algorithm 1 XOR-CD Input: 0, c(X), φ(X), T, , δ, γ, M, K, {xi}N 1 for t = 0 to T do while j K do 3 x0 XOR-Sampling T φ(X), δ, γ if x0 6= Failure then j x0; j j + 1 7 Sample {xj}M j=1 uniformly from {xi}N i=1. gt = 1 M j=1 φ(xj) 1 j) t+1 = t + gt 9 return T = 1 have access to constant approximate gradient vectors. As far as we know, previous SGD convergence analysis largely requires the unbiasedness of the gradient. We are the first to extend SGD convergence bounds to biased cases. Theorem 3 requires function f to be L-smooth. f( ) is L-smooth if and only if ||f( 1) f( 2)||2 L|| 1 2||2. Notice that the conditions of Theorem 2 automatically guarantee the L-smoothness of the log likelihood. Theorem 3. Let f : Rd ! R be a L-smooth convex function and = argmin f( ). In iteration t of SGD, gt is the estimated gradient, i.e., t+1 = t gt. If V ar(gt) σ2, and there exists 1 c c[rf( t)]+ E[g+ t ] c[rf( t)]+ and c[rf( t)] E[g c[rf( t)] , then for any T > 1 and step size 2 c2 Lc , let T = 1 t=1 t, we have E[f( T )] f( ) c|| 0 ||2 The proofs of Theorems 2 and 3 are left to the supplementary materials. Here we outline the sketch to prove Theorem 3. Proof. (sketch for Theorem 3) One can show under the conditions of Theorem 3, we must have (via Lemma 2, stated and proved in supplementary materials): 1 c ||E[gt]||2 2 hrf( t), E[gt]i c||E[gt]||2 1 c h E[gt], t i hrf( t), t i ch E[gt], t i. By the L-smoothness of f, for the t-th iteration, f( t+1) f( t) + hrf( t), t+1 ti + L 2 || t+1 t||2 XOR-CD: Linearly Convergent Constrained Structure Generation Data distribution Model distribution Training set Samples from model Update Direction (a) Updating steps of traditional CD Data distribution Model distribution Training set Samples from model Update Direction (b) Updating steps of XOR-CD Figure 2. An intuitive explanation of why constraint satisfaction enabled by XOR-CD improves the overall learning performance. In training data, valid structures are scattered across several isolated regions due to combinatorial constraints (green curves in both plots denote the training data distribution and green dots denote the training samples). Many negative samples generated by traditional CD do not satisfy constraints (blue triangles in the left plot). Therefore, traditional CD spends many iterations minimizing the likelihood of invalid structures (blue arrows in the left plot). XOR-CD converges faster to the ground-truth data distribution because all its updates are used to match the data distribution within the regions that satisfy constraints (orange arrows in the right plot). = f( t) hrf( t), gti + Lt2 2 ||gt||2. (11) By the convexity of f, we have f( t) f( ) + hrf( t), t i. (12) The following inequalities can be shown, combining Lemma 2, Equation 11 and 12: E[f( t+1)] f( ) + c E which implies E[f( t+1)] f( ) c 2 E[(|| t ||2 2 || t+1 ||2 c σ2. Sum this inequality for t = 0, . . . , T 1, we get E[f( t+1) f( )] 2 (|| 0 ||2 2 E[|| T ||2 Finally, by Jensen s inequality, tf( T ) PT E[f( t+1) f( )] = E[ f( t)] Tf( ) TE[f( T )] Tf( ). Combining the above equations we get E[f( T )] f( ) + c|| 0 ||2 This completes the proof. The proof of Theorem 2 is to apply Theorem 3 on the log likelihood function and noticing that l( ) is L-smooth when the total variation V ar(φ(x)) is bounded (proved by a separate lemma). The lemmas and the proofs are left to section B.4 in supplementary materials. Theorem 2 states that in expectation, the difference between the output of XOR-CD algorithm T and the true optimum OPT is bounded by a term that is inversely proportional to the number of iterations T and a tail term (σ2 1 δM . To reduce the tail term with fixed steps , we can generate more samples at each iteration to reduce the variance (increase M and K). In addition, to quantify the computational complexity of XOR-CD, we prove the following theorem in the supplementary materials detailing the number of queries to NP oracles needed for XOR-CD. Theorem 4. XOR-CD in Algorithm 1 uses O(Tn ln n γ + TK) queries to NP oracles. 3.2. Constraint Satisfaction Improves Learning In the previous section we prove the theoretic convergence of XOR-CD towards the global optimum of the likelihood function. Despite XOR-CD has to query NP oracles, which are significantly more expensive than e.g., MCMC sampling, we notice that XOR-CD converges to the global optimum faster than classical CD approaches in real-world experiments, especially on constrained structure generation problems (see the experiment section). Notice that our observation is different from that of (Hinton, 2002a), where they observe CD works reasonably well even if the number of the MCMC steps k is kept far less than that required for well mixing. We attribute the observational difference to the types of problems we consider, which are mainly con- XOR-CD: Linearly Convergent Constrained Structure Generation strained structure generation problems. The capability to generate negative samples that satisfy constraints becomes important for likelihood learning in this setting. We use Figure 2 as an intuitive explanation of why the constraint satisfaction enabled by XOR-CD leads to improvement in the learning performance. Figure 2(a) depicts one iteration of a traditional CD process. Here, CD tries to match the current model distribution (shown in the blue dashed line) with the data distribution (shown in the green line), by increasing the likelihood of the training samples and decreasing the likelihood of the negative samples generated typically by MCMC (denoted by the pulling of blue arrows). Because MCMC does not guarantee the constraint satisfaction of the negative samples, traditional CD spends much time pulling down the model likelihood in regions which violate constraints. On the contrary, Figure 2(b) depicts one iteration of XOR-CD. Because the negative samples are generated provably from XOR-sampling, they satisfy constraints. This allows XOR-CD to focus on matching the likelihood within the region that satisfy constraints; hence leading towards faster matching to the data distribution. 4. Related Work There is a fruitful line of work for generative machine learning. Energy-based models (Hinton & Salakhutdinov, 2006; Bengio & Delalleau, 2009; Carreira-Perpinan & Hinton, 2005) take advantage of either exponential families (Hinton, 2002b; Jiang et al., 2018; Durkan et al., 2020) or neural networks (Belanger & Mc Callum, 2016; Belanger et al., 2017) for structure modelling. Qiu et al. (2019) leverages coupling of Markov chains to get unbiased samples in Contrastive Divergence framework, which however is hard to reach in practice. Score matching based methods (Bao et al., 2020; Song & Ermon, 2020; Pang et al., 2020) try to estimate the score function in order to get rid of the intractable partition function. Deep generative models like graph neural networks (Grover et al., 2019; Zhou et al., 2018) and normalizing flow models (Kingma & Dhariwal, 2018; Prenger et al., 2019) are widely used recently. Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015; Isola et al., 2017) learn the structure in a likelihood-free manner. While learning the evidence lower bound, soft constraints were embedded to Variational Auto-Encoder (VAE) (Kingma & Welling, 2013) for molecule design (Kusner et al., 2017; Jin et al., 2018). However, these deep generative methods can hardly deal with hard combinatorial constraints. Previous approaches embed machine learning models into the optimization by, e.g., integrating neural networks and decision trees with constraint programming (Lallouet & Legtchenko, 2007), or introducing a Neuron global constraint that represents a pre-trained neural network (Lom- Figure 3. Averaged log likelihood of 100 structures generated by different learning algorithms on a discrete exponential family model varying the number of variables. The structures generated by XOR-CD have the highest average log-likelihoods. bardi & Gualandi, 2016; Lombardi et al., 2017). Machine learning approaches have also been used to solve constraint reasoning and optimization problems (Galassi et al., 2018; Vinyals et al., 2015; Khalil et al., 2017). Graves et al. (2016) employs neural networks for discrete structure generation, while Wang et al. (2019); Amos & Kolter (2017); Agrawal et al. (2019) and de Avila Belbute-Peres et al. (2020) integrate logical reasoning and differentiable optimization problems within deep learning architectures. Parity constraints are proposed for both sampling (Gomes et al., 2007a; Ermon et al., 2013b) and counting problems (Ermon et al., 2013a; Chakraborty et al., 2014; Achlioptas & Theodoropoulos, 2017; Achlioptas et al., 2018; Ding et al., 2019) in probabilistic inference. These approaches provide constant approximation guarantees on either the probability of the samples or the estimated values of discrete integration. 5. Experiments In this section we show the superior performance of XORCD on 4 structure generation experiments, one on synthetic data generated by a known model and the other three are dispatching route generation, optimal experimental design, and sequence-based protein homology detection. One baseline is Contrastive Divergence CD100, denoted as Gibbs-CD, which uses Gibbs Sampling (Carreira-Perpinan & Hinton, 2005) of 100 steps to obtain the samples from the model distribution. We also compare with Generative Adversarial Networks (GAN) (Goodfellow et al., 2014), belief propagationbased CD approaches, BP-CD, and the recent BPChain-CD (Fan & Xue, 2020). Various experiment settings are left to the supplementary materials. In Figure 3, we show XORCD learns the highest log likelihood on a synthetic dataset (details in the supplementary materials). 5.1. Dispatching Route Generation We consider the problem of generating dispatching routes for delivery drivers. The delivery routes need to form Hamiltonian cycles, where each location is visited once and ex- XOR-CD: Linearly Convergent Constrained Structure Generation KL=0.026 KL=0 KL=0.075 KL=0 Figure 4. XOR-CD outperforms competing approaches by producing 100% valid delivery routes and experiment designs while capturing the inductive bias in training data. (Left) The percentage of valid routes generated by different algorithms, varying the number of locations. (Middle) The dashed line shows the percentage of valid routes generated from different algorithms when the number of locations is 10. The bars show the distributions of these valid routes grouped by the traveling distances d. The distribution of XOR-CD closely matches that of the training data with the smallest KL divergence 0.026. (Right) The dashed line shows the percentage of valid experiment designs generated from different algorithms on 5 5 grids. The bars show the distributions of these valid structures grouped by variance. XOR-CD generates 100% valid designs, and has the minimal KL divergence 0.075 towards the training data distribution. actly once. For this experiment, we assume the number of delivery locations n is fixed, although it is not difficult to extend our approach to the cases where n varies. In the learning phase, we are given a dataset of historical trips, where each trip is represented with a permutation of n locations, denoting the order of in which these locations are visited. We learn an exponential family model to capture the likelihood of different permutations. Specifically, denote xi,j as a binary indication variable, which is 1 if and only if the i-th location to visit is location j. The exponential family model is Pr(x) / exp( 0 + P i,j i,jxi,j + P i,j,k i,j,kxi,jxi+1,k) and the s are the parameters to learn. Learning rate is fixed as 0.1 and total number of epochs T is 500. There is also a timeout of 10 hours for all algorithms. We also set both M and K to be 100, and parameters for XOR-Sampling were kept the same as in (Ermon et al., 2013b). We leave the data generation process, and how we add the Hamilton cycle constraint into XOR-CD to the supplementary materials. In solving the inference problem during testing, we have a series of additional constraints detailing the requirement of a new day delivery. Such constraints include e.g., certain locations must be visited first, and one location must be visited after another location, etc. Therefore, the inference problem is to find variable assignments to all xi,j variables, which maximizes the likelihood, while satisfies the Hamilton cycle and the additional constraints. We first examine the validity of the routes generated. The left figure in Figure 4 shows the percentage of valid Hamilton cycles generated from different algorithms, varying the numbers of delivery locations from 5 to 10. We can see that XOR-CD can generate 100% valid Hamilton structures while the competing methods at best generate 40% valid routes. The red dashed line in the middle figure shows the percentage of valid Hamilton cycles generated from different algorithms when the number of locations is 10. We then examine whether the generated routes resemble those in the training data. To validate this, we evaluate the distribution of the total lengths of the routes generated. Without imposing additional constraints, the lengths distribution of the generated routes should closely resembles that of the training set for a successful learning algorithm. The bars in Figure 4 (middle) demonstrate such distributions of the valid routes generated by each algorithm. The last column shows the distribution of the training data. We can see the distribution from XOR-CD closely matches the data distribution, with KL divergence of 0.026. Other approaches are worse. This indicates that XOR-CD is able to capture the inductive bias of the training set better than competing approaches. 5.2. Optimal Experiment Design We further consider the optimal experiment design problem. Here, we generate an experiment design in the form of a Latin square, which is a n by n matrix and each entry can be planted with crop 1 to n. Each crop needs to be planted exactly once in each row and column. Let xi,j,k be an indicator variable which is 1 if and only if crop k is planted at the i, j-th entry. The exponential family model is: Pr(x) / exp( 0 + P i,j,k i,j,kxi,j,k + P i,j,m,lφ(xi,j,m, xi+1,j,l) + 2 i,j,m,lφ(xi,j,m, xi,j+1,l)). The learning parameters are set the same as in the previous task. The learning problem is to identify the values of s. During testing, the inference problem is to generate experiment designs from partially filled matrices satisfying the Latin square requirement while closely resemble those in the training set. We first examine the validity of experiment designs. Here we consider generating 5 5 Latin squares. The percentage of valid Latin squares are shown in the red dashed curve of Figure 4 (right). Again, XOR-CD generates 100% valid XOR-CD: Linearly Convergent Constrained Structure Generation Partially filled Latin Square XOR-CD GAN Gibbs-CD Figure 5. XOR-CD (column 4) generates valid Latin squares from partially filled structures (column 1), which all have variance 4, the most frequent variance in the training dataset. Gibbs-CD (column 2) and GAN (column 3) cannot generate valid Latin squares (constraints violation shown in red boxes). experiment designs while competing approaches at best generate 20%. To judge how well the generated Latin squares resemble those in the training set, we evaluate the distribution of the spatial variance of the generated Latin squares. This metric was used as an additional criterion for good experiment designs (see, e.g., (Gomes et al., 2004; Smith et al.; Le Bras et al., 2012)). Without partially filled cells, the generated Latin squares should be close in spatial variance as those in the training set. The bars in Figure 4 shows that the distribution of the spatial variance of Latin squares generated by XOR-CD matches that of the training set most with KL divergence of 0.075. In addition, Figure 5 shows that XOR-CD is able to complete a partially-filled Latin square resembling those in the training set while Gibbs-CD and GAN cannot generate valid structures. 5.3. Sequence-based Protein Homology Detection We also consider a real-world task, sequence-based protein homology detection. Our approach is based upon comparing protein sequence profiles, which are derived from multiple sequence alignment (MSA) of homologies in a protein family. Let S1 and S2 be two sequences of amino acids. Our goal is to align the two sequences. Our tasks are: (1) (learning) given a dataset of aligned pairs of amino acid sequences, learn the likelihoods of different alignments of the two sequences; (2) (inference) given a new pair of amino acid sequences, determine their most likely alignment. The exponential family model for protein alignment is similar to the one used in (Ma et al., 2014), where we use flow constraints to guarantee the solution to form a valid path in the alignment matrix. The details are left to the supplementary materials. We constructed the training set from the Program Gibbs-CD XOR-CD valid align. 100% 0% 100% Precision Recall exact match 4-offset exact match 4-offset Program 28.7% 39.5% 33.5% 41.2% Gibbs-CD 39.6% 47.8% 37.9% 45.4% XOR-CD 48.8% 54.3% 45.3% 52.1% Table 1. (Upper) XOR-CD and dynamic programming generate 100% valid alignments for protein homology detection, while Gibbs-CD cannot. (Lower) Precision and recall of the alignments found by different approaches. XOR-CD outperforms two baselines by a large margin, even when both metrics are calculated taking into account both valid and invalid alignments. 4-offset is a relaxed measure. PDB40 dataset (Wu & Xu, 2020). Following the practice of (Ma et al., 2014), the reference alignment (groundtruth) is generated by Deep Align (Wang et al., 2013). Our experiment data include those whose groundtruth alignment has fewer than 50 gap positions (excluding the gap in the beginning and end) and the total length is up to 200. The test set is made up with 50 randomly sampled sequences from the PDB40 dataset separated from the training set. Following common practice, we use precision and recall as evaluation metrics. Precision is the fraction of correctly aligned amino acid pairs within all predicted ones, and recall is the fraction of correctly aligned pairs within all ground-truth ones. Notice these two metrics are local, and can be computed even when the global alignment is invalid (does not form a path in the alignment matrix). We compare XOR-CD with dynamic programming and Gibbs-CD. Dynamic programming uses an expert-defined objective (Rychlewski et al., 2000) with a few learned terms. It always produces valid alignments. As shown in Table 1 (Upper), both dynamic programming and XOR-CD have the ability to generate 100% valid alignments, while Gibbs-CD cannot. XOR-CD outperforms both baselines in precision and recall in Table 1 (Lower). 4position off is a relaxed metric that considers a alignment correct if it is off by at most 4 positions. Using this relaxed metric, XOR-CD still outperforms both baselines by 7% in precision and 14% in recall. Notice these metrics are computed taking into both valid and invalid alignments. In summary, XOR-CD outperforms baselines in all learning metrics while also generating 100% valid alignments. 6. Conclusion We proposed XOR-CD, a novel algorithm for constrained structure generation. We showed theoretically that XOR-CD has a linear convergence rate to the global optimum for exponential family models. Empirically, we demonstrated the superior performance of XOR-CD on three real-world con- XOR-CD: Linearly Convergent Constrained Structure Generation strained structure generation tasks. In all tasks, XOR-CD generates 100% valid structures and these generated structures closely match those in the training set. Future work includes extending XOR-CD to deep generative models. Acknowledgements This research was supported by NSF grants IIS-1850243, CCF-1918327. We thank anonymous reviewers for their comments and suggestions. Achlioptas, D. and Theodoropoulos, P. Probabilistic model counting with short xors. In International Conference on Theory and Applications of Satisfiability Testing, pp. 3 19. Springer, 2017. Achlioptas, D., Hammoudeh, Z., and Theodoropoulos, P. Fast and flexible probabilistic model counting. In International Conference on Theory and Applications of Satisfiability Testing, 2018. Agrawal, A., Amos, B., Barratt, S., Boyd, S., Diamond, S., and Kolter, J. Z. Differentiable convex optimization layers. In Advances in neural information processing systems, pp. 9562 9574, 2019. Amos, B. and Kolter, J. Z. Optnet: Differentiable opti- mization as a layer in neural networks. ar Xiv preprint ar Xiv:1703.00443, 2017. Bao, F., Li, C., Xu, T., Su, H., Zhu, J., and Zhang, B. Bi- level score matching for learning energy-based latent variable models. Advances in Neural Information Processing Systems, 33, 2020. Barton, J. P., De Leonardis, E., Coucke, A., and Cocco, S. Ace: adaptive cluster expansion for maximum entropy graphical model inference. Bioinformatics, 32(20):3089 3097, 2016. Belanger, D. and Mc Callum, A. Structured prediction en- ergy networks. In International Conference on Machine Learning, pp. 983 992, 2016. Belanger, D., Yang, B., and Mc Callum, A. End-to-end learning for structured prediction energy networks. ar Xiv preprint ar Xiv:1703.05667, 2017. Bengio, Y. and Delalleau, O. Justifying and generalizing contrastive divergence. Neural computation, 21(6):1601 1621, 2009. Briot, J.-P., Hadjeres, G., and Pachet, F.-D. Deep learning techniques for music generation a survey. ar Xiv preprint ar Xiv:1709.01620, 2017. Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018. Carreira-Perpinan, M. A. and Hinton, G. E. On contrastive divergence learning. In Aistats, volume 10, pp. 33 40. Citeseer, 2005. Chakraborty, S., Fremont, D. J., Meel, K. S., Seshia, S. A., and Vardi, M. Y. Distribution-aware sampling and weighted model counting for sat. In AAAI, 2014. de Avila Belbute-Peres, F., Economon, T. D., and Kolter, J. Z. Combining differentiable pde solvers and graph neural networks for fluid flow prediction. 2020. Dehaene, G. P. and Barthelm e, S. Bounding errors of expectation-propagation. In Advances in Neural Information Processing Systems, pp. 244 252, 2015. Ding, F., Wang, H., Sabharwal, A., and Xue, Y. Towards efficient discrete integration via adaptive quantile queries. ar Xiv preprint ar Xiv:1910.05811, 2019. Durkan, C., Murray, I., and Papamakarios, G. On contrastive learning for likelihood-free inference. ar Xiv preprint ar Xiv:2002.03712, 2020. Engel, J., Resnick, C., Roberts, A., Dieleman, S., Norouzi, M., Eck, D., and Simonyan, K. Neural audio synthesis of musical notes with wavenet autoencoders. In International Conference on Machine Learning, pp. 1068 1077. PMLR, 2017. Ermon, S., Gomes, C. P., Sabharwal, A., and Selman, B. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In Proceedings of the 30th International Conference on Machine Learning, ICML, 2013a. Ermon, S., Gomes, C. P., Sabharwal, A., and Selman, B. Embed and project: Discrete sampling with universal hashing. In Advances in Neural Information Processing Systems (NIPS), 2013b. Fan, D. and Xue, Y. Contrastive divergence learning with chained belief propagation. In International Conference on Probabilistic Graphical Models, 2020. Galassi, A., Lombardi, M., Mello, P., and Milano, M. Model Agnostic Solution of CSPs via Deep Learning: A Preliminary Study. In Proceedings of CPAIOR, volume 10848 of LNCS, pp. 254 262. Springer, 2018. Gomes, C., Sellmann, M., Van Es, C., and Van Es, H. The challenge of generating spatially balanced scientific experiment designs. In International Conference on Integration of Artificial Intelligence (AI) and Operations XOR-CD: Linearly Convergent Constrained Structure Generation Research (OR) Techniques in Constraint Programming, pp. 387 394. Springer, 2004. Gomes, C. P., Sabharwal, A., and Selman, B. Near-uniform sampling of combinatorial spaces using xor constraints. In Sch olkopf, B., Platt, J. C., and Hoffman, T. (eds.), Advances in Neural Information Processing Systems 19, pp. 481 488. MIT Press, 2007a. Gomes, C. P., Van Hoeve, W.-J., Sabharwal, A., and Selman, B. Counting csp solutions using generalized xor constraints. In Proceedings of the 22nd National Conference on Artificial Intelligence, 2007b. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Graves, A., Wayne, G., Reynolds, M., Harley, T., Dani- helka, I., Grabska-Barwinska, A., Colmenarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou, J., Badia, A. P., Hermann, K. M., Zwols, Y., Ostrovski, G., Cain, A., King, H., Summerfield, C., Blunsom, P., Kavukcuoglu, K., and Hassabis, D. Hybrid computing using a neural network with dynamic external memory. Nature, 538 (7626):471 476, 2016. Grover, A., Zweig, A., and Ermon, S. Graphite: Iterative generative modeling of graphs. In International conference on machine learning, pp. 2434 2444. PMLR, 2019. Hinton, G. and Salakhutdinov, R. Reducing the dimension- ality of data with neural networks. Science, 313(5786): 504 507, 2006. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Comput., pp. 1771 1800, 2002a. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771 1800, 2002b. Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to- image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125 1134, 2017. Jiang, B., Wu, T.-Y., Jin, Y., Wong, W. H., et al. Conver- gence of contrastive divergence algorithm in exponential family. The Annals of Statistics, 46(6A):3067 3098, 2018. Jin, W., Barzilay, R., and Jaakkola, T. S. Junction tree variational autoencoder for molecular graph generation. In ICML, 2018. Khalil, E., Dai, H., Zhang, Y., Dilkina, B., and Song, L. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, pp. 6348 6358. 2017. Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in neural information processing systems, pp. 10215 10224, 2018. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Kusner, M. J., Paige, B., and Hern andez-Lobato, J. M. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 1945 1954, 2017. Lallouet, A. and Legtchenko, A. Building Consistencies for Partially Defined Constraints with Decision Trees and Neural Networks. International Journal on Artificial Intelligence Tools, 16(4):683 706, 2007. Le Bras, R., Gomes, C., and Selman, B. From streamlined combinatorial search to efficient constructive procedures. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012. Liu, Q., Allamanis, M., Brockschmidt, M., and Gaunt, A. Constrained graph variational autoencoders for molecule design. In Advances in neural information processing systems, pp. 7795 7804, 2018. Lombardi, M. and Gualandi, S. A lagrangian propagator for artificial neural networks in constraint programming. Constraints, 21(4):435 462, 2016. Lombardi, M., Milano, M., and Bartolini, A. Empirical de- cision model learning. Artif. Intell., 244:343 367, 2017. Ma, J., Wang, S., Wang, Z., and Xu, J. Mrfalign: protein homology detection through alignment of markov random fields. PLo S Comput Biol, 10(3):e1003500, 2014. Minka, T. P. Expectation propagation for approximate bayesian inference. ar Xiv preprint ar Xiv:1301.2294, 2013. Pang, T., Xu, K., Li, C., Song, Y., Ermon, S., and Zhu, J. Ef- ficient learning of generative models via finite-difference score matching. ar Xiv preprint ar Xiv:2007.03317, 2020. Ping, W. and Ihler, A. Belief propagation in conditional rbms for structured prediction. ar Xiv preprint ar Xiv:1703.00986, 2017. Prenger, R., Valle, R., and Catanzaro, B. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617 3621. IEEE, 2019. XOR-CD: Linearly Convergent Constrained Structure Generation Qiu, Y., Zhang, L., and Wang, X. Unbiased contrastive diver- gence algorithm for training energy-based latent variable models. In International Conference on Learning Representations, 2019. Radford, A., Metz, L., and Chintala, S. Unsupervised rep- resentation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. Rychlewski, L., Li, W., Jaroszewski, L., and Godzik, A. Comparison of sequence profiles. strategies for structural predictions using sequence information. Protein Science, 9(2):232 241, 2000. Smith, C., Gomes, C., and Fernandez, C. Streamlining local search for spatially balanced latin squares. S oding, J. Protein homology detection by hmm hmm com- parison. Bioinformatics, 21(7):951 960, 2005. Song, Y. and Ermon, S. Improved techniques for train- ing score-based generative models. Advances in Neural Information Processing Systems, 33, 2020. Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, pp. 2692 2700. 2015. Wang, P.-W., Donti, P. L., Wilder, B., and Kolter, Z. Sat- net: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. ar Xiv preprint ar Xiv:1905.12149, 2019. Wang, S., Ma, J., Peng, J., and Xu, J. Protein structure alignment beyond spatial proximity. Scientific reports, 3: 1448, 2013. Wu, F. and Xu, J. Deep template-based protein structure prediction. bio Rxiv, 2020. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., and Sun, M. Graph neural networks: A review of methods and applications. ar Xiv preprint ar Xiv:1812.08434, 2018.