# on_parameter_tying_by_quantization__fedf6d06.pdf

On Parameter Tying by Quantization

Li Chou, Somdeb Sarkhel Department of Computer Science The University of Texas at Dallas {lkc130030,sxs104721}@utdallas.edu

Nicholas Ruozzi Department of Computer Science The University of Texas at Dallas nicholas.ruozzi@utdallas.edu

Vibhav Gogate Department of Computer Science The University of Texas at Dallas vgogate@hlt.utdallas.edu

The maximum likelihood estimator (MLE) is generally asymptotically consistent but is susceptible to overﬁtting. To combat this problem, regularization methods which reduce the variance at the cost of (slightly) increasing the bias are often employed in practice. In this paper, we present an alternative variance reduction (regularization) technique that quantizes the MLE estimates as a post processing step, yielding a smoother model having several tied parameters. We provide and prove error bounds for our new technique and demonstrate experimentally that it often yields models having higher test-set log-likelihood than the ones learned using the MLE. We also propose a new importance sampling algorithm for fast approximate inference in models having several tied parameters. Our experiments show that our new inference algorithm is superior to existing approaches such as Gibbs sampling and MCSAT on models having tied parameters, learned using our quantization-based approach.

Introduction

Weight (parameter) learning is a central problem in probabilistic graphical models. It is often expressed as the following maximum likelihood estimation (MLE) task: ﬁnd an assignment of values to all parameters that maximizes the log-likelihood of the data. MLE is a popular weight learning formulation because it has several desirable theoretical properties including consistency (convergence in the limit to the correct value), and asymptotic efﬁciency (best-possible estimator in the limit). However, in practice, when the data size is small or the number of parameters is large (or both), MLE yields parameter estimates having high variance, and the resulting models have poor predictive (generalization) accuracy. To combat this, regularization methods such as ℓ1 and ℓ2 regularization, and their Bayesian counterparts Laplace and Gaussian priors, which reduce the variance by penalizing large parameter values, are often employed in practice. In this paper, we consider an alternative regularization method: parameter tying, which forces several parameters of the graphical model to take the same value. Roughly

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

speaking, since the sufﬁcient statistics of the tied parameters can be combined, the number of samples used to estimate a speciﬁc parameter increases, which in turn reduces the variance (but increases the bias). Parameter tying is employed in a large variety of graphical models and their extensions such as hidden Markov models (Baum et al. 1970), dynamic Bayesian networks (Murphy 2002), conditional random ﬁelds (Lafferty, Mc Callum, and Pereira 2001), and statistical relational models (Getoor and Taskar 2007). However, a key feature of this existing work is that it assumes that the tied parameters are speciﬁed apriori. As a result, the full power of parameter tying as a regularization method has not yet been exploited. We deviate from the existing approach and seek methods that automatically tie the parameters by analyzing the data. A straight-forward approach to solve the parameter tying problem is to express it as a constrained optimization problem. Given m parameters and a hyperparameter k, which bounds the maximum number of tied parameters and controls the amount of regularization, ﬁnd a set of at most k equality constraints that are mutually exclusive (no two constraints have the same parameter) and exhaustive (all parameters are included) such that the likelihood is maximized. Unfortunately, this problem is computationally difﬁcult as it includes structure learning as a special case. To obviate this computational difﬁculty, we propose a greedy approach that ties parameters by quantizing the learned weights. Namely, we propose to ﬁnd a many-to-one function f that maps m parameter values to a set having k values, such that the sum of the Euclidean distance between a parameter w and its quantized value f(w) is minimized. This problem can be solved optimally using dynamic programming (Wang and Song 2011). We show that despite its simplicity, our new approach has several desirable theoretical properties and guarantees. In particular, we provide and prove novel error bounds which show that if the quantization is accurate then the difference between the average log-likelihoods of the original model and quantized model is quite small. Experimentally, we show that the quantized model often outperforms the untied model on a wide range of datasets. A key beneﬁt of parameter tying is that it introduces symmetries in the probabilistic model, and exploiting them makes inference easier and more efﬁcient. To this end, we

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

develop a novel importance sampling algorithm for fast, accurate sampling in parameter tied models. Our experiments conclusively show that our new sampling algorithm outperforms MC-SAT (Poon and Domingos 2006), a popular sampling algorithm that combines Markov Chain Monte Carlo sampling and satisﬁability solution sampling, and Gibbs sampling (baseline) in the presence of tied parameters.

Background We represent variables by capital letters (e.g., X), values in the domain of a variable by corresponding small letters (e.g., x) and an assignment of a value x to a variable X by x. We represent sets of variables by bold capital letters (e.g., X) and assignment of values to all variables in the set X by x. For simplicity of presentation, we will assume that all domains are binary unless otherwise noted. A log-linear probabilistic graphical model (PGM), denoted by M is a triple X, F , θ where X = {X1, ..., Xn} is a set of variables, F = {F1, ..., Fm} is a set of features and θ = {θ1, . . . , θm} is the set of weights (parameters) such that θi is the weight of feature Fi. M represents the following probability distribution

Pθ( x) = 1 Z(θ) exp

i θi Fi( x)

where Fi( x) is 1 if the assignment x evaluates Fi to true and 0 otherwise, and Z(θ) =

i θi Fi( x)) is the normalization constant or the partition function. Bayesian networks are directed graphical models in which each parameter θi has a probabilistic interpretation and equals the log of a conditional probability, and the partition function equals one. The two main inference problems over PGMs are: (1) marginal inference which is deﬁned as computing the marginal probability of a variable given an assignment to a subset of variables (evidence) and (2) maximuma-posteriori inference, which is deﬁned as ﬁnding an assignment of values to all non-evidence variables that has the maximum probability. In this paper, we focus on the former.

Weight (Parameter) Learning We assume that the PGM structure is known, namely the variables and the features are given while the weights are unknown and need to be estimated from data. We further assume that our dataset is such that all variables are observed and there are no missing values. Under these assumptions, given a training data set D = { x(1), ..., x(D)} for variables X, the weight learning task can be expressed as maximizing the following log-likelihood function.

ℓ(D : θ) = log

i=1 Pθ( x(i)) =

i=1 log Pθ( x(i))

This method of parameter learning or estimation is referred to as Maximum Likelihood Estimation (MLE). MLE has the desirable consistency property, which guarantees that the MLE solution converges to the true unknown parameter with high probability for sufﬁciently large samples. Under the assumptions given above, the log-likelihood function is concave. The optimum and thus the parameters (MLE solution)

can be found using standard gradient descent methods. In Bayesian networks, the MLE solution can be computed in closed form (since each parameter is a conditional probability).

Importance Sampling Importance Sampling (IS) (Liu 2001) is a Monte Carlo simulation technique, commonly used to evaluate the expectation, EP [f(X)] =

x X f( x)P( x) of a real function f( ) according to a probability distribution P( ). The general idea is to generate samples x(1), ..., x(N) from a proposal distribution Q, satisfying f( x) > 0 Q( x) > 0, and then estimate EP [f(X)] as follows

EP [f(X)] N i=1 f( x(i))w( x(i)) N i=1 w( x(i))

where w( x(i)) = P ( x(i))

Q( x(i)) is called the importance weight of

the sample x(i). Note that the importance weight needs to be known only up to a multiplicative constant. Importance sampling can be used to estimate both the partition function (normalizing constant) as well as one variable marginals. In this paper we focus on computing the single variable marginal probability, which can be estimated as

PN( x) = N i=1 1{ x(i)}w( x(i))

N i=1 w( x(i)) (1)

where 1{ x} = 1 iff x contains the assignment x, and 0 otherwise, and w( x) = exp(

i θi Fi( x)) Q( x) . In this paper, we will make the standard assumption that Q is a Bayesian network (Fung and Chang 1989; Ortiz and Kaelbling 2000; Cheng and Druzdzel 2001; Gogate 2009), since it is easy to generate independent samples from a Bayesian network using logic (forward) sampling (Pearl 1988). The quality of estimation, namely the accuracy of PN( x) is highly dependent on how far Q is from P, and as a result most of the research on importance sampling is about designing a good Q. In this paper, we design a Q for parameter tied PGMs.

Quantization, Clustering and Partition Quantization is the process of mapping a larger set of real numbers to a smaller set. Formally, given two sets of real numbers w and v, a quantization function Q, called a quantizer, is a surjective function from elements of x to elements of y where |x| |y|, such that |xi Q(xi)| ϵi for all xi x. Here ϵi is referred to as error of quantization. Often we are interested in Q which minimizes the average error of quantization (i.e., 1 |x|

i |xi Q(xi)|). The optimal quantization problem is closely related to the optimal k-means clustering problem which is deﬁned as follows: given a set of observations x = {x1, ..., xn}, ﬁnd a k-partition S = {S1, ..., Sk} of x (namely, i Si x and k i=1Si = x) such that following objective function is minimized

x Si ||x μSi||d

where || ||d is a distance measure and μSi (called the cluster center), is the mean of elements in Si. The most popular distance measure is Euclidean distance. A function Q : x μ, deﬁned as Q(x) = μSi iff x is in cluster Si, where μ = {μS1, ..., μSk}, is clearly a quantizer of x. Hence, we can use any clustering algorithm (e.g., the kmeans algorithm) for quantization. In this work, it turns out that we will only need to solve one dimensional clustering problems, which can be solved in polynomial time, O(m2k), using dynamic programming (Wang and Song 2011). In this paper, we leverage this algorithm for ﬁnding optimal quantizations.

Quantization for Weight Learning

We deﬁne a parameter tied graphical model Mt, as a tuple X, θ, C , where X is the set of variables, θ is the parameter set, and C is a set of equality constraints of the form θi = θj, for some θi, θj θ. Since equality constraints are equivalence relations, C induces a partition over θ where each part of the partition corresponds to an equivalence class. We are interested in learning the optimal parameter tied model M t from data. This problem can be deﬁned as follows. Given training data D on variables X, ﬁnd the constraint set C and parameters θ such that C induces a k-partition on θ and the (log)likelihood of data is maximized. Solving the aforementioned optimization problem is hard because it requires searching over all possible constraint sets, which is clearly impractical. In particular, the number of possible constraint sets of size k for m parameters equals the number of partitions of size k for a set of size m (denoted by m k ). This number is given by the so-called Stirling numbers of the second kind which grows exponentially with m. The total number of partitions of a set is given by the Bell number, Bm = m k=1 m k .1

Our approach. To remedy this computational difﬁculty, we propose the following greedy approach. First, we learn the parameters of the log-linear model using MLE. Then, we quantize these parameters to k levels using the onedimensional k-means clustering algorithm.

Error bound for Quantization

Although, our approach is simple and straightforward, we show next that it will yield models that have high loglikelihood score yet fewer parameters under the assumption that the quantization error is small. Formally, let θ = (θ1, . . . , θm) denote the parameters of a log-linear model learned from a dataset D having D examples. Without loss of generality, we assume that all weights are positive. Let μ = (μ1, . . . , μk), k m be a quantization of θ with respect to the quantizer Q between θ and μ such that for all θi θ, the following holds

|θi Q(θi)| ϵ

1The fact that the optimization problem is computationally difﬁcult also follows from the observation that it includes structure learning as a special case where the number of true features is k 1.

where ϵ 0 is a small constant. Let ℓ(θ : D) and ℓ(μ : D) denote the log-likelihoods of D with respect to θ and μ respectively. Then, we can prove that the difference between the average log-likelihood scores of the quantized model and the original model is bounded by 2mϵ. Formally, (proof is available in the extended version)

1 D (ℓ(θ : D) ℓ(μ : D)) 2mϵ.

As the number of quantization levels (k) increases, the quantization error ϵ decreases, and vanishes when k = m. As a result, the bound speciﬁed in Theorem 1 becomes tighter, and our greedy learning approach yields more accurate results while using a smaller number of parameters.

The log-likelihood score of the quantized log-linear model can be further improved by simply relearning the model, treating all parameters that are quantized to the same value as tied. Formally, since the quantizer Q induces a k-partition on the original parameter set θ, we set up the learning problem as follows

maximize θ ℓ(θ : D)

subject to {θi = θj|θi, θj θ, Q(θi) = Q(θj)}. (2)

The optimization problem given above has a concave objective function, and therefore has a single maximum, and therefore can be solved using the same approaches (e.g., gradient descent) that are used to solve the MLE problem. As mentioned earlier, the key beneﬁt of the formulation in Eq. (2) is that it has fewer (unique) parameters and therefore the data statistics are estimated using a larger sample size than the ones used in the unconstrained (MLE) version of this problem. From standard sampling theory (Liu 2001), this reduces the variance of the estimates.

Proposition 1. Let μ denote set of quantized parameters, Q be the corresponding quantizer and ρ be the relearned parameters (optimal solution of the optimization problem given in Eq. (2)), then

ℓ(ρ : D) ℓ(μ : D).

Proof. Since μ is a feasible solution of the problem given in Eq. (2) (it satisﬁes all the the constraints) and ρ is the optimal solution, it follows that ℓ(ρ : D) ℓ(μ : D).

Slice Importance Sampling

We observe that parameter tying introduces symmetries in the network which can be exploited to design a fast, approximate inference algorithm. In this section, we propose one such algorithm, which is based on slice importance sampling (Neal 2000; Gogate and Domingos 2010).

In slice importance sampling, the proposal distribution is deﬁned over the features rather than over the variables. For instance, we can deﬁne a proposal distribution

Q(F ) = Q(F1)

i=2 Q(Fi|F1, . . . , Fi 1)

over the set of features F . Sampling each feature in order from Q yields a 0/1 assignment to the features where 0 indicates that the negation of the feature is true while 1 indicates that the feature is true. We can consider this 0/1 assignment as a slice over the possible assignments to the variables in the following sense. All (variable) assignments that satisfy all features assigned to 1 and all negations of features assigned to 0 will have the same probability. Uniformly sampling over this subset of variable assignments (the slice) gives us the required sample. The beneﬁt of slice sampling is that all variable assignments having the same probability in the distribution represented by the log-linear model have the same probability (assuming uniform sampling over the slice can be done) in the proposal distribution. This reduces the variance (Neal 2000; Gogate and Domingos 2010). In our new algorithm, our main idea is to deﬁne the proposal distribution over the tied features rather than over the individual features, further reducing the variance. We ﬁrst create a set of super-features G, such that each of them contains features with tied parameters, i.e.,

Gi = {Fj|Fj has parameter θi}, (3)

and deﬁned a proposal over them as follows (using the chain rule):

Q(G) = Q(G1)

i=2 Q(Gi|G1, . . . , Gi 1).

However, one problem with deﬁning a proposal over the super-features is that the number of values each superfeature Gi can take is exponential in the number of features it contains, namely exponential in |Gi|. To compactly represent these exponentially many assignments, we use the so-called counting assignments (Milch et al. 2008; Jha et al. 2010) as follows. We partition the assignments to the features in the set Gi into |Gi|+1 subsets where the j-th subset contains all assignments in which exactly j features in Gi are assigned to true and the remaining are assigned to false. Thus, each super-feature Gi can take |Gi| + 1 values, yielding exponential reduction in complexity. Another beneﬁt of using counting assignments is that they yield better proposal distributions. In particular, the set of valid counting assignments to all super-features partitions the set of assignments to the variables into equiprobable subsets, where each subset is composed of assignments to variables that satisfy a counting assignment to all super-features. In an ideal proposal distribution all such assignments must have the same probability, and deﬁning the proposal over the counting assignments preserves this property. Example 1. Consider a log-linear model having three features F1 = a b, F2 = b c and F3 = a c and three

Algorithm 1 Tied Weight Importance Sampling

Input: A log-linear model M = X, F , μ with k unique weights, Number of samples N Output: Importance weighted samples 1: Create one super-feature Gi for each parameter μi 2: Construct a proposal distribution Q(G) over the superfeatures 3: for s = 1 to N do 4: S = ; w(s) = 1 5: for i = 1 to k do 6: ji Q(Gi|G1, . . . , Gi 1) 7: Add ji randomly selected features from Gi to S 8: Add the negation of the features from Gi not selected in the previous step to S 9: w(s) = w(s) |Gi| ji exp(jiμi) Q(Gi|G1,...,Gi 1) 10: end for 11: Sample x(s) USAT (S) 12: w(s) = w(s) #S 13: end for 14: return ( x(s), w(s)) for s = 1 to N

binary variables a, b, and c. Let the features F1 and F2 have the same weight θ1 and θ2 be the weight associated with F3. We deﬁne two super-features: G1 = {F1, F2} and G2 = {F3}. G1 has four possible assignments: {(F1 = 0, F2 = 0), (F1 = 0, F2 = 1), (F1 = 1, F2 = 0), (F1 = 1, F2 = 1)} but only three possible counting assignments {0, 1, 2}, where 0, 1 and 2 correspond to the subset of assignments {(F1 = 0, F2 = 0)}, {(F1 = 0, F2 = 1), (F1 = 1, F2 = 0)} and {(F1 = 1, F2 = 1)} respectively. The reader can verify that the assignments to the the variables that satisfy either (F1 = 0, F2 = 1) or (F1 = 1, F2 = 0) have the same probability. However, counting assignments introduce the following problem. To generate a sample, we need to generate an assignment to variables uniformly at random from the subset of variable assignments that satisfy the counting assignment. Unfortunately, this problem is extremely challenging and to our knowledge no general-purpose algorithms exist for it. To alleviate this computational difﬁculty, we propose to use the following proposal distribution:

Q(F1, . . . , F|Gi|| j Fj Gi, Gi = t) = 1 |Gi| t (4)

Sampling from this proposal distribution yields a 0/1 assignment to the features and the only problem that remains to be solved is generating an assignment to variables uniformly at random from the subset of variable assignments that satisfy the given assignment to the features. This problem can be reduced to the uniform solution sampling problem, a well-researched problem for which a number of general-purpose solvers and techniques exist (Gogate and Dechter 2011; Wei, Erenrich, and Selman 2004; Gogate 2009). Algorithm 1 formally describes our proposed slice importance sampling algorithm. The sampling process begins

by constructing a proposal distribution Q over the superfeatures Gi associated with the same parameter μi. Then we sample assignments (ji) for each super-feature Gi from the proposal Q in a selected order (step 6). Then we select ji random features from Gi and set their assignment as 1. The rest of the features in Gi are assigned 0. This sampled 0/1 assignment to all features deﬁnes a satisﬁability problem S. Solutions of this satisﬁability problem correspond to the subset of variable assignments that have the same probability. Thus to generate a sample, all we have to do is uniformly sample the solutions of S (step 11). For this (procedure USAT ), we can use uniform solution samplers such as Sample SAT (Wei, Erenrich, and Selman 2004) and Sample Search (Gogate and Dechter 2011). In our experiments, we used the latter. The weight w(s) of the generated sample is proportional to the ratio between the probability of generating the sample from M and the probability of generating it from the proposal distribution and is computed iteratively in Steps 9 and 12. In our experiments, we have used a simple proposal which factorizes independently over the super-features, i.e.,

Let μ1, . . . , μk be the parameters and let Gi denote the super-feature that is associated with μi. We deﬁne Q(Gi) as the following binomial distribution

Q(Gi = t) |Gi| t

(1 + eμi)|Gi| (5)

where t {0, . . . , |Gi|}. Notice that the binomial is deﬁned over |Gi|+1 points as opposed to the conventional proposal which would have been deﬁned over 2|Gi| points.

Experiments We evaluated the performance of our quantized approach on both learning and inference tasks using several publicly available benchmark datasets from the UAI 2008 probabilistic inference competition repository (http://graphmod.ics.uci.edu/uai08). All experiments were performed on quad-core Intel i7 based machines with 16GB of RAM running Ubuntu.

Weight Learning First, we compared our quantized tied weight learning algorithms to the MLE with a Laplacian prior on a collection of Bayesian network learning problems. For each selected Bayesian network, we used forward sampling to generate 100 sets of 6,000 training, 2,000 validation and 2,000 test data points. Using the training data, we learned three models corresponding to the MLE, the quantized MLE, and a MLE obtained by relearning after quantization for different values of k. Performance of each learning technique was evaluated using the average log-likelihood over the test set. We consider three kinds of Bayesian networks: two-layered noisy-or Bayesian networks (Savicky and Vomlel 2009), relational Bayesian networks constructed from the Primula

Total # True Est. Max Error Network Param. k k MLE Q RL bn2o 20510 20491 500 0.339 0.339 0.299 students 1308 13 20 0.141 0.141 0.140 grid 1089 596 400 0.0186 0.0186 0.0181 friends 3899 6 10 0.008 0.008 0.007

Table 1: Network information and error analysis (MLE, Quantized and Relearned

tool (UCLA), and grid networks (Sang, Beame, and Kautz 2005). The various networks and the respective number of parameters in the networks are shown in Table 1. We experimented on various noisy-or networks. The results for these networks were consistently similar. Figure 1 (a) shows the result for one of the networks. For these models, the MLE has a higher average log-likelihood than the quantized MLE at low values for k, but we obtained a significant performance improvement over the MLE by relearning after quantization. As k increases, relearning greatly outperforms the MLE and reaches a steady level. The performance difference tends to zero as k converges to the actual number of parameter in the network. Even for small k, the relatively small difference in log-likelihood appears to be a reasonable trade-off for the drastic reduction in the number of model parameters. We also selected two relational networks and one grid network. The results for these models appear in Figure 1(b)-(d). The performance for these networks are similar to the noisyor networks. For small k, the performance improvement by relearning is realized fairly immediately, while the quantization also converges to the MLE quickly. We conducted error analysis on parameter estimation between the three different learned models and the true model by using the validation set to obtain an optimally estimated k. With the exception of bn2o, our estimated k is reasonably close to the true k in the various models as shown in third and fourth column of Table 1. The bn2o networks contains many very similar parameter values and thus resulted in a much lower estimated k. For each model set, we calculated the average absolute point-wise error between the true and learned parameters and selected the maximum among the experimental set. The results are also shown in Table 1. The relearning has a consistent lower error rate compared to MLE and quantization, while the quantization has similar error rate as MLE with fewer parameters. This reduction in the number of parameters often translates to better inference performance at prediction time as we show in the next subsection.

We compared our proposed approximate inference method based on slice importance sampling (denoted TW) to MCSAT (Poon and Domingos 2006) in the Alchemy system (Kok et al. 2006) and Gibbs sampling for inference in Bayesian networks. Each algorithm was run for 500 seconds and then evaluated by computing the average Hellinger distance (Kokolakis and Nanopoulos 2001) between the

Average Log-likelihood

Average Log-likelihood

Average Log-likelihood

(c) students

Average Log-likelihood

(d) friends

Figure 1: Average log-likelihood on test data plotted for each parameter learned graphical model (MLE, Quantized and Relearned) varying the value for k level of quantization.

(b) students

Figure 2: Average Hellinger distance between the exact and the approximate one-variable marginals plotted as a function of k level of quantization for MS MLE (MC-SAT MLE), MS RL (MC-SAT Relearned), TW MLE (Tied Weight MLE), TW RL (Tied Weight Relearned), GIBBS MLE (Gibbs MLE) and GIBBS RL (Gibbs Relearned). Result for each of the network types (noisy-or, relational and grid) are shown

single-variable marginals obtained by each algorithm and marginals obtained by an exact solver. The results were averaged over 10 runs of each algorithm using the MLE as well as parameters that were relearned (RL) after quantization. Figure 2a)-(c) shows our experimental results on each the three types of networks (noisy-or, relational and grid).

The average Hellinger distances are consistently similar between MC-SAT and our method across the three network types when no parameters are tied (the MLE case). This is expected as MC-SAT is a special case of our algorithm. However, with the relearned parameters, our algorithm signiﬁcantly outperforms both MC-SAT and Gibbs sampling on the students and grid networks. This shows that even though the test-set log likelihood of the MLE solution and the parameter tied models are roughly the same, at prediction time (estimating marginals), models having tied parameters outperform untied models provided methods such as

TW that explicitly exploit the tied parameters are used. One explanation for the poor performance of MC-SAT on parameter tied models is that MC-SAT is based on local search with a strong bias towards features that have high weights. Thus, without the ability to make large moves, MC-SAT is not able to efﬁciently traverse through the state space. Analogously, MCMC techniques such as Gibbs sampling, can also become trapped within a local region and may require a large number of samples to escape. Since our algorithm systematically partitions the overall state-space through the tied weight structure, it is able to move across the various regions more easily.

We proposed a greedy method to learn tied parameter models that quantizes the parameters learned via maximum likelihood estimation using k-means clustering. Despite its sim-

plicity, we demonstrated empirically that our approach can be used both as a regularizer and as a technique to reduce model complexity while maintaining predictive performance, which comports with the theoretical bounds on the error resulting from quantization that we provided. We also introduced a new importance sampling technique that exploits the symmetry resulting from the quantization in order to sample more effectively from tied parameter models than MC-SAT and Gibbs sampling. In future work, we plan to investigate applications of these techniques to Markov networks, bounds on the sample complexity required to obtain the correct quantization, and applications of quantization to structure learning.

Acknowledgements

This work was supported in part by the DARPA Probabilistic Programming for Advanced Machine Learning Program under AFRL prime contract number FA8750-14-C-0005. Any opinions, ﬁndings, conclusions, or recommendations expressed in this paper are those of the authors and do not necessarily reﬂect the views or ofﬁcial policies, either expressed or implied, of DARPA, AFRL, ARO or the US government.

Baum, L. E.; Petrie, T.; Soules, G.; and Weiss, N. 1970. A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics 41(1):164 171. Cheng, J., and Druzdzel, M. J. 2001. Conﬁdence Inference in Bayesian Networks. In Proceedings of the Seventeenth Conference on Uncertainty in Artiﬁcial Intelligence, 75 82. Fung, R. M., and Chang, K. 1989. Weighing and Integrating Evidence for Stochastic Simulation in Bayesian Networks. In Proceedings of the Fifth Conference on Uncertainty in Artiﬁcial Intelligence, 209 220. Getoor, L., and Taskar, B., eds. 2007. Introduction to Statistical Relational Learning. MIT Press. Gogate, V., and Dechter, R. 2011. Sample Search: Importance Sampling in Presence of Determinism. Artiﬁcial Intelligence 175(2):694 729. Gogate, V., and Domingos, P. 2010. Formula-Based Probabilistic Inference. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artiﬁcial Intelligence, 210 219. Gogate, V. 2009. Sampling Algorithms for Probabilistic Graphical Models with Determinism. Ph.D. Dissertation, University of California, Irvine. Jha, A.; Gogate, V.; Meliou, A.; and Suciu, D. 2010. Lifted Inference from the Other Side: The tractable Features. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS), 973 981. Kok, S.; Sumner, M.; Richardson, M.; Singla, P.; Poon, H.; and Domingos, P. 2006. The Alchemy System for Statistical Relational AI. Technical report, Department of Computer Science and Engineering, University of Washington, Seattle, WA. http://alchemy.cs.washington.edu.

Kokolakis, G., and Nanopoulos, P. 2001. Bayesian Multivariate Micro-Aggregation Under the Hellinger s Distance Criterion. Research in Ofﬁcial Statistics 4:117 125. Lafferty, J.; Mc Callum, A.; and Pereira, F. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Data. In Proceedings of the Eighteenth International Conference on Machine Learning, 282 289. Liu, J. S. 2001. Monte Carlo Strategies in Scientiﬁc Computing. Springer Publishing Company, Incorporated. Milch, B.; Zettlemoyer, L. S.; Kersting, K.; Haimes, M.; and Kaelbling, L. P. 2008. Lifted Probabilistic Inference with Counting Formulas. In Proceedings of the Twenty-Third National Conference on Artiﬁcial Intelligence, 1062 1068. Murphy, K. 2002. Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. Dissertation, UC Berkeley, Computer Science Division. Neal, R. 2000. Slice Sampling. Annals of Statistics 31(3):705 767. Ortiz, L. E., and Kaelbling, L. P. 2000. Adaptive Importance Sampling for Estimation in Structured Domains. In Proceedings of the Sixteenth Conference on Uncertainty in Artiﬁcial Intelligence, 446 454. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Poon, H., and Domingos, P. 2006. Sound and Efﬁcient Inference with Probabilistic and Deterministic Dependencies. In Proceedings of the Twenty-First National Conference on Artiﬁcial Intelligence, 458 463. Sang, T.; Beame, P.; and Kautz, H. 2005. Solving Bayesian Networks by Weighted Model Counting. In Proceedings of the Twentieth National Conference on Artiﬁcial Intelligence, 475 482. Savicky, P., and Vomlel, J. 2009. Triangulation Heuristics for BN2O Networks. In Symbolic and Quantitative Approaches to Reasoning with Uncertainty. Springer Berlin Heidelberg. 566 577. Wang, H., and Song, M. 2011. Ckmeans. 1d. dp: Optimal k-means Clustering in One Dimension by Dynamic Programming. The R Journal 3(2):29 33. Wei, W.; Erenrich, J.; and Selman, B. 2004. Towards Efﬁcient Sampling: Exploiting Random Walk Strategies. In Proceedings of the Nineteenth National Conference on Artiﬁcial Intelligence, 670 676.