# testing_probabilistic_circuits__12e80427.pdf Testing Probabilistic Circuits Yash Pote r Kuldeep S. Meel School of Computing, National University of Singapore Probabilistic circuits (PCs) are a powerful modeling framework for representing tractable probability distributions over combinatorial spaces. In machine learning and probabilistic programming, one is often interested in understanding whether the distributions learned using PCs are close to the desired distribution. Thus, given two probabilistic circuits, a fundamental problem of interest is to determine whether their distributions are close to each other. The primary contribution of this paper is a closeness test for PCs with respect to the total variation distance metric. Our algorithm utilizes two common PC queries, counting and sampling. In particular, we provide a poly-time probabilistic algorithm to check the closeness of two PCs, when the PCs support tractable approximate counting and sampling. We demonstrate the practical efficiency of our algorithmic framework via a detailed experimental evaluation of a prototype implementation against a set of 475 PC benchmarks. We find that our test correctly decides the closeness of all 475 PCs within 3600 seconds. 1 Introduction Probabilistic modeling is at the heart of modern computer science, with applications ranging from image recognition and image generation [29, 30] to weather forecasting [3]. Probabilistic models have a multitude of representations, such as probabilistic circuits (PCs) [9], graphical models [19], generative networks [16], and determinantal point processes [20]. Of particular interest to us are PCs, which are known to support guaranteed inference and thus have applications in safety-critical fields such as healthcare [2, 25]. In this work, we will focus on PCs that are fragments of the Negation Normal Form (NNF), specifically DNNFs, d-DNNFs, SDNNFs, and PIs [13]. We refer to the survey by Choi et al. [9] for more details regarding PCs. Given two distributions P and Q, a fundamental problem is to determine whether they are close. Closeness between distributions is frequently quantified using the total variation (TV) distance, d T V (P, Q) = (1/2) P Q 1, where is the ℓ1 norm [21, 6]. Thus, stated formally, closeness testing is the problem of deciding whether d T V (P, Q) ε or d T V (P, Q) η for 0 ε < η 1. Determining the closeness of models has applications in AI planning [13], bioinformatics [31, 33, 35] and probabilistic program verification [15, 23]. Equivalence testing is a special case of closeness testing, where one tests if d T V (P, Q) = 0. Darwiche and Huang [13] initiated the study of equivalence testing of PCs by designing an equivalence test for d-DNNFs. An equivalence test is, however, of little use in contexts where the PCs under test The accompanying tool, available open source, can be found at https://github.com/meelgroup/teq. The Appendix is available in the accompanying supplementary material. The authors decided to forgo the old convention of alphabetical ordering of authors in favor of a randomized ordering, denoted by r . The publicly verifiable record of the randomization is available at https://www.aeaweb.org/journals/policies/random-author-order/search with confirmation code: icoxrj0s NB2L. For citation of the work, authors request that the citation guidelines by AEA for random author ordering be followed. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). encode non-identical distributions that are nonetheless close enough for practical purposes. Such situations may arise due to the use of approximate PC compilation [10] and sampling-based learning of PCs [26, 27]. As a concrete example, consider PCs that are learned via approximate methods such as stochastic gradient descent [27]. In such a case, PCs are likely to converge to close but non-identical distributions. Given two such PCs, we would like to know whether they have converged to distributions close to each other. Thus, we raise the question: Does there exist an efficient algorithm to test the closeness of two PC distributions? In this work, we design the first closeness test for PCs with respect to TV distance, called Teq. Assuming the tested PCs allow poly-time approximate weighted model counting and sampling, Teq runs in polynomial time. Formally, given two PC distributions P and Q, and three parameters (ε,η,δ), for closeness(ε), farness(η), and tolerance(δ), Teq returns Accept if d T V (P, Q) ε and Reject if d T V (P, Q) η with probability at least 1 δ. Teq makes atmost O((η ε) 2 log(δ 1)) calls to the sampler and exactly 2 calls to the counter. Teq builds on a general distance estimation technique of Canonne and Rubinfeld [4] that estimates the distance between two distributions with a small number of samples. In the context of PCs, the algorithm requires access to an exact sampler and an exact counter. Since not all PCs support exact sampling and counting, we modify the technique presented in [4] to allow for approximate samples and counts. Furthermore, we implement and test Teq on a dataset of publicly available PCs arising from applications in circuit testing. Our results show that closeness testing can be accurate and scalable in practice. For some NNF fragments, such as DNNF, no sampling algorithm is known, and for fragments such as PI, sampling is known to be NP-hard [32]. Since Teq requires access to approximate weighted counters and samplers to achieve tractability, the question of determining the closeness of the PCs mentioned above remains unanswered. Thus, we investigate further and characterize the complexity of closeness testing for a broad range of PCs. Our characterization reveals that PCs from the fragments d-DNNFs and SDNNFs can be tested for closeness in poly-time via Teq, owing to the algorithms of Darwiche [11] and Arenas et al. [1]. We show that the SDNNF approximate counting algorithm of Arenas et al. [1] can be extended to log-linear SDNNFs using chain formulas [8]. Then, using previously known results, we also find that there are no poly-time equivalence tests for PCs from PI and DNNF, conditional on widely believed complexity-theoretic conjectures. Our characterization also reveals some open questions regarding the complexity of closeness and equivalence testing of PCs. The rest of the paper is organized in the following way. We define the notation and discuss related work in Section 2. We then present the main contribution of the paper, the closeness test Teq, and the associated proof of correctness in Section 3. We present our experimental findings in Section 4, and then discuss the complexity landscape of closeness testing in Section 5. We conclude the paper and discuss some open problems in Section 6. Due to space constraints, we defer some proofs to the supplementary Section A. 2 Background Let ϕ : {0, 1}n {0, 1} be a circuit over n Boolean variables. An assignment σ {0, 1}n to the variables of ϕ is a satisfying assignment if ϕ(σ) = 1. The set of all satisfying assignments of ϕ is Rϕ. If |Rϕ| > 0, then ϕ is said to be satisfiable and if |Rϕ| = 2n, then ϕ is said to be valid. We use |ϕ| to denote the size of circuit ϕ, where the size is the total number of vertices and edges in the circuit DAG. The polynomial hierarchy (PH) contains the classes ΣP 1 (NP) and ΠP 1 (co-NP) along with generalizations of the form ΣP i and ΠP i where ΠP i+1 = co-NPΠP i and ΣP i+1 = NPΣP i [34]. The classes ΣP i and ΠP i are said to be at level i. If it is shown that two classes on the same or consecutive levels are equal, the hierarchy collapses to that level. Such a collapse is considered unlikely, and hence is used as the basic assumption for showing hardness results, including the ones we present in the paper. 2.1 Probability distributions A weight function w : {0, 1}n Q+ assigns a positive rational weight to each assignment σ. We extend the definition of w to also allow circuits as input: w(ϕ) = P σ Rϕ w(σ). For weight function w and circuit ϕ, w(ϕ) is the weighted model count (WMC) of ϕ w.r.t. w. In this paper, we focus on log-linear weight functions as they capture a wide class of distributions, including those arising from graphical models, conditional random fields, and skip-gram models [24]. Log-linear models are represented as literal-weighted functions, defined as: Definition 1. For a set X of n variables, a weight function w is called literal-weighted if there is a poly-time computable map w : X Q (0, 1) such that for any assignment σ {0, 1}n : w(x) if x = 1 1 w(x) if x = 0 For all circuits ϕ, and log-linear weight functions w, w(ϕ) can be represented in size polynomial in the input. Probabilistic circuits: A probabilistic circuit is a satisfiable circuit ϕ along with a weight function w. ϕ and w together define a discrete probability distribution on the set {0, 1}n that is supported over Rϕ. We denote the p.m.f. of this distribution as: P(ϕ, w)(σ) = 0 ϕ(σ) = 0 w(σ)/w(ϕ) ϕ(σ) = 1 In this paper, we study circuits that are fragments of the Negation Normal Form (NNF). A circuit ϕ in NNF is a rooted, directed acyclic graph (DAG), where each leaf node is labeled with true, false, v or v; and each internal node is labeled with a or and can have arbitrarily many children. We focus on four fragments of NNF, namely, Decomposable NNF(DNNF), deterministic-DNNF(d-DNNF), Structured DNNF(SDNNF), and Prime Implicates(PI). For further information regarding circuits in NNF, refer to the survey [14] and the paper [28]. The TV distance of two probability distributions P(ϕ1, w1) and P(ϕ2, w2) over {0, 1}n is defined as: d T V (P(ϕ1, w1), P(ϕ2, w2)) = 1 σ {0,1}n |P(ϕ1, w1)(σ) P(ϕ2, w2)(σ)|. P(ϕ1, w1) and P(ϕ2, w2) are said to be (1) equivalent if d T V (P(ϕ1, w1), P(ϕ2, w2)) = 0, (2) ε-close if d T V (P(ϕ1, w1), P(ϕ2, w2)) ε, and (3) η-far if d T V (P(ϕ1, w1), P(ϕ2, w2)) η. Our closeness testing algorithm Teq, assumes access to an approximate weighted counter Awct(α, β, ϕ, w), and an approximate weighted sampler Samp(α, β, ϕ, w). We define their behavior as follows: Definition 2. Awct(α, β, ϕ, w) takes a circuit ϕ, a weight function w, a tolerance parameter α > 0 and a confidence parameter β > 0 as input and returns the approximate weighted model count of ϕ w.r.t. w such that 1 + α Awct(α, β, ϕ, w) (1 + α)w(ϕ) 1 β Tractable approximate counting algorithms for PCs are known as Fully Polynomial Randomised Approximation Schemes (FPRAS). The running time of an FPRAS is given by T(α, β, ϕ) = poly(α 1, log(β 1), |ϕ|). Definition 3. Samp(α, β, ϕ, w) takes a circuit ϕ, a weight function w, a tolerance parameter α > 0 and a confidence parameter β > 0 as input and returns either (1) a satisfying assignment σ sampled approximately w.r.t. weight function w with probability 1 β or (2) a symbol indicating failure with probability < β. In other words, whenever Samp samples σ: P(ϕ, w)(σ) 1 + α Pr[Samp(α, β, ϕ, w) = σ] (1 + α)P(ϕ, w)(σ) Tractable approximate sampling algorithms for PCs are known as Fully Polynomial Almost Uniform Samplers (FPAUS). The running time of an FPAUS for a single sample is given by T(α, β, ϕ) = poly(α 1, log(β 1), |ϕ|). In the rest of the paper [m] denotes the set {1, 2, . . . m}, 1(e) represents the indicator variable for event e, and E(v) represents the expectation of random variable v. 2.2 Related work Closeness testing: Viewing circuit equivalence testing through the lens of distribution testing, we see that the d-DNNF equivalence test of Darwiche and Huang [13] can be interpreted as an equivalence test for uniform distribution on the satisfying assignments of d-DNNFs. This relationship between circuit equivalence testing and closeness testing lets us rule out the existence of distributional equivalence tests for all those circuits for which circuit equivalence is already known to be hard under complexity-theoretic assumptions. We will explore this further in Section 5.2. Distribution testing: Discrete probability distributions are typically defined over an exponentially large number of points; hence a lot of recent algorithms research has focused on devising tests that require access to only a sublinear or even constant number of points in the distribution [5]. In this work, we work with distributions over {0, 1}n, and thus we aim to devise algorithms with running time at most polynomial in n. Previous work in testing distributions over Boolean functions has focused on the setting where the distributions offer pair-conditional sampling access [7, 22]. Using pair-conditional sampling access, Meel r et al. [22] were able to test distributions for closeness using O(tilt(ϕ)2/(η 6ε)3η) queries, where tilt is the ratio of the probabilities of the most and least probable element in the support. 3 Teq: a tractable algorithm for closeness testing In this section, we present the main contribution of the paper: a closeness test for PCs, Teq. The pseudocode of Teq is given in Algorithm 1. Given satisfiable circuits ϕ1, ϕ2 and weight functions w1, w2 along with parameters (ε, η, δ), Teq decides whether the TV distance between P(ϕ1, w1) and P(ϕ2, w2) is lesser than ε or greater than η with confidence at least 1 δ. Teq assumes access to an approximate weighted counter Awct(α, β, ϕ, w), and an approximate weighted sampler Samp(α, β, ϕ, w). We define their behavior in the following two definitions. The algorithm Teq starts by computing constants γ and m. Then it queries the Awct routine with circuit ϕ1 and weight function w1 to obtain a p 1 + γ/4 1 approximation of w1(ϕ1) with confidence at least 1 δ/8. A similar query is made for ϕ2 and w2 to obtain an approximate value for w2(ϕ2). These values are stored in k1 and k2, respectively. Teq maintains a m-sized array Γ, to store the estimates for r(σi). Teq now iterates m times. In each iteration, it generates one sample σi through the Samp call on line 7. There is a small probability of at most δ/4m that this call fails and returns . Teq only samples from one of the two PCs. The algorithm then proceeds to compute the weight of assignment σi w.r.t. the weight functions w1 and w2 and stores it in s1 and s2, respectively. Using the weights and approximate weighted counts stored in k1, k2 the algorithm computes the value r(σi) on line 10, where r(σi) is an approximation of the ratio of the probability of σi in the distribution P(ϕ2, w2) to its probability in P(ϕ1, w1). Since σi was sampled from P(ϕ1, w1), its probability in P(ϕ1, w1) cannot be 0, ensuring that there is no division by 0. If the ratio r(σi) is less than 1, then Γ[i] is updated with the value 1 r(σi) otherwise the value of Γ[i] remains 0. After the m iterations, Teq sums up the values in the array Γ. If the sum is found to be less than threshold m(ε + γ), Teq returns Accept and otherwise returns Reject. The following theorem asserts the correctness of Teq. Theorem 1. Given two satisfiable probabilistic circuits ϕ1, ϕ2 and weight functions w1, w2, along with parameters ε < η < 1 and δ < 1, A. If d T V (P(ϕ1, w1), P(ϕ2, w2)) ε, then Teq(ϕ1, w1, ϕ2, w2, ε, η, δ) returns Accept with probability at least (1 δ). B. If d T V (P(ϕ1, w1), P(ϕ2, w2)) η, then Teq(ϕ1, w1, ϕ2, w2, ε, η, δ) returns Reject with probability at least (1 δ). The following theorem states the running time of the algorithm, Theorem 2. Let γ = η ε, then the time complexity of Teq is in O TAwct(γ, δ, max(|ϕ1|, |ϕ2|)) + TSamp(γ, δ, max(|ϕ1|, |ϕ2|)) log(δ 1) γ2 . If the underlying Algorithm 1 Teq(ϕ1, w1, ϕ2, w2, ε, η, δ) 1: γ (η ε)/2 2: m 2 log(4/δ)/γ2 3: Γ [0] m 4: k1 Awct( p 1 + γ/4 1, δ/8, ϕ1, w1) 5: k2 Awct( p 1 + γ/4 1, δ/8, ϕ2, w2) 6: for i {1, 2 . . . , m} do 7: σi Samp(γ/(4η 2γ), δ/4m, ϕ1, w1) 8: if σi = then 9: s1 w1(σi), s2 w2(σi) 10: r(σi) s2 s1 11: if r(σi) < 1 then 12: Γ[i] 1 r(σi) 13: if P i [m] Γ[i] m(ε + γ) then 14: Return Accept 15: else 16: Return Reject PCs support approximate counting and sampling in polynomial time, then the running time of Teq is also polynomial in terms of γ, log(δ 1) and max(|ϕ1|, |ϕ2|). To improve readability, we use P1 to refer to the distribution P(ϕ1, w1) and P2 to refer to P(ϕ2, w2). 3.1 Proving the correctness of Teq In this subsection, we present the theoretical analysis of Teq, and the proof of Theorem 1(A). We will defer the proofs of Theorem 1(B) and Theorem 2 to the supplementary Section A.4.2 and Section A.4.3, respectively. For the purpose of the proof, we will first define events Pass1, Pass2 and Good. Events Pass1, Pass2 are defined w.r.t. the function calls Awct( p 1 + γ/4 1, δ/8, ϕ1, w1) and Awct( p 1 + γ/4 1, δ/8, ϕ2, w2), respectively (as on lines 4, 5 of Algorithm 1). Pass1 and Pass2 represent the events that the two calls correctly return p 1 + γ/4 approximations of the weighted model counts of ϕ1 and ϕ2 i.e. w1(ϕ1) 1+γ/4 Awct( p 1 + γ/4 1, δ/8, ϕ1, w1) ( p 1 + γ/4)w1(ϕ1), and 1+γ/4 Awct( p 1 + γ/4 1, δ/8, ϕ2, w2) ( p 1 + γ/4)w2(ϕ2). From the definition of Awct, we have Pr[Pass1], Pr[Pass2] 1 δ/8. Let Faili denote the event that Samp (Algorithm 1, line 7) returns the symbol in the ith iteration of the loop. By the definition of Samp we know that i [m] Pr[Faili] < δ/4m. The analysis of Teq requires that all m Samp calls and both Awct calls return correctly. We denote this super-event as Good = T i [m] Faili Pass1 Pass2. Applying the union bound we see that the probability of all calls to Awct and Samp returning without error is at least 1 δ/2: Pr[Good] = 1 Pr[ [ i [m] Faili Pass1 Pass2] 1 m δ/4m 2 δ/8 = 1 δ/2 (1) We will now state a lemma, which we will prove in the supplementary Section A.4. Lemma 1. Good r(σ) P2(σ) P1(σ) γ/4 P2(σ) We now prove the lemma critical for our proof of correctness of Teq. Lemma 2. Assuming the event Good, let A = P σ {0,1}n 1 (r(σ) < 1) (1 r(σ)) P1(σ) , then 1. If d T V (P1, P2) ε , then A ε + γ/4 2. If d T V (P1, P2) η , then A η γ/4 Proof. If P x (P1(x) P2(x)) = 0, then 1 2 P x |P1(x) P2(x)| = P x:P1(x) P2(x)>0 (P1(x) P2(x)). Using this fact we see that, d T V (P1, P2) = X σ:P2(σ) 1(r(σ) < 1)}; S3 = {σ : 1( P2(σ) P1(σ) < 1) < 1(r(σ) < 1)}. The definition implies that the indicator 1(r(σ) < 1) is 0 for all assignments in the set S2, and is 1 for all assignments in S3. Similarly 1( P2(σ) P1(σ) < 1) takes value 1 and 0 for all elements in S2 and S3, respectively. Now we bound the magnitude of B, P1(σ) < 1 (1 r(σ)) 1 (r(σ) < 1) P1(σ) For bj > 0, we have that | P j |aj|bj, and thus: P1(σ) < 1 (1 r(σ)) 1 (r(σ) < 1) P1(σ) We can split the summation into three terms based on the sets in which the assignments lie. Some summands take the value 0 in a particular set, so we don t include them in the term. σ S1 1 P2(σ) P1(σ) < 1 r(σ) P2(σ) σ S2 1 P2(σ) P1(σ) < 1 1 P2(σ) σ S3 1 (r(σ) < 1) (1 r(σ)) P1(σ) Since we know that σ S2, r(σ) > 1 and σ S3, P2(σ) P1(σ) > 1, we can alter the second and third terms of the inequality in the following way: σ S1 1 P2(σ) P1(σ) < 1 r(σ) P2(σ) σ S2 1 P2(σ) P1(σ) < 1 r(σ) P2(σ) σ S3 1 (r(σ) < 1) P2(σ) P1(σ) r(σ) P1(σ) X Using our assumption of the event Good and Lemma 1, |B| P σ {0,1}n γ/4 P1(σ) γ/4 Since d T V (P1, P2) A = B, we get |d T V (P1, P2) A| γ/4. We can now deduce that if d T V (P1, P2) ε, then A ε + γ/4 and if d T V (P1, P2) η, then A η γ/4. Using Teq to test PCs in general. Exact weighted model counting(WMC) is a commonly supported query on PCs. In the language of PC queries, a WMC query is known as the marginal (MAR) query. Conditional inference (CON) is another well studied PC query. Using CON and MAR, one can sample from the distribution encoded by a given PC. It is known that if a PC has the structural properties of smoothness and decomposability, then the CON and MAR queries can be computed tractably. For the definitions of the above terms and further details, please refer to the survey [9]. 4 Evaluation To evaluate the performance of Teq, we implemented a prototype in Python. The prototype uses WAPS3 [17] as a weighted sampler to sample over the input d-DNNF circuits. The primary objective of our experimental evaluation was to seek an answer to the following question: Is Teq able to determine the closeness of a pair of probabilistic circuits by returning Accept if the circuits are ε-close and Reject if they are η-far? We test our tool Teq in the following two settings: A. The pair of PCs represent small randomly generated circuits and weight functions. B. The pair of PCs are from the set of publicly available benchmarks arising from sampling and counting tasks. Our experiments were conducted on a high performance compute cluster with Intel Xeon(R) E5-2690 v3@2.60GHz CPU cores. For each benchmark, we use a single core with a timeout of 7200 seconds. 4.1 Setting A - Synthetic benchmarks Dataset Our dataset for experiments conducted in setting A consisted of randomly generated 3CNFs and with random literal weights. Our dataset consisted of 3-CNFs with {14, 15, 16, 17, 18} variables. Since the circuits are small, we validate the results by computing the actual total variation distance using brute-force. d T V Benchmark ϵ η Actual Result Expected Result 14_1 0.9 0.99 0.740 A A 14_2 0.8 0.9 0.764 A A 15_3 0.75 0.94 0.804 R A/R 17_4 0.75 0.9 0.941 R R 18_2 0.75 0.9 0.918 R R Table 1: Runtime performance of Teq. We experiment with 375 random PCs with known d T V , and out of the 375 benchmarks we display 5 in the table and the rest in the supplementary Section B. In the table A represents Accept and R represents Reject. In the last column A/R represents that both Accept and Reject are acceptable outputs for Teq. Results Our tests terminated with the correct result in less than 10 seconds on all the randomly generated PCs we experimented with. We present the empirical results in Table 1. The first column indicates the benchmark s name, the second and third indicate the parameters ε and η on which we executed Teq. The fourth column indicates the actual d T V distance between the two benchmark PCs. The fifth column indicates the output of Teq, and the sixth indicates the expected result. The full detailed results are presented in the appendix Section B. 4.2 Setting B - Real-world benchmarks Dataset We conducted experiments on a range of publicly available benchmarks arising from sampling and counting tasks4. Our dataset contained 100 d-DNNF circuits with weights. We have assigned random weights to literals wherever weights were not readily available. For the empirical evaluation of Teq, we needed pairs of weighted d-DNNFs with known d T V distance. To generate such a dataset, we first chose a circuit and a weight function, and then we synthesized new weight functions using the technique of one variable perturbation, described in the appendix Section B.1. 3https://github.com/meelgroup/WAPS 4https://zenodo.org/record/3793090 d T V ε d T V η Benchmark Result Teq(s) Result Teq(s) or-70-10-8-UC-10 A 23.2 R 22.82 s641_15_7 A 33.66 R 33.51 or-50-5-4 A 414.17 R 408.59 Project Service3 A 356.15 R 356.14 s713_15_7 A 24.86 R 24.41 or-100-10-2-UC-30 A 31.04 R 31.0 s1423a_3_2 A 153.13 R 152.81 s1423a_7_4 A 104.93 R 103.51 or-50-5-10 A 283.05 R 282.97 or-60-20-6-UC-20 A 363.32 R 362.8 Table 2: Runtime performance of Teq. We experiment with 100 PCs with known d T V , and out of the 100 benchmarks we display 10 in the table and the rest in the appendix B. In the table A represents Accept and R represents Reject. The value of the closeness parameter is ε = 0.01 and the farness parameter is η = 0.2. Results We set the closeness parameter ε, farness parameter η and confidence δ for Teq to be 0.01, 0.2 and 0.01, respectively. The chosen parameters imply that if the input pair of probabilistic circuits are 0.01 close in d T V , then Teq returns Accept with probability atleast 0.99, otherwise if the circuits are 0.2 far in d T V , the algorithm returns Reject with probability at least 0.99. The number of samples required for Teq (indicated by the variable m as on line 2 of Algorithm 1) depends only on ε, η, δ and for the values we have chosen, we find that we require m = 294 samples. Our tests terminated with the correct result in less than 3600 seconds on all the PCs we experimented with. We present the empirical results in Table 2. The first column indicates the benchmark s name, the second and third indicate the result and runtime of Teq when presented with a pair of ε-close PCs as input. Similarly, the fourth and fifth columns indicate the result and observed runtime of Teq when the input PCs are η-far . The full set of results are presented in the supplementary Section B. 5 A characterization of the complexity of testing In this section, we characterize PCs according to the complexity of closeness and equivalence testing. We present the characterization in Table 3. The results presented in the table can be separated into (1) hardness results, and (2) upper bounds. The hardness results, presented in Section 5.2, are largely derived from known complexity-theoretic results. The upper bounds, presented in Section 5.1, are derived from a combination of established results, our algorithm Teq and the exact equivalence test of Darwiche and Huang [13](presented in supplementary Section A.1 for completeness). 5.1 Upper bounds In Table 3 we label the pair of classes of PCs that admit a poly-time closeness and equivalence test with green symbols C and E respectively. Darwiche and Huang [13] provided an equivalence test for d-DNNF s. From Theorem 1, we know that PCs that supports the Awct and Samp queries in poly-time must also admit a poly-time approximate equivalence test. A weighted model counting algorithms for d-DNNFs was first provided by Darwiche [11], and a weighted sampler was provided by [17]. Arenas et al. [1] provided the first approximate counting and uniform sampling algorithm for SDNNFs. Using the following lemma, we show that with the use of chain formulas, the uniform sampling and counting algorithms extend to log-linear SDNNF distributions as well. Lemma 3. Given a SDNNF formula ϕ (with a v-tree T), and a weight function w, Samp(ϕ, w) requires polynomial time in the size of ϕ. The proof is provided in the supplementary Section A.5. NNF PI DNNF SDNNF d-DNNF NNF EC PI EC UU DNNF EC EU EU SDNNF EC EU EU EC d-DNNF EC UU EU EC EC Table 3: Summary of results. C (resp. E) indicates that a poly-time closeness (resp. equivalence) test exists. C (resp. E) indicates that a poly-time closeness (equivalence) test exists only if PH collapses. U indicates that the existence of a poly-time test is not known. The table is best viewed in color. 5.2 Hardness In Table 3, we claim that the pairs of classes of PCs labeled with symbols C and E , cannot be tested in poly-time for closeness equivalence, respectively. Our claim assumes that the polynomial hierarchy (PH) does not collapse. To prove the hardness of testing the labeled pairs, we combine previously known facts about PCs and a few new arguments. Summarizing for brevity, We start off by observing that PC families are in a hierarchy, with CNF NNF and DNF SDNNF DNNF [14]. We then reduce the problem of satisfiability testing of CNFs (NP-hard) and validity testing of DNFs (co-NP-hard) into the problem of equivalence and closeness testing of PCs, in Propositions 1, 2 and 3. These propositions and their proofs can be found in the supplementary Section A.5. We then connect the existence of poly-time algorithms for equivalence to the collapse of PH via a complexity result due to Karp and Lipton [18]. The NP-hardness of deciding the equivalence of pairs of DNNFs and pairs of SDNNFs was first shown by Pipatsrisawat and Darwiche [28]. We recast their proofs in the language of distribution testing for the sake of completeness in the supplementary Section A.5. 6 Conclusion and future work In this paper, we studied the problem of closeness testing of PCs. Before our work, poly-time algorithms were known only for the special case of equivalence testing of PCs; and, no poly-time closeness test was known for any PC. We provided the first such test, called Teq, that used ideas from the field of distribution testing to design a novel algorithm for testing the closeness of PCs. We then implemented a prototype for Teq, and tested it on publicly available benchmarks to determine the runtime performance. Experimental results demonstrate the effectiveness of Teq in practice. We also characterized PCs with respect to the complexity of deciding equivalence and closeness. We combined known hardness results, reductions, and our proposed algorithm Teq to classify pairs of PCs according to closeness and equivalence testing complexity. Since the characterization is incomplete, as seen in Table 3, there are questions left open regarding the existence of tests for certain PCs, which we leave for future work. Broader Impact Recent advances in probabilistic modeling techniques have led to increased adoption of the said techniques in safety-critical domains, thus creating a need for appropriate verification and testing methodologies. This paper seeks to take a step in this direction and focuses on testing properties of probabilistic models likely to find use in safety-critical domains. Since our guarantees are probabilistic, practical adoption of such techniques still requires careful design to handle failures. Acknowledgments and Disclosure of Funding We are grateful to the anonymous reviewers of UAI 2021 and Neur IPS 2021 for their constructive feedback that greatly improved the paper. We would also like to thank Suwei Yang and Lawqueen Kanesh for their useful comments on the earlier drafts of the paper. This work was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme[NRFNRFFAI1-2019-0004 ] and AI Singapore Programme [AISG-RP-2018-005], and NUS ODPRT Grant [R-252-000-685-13]. The computational work for this article was performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). [1] Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. When is Approximate Counting for Conjunctive Queries Tractable? In STOC, 2021. [2] Dominik Aronsky and Peter J Haug. Diagnosing community-acquired pneumonia with a bayesian network. In Proceedings of the AMIA Symposium, 1998. [3] Rafael Cano, Carmen Sordo, and José M Gutiérrez. Applications of bayesian networks in meteorology. In Advances in Bayesian networks. 2004. [4] Clément Canonne and Ronitt Rubinfeld. Testing probability distributions underlying aggregated data. In Automata, Languages, and Programming, 2014. [5] Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, 2020. [6] Clément L. Canonne, Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Testing bayesian networks. IEEE Transactions on Information Theory, 2020. [7] Sourav Chakraborty and Kuldeep S. Meel. On testing of uniform samplers. In AAAI, 2019. [8] Supratik Chakraborty, Dror Fried, Kuldeep S Meel, and Moshe Y Vardi. From weighted to unweighted model counting. In IJCAI, 2015. [9] Yoo Jung Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: A unifying framework for tractable probabilistic models. 2020. [10] Karine Chubarian and György Turán. Interpretability of bayesian network classifiers: Obdd approximation and polynomial threshold functions. In ISAIM, 2020. [11] Adnan Darwiche. On the tractable counting of theory models and its application to truth maintenance and belief revision. Journal of Applied Non-Classical Logics, 2001. [12] Adnan Darwiche. A differential approach to inference in bayesian networks. Journal of the ACM (JACM), 2003. [13] Adnan Darwiche and Jinbo Huang. Testing equivalence probabilistically. In Technical Report, 2002. [14] Adnan Darwiche and Pierre Marquis. A knowledge compilation map. JAIR, 2002. [15] Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. Testing probabilistic programming systems. In Proc. of Joint Meeting on ESE and FSE, 2018. [16] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014. [17] Rahul Gupta, Shubham Sharma, Subhajit Roy, and Kuldeep S Meel. Waps: Weighted and projected sampling. In TACAS, 2019. [18] Richard M. Karp and Richard J. Lipton. Some connections between nonuniform and uniform complexity classes. In STOC, 1980. [19] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. [20] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. ar Xiv preprint ar Xiv:1207.6083, 2012. [21] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples in generative adversarial networks. Neur IPS, 2018. [22] Kuldeep S. Meel r , Yash Pote r , and Sourav Chakraborty. On Testing of Samplers. In Neur IPS, 2020. [23] Andrzej S. Murawski and Joël Ouaknine. On probabilistic program equivalence and refinement. In Martín Abadi and Luca de Alfaro, editors, CONCUR, 2005. [24] K.P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012. [25] Agnieszka Oni sko, Marek J Druzdzel, and Hanna Wasyluk. Extension of the hepar ii model to multiple-disorder diagnosis. In Intelligent Information Systems. 2000. [26] Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina, Martin Trapp, Guy Van den Broeck, Kristian Kersting, and Zoubin Ghahramani. Einsum networks: Fast and scalable learning of tractable probabilistic circuits. In ICML, 2020. [27] Robert Peharz, Antonio Vergari, Karl Stelzner, Alejandro Molina, Xiaoting Shao, Martin Trapp, Kristian Kersting, and Zoubin Ghahramani. Random sum-product networks: A simple and effective approach to probabilistic deep learning. In UAI. PMLR, 2020. [28] Knot Pipatsrisawat and Adnan Darwiche. New compilation languages based on structured decomposability. In AAAI, 2008. [29] Arthur R Pope and David G Lowe. Probabilistic models of appearance for 3-d object recognition. International Journal of Computer Vision, 2000. [30] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015. [31] Yasir Rahmatallah, Frank Emmert-Streib, and Galina Glazko. Gene sets net correlations analysis (gsnca): a multivariate differential coexpression test for gene sets. Bioinformatics, 2014. [32] Dan Roth. On the hardness of approximate reasoning. Artificial Intelligence, 82(1-2), 1996. [33] Nicolas Städler and Sach Mukherjee. Multivariate gene-set testing based on graphical models. Biostatistics, 2015. [34] Larry J. Stockmeyer. The polynomial-time hierarchy. Theoretical Computer Science, 1976. [35] Weiwei Yin, Swetha Garimalla, Alberto Moreno, Mary R Galinski, and Mark P Styczynski. A tree-like bayesian structure learning algorithm for small-sample datasets from complex biological model systems. BMC Systems Biology, 2015.