# on_scalable_testing_of_samplers__280915ff.pdf

On Scalable Testing of Samplers

Yash Pote r Kuldeep S. Meel School of Computing, National University of Singapore

In this paper we study the problem of testing of constrained samplers over highdimensional distributions with (ε, η, δ) guarantees. Samplers are increasingly used in a wide range of safety-critical ML applications, and hence the testing problem has gained importance. For n-dimensional distributions, the existing state-of-the-art algorithm, Barbarik2, has a worst case query complexity of exponential in n and hence is not ideal for use in practice. Our primary contribution is an exponentially faster algorithm that has a query complexity linear in n and hence can easily scale to larger instances. We demonstrate our claim by implementing our algorithm and then comparing it against Barbarik2. Our experiments on the samplers w Unigen3 and w STS, ﬁnd that Barbarik3 requires 10 fewer samples for w Unigen3 and 450 fewer samples for w STS as compared to Barbarik2.

1 Introduction

The constrained sampling problem is to draw samples from high-dimensional distributions over constrained spaces. A constrained sampler Q(ϕ, w) takes in a set of constraints ϕ : {0, 1}n {0, 1} and a weight function w : {0, 1}n R>0, and returns a sample σ ϕ 1(1) with probability proportional to w(σ). Constrained sampling is a core primitive of many statistical inference methods used in ML, such as Sequential Monte Carlo[29], Markov Chain Monte Carlo(MCMC)[3, 9], Simulated Annealing [4], and Variational Inference [24]. Sampling from real-world distributions is often computationally intractable, and hence, in practice, samplers are heuristical and lack theoretical guarantees. For such samplers, it is an important problem to determine whether the sampled distribution is close to the desired distribution, and this problem is known as testing of samplers. The problem was formalised in [14, 27] as follows: Given access to a target distribution P, a sampler Q(ϕ, w), and three parameters (ε, η, δ), with probability at least 1 δ, return (1) Accept if d (P, Q(ϕ, w)) < ε, or (2) Reject if d T V (P, Q(ϕ, w)) > η. Here d T V is the total variation distance, d the multiplicative distance, and ε, η, and δ are parameters for closeness, farness and conﬁdence respectively. Access to distribution P is via the DUAL oracle, and access to Q is via the PCOND and SAMP oracles (deﬁned in Section 2.1)

There is substantial interest in the testing problem due to the increasing use of ML systems in real-world applications where safety is essential, such as medicine [2], transportation [8, 25], and warfare [28]. For the ML systems that incorporate samplers, the typical testing approach has been to show the convergence of the sampler with the target distribution via empirical tests that rely on heuristics and do not provide any guarantees [19, 22, 31, 34]. In a recent work [27], a novel framework, called Barbarik2, was proposed that could test a given sampler while providing (ε, η, δ) guarantees, using O tilt(P)2

η(η 3ε)3 queries, where tilt(P) := max σ1,σ2 {0,1}n P(σ1) P(σ2) for P(σ2) > 0. Since the tilt(P)

The accompanying tool, available open source, can be found at https://github.com/meelgroup/barbarik The authors decided to forgo the old convention of alphabetical ordering of authors in favor of a randomized ordering, denoted by r . The publicly veriﬁable record of the randomization is available at https://www. aeaweb.org/journals/policies/random-author-order/search with conﬁrmation code: Lrr1ec P-xv14. For citations, the authors request that the citation guidelines by AEA for random author ordering be followed.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

can take arbitrary values, we observe that the query complexity can be prohibitively large3. On the

other hand, the best known lower bound for the problem, derived from [30], is Ω

this work, we take a step towards bridging this gap with our algorithm, Barbarik3, that has a query complexity of O n log n (η 11.6ε)η3 + n

η2 , representing an exponential improvement over the state of the art.

To be of any real value, testing tools must be able to scale to larger instances. In the case of constrained samplers, the only existing testing tool, Barbarik2, is not scalable owing to its query complexity. The lack of scalability is illustrated by the following fact: product distributions are the simplest possible constrained distributions, and given a union of two n-dimensional product distributions, Barbarik2 requires more than 108 queries for n > 30. On the other hand, the query complexity of Barbarik3 scales linearly with n, the number of dimensions, thus making it more appropriate for practical use.

We implement Barbarik3 and compare it against Barbarik2 to determine their relative performance. In our experiments, we consider two sets of problems, (1) constrained sampling benchmarks, (2) scalable benchmarks and two constrained samplers w STS and w Unigen3. We found that to complete the test Barbarik3 required at least 450 fewer samples from w STS and 10 fewer samples from w Unigen3 as compared to Barbarik2. Moreover, Barbarik3 terminates with a result on at least 3 more benchmarks than Barbarik2 in each experiment.

Our contributions can be summarized as follows:

1. For the problem of testing of samplers, we provide an exponential improvement in query complexity over the current state of the art test Barbarik2. Our test, Barbarik3, makes a total of O n log n (η 11.6ε)η3 + n

η2 queries, where O hides polylog factors of ε, η and δ.

2. We present an extensive empirical evaluation of Barbarik3 and contrast it with Barbarik2. The results indicate that Barbarik3 requires far fewer samples and terminates on more benchmarks when compared to Barbarik2.

We deﬁne the notation and discuss related work in Section 2. We then present the main contribution of the paper, the test Barbarik3, and its proof of correctness in Section 3. We present our experimental ﬁndings in Section 4 and then we conclude the paper and discuss some open problems in Section 5. Due to space constraints, we defer some proofs and the full experimental results to the supplementary Section A and B respectively.

2 Notation and preliminaries

Probability distributions In this paper we deal with samplers that sample from discrete probability distributions over high-dimensional spaces. We consider the sample space to be the n-dimensional Boolean hypercube {0, 1}n. A constrained sampler Q takes in a set of constraints ϕ : {0, 1}n {0, 1} and a weight function w : {0, 1}n R>0, and samples from the distribution Q(ϕ, w) deﬁned as

Q(ϕ, w)(σ) = w(σ)/w(ϕ) σ ϕ 1(1) 0 σ ϕ 1(0), where w(ϕ) = P

σ ϕ 1(1) w(σ). To improve readability,

we use Q to refer to the distribution Q(ϕ, w). For an element i, D(i) denotes it s probability in distribution D and i D represents that i is sampled from D. For any non-empty set S {0, 1}n, DS is the distribution D conditioned on set S, and D(S) is the probability of S in D i.e., D(S) = P

The total variation (TV) distance of two probability distributions D1 and D2 is deﬁned as: d T V (D1, D2) = 1 2 P

i {0,1}n |D1(i) D2(i)|. For S {0, 1}n, we deﬁne d T V (S)(D1, D2) = 1 2 P

i S |D1(i) D2(i)|. The multiplicative distance of D2 from D1 is deﬁned as: d (D1, D2) = maxi {0,1}n |D2(i)/D1(i) 1|. The two notions of distance obey the identity: 2d T V (D1, D2) d (D1, D2).

In the rest of the paper, E[v] represents the expectation of random variable v and [k] represents the set {1, 2 . . . , k}.

3A simple modiﬁcation reveals that in terms of n, η, ε, the bound is O 4n

Tools used in the analysis

Proposition 1 (Hoeffding). For independent 0-1 random variables Xi, X = Pk i=1 Xi, and t 0, Pr(X EX > t) exp 2t2k and Pr(EX X > t) exp 2t2k

Proposition 2 (Chebyshev). Given bounded r.v. X, we have Pr(|X E[X]| < E[X]) > E[X]2

E[X2] Proposition 3. Given distributions D1 and D2 supported on {0, 1}n, and a set S {0, 1}n,

i S D1(i)D2(i) > (D1(S) + D2(S) 2d T V (S)(D1, D2))2

The proof can be found in the Appendix A.1

If we are given samples {s1, s2, . . . , sn} from a distribution D over [k], then the empirical distribution

b D is deﬁned to be b D(i) = 1

j=1 1{sj=i}.

Proposition 4 (See [11] for a simple proof). Suppose D is a distribution over [k], and b D is constructed

using max k η2 , 2 ln(2/δ)

η2 samples from D. Then d T V (D, b D) η with probability at least 1 δ.

2.1 Testing with the help of oracles

In distribution testing, we are given samples from an unknown distribution P over a large support {0, 1}n, and the task is to test whether P satisﬁes some property of interest. One of the important properties we care about is whether P is close to another distribution Q, and this subﬁeld of testing is known as closeness testing. It was shown by Valiant and Batu et al. that the ability to draw samples from P and Q is not powerful enough, as at least Ω(22n/3) samples are required to provide any sort of probabilistic guarantee for closeness testing. Since n is usually large, it was desirable to ﬁnd tests that could solve the closeness testing problem using polynomially many samples in n.

Motivated by the above requirement, Canonne et al. and Chakraborty et al. introduced the conditional sampling oracle (COND), that is a more powerful way to access distributions. A COND oracle for distribution D over {0, 1}n takes as input a set S {0, 1}n with D(S) > 0, and returns a sample i S with probability D(i)/D(S). It has been shown that the use of the COND oracle, and its variants, drastically reduces the sample complexity of many tasks in distribution testing [1, 21, 12, 15, 6, 26, 7, 17, 13, 30] (see [10] for an extensive survey). In this paper, we consider the pair-conditioning (PCOND) oracle, which is a special case of the COND oracle with the restriction that |S| = 2 i.e., the size of the conditioning set has to be two. To engineer practical PCOND oracle access into constrained samplers, we use the chain formula construction introduced in [14].

With the same goal of designing tests with polynomial sample complexity, a different kind of oracle, known as the DUAL oracle, was proposed by Canonne et al.. The DUAL oracle allows one to sample from a given distribution and also query the distribution for the probability of arbitrary elements of the support. Tractable DUAL oracle access is supported by a number of distribution representations, such as the fragments of probabilistic circuits (PC) that support the EVI query [18]. In our experimental evaluation, we use distributions from one such fragment: weighted d-DNNFs. Weighted d-DNNFs are a class of arithmetic circuits with properties that enable DUAL oracle access in time linear in the size of the circuit [16, 23].

3 Barbarik3: an algorithm for testing samplers

We start by providing a brief overview of our testing algorithm before providing the full analysis.

3.1 Algorithm outline

The pseudocode of Barbarik3 is given in Algorithm 1. We adapt the deﬁnition of bucketing of distributions from [30] for use in our analysis. Deﬁnition 1. For a given k N>0, the bucketing of {0, 1}n with respect to P is deﬁned as follows: For 1 i k, let Si = {b : 2 i < P(b) 2 i+1} and let S0 = {0, 1}n \ S

i [k] Si. Given

Algorithm 1 Barbarik3(P, Q, η, ε, δ)

1: k n + log2(100/η) 2: for i = 1 to k do 3: Si = {b : 2 i < P(b) 2 i+1} 4: S0 = {0, 1}n \ S

i [k] Si 5: BP is the distribution over [k] {0} where we sample i BP if we sample j P and j Si 6: BQ is the distribution over [k] {0} where we sample i BQ if we sample j Q and j Si 7: θ η/20 8: bd Out Bucket(BP, BQ, k, θ, δ/2) 9: if bd > ε/2 + θ then 10: Return Reject 11: ε2 bd + θ 12: Return In Bucket(P, Q, k, ε, ε2, η, δ/2)

any distribution D over {0, 1}n, we deﬁne a distribution BD over [k] {0} as: for 0 i k, BD(i) = D(Si). We call BD the bucket distribution of D and Si the ith bucket.

Barbarik3 takes as input two distributions P and Q deﬁned over the support {0, 1}n, along with the parameters for closeness(ε), farness(η), and conﬁdence(δ). On Line 1, Barbarik3 computes the value of k using η and the number of dimensions n. Then, using DUAL access to P, and SAMP access to Q, Barbarik3 creates bucket distributions BP and BQ as in Defn. 1, in the following way: To sample from BP, Barbarik3 ﬁrst draws a sample j P, then using the DUAL oracle, determines the value of P(j). Then, if j lies in the ith bucket i.e., 2 i < P(j) 2 i+1, the algorithm takes sample i as the sample from BP. Similarly, to draw a sample from BQ, Barbarik3 draws a sample j Q and then, using the DUAL oracle to ﬁnd P(j), ﬁnds i such that j lies in the ith bucket, and then uses i as the sample.

Barbarik3 then calls two subroutines, Out Bucket (Section 3.4) and In Bucket (Section 3.3). The Out Bucket subroutine returns an θ-multiplicative estimate of the TV distance between BP and BQ, the two bucket distributions of P and Q, with an error of at most δ/2. If it is found on Line 9 that the estimate bd is greater than ε/2 + θ, we know that d T V (P, Q) > ε/2 and also that d (P, Q) > ε, and hence the algorithm returns Reject. Otherwise, the algorithm calls the In Bucket subroutine.

Now suppose that d T V (P, Q) η. Then, for ε2 (Line 11), it is either the case that d T V (BP, BQ) > ε2 or else d T V (BP, BQ) ε2. In the former case, the algorithm returns Reject on Line 10, and in the latter case the In Bucket subroutine returns Reject. In both cases, the failure probability is at most δ/2. Thus Barbarik3 returns Reject on given η-far input distributions with probability at least 1 δ.

We will now prove the following theorem:

Theorem 1. Barbarik3(P, Q, η, ε, δ) takes in distributions P and Q deﬁned over {0, 1}n, and parameters η (0, 1], ε [0, η/11.6) and δ (0, 1/2]. Barbarik3 has DUAL access to P, and PCOND+SAMP access to Q. With probability at least 1 δ, Barbarik3 returns

Accept if d (P, Q) ε Reject if d T V (P, Q) > η

Barbarik3 has query complexity O n log(n) η3(η 11.6ε) + n

η2 , where O hides polylog factors of ε, η and δ.

3.2 Lower bound

The lower bound comes from the paper of Narayanan [30], where it appears in Theorem 1.6. Phrased in the jargon of our paper, the lower bound states that distinguishing between d T V (P, Q) > η and d (P, Q) = 0 requires Ω( p

n/ log(n)/η2) samples. Note that the lower bound is shown on a special case (ε = 0) of our problem. Hence the lower bound applies to our problem as well. Furthermore, the lower bound is shown for the case where distribution P provides full access, i.e., the algorithm can make arbitrary queries to P. This is a stronger access model than DUAL. Since the lower bound is for a stronger access model, it extends to our problem as well.

3.3 The In Bucket subroutine

In this section, we present the In Bucket subroutine, whose behavior is stated in the following lemma.

Lemma 1. In Bucket(P, Q, k, ε, ε2, η, δ) takes as input two distributions P, Q, an integer k and parameters ε, ε2, η, δ. If d (P, Q) ε, In Bucket returns Accept. If d T V (P, Q) η and d T V (BP, BQ) < ε2, then In Bucket returns Reject. In Bucket errs with probability at most δ .

Algorithm 2 In Bucket(P, Q, k, ε, ε2, η, δ)

1: ε1 (0.99η 3.25ε2 2ε/(1 ε))/1.05 + 2ε/(1 ε) 2: m

k/(0.99η 3.25ε2 ε1) 3: α (ε1 + 2ε/(1 ε))/2

4: t l ln(4/δ) ln(10/(10 ε1+α)) m

5: for t iterations do 6: ΓP m samples from P 7: i [k]Γi P ΓP Si Si is deﬁned in Defn. 1 8: ΓQ m samples from Q 9: i [k]Γi Q ΓQ Si 10: for all j [k] s.t. |Γj P|, |Γj Q| > 0 do

11: p Γj P p is an arbitrary sample from the set Γj P 12: q Γj Q q is an arbitrary sample from the set Γj Q 13: h P(p) P(p)+P(q)(1+ 2ε

14: ℓ P(p) P(p)+P(q)(1+α)

15: r l 2 ln(4mt/δ)

16: bc Bias(Q, p, q, r) 17: if bc (h + ℓ)/2 then 18: Return Reject 19: Return Accept

Algorithm 3 Bias(Q, p, q, r)

1: if p and q are identical then 2: Return 0.5 3: ΓQ{p,q} r samples from Q{p,q} 4: Return # of times p appears in ΓQ{p,q}

In Bucket makes extensive use of the PCOND oracle access to Q via the Bias subroutine, which we describe in the following subsection.

The Bias subroutine The Bias subroutine takes in distribution Q, two elements p, q and a positive integer r. Then, using the PCOND oracle, Bias draws r samples from the conditional distribution Q{p,q} and returns the number of times it sees p in the r samples. It can be seen that the returned value is an empirical estimate of Q(p) Q(p)+Q(q). Let the estimate be c cpq. We use the Hoeffding bound in Prop. 1, and the value of r from Line 15 of Alg. (2) to show that:

Pr Q(p) Q(p) + Q(q) c cpq h ℓ

δ 4mt Pr c cpq Q(p) Q(p) + Q(q) h ℓ

Here t represents the number of iterations of the outer loop (Line 4), and m is the number of samples drawn from BP and BQ. Together, there are at most mt pairs of samples that are passed to the Bias oracle. Since in each invocation of Bias, the probability of error is δ/4mt, using the union bound we ﬁnd that the probability that all mt Bias calls return correctly is at least 1 δ/4 and thus with probability at least 1 δ/4, the empirical estimate c cpq is closer than (h ℓ)/4 to Q(p) Q(p)+Q(q).

Henceforth we assume: c cpq Q(p) Q(p) + Q(q)

3.3.1 The Accept case

In this section we will provide an analysis of the case when d (P, Q) < ε. We will now state a proposition required for the remaining proofs, the proof of which we relegate to Appendix A.4.

Proposition 5. Let P, Q be distributions and let p P and q Q. Then,

1. If d (P, Q) < ε then

Q(p) Q(p) + Q(q) P(p) P(p) + (1 + 2ε 1 ε)P(q)

2. If d T V (P, Q) > ε1, then for 0 α < ε1, with probability at least (d T V (P, Q) α)/2,

Q(p) Q(p) + Q(q) < P(p) P(p) + (1 + α)P(q)

From our assumption (1), we know that for all invocations of Bias, with probability at least 1 δ/4, c cpq Q(p) Q(p)+Q(q) (h ℓ)/2. Using Prop. 5, and using the value of h given on Line 13, we can

see that Q(p) Q(p)+Q(q) > h. From this we can observe that for all invocations of Bias, c cpq > (h + ℓ)/2 and the test does not return Reject in any iteration, hence eventually returning Accept. Thus, in the case that d (P, Q) < ε, the In Bucket subroutine returns Accept with probability at least 1 δ/4.

3.3.2 The Reject case

In this section we analyse the case when d T V (P, Q) η and d T V (BP, BQ) ε2 and we will show that the algorithm returns Reject with probability at least 1 δ. For the purpose of the proof we will deﬁne a set of bad buckets Bad [k]. Note that bucket {0} is not in Bad.

Deﬁnition 2. Bad = {i [k] : d T V (PSi, QSi) > ε1 BP(i)/BQ(i) [5 1, 2]}

Suppose we have an indicator variable Xr,s constructed as follows: draw m samples from P and Q, and if the rth sample from P and the sth sample from Q both belong to some bucket b Bad, then Xr,s = 1 else Xr,s = 0. Then,

E[Xr,s] = X

b Bad BP(b)BQ(b) > (BP(Bad) + BQ(Bad) 2d T V (Bad)(BP, BQ))2

The inequality is by the application of Prop. 3.

We analyse the expression for the expectation in the following lemma, the proof of which we relegate to Appendix A.2

BQ(Bad) + BP(Bad) 2d T V (Bad)(BQ, BP) > 2 0.99η 13

Using Lemma 2 we immediately derive the fact that E[Xr,s] > 0.99η 13

4 ε2 ε1 2 /K. Let X = P

r,s [m] Xr,s. Given m samples from P and Q, Pr(X 1) is the probability that there is at least one bucket in Bad that is sampled at least once each in both sets of samples.

Lemma 3. Pr(X 1) > 1/5

The proof can be found in Appendix A.3.

Henceforth we will condition on the the event that X 1. In such a case, we know that for some k Bad, there is a sample p PSk and a sample q QSk. Then for such a pair of samples (p, q), and some α, Prop. 5 tells us that with probability at least (d T V (P, Q) α)/2 we have

Q(p) Q(p) + Q(q) < P(p) P(p) + (1 + α)P(q)

Using the assumption made in (1), we immediately have that c cpq Q(p) Q(p)+Q(q) + h ℓ

2 . But from

Prop. 5 we have that Q(p) Q(p)+Q(q) < ℓand hence c cpq < (h + ℓ)/2. Since d T V (P, Q) ε1, we see that if X 1, then with probability at least (ε1 α)/2, the iteration returns Reject.

Then, using Lemma 3 we see that in every iteration, with probability at least (ε1 α)/10, In Bucket returns Reject. There are t iterations, where t (line 4) is chosen such that the overall probability of the test returning Reject is at least 1 δ/2.

3.4 The Out Bucket subroutine

The Out Bucket subroutine takes as input two distributions D1, D2 over k + 1 elements and two parameters θ and δ. Then with probability at least 1 δ, In Bucket returns a θ-multiplicative estimate for d T V (D1, D2).

The Out Bucket starts by drawing max 4(k+1)

θ2 , 8 ln(4/δ)

θ2 samples from the two distributions D1

and D2, and constructs the empirical distributions c D1 and c D2. Then from Prop. 4, we know that with probability at least 1 δ, both d T V (D1, c D1) θ/2 and d T V (D2, c D2) θ/2.

From the triangle inequality we have that,

d T V (c D1, c D2) d T V (D1, c D1) + d T V (D2, c D2) + d T V (D1, D2) < θ + d T V (D1, D2)

and also that,

d T V (D1, D2) d T V (D1, c D1) + d T V (D2, c D2) + d T V (c D1, c D2) < θ + d T V (c D1, c D2)

Thus with probability at least 1 δ, the returned estimate d T V (c D1, c D2) satisﬁes |d T V (c D1, c D2) d T V (D1, D2)| < θ.

Query and runtime complexity The number of queries made by Out Bucket to P and Q is given by O n η2 , where O hides polylog factors of ε, η and δ. The number of queries required

by In Bucket is given by mtr. Bounding the terms individually, we see that m = O n η 11.6ε ,

t = O 1 η and r = O log n

η2 . Thus mtr = O n log n (η 11.6ε)η3 and hence the total query complexity

is O n log n (η 11.6ε)η3 + n

4 Evaluation

To evaluate the performance of Barbarik3 and test the quality of publicly available samplers, we implemented Barbarik3 in Python. Our evaluation took inspiration from the experiments presented in previous work [14, 27], and we used the same framework to evaluate our proposed algorithm. The role of target distribution P was played by WAPS4 [23]. WAPS compiles the input Boolean formula into a representation that allows exact sampling and exact probability computation, thereby implementing the SAMP and EVAL oracles needed for our test.

For the role of sampler Q(ϕ, w), we used the state-of-the-art samplers w STS and w Unigen3. w Unigen3 [32] is a hashing-based sampler that provides (ε, δ) guarantees on the quality of the samples. w STS [20] is a sampler designed for sampling over challenging domains such as energy barriers and highly asymmetric spaces. w STS generates samples much faster than w Unigen3, albeit

4https://github.com/meelgroup/WAPS

without any guarantees on the quality of the samples. To implement PCOND access, we use the Kernel construction from [14]. Kernel takes in ϕ and two assignments σ1, σ2, and returns a function bϕ on m variables, such that: (1) m > n, (2) ϕ and bϕ are similar in structure, and (3) for σ bϕ 1(1), it holds that σ supp(ϕ) {σ1, σ2}. Here σ supp(ϕ) denotes the projection of σ on the variables of ϕ.

For the closeness(ε), farness(η), and conﬁdence(δ) parameters, we choose the values 0.05, 0.9 and 0.2. This setting implies that for a given distribution P, and for a given sampler Q(ϕ, w), Barbarik3 returns (1) Accept if d (P, Q(ϕ, w)) < 0.05, and (2) Reject if d T V (P, Q(ϕ, w)) > 0.9, with probability at least 0.8. Our empirical evaluation sought to answer the question: How does the performance of Barbarik3 compare with the state-of-the-art tester Barbarik2?

Our experiments were conducted on a high-performance compute cluster with Intel Xeon(R) E52690v3@2.60GHz CPU cores. We use a single core with 4GB memory with a timeout of 16 hours for each benchmark. We set a sample limit of 108 samples for our experiments due to our limited computational resources. The complete experimental data along with the running time of instances, is presented in the Appendix B.

4.1 Setting A - scalable benchmarks

Dataset Our dataset consists of the union of two n-dimensional product distributions, for n {4, 7, 10, . . . , 118}. We have 39 problems in the dataset. We represent the union of two product distributions as the constraint: ϕ(σ) = V2k i=1(σ3k+1 σi) V3k i=2k+1( σ3k+1 σi), and the weight function: w(σ) = Q3k i=2k+1 3σi, where σi is the value of σ at position i.

Results We observe that in the case of w STS, Barbarik2 can handle only 12 instances within the sample limit of 108. On the other hand, Barbarik3 can handle all 39 instances using at the most 106 samples. In the case of w Unigen3, Barbarik2 solves 5 instances, and Barbarik3 can handle 17 instances.

Figure 1 shows a cactus plot comparing the sample requirement of Barbarik3 and Barbarik2. The x-axis represents the number of benchmarks and y-axis represents the number of samples, a point (x, y) implies that the relevant tester took less than y number of samples to distinguish between d T V (P, Q(ϕ, w)) > η and d (P, Q(ϕ, w)) < ε, for x many benchmarks. We display the set of benchmarks for which at least one of the two tools terminated within the sample limit of 108. We want to highlight that the y-axis is in log-scale, thus showing the sample efﬁciency of Barbarik3 compared to Barbarik2. For every benchmark, we compute the ratio of the number of samples required by Barbarik2 to test a sampler and the number of samples required by Barbarik3. The geometric mean of these ratios indicates the mean speedup. We ﬁnd that the Barbarik3 s speedup on w STS is 451 and on w Unigen3 is 10 .

4.2 Setting B - real-life benchmarks

Dataset We experiment on 87 constraints drawn from a collection of publicly available benchmarks arising from sampling and counting tasks5. We use distributions from the log-linear family. In a loglinear distribution, the probability of an element σ ϕ 1(1) is given as: Pr[σ] exp (Pn i=1 σiθi), where θi Rn >0. We found that w Unigen3 was not able to sample from most of the benchmarks in the dataset within the given time limit, and hence we present the results only for w STS.

Results We ﬁnd that Barbarik3 terminated with a result on all 87 instances from the set of real-life benchmarks, while Barbarik2 could only terminate on 16. We present the results of our experiments in Table 1. The ﬁrst column indicates the benchmark s name, and the second column has the number of dimensions of the space the distribution is deﬁned on. The third and ﬁfth columns indicate the number of samples required by Barbarik2 and Barbarik3. The fourth and sixth columns report the output of Barbarik2 and Barbarik3.

5https://zenodo.org/record/3793090

0 5 10 15 20 25 30 35 38 104

# of samples (logscale)

Barbarik2 Barbarik3

0 2 4 6 8 10 12 14 16 # of instances

# of samples (logscale)

Barbarik2 Barbarik3

Figure 1: Cactus plot: Barbarik3 vs. Barbarik2. We set the sample limit to be 108, and our dataset consists of 39 benchmarks. The plot shows all the instances where at least one of the two tools terminated within the time limit of 16 hours and sample limit of 108.

Table 1: Runtime performance of Barbarik3. We experiment with 87 benchmarks, and out of the 87 benchmarks we display 15 in the table and we display the full data in Appendix B . In the table A represents Accept, R represents Reject and TO represents that the tester either asked for more than 108 samples or did not terminate in the given time limit of 16 hours.

Barbarik2 Barbarik3

Benchmark Dimensions Result # of samples Result # of samples

Set Test.sk_9_21 21 R 2817 R 58000 Pollard.sk_1_10 10 R 7606 R 36000 s444_3_2 24 R 848148 R 64000 s526a_3_2 24 R 848148 R 64000 s510_15_7 25 R 12708989 R 66000 s27_new_7_4 7 A 23997012 R 30000 s298_15_7 17 R 38126967 R 50000 s420_3_2 34 TO - R 83000 s382_3_2 24 TO - R 64000 s641_3_2 54 TO - R 123000 111.sk_2_36 36 TO - R 87000 7.sk_4_50 50 TO - R 115000 56.sk_6_38 38 TO - R 91000 s820a_15_7 23 TO - R 62000 Project Service3.sk_12_55 55 TO - R 125000

5 Conclusion

In this paper, we studied the problem of testing constrained samplers over high-dimensional distributions with (ε, η, δ) guarantees. For n-dimensional distributions, the existing state-of-the-art testing algorithm, Barbarik2, has a worst-case query complexity that is exponential in n and hence is not ideal for use in practice. We provided an exponentially faster algorithm, Barbarik3, that has a query complexity linear in n and hence can easily scale to larger instances. We implemented Barbarik3 and tested the samplers w STS and w Unigen3 to determine their sample complexity in practice. The results demonstrate that Barbarik3 is signiﬁcantly more sample efﬁcient than Barbarik2, requiring 450 fewer samples when it tested w STS and 10 fewer samples when it tested w Unigen3. Since there is a n gap between the upper bound provided by our work and the lower bound shown in [30], the problem of designing a more sample efﬁcient algorithm or ﬁnding a stronger lower bound, remains open.

Limitations For a given farness parameter η, Barbarik3 requires the value of the closeness parameter ε to lie in the interval [0, η/11.6). In the case of Barbarik2, the previous state-of-the-art test, the permissible values of ε for a given η lie in the interval [0, η/3). Thus, Barbarik3 supports testing with only a subset of parameter values that Barbarik2 support.

Acknowledgments and Disclosure of Funding

We are grateful to the anonymous reviewers of Neur IPS 2022 for their constructive feedback that greatly improved the paper. We thank Ayush Jain and Shyam Narayanan for helpful discussions regarding the paper. We would also like to thank Anna L.D. Latour and Priyanka Golia for their useful comments on the code and the earlier drafts of the paper.

This work was supported in part by National Research Foundation Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme, NRF Fellowship Programme [NRF-NRFFAI1-2019-0004] and Ministry of Education Singapore Tier 2 grant [MOE-T2EP201210011], Ministry of Education Singapore Tier 1 Grant [R-252-000-B59-114]. The computational work for this article was performed on resources of the National Supercomputing Centre, Singapore.

[1] Jayadev Acharya, Clément L. Canonne, and Gautam Kamath. A chasm between identity and equivalence testing with conditional queries. Electron. Colloquium Comput. Complex., 2014.

[2] Filippo Amato, Alberto López, Eladia María Peña-Méndez, Petr Vaˇnhara, Aleš Hampl, and Josef Havel. Artiﬁcial neural networks in medical diagnosis, 2013.

[3] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to mcmc for machine learning. Machine learning, 2003.

[4] Christophe Andrieu, Nando De Freitas, and Arnaud Doucet. Reversible jump mcmc simulated annealing for neural networks. ar Xiv preprint ar Xiv:1301.3833, 2013.

[5] Tu gkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing closeness of discrete distributions. Journal of the ACM (JACM), 2013.

[6] Rishiraj Bhattacharyya and Sourav Chakraborty. Property testing of joint distributions using conditional samples. ACM Transactions on Computation Theory (TOCT), 2018.

[7] Eric Blais, Clément L Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. ACM Transactions on Computation Theory (TOCT), 2019.

[8] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016.

[9] Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of markov chain monte carlo. CRC press, 2011.

[10] Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, 2020.

[11] Clément L Canonne. A short note on learning discrete distributions. ar Xiv preprint ar Xiv:2002.11457, 2020.

[12] Clément L Canonne, Dana Ron, and Rocco A Servedio. Testing probability distributions using conditional samples. SIAM Journal on Computing, 2015.

[13] Clément L Canonne, Xi Chen, Gautam Kamath, Amit Levi, and Erik Waingarten. Random restrictions of high dimensional distributions and uniformity testing with subcube conditioning. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 2021.

[14] Sourav Chakraborty and Kuldeep S Meel. On testing of uniform samplers. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 2019.

[15] Sourav Chakraborty, Eldar Fischer, Yonatan Goldhirsh, and Arie Matsliah. On the power of conditional samples in distribution testing. SIAM Journal on Computing, 2016.

[16] Mark Chavira and Adnan Darwiche. On probabilistic inference by weighted model counting. Artiﬁcial Intelligence, 2008.

[17] Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Learning and testing junta distributions with sub cube conditioning. In Conference on Learning Theory. PMLR, 2021.

[18] Yoo Jung Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: A unifying framework for tractable probabilistic models. 2020.

[19] Mary Kathryn Cowles and Bradley P. Carlin. Markov chain monte carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 1996.

[20] Stefano Ermon, Carla P. Gomes, and Bart Selman. Uniform solution sampling using a constraint solver as an oracle. In UAI, 2012.

[21] Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh. Faster algorithms for testing under conditional sampling. In Conference on Learning Theory. PMLR, 2015.

[22] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences. Statistical Science, 1992.

[23] Rahul Gupta, Shubham Sharma, Subhajit Roy, and Kuldeep S Meel. Waps: Weighted and projected sampling. In TACAS, 2019.

[24] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 2013.

[25] Kyle D Julian, Mykel J Kochenderfer, and Michael P Owen. Deep neural network compression for aircraft collision avoidance systems. Journal of Guidance, Control, and Dynamics, 2019.

[26] Gautam Kamath and Christos Tzamos. Anaconda: A non-adaptive conditional sampling algorithm for distribution testing. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2019.

[27] Kuldeep S. Meel r , Yash Pote r , and Sourav Chakraborty. On testing of samplers. In Proceedings of Advances in Neural Information Processing Systems(Neur IPS), 12 2020.

[28] Forrest E. Morgan, Benjamin Boudreaux, Andrew J. Lohn, Mark Ashby, Christian Curriden, Kelly Klima, and Derek Grossman. Military Applications of Artiﬁcial Intelligence: Ethical Concerns in an Uncertain World. RAND Corporation, Santa Monica, CA, 2020.

[29] Christian A Naesseth, Fredrik Lindsten, and Thomas B Schön. Elements of sequential monte carlo. ar Xiv preprint ar Xiv:1903.04797, 2019.

[30] Shyam Narayanan. On tolerant distribution testing in the conditional sampling model. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 2021.

[31] Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.

[32] Mate Soos, Stephan Gocht, and Kuldeep S Meel. Tinted, detached, and lazy cnf-xor solving and its applications to counting and sampling. In International Conference on Computer Aided Veriﬁcation, 2020.

[33] Paul Valiant. Testing symmetric properties of distributions. SIAM Journal on Computing, 40, 2011.

[34] Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. Rank-normalization, folding, and localization: An improved ˆR for assessing convergence of MCMC (with Discussion). Bayesian analysis, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]