# computational_explorations_of_total_variation_distance__a103f399.pdf

Published as a conference paper at ICLR 2025

COMPUTATIONAL EXPLORATIONS OF TOTAL VARIATION DISTANCE

Arnab Bhattacharyya University of Warwick Sutanu Gayen IIT Kanpur Kuldeep S. Meel University of Toronto Georgia Institute of Technology

Dimitrios Myrisiotis CNRS@CREATE LTD. A. Pavan Iowa State University N. V. Vinodchandran University of Nebraska-Lincoln

We investigate some previously unexplored (or underexplored) computational aspects of total variation (TV) distance. First, we give a simple deterministic polynomial-time algorithm for checking equivalence between mixtures of product distributions, over arbitrary alphabets. This corresponds to a special case, whereby the TV distance between the two distributions is zero. Second, we prove that unless NP RP it is impossible to efficiently estimate the TV distance between arbitrary Ising models, even in a bounded-error randomized setting.

1 INTRODUCTION

The total variation (TV) distance between distributions P and Q over a common sample space D, denoted by d TV(P, Q), is defined as

d TV(P, Q) := max S D(P(S) Q(S)) = 1

x D |P(x) Q(x)| = X

x D max(0, P(x) Q(x)) .

The TV distance satisfies many basic properties which makes it a versatile and fundamental measure for quantifying the dissimilarity between probability distributions. First, it has an explicit probabilistic interpretation: The TV distance between two distributions is the maximum gap between the probabilities assigned to a single event by the two distributions. Second, it satisfies many mathematically desirable properties: It is bounded and lies in [0, 1], it is a metric, and it is invariant with respect to bijections. Third, it satisfies some interesting composability property. Given f(g1, g2, . . . , gn), suppose we replace g2 with g 2 such that d TV(g2, g 2) ε. Then the TV distance between f(g1, g2, . . . , gn) and f(g1, g 2, . . . , gn) is at most ε. Because of these reasons, the total variation distance is a central distance measure employed in a wide range of areas including probability and statistics Mitzenmacher & Upfal (2005), machine learning Shalev-Shwartz & Ben-David (2014), and information theory Cover & Thomas (2006), cryptography Stinson (1995), data privacy Dwork (2006), and pseudorandomness Vadhan (2012).

Lately, the computational aspects of TV distance have attracted a lot of attention. Sahai & Vadhan (2003) established, in a seminal work, that additively approximating the TV distance between two distributions that are samplable by Boolean circuits is hard for SZK (Statistical Zero Knowledge). The complexity class SZK is fundamental to cryptography and is believed to be computationally hard. Subsequent works captured variations of this theme Goldreich et al. (1999); Malka (2015); Dixon et al. (2020): For example, Goldreich et al. (1999) showed that the problem of deciding whether a distribution samplable by a Boolean circuit is close or far from the uniform distribution is complete for NISZK (Non-Interactive Statistical Zero Knowledge). Moreover, Cortes et al. (2007); Lyngsø & Pedersen (2002); Kiefer (2018) showed that it is undecidable to check whether the TV distance between two hidden Markov models is greater than a threshold or not, and that it is #P-hard to additively approximate it. Finally, Bhattacharyya et al. (2023) showed that (a) exactly computing the

Work done while the author was affiliated with the National University of Singapore.

Published as a conference paper at ICLR 2025

TV distance between product distributions is #P-complete, and (b) multiplicatively approximating the TV distance between Bayes nets is NP-hard.

On an algorithmic note, Bhattacharyya et al. (2020) designed efficient algorithms to additively approximate the TV distance between distributions efficiently samplable and efficiently computable (including the case of ferromagnetic Ising models). In particular, they designed efficient algorithms for additively approximating the TV distance of structured high dimensional distributions such as Bayesian networks, Ising models, and multivariate Gaussians. In a similar vein, Pote & Meel (2021) studied a related property testing variant of TV distance for distributions encoded by circuits. Multiplicative approximation of TV distance has received less attention compared to additive approximation. Recently, Bhattacharyya et al. (2023) gave an FPTAS for estimating the TV distance between an arbitrary product distribution and a product distribution with a bounded number of distinct marginals. Feng et al. (2023) designed an FPRAS for multiplicatively approximating the TV distance between two arbitrary product distributions and Feng et al. (2024) gave an FPTAS for the same task. Finally, Bhattacharyya et al. (2024) gave an FPRAS for estimating the TV distance between Bayes nets of small treewidth.

In this paper we address some previously unexplored (or under-explored) computational aspects of total variation distance relating to mixtures of product distributions and Ising models.

1.1 EQUIVALENCE CHECKING FOR MIXTURES OF PRODUCT DISTRIBUTIONS

Mixtures of product distributions constitute a natural and important class of distributions that have been studied in the mathematics and computer science literature. For instance, it is a standard observation that any distribution can be described by some (possibly large) mixture of product distributions (see Observation 9 in Appendix A).

Freund & Mansour (1999) gave an efficient algorithm for learning a mixture of two product distributions over the Boolean domain. As part of their analysis, they showed that given two mixtures of two product distributions, their KL divergence can be upper bounded by that of the components and a certain distance between the mixture coefficients. However, this upper bound does not lead to an equivalence checking algorithm.

A related problem in machine learning is source identification, whereby one is asked to identify the source parameters of a distribution. Gordon et al. (2021); Gordon & Schulman (2022); Gordon et al. (2023) give algorithms for source identification of a mixture of k product distributions on n bits, when given as input approximate values of multilinear moments.

We focus on the equivalence checking problem regarding mixtures of product distributions. Note that while it is easy to check whether two product distributions are equivalent, that is, by checking whether their respective Bernoulli parameters are equal, it is not clear how to do so for the case of mixtures of product distributions. This is so, because there are mixtures of product distributions that are equal (as distributions) but different sets of Bernoulli parameters describe them. For example, consider the case where we have two mixtures over one bit, namely P = 1 P1 +0 P2 and Q = 1

2 Q2, where P1 = P2 = Bern( 1

2) while Q1 = Bern( 1

3) and Q2 = Bern( 2

3). In this case, P = Q = Bern( 1

2), but the parameters of P and Q are different.

We present a simple deterministic polynomial-time algorithm for checking equivalence between mixtures of product distributions. Let us first formally define mixtures of product distributions. Let w1, . . . , wk be real numbers (weights) such that 0 wi 1 for all 1 i k, Pk i=1 wi = 1, and P1, . . . , Pk are n-dimensional product distributions over an alphabet Σ. The distribution P specified by the tuple (w1, . . . , wk, P1, . . . , Pk) is a mixture of products if for all x Σn it is the case that P(x) = Pk i=1 wi Pi(x). For a distribution P, we denote by P i its marginal on the first i variables. We may now state our first main result. Theorem 1. There is a deterministic algorithm E such that, given two mixtures of product distributions P and Q, specified by (w1, . . . , wk, P1, . . . , Pk) and (v1, . . . , vk, Q1, . . . , Qk), respectively, decides whether P = Q or not. Moreover, if P = Q, then E outputs some x Σi (with i n) such that P i(x) = Q i(x). The running time of E is O(nk4|Σ|4).

(Note that the algorithm outlined in Theorem 1 has input size Ω(kn |Σ|).) The primary conceptual contribution of our work is a connection between equivalence checking for mixtures of distributions

Published as a conference paper at ICLR 2025

and basis construction over appropriately chosen vector space. The connection lends itself to a construction that makes the algorithm as well as proof accessible to undergraduates.

1.2 HARDNESS OF APPROXIMATING TOTAL VARIATION DISTANCE BETWEEN ISING MODELS

The Ising model (Ising, 1925; Lenz, 1920), originally developed to describe ferromagnetism in statistical mechanics, serves as a cornerstone in the study of phase transitions and critical phenomena. It consists of discrete variables, known as spins, which can take values of either +1 or 1. These spins are arranged on a lattice, and their interactions with nearest neighbors lead to a rich tapestry of behavior, including spontaneous magnetization and phase transitions at critical temperatures.

One of the most fascinating aspects of the Ising model is its ability to illustrate complex systems using simple rules. For instance, in 2D, it exhibits a second-order phase transition at a critical temperature, where the system changes from a disordered state to an ordered state as temperature decreases. This model has been extensively studied, leading to profound insights not only in physics but also in fields such as biology, sociology, and computer science. For more information the reader is invited to check the survey written by Cipra (1987).

On another note, the computational study of the Ising model has become increasingly relevant. With its relatively simple structure interacting binary spins on a lattice, the Ising model serves as an ideal platform for exploring computational techniques ranging from Monte Carlo simulations to mean-field approximations. Monte Carlo methods, in particular, are widely used to investigate thermodynamic properties of the Ising model, as they allow for efficient sampling of spin configurations at various temperatures, enabling the computation of quantities like magnetization and susceptibility.

Some notable algorithmic results along these lines are the ones by Kasteleyn (1963) and Fisher (1966), whereby they showed that the evaluation of the partition function for planar Ising models can be reduced to some appropriate determinant computations. Moreover, Jerrum & Sinclair (1993) devised an efficient Monte Carlo approximation algorithm for estimating the partition function of arbitrary ferromagnetic (whereby all wi,j s are positive) Ising models.

On the other hand, there are some works pertaining to the intractability of computing various quantities of interest regarding Ising models (Welsh, 1993), such as the partition function outlined above. For example, Jerrum & Sinclair (1993) show that unless NP = RP, there is no fully polynomial-time randomized approximation scheme (FPRAS) to estimate the partition function of arbitrary Ising models. Moreover, Istrail (2000) proves that computing the partition function (for various kinds of Ising models) is NP-complete.

The second part of our work falls in this latter category. Let us first fix some notation. We focus on Ising models P such that for all x { 1, 1}n it is the case that the probability that the underlying system of spins assumes the configuration x is

i,j wi,jxixj + X

i,j wi,jxixj + X

whereby Z := P

i,j wi,jyiyj + P

i hiyi is the partition function of P, and {wi,j}i,j and

{hi}i are the parameters of the system.

We prove that it is hard to estimate the TV distance between Ising models under the very mild complexity-theoretic assumption NP RP, which states that Boolean formula satisfiability (SAT) does not admit any one-sided-error randomized polynomial-time algorithm, that is, a randomized polynomial-time algorithm that may output a false positive answer with small probability (Arora & Barak, 2009).

Theorem 2. If NP RP, then there is no FPRAS that estimates the TV distance between any two Ising models.

Our proof draws on the hardness result of Jerrum & Sinclair (1993), and shows that the partition function of Ising models can be reduced to the TV distance between Ising models, by a simple efficient approximation preserving reduction. The main ingredients of this reduction are as follows. First, we prove that estimating the partition function of any Ising model reduces to estimating any atomic

Published as a conference paper at ICLR 2025

marginal the form Pr P [xk = 1] for any variable xk and any Ising model P (see Proposition 6). Then we show that estimating any atomic marginal the form Pr P [xk = 1] for any variable xk and any Ising model P, can be reduced to estimating the TV distance between the Ising models P, Q, whereby Q depends on P (see Proposition 7).

1.3 PAPER ORGANIZATION

We give some preliminaries in Section 2. We prove Theorem 1 in Section 3 and Theorem 2 in Section 4. We conclude in Section 5 with some interesting open problems. Observation 9 is proved in Appendix A and Proposition 6 is proved in Appendix B.

2 PRELIMINARIES

We require the following folklore result, which is an application of Gaussian elimination. Proposition 3. There is a deterministic algorithm G that gets as input a set of vectors V , and outputs a maximum-size subset S V of linearly independent vectors. The running time of G is O |V |4 .

An n-dimensional product distribution R over an alphabet Σ is described by the n |Σ| parameters n Pr R [Xi = y] o

i [n], y Σ so that R(x) =

i=1 Pr R [Xi = xi] for all x Σn.

For n-dimensional product distribution R over an alphabet Σ, we denote its marginal over the first 1 j n coordinates by R j. Note that for any x Σj we have R j(x) = Qj i=1 Pr R[Xi = xi] .

We shall also require the following notion of approximation algorithm. Definition 4. A function f : {0, 1} R admits a fully polynomial-time randomized approximation scheme (FPRAS) if there is a randomized algorithm A such that for every input x (of length n) and parameters ε, δ > 0, the algorithm A outputs a ε-multiplicative approximation of f(x), i.e., a value that lies in the interval [f(x)/(1 + ε), (1 + ε)f(x)], with probability at least 1 δ. The running time of A is polynomial in n, 1/ε, 1/δ.

3 EQUIVALENCE CHECKING FOR MIXTURES OF PRODUCT DISTRIBUTIONS

Let us now prove Theorem 1. First, observe that P = Q if and only if P j = Q j for all 1 j n. This is so, since if P = Q, then every marginal of P matches the respective marginal of Q (in symbols, P j = Q j for all 1 j n). Otherwise, there would be some 1 j n and y Σj such that P j(y) = Q j(y). The latter would then establish the existence of an x := (y, z) Σn (for some z Σn j) such that P(x) = Q(x). On the other hand, if P j = Q j for all 1 j n, then P n = Q n, which in particular implies that P = Q.

Note that the condition P j = Q j for all 1 j n is equivalent, by the definitions of P and Q, to the condition Pk i=1 wi P j i = Pk i=1 vi Q j i for all 1 j n. Thus, if P = Q, then there is some 1 j n so that Pk i=1 wi P j i = Pk i=1 vi Q j i .

We will use an inductive argument on 1 j n to show that these conditions can be efficiently checked (in either case).

Base Case. For j = 1, we can efficiently check whether it is the case that

i=1 wi P 1 i =

i=1 vi Q 1 i .

This is done by checking for all x Σ that

i=1 wi Pr Pi [X1 = x]

i=1 vi Pr Qi [X1 = x] = 0.

Published as a conference paper at ICLR 2025

If these tests pass, then the algorithm proceeds with the inductive argument outlined below (otherwise, it outputs x). Towards this, we will now find a basis B1 for the set of coefficient vectors of the equations ( k X

i=1 wi Pr Pi [X1 = x] zi

i=1 vi Pr Qi [X1 = x] zk+i = 0

over variables z1, . . . , z2k. Note that the size of B1 is at most min(2k, |Σ|) 2k. We can find B1 as follows. We appeal to Proposition 3 and run the algorithm G outlined there on the set of vectors w1 Pr P1 [X1 = x] , . . . , wk Pr Pk [X1 = x] , v1 Pr Q1 [X1 = x] , . . . , vk Pr Qk [X1 = x]

(in time O |Σ|4 ). Then, we define B1 to be the set of independent vectors being output by G.

Induction Hypothesis. Assume that for a j 1 it is the case that

i=1 wi P j i =

i=1 vi Q j i ,

and we have a basis Bj for the set of coefficient vectors of the following equations ( k X

i=1 wi P j i (x) zi

i=1 vi Q j i (x) zk+i = 0

over variables z1, . . . , z2k. Note that Bj is of size at most min 2k, |Σ|j 2k.

Induction Step. We will establish that we can check whether P and Q agree up to coordinate j + 1 and compute a basis Bj+1 for the respective set of coefficient vectors of the equations that capture this equivalence.

To see whether P and Q agree up to coordinate j + 1, one needs to check that

i=1 wi P j i (x) Pr Pi [Xj+1 = y]

i=1 vi Q j i (x) Pr Qi [Xj+1 = y] = 0

for all x Σj and y Σ. A crucial observation (that follows from the inductive hypothesis) is that we only need to check whether these equations hold for the assignments x that correspond to vectors in Bj, and the values y Σ. (Note that each basis vector b Bj can be specified by an assignment xb Σj. This follows from the way these basis vectors are constructed. See below, the discussion after Claim 5.) If any of these tests fails, then the algorithm outputs (x, y); else, it continues as follows.

To proceed with the induction, it would suffice to show how to construct a basis Bj+1 for the set of coefficient vectors of the following equations over variables z1, . . . , z2k, namely ( k X

i=1 wi P j i (x) Pr Pi [Xj+1 = y] zi

i=1 vi Q j i (x) Pr Qi [Xj+1 = y] zk+i = 0

Let Bj = {b1 = (b1,1, . . . , b1,2k) , . . . , bm = (bm,1, . . . , bm,2k)} whereby m 2k and C := Sm i=1 Ci is such that

C1 := b1,1 Pr P1 [Xj+1 = y] , . . . , b1,k Pr Pk [Xj+1 = y] ,

b1,k+1 Pr Q1 [Xj+1 = y] , . . . , b1,2k Pr Qk [Xj+1 = y]

Published as a conference paper at ICLR 2025

Cm := bm,1 Pr P1 [Xj+1 = y] , . . . , bm,k Pr Pk [Xj+1 = y] ,

bm,k+1 Pr Q1 [Xj+1 = y] , . . . , bm,2k Pr Qk [Xj+1 = y]

We require the following claim. Below, we denote the dot product of vectors a, b Rs by a, b . That is, a, b = Ps i=1 aibi. Claim 5. It is the case that Bj+1 C.

Claim 5 helps explain how, for any vector b Bj+1, one may extract an assignment xb Σj+1 that corresponds to b. This is done by keeping track of the Bernoulli parameters that appear in b: There is exactly one such parameter for each variable, and it refers to a symbol in Σ.

Proof of Claim 5. Let y Σ. We have that

i=1 wi P j+1 i (x, y) zi

i=1 vi Q j+1 i (x, y) zk+i

i=1 wi P j i (x) Pr Pi [Xj+1 = y] zi

i=1 vi Q j i (x) Pr Qi [Xj+1 = y] zk+i.

Let z := (z1, . . . , z2k). By the inductive hypothesis, for any x Σj,

i=1 wi P j i (x) zi

i=1 vi Q j i (x) zk+i

can be written as a linear combination of { bℓ, z }m ℓ=1. Concretely, there exist d1, . . . , dm R such that k X

i=1 wi P j i (x) zi

i=1 vi Q j i (x) zk+i =

ℓ=1 dℓ bℓ, z .

Since the corresponding coefficients of each zi must be equal between the LHS and the RHS of this equation, we get that for all 1 i k,

wi P j i (x) =

ℓ=1 dℓ bℓ,i and vi Q j i (x) =

ℓ=1 dℓ bℓ,k+i (1)

for some d1, . . . , dm. Then it is straightforward to see that for every y Σ

i=1 wi P j i (x) Pr Pi [Xj+1 = y] zi

i=1 vi Q j i (x) Pr Qi [Xj+1 = y] zk+i =

ℓ=1 dℓ cℓ, z ,

whereby cℓis the vector within the set Cℓthat corresponds to the setting where Xj+1 = y. To see why this equation holds, we will consider the coefficients of each zi in either side of this equation and show that they are equal. Let us consider the case where 1 i k. (The case where k + 1 i 2k is similar.) In the LHS, the coefficient of zi is equal to

wi P j i (x) Pr Pi [Xj+1 = y] =

ℓ=1 dℓ bℓ,i Pr Pi [Xj+1 = y] ,

by Equation (1), while in the RHS it is equal to

ℓ=1 dℓ cℓ,i =

ℓ=1 dℓ bℓ,i Pr Pi [Xj+1 = y] ,

by the definition of cℓ. This shows that Bj+1 is a subset of C, and the proof is complete.

By Claim 5, we have that Bj+1 contains (at most) m |Σ| = O(k |Σ|) vectors. However, not all of these O(k |Σ|) vectors are (necessarily) independent. By Proposition 3, we can find the (at most) 2k

independent vectors among the O(k |Σ|) vectors in time O k4 |Σ|4 . These vectors constitute Bj+1. This concludes the inductive description of our algorithm.

Published as a conference paper at ICLR 2025

Running Time. Let T(n, k, Σ) denote the running time of this procedure, on mixtures of size k over n variables that have alphabet Σ. We have the recurrence relation

T(n, k, Σ) T(n 1, k, Σ) + O k4 |Σ|4 ,

which in particular yields T(n, k, Σ) = O nk4 |Σ|4 .

4 HARDNESS OF APPROXIMATING TOTAL VARIATION DISTANCE BETWEEN ISING MODELS

In this section we prove Theorem 2.

4.1 REDUCING THE PARTITION FUNCTION TO ATOMIC MARGINALS

We first observe the following. The proof is standard and is given in Appendix B.

Proposition 6. If there is some FPRAS that estimates any atomic marginal of any Ising model P, namely Pr P [xk = 1] for any variable xk, then there is some FPRAS that estimates the partition function of any Ising model.

4.2 REDUCING THE ATOMIC MARGINALS TO TV DISTANCE

We prove the following.

Proposition 7. If there is some FPRAS that estimates the TV distance between any two Ising models, then there is some FPRAS that estimates any atomic marginal of any Ising model P, namely Pr P [xk = 1] for any variable xk.

Proof. Assume that there is some FPRAS that estimates the TV distance between any two Ising models. We will show that there is some FPRAS that estimates the atomic marginals of any Ising model P, namely Pr P [xk = 1] for any variable xk. Let ε be the desired accuracy error of the FPTAS that estimates the atomic marginals of any Ising model P.

Fix some Ising model P with parameters {wi,j}i,j and {hi}i. We shall first introduce a new dummy variable x0. Let P0 be a new Ising model over x0, . . . , xn with parameters {wi,j}i,j and {hi}i so that w0,i = 0 for all i > 0 and h0 (that is, h0 is a small negative quantity which we will fix later). Note that under these conditions x0 is independent from every other node x1, . . . , xn. Moreover,

Pr P0 [x0 = 1] = X

x:x0=1 P0(x) =

x:x0=1 exp P

i,j wi,jxixj + P

i,j wi,jxixj + P

x:x0=1 exp P

i,j wi,jxixj + h0x0 + P

i,j wi,jxixj + h0x0 + P

x:x0=1 exp P

i,j wi,jxixj + h0 + P

i,j wi,jxixj + h0x0 + P

= exp(h0) P

x:x0=1 exp P

i,j wi,jxixj + P

P x exp P i,j wi,jxixj + h0x0 + P i>0 hixi

= exp(h0) P x exp P i,j wi,jxixj + P i>0 hixi

i,j wi,jxixj + h0x0 + P

Published as a conference paper at ICLR 2025

Let now us define E(x) := exp P

i,j wi,jxixj + P

i>0 hixi for all x. We have

Pr P0 [x0 = 1] = exp(h0) P

x E(x) exp(h0) P

x:x0=1 E(x) + exp( h0) P

x:x0= 1 E(x)

= exp(h0) P

x E(x) exp(h0) P

x E(x) + exp( h0) P

= exp(2h0) P

x E(x) exp(2h0) P

x E(x) = exp(2h0) exp(2h0) + 1.

We shall now define another Ising model Q over x0, . . . , xn as follows. The model Q has parameters w i,j

i,j and {h i}i so that w i,j := wi,j if (i, j) = (0, k) and w 0,k := w0,k + δ for some δ > 1, and h i = hi for all i. We have that for all x { 1, 1}n it is the case that

i,j w i,jxixj + X

i,j w i,jxixj + X

(w0,k + δ) x0xk + X

(i,j) =(0,k) wi,jxixj + X

w0,kx0xk + δx0xk + X

(i,j) =(0,k) wi,jxixj + X

i,j wi,jxixj + X

= exp(δx0xk) exp

i,j wi,jxixj + X

= exp(δx0xk) P0(x) ZP0,

whereby ZP0 is the partition function of P0. If x0xk = 1, then Q(x) exp(δ) P0(x); else Q(x) exp( δ) P0(x). Moreover, for any x such that x0xk = 1,

Q(x) = exp(δ) P0(x) ZP0

= exp(δ) P0(x) ZP0 P

x:x0xk=1 exp(δ) P0(x) ZP0 + P

x:x0xk= 1 exp( δ) P0(x) ZP0

= P0(x) P x:x0xk=1 P0(x) + exp( 2δ) P x:x0xk= 1 P0(x)

x:x0xk=1 P0(x) + P

x:x0xk= 1 P0(x) = P0(x) P

x P0(x) P0(x)

where ZQ is the partition function of Q. Similarly, for any x such that x0xk = 1,

Q(x) = exp( δ) P0(x) ZP0

= exp( δ) P0(x) ZP0 P

x:x0xk=1 exp(δ) P0(x) ZP0 + P

x:x0xk= 1 exp( δ) P0(x) ZP0

x:x0xk=1 exp(2δ) P0(x) + P

x:x0xk= 1 P0(x)

Published as a conference paper at ICLR 2025

x:x0xk=1 P0(x) + P

x:x0xk= 1 P0(x) = P0(x).

That is, P0 (x) Q (x) if and only if x0xk = 1. We now have

d TV(P0, Q) = X

x max(0, P0(x) Q(x))

x:P0(x) Q(x) (P0(x) Q(x))

x:x0xk= 1 (P0(x) Q(x)) = X

x:x0xk= 1 P0(x) X

x:x0xk= 1 Q(x) .

Moreover, X

x:x0xk= 1 P0(x) = Pr P0 [x0xk = 1]

= Pr P0 [x0 = 1, xk = 1] + Pr P0 [x0 = 1, xk = 1]

= Pr P0 [x0 = 1] Pr P0 [xk = 1] + Pr P0 [x0 = 1] Pr P0 [xk = 1]

= Pr P0 [x0 = 1] Pr P [xk = 1] + Pr P0 [x0 = 1] Pr P [xk = 1]

= Pr P0 [x0 = 1] + 1 2 Pr P0 [x0 = 1] Pr P [xk = 1]

= exp(2h0) exp(2h0) + 1 + 1 exp(2h0)

exp(2h0) + 1 Pr P [xk = 1] ,

whereby we have used the fact that Pr P0[xk = 1] = Pr P [xk = 1]. That is,

d TV(P0, Q) = exp(2h0) exp(2h0) + 1 X

x:x0xk= 1 Q(x) + 1 exp(2h0)

exp(2h0) + 1 Pr P [xk = 1] . (2)

x:x0xk= 1 Q(x) = X

exp( δ) P0(x) ZP0

ZQ exp( δ) ZP0

Intuitively, note that if (in the above) we set δ and h0 , then d TV(P0, Q) = Pr P [xk = 1], and an approximation of d TV(P0, Q) implies an approximation of Pr P [xk = 1]. Now we make this formal by finitely quantifying h0, δ.

It would suffice to show that d TV(P0, Q) Pr P [xk = 1] ε

2 Pr P [xk = 1] .

Then this could be combined with an ε

2-multiplicative approximation of d TV(P0, Q) to yield the desired ε-multiplicative approximation of Pr P [xk = 1]. By Equation (2), d TV(P0, Q) Pr P [xk = 1] exp (2h0) exp (2h0) + 1 + exp ( δ) ZP0

ZQ + exp (2h0) exp (2h0) + 1 Pr P [xk = 1]

2 exp(2h0) + exp ( δ) ZP0

Let W := maxi,j |wi,j| and H := maxi |hi|. Then for any x,

Pr P [xk = 1] = X

exp P i,j wi,jxixj + P i hixi

i,j wi,jyiyj + P

2n 1 exp W (n + 1)2 (n + 1) H

2n+1 exp W (n + 1)2 + (n + 1) H

Published as a conference paper at ICLR 2025

= exp W (n + 1)2 (n + 1) H

4 exp W (n + 1)2 + (n + 1) H .

Choosing h0 and δ such that max( h0, δ) = Ω(poly(n, H, W, 1/ε, ZP0/ZQ)) = Ω(poly(n, H, W, 1/ε)) (note that ZP0/ZQ can be bounded by some polynomial in n, H, W) we may ensure the desired

exp W (n + 1)2 (n + 1) H

4 exp W (n + 1)2 + (n + 1) H 2

2 exp(2h0) + exp ( δ) ZP0

An identical to the above argument can be employed to get a multiplicative approximation of Pr P [xk = 1] (by letting h0 above).

Since d TV(P0, Q) multiplicatively approximates Pr P [xk = 1], this is an approximation preserving reduction, as any multiplicative approximation of d TV(P0, Q) is a multiplicative approximation of Pr P [xk = 1].

4.3 PROOF OF THEOREM 2

To prove Theorem 2 we shall require the following result of Jerrum & Sinclair (1993).

Theorem 8 (Jerrum & Sinclair (1993)). If NP RP, then there is no FPRAS that estimates the partition function of any Ising model.

We now turn to the proof of Theorem 2.

Proof of Theorem 2. By Proposition 6, if there exists some FPRAS that estimates any atomic marginal of any Ising model, then there exists some FPRAS that estimates the partition function of any Ising model. By Proposition 7, if there exists some FPRAS that estimates the TV distance between Ising models, then there exists some FPRAS that estimates any atomic marginal of any Ising model. That is, if there exists some FPRAS that estimates the TV distance between Ising models, then there exists some FPRAS that estimates the partition function of any Ising model. By Theorem 8, if NP RP, then there is no FPRAS that estimates the partition function of any Ising model. That is, if NP RP, then there is no FPRAS that estimates the TV distance between Ising models. This concludes the proof.

5 DISCUSSION

We have shown how to efficiently check whether two mixtures of product distributions are equivalent or not. One might wonder whether our ideas can be applied to check equivalence between mixtures of other, more expressive, kinds of distributions, such as Bayes nets.

Bayes nets constitute Pearl (1989) a probabilistic graphical model, with numerous applications in machine learning and inference, that naturally generalize product distributions in that they allow for dependencies among different variables. Bhattacharyya et al. (2023) showed that it is NP-hard to decide whether the total variation distance between two Bayes nets P and Q is equal to 0 or not, so one cannot hope to extend our methods to testing equivalence between mixtures of Bayes net distributions (unless, of course, P = NP.) Can our ideas be extended to testing equivalence between mixtures of some subclass of Bayes net distributions, such as Bayes net distributions whereby their underlying graph is a tree or has small treewidth?

Our second result, namely the hardness of approximating the TV distance between Ising models, helps us further understand the intricacies of the Ising model and the consequences of complexity-theoretic conjectures such as NP RP. Can we extend this complexity-theoretic hardness of approximation to other classes of probability distributions, such as factor graphs or general undirected probabilistic graphical models?

Published as a conference paper at ICLR 2025

ACKNOWLEDGEMENTS

Des Cartes: This research is supported by the National Research Foundation, Prime Minister s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme. This work was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme [NRF-NRFFAI1-2019-0004] and an Amazon Research Award. We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number RGPIN-2024-05956]. The work of AB was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme (NRF-NRFFAI2019-0002) and an Amazon Faculty Research Award. The work of SG was supported in part by the SERB award CRG/2022/007985. Pavan s work is partly supported by NSF awards 2130536, 2342245, 2413849. Vinod s work is partly supported by NSF awards 2130608, 2342244, 2413848.

Sanjeev Arora and Boaz Barak. Computational Complexity - A Modern Approach. Cambridge University Press, 2009. ISBN 978-0-521-42426-4. URL http://www.cambridge.org/ catalogue/catalogue.asp?isbn=9780521424264.

Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, and N. V. Vinodchandran. Efficient distance approximation for structured high-dimensional distributions via learning. In Proc. of Neur IPS, 2020.

Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, and N. V. Vinodchandran. On approximating total variation distance. In Proceedings of IJCAI, 2023.

Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, and N. V. Vinodchandran. Total variation distance meets probabilistic inference. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Open Review.net, 2024. URL https://openreview.net/forum?id=6OSLj Er Bhh.

Barry A. Cipra. An introduction to the Ising model. American Mathematical Monthly, 94:937 959, 1987. URL https://api.semanticscholar.org/Corpus ID:10158214.

Corinna Cortes, Mehryar Mohri, and Ashish Rastogi. Lp distance and equivalence of probabilistic automata. Int. J. Found. Comput. Sci., 18(4):761 779, 2007.

Thomas M. Cover and Joy A. Thomas. Elements of information theory (2. ed.). Wiley, 2006.

Peter Dixon, Sutanu Gayen, Aduri Pavan, and N. V. Vinodchandran. Perfect zero knowledge: New upperbounds and relativized separations. In Rafael Pass and Krzysztof Pietrzak (eds.), Proc. of TCC, volume 12550 of Lecture Notes in Computer Science, pp. 684 704. Springer, 2020.

Cynthia Dwork. Differential privacy. In Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener (eds.), Automata, Languages and Programming, 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II, volume 4052 of Lecture Notes in Computer Science, pp. 1 12. Springer, 2006.

Weiming Feng, Heng Guo, Mark Jerrum, and Jiaheng Wang. A simple polynomial-time approximation algorithm for the total variation distance between two product distributions. Theoreti CS, 2, 2023.

Weiming Feng, Liqiang Liu, and Tianren Liu. On deterministically approximating total variation distance. In Proceedings of SODA, 2024.

Michael E. Fisher. On the Dimer Solution of Planar Ising Models. Journal of Mathematical Physics, 7(10):1776 1781, 10 1966. ISSN 0022-2488. doi: 10.1063/1.1704825. URL https: //doi.org/10.1063/1.1704825.

Yoav Freund and Yishay Mansour. Estimating a mixture of two product distributions. In Proceedings of COLT, 1999.

Published as a conference paper at ICLR 2025

Oded Goldreich, Amit Sahai, and Salil P. Vadhan. Can statistical zero knowledge be made noninteractive? or On the relationship of SZK and NISZK. In Proc. of CRYPTO, pp. 467 484, 1999.

Spencer Gordon, Bijan H. Mazaheri, Yuval Rabani, and Leonard J. Schulman. Source identification for mixtures of product distributions. In Conference on Learning Theory, COLT, 2021.

Spencer L. Gordon and Leonard J. Schulman. Hadamard extensions and the identification of mixtures of product distributions. IEEE Trans. Inf. Theory, 68(6):4085 4089, 2022.

Spencer L. Gordon, Erik Jahn, Bijan Mazaheri, Yuval Rabani, and Leonard J. Schulman. Identification of mixtures of discrete product distributions in near-optimal sample and time complexity. Co RR, abs/2309.13993, 2023.

Ernst Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift f ur Physik, 31:253 258, 1925. URL https://api.semanticscholar.org/Corpus ID:122157319.

Sorin Istrail. Statistical mechanics, three-dimensionality and NP-completeness: I. universality of intracatability for the partition function of the Ising model across non-planar surfaces (extended abstract). In Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, STOC 00, pp. 87 96, New York, NY, USA, 2000. Association for Computing Machinery. ISBN 1581131844. doi: 10.1145/335305.335316. URL https://doi.org/10.1145/335305. 335316.

Mark Jerrum and Alistair Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM J. Comput., 22(5):1087 1116, 1993.

P. W. Kasteleyn. Dimer Statistics and Phase Transitions. Journal of Mathematical Physics, 4(2): 287 293, 02 1963. ISSN 0022-2488. doi: 10.1063/1.1703953. URL https://doi.org/10. 1063/1.1703953.

Stefan Kiefer. On computing the total variation distance of hidden Markov models. In Proc. of ICALP, pp. 130:1 130:13, 2018.

W Lenz. Beitrag zum Verst andnis der magnetischen Erscheinungen in festen K orpern. Z. Phys., 21: 613 615, 1920. URL http://cds.cern.ch/record/460663.

Rune B. Lyngsø and Christian N. S. Pedersen. The consensus string problem and the complexity of comparing hidden Markov models. J. Comput. Syst. Sci., 65(3):545 569, 2002.

Lior Malka. How to achieve perfect simulation and a complete problem for non-interactive perfect zero-knowledge. J. Cryptol., 28(3):533 550, 2015.

Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005.

Judea Pearl. Probabilistic reasoning in intelligent systems - networks of plausible inference. Morgan Kaufmann series in representation and reasoning. Morgan Kaufmann, 1989.

Yash Pote and Kuldeep S. Meel. Testing probabilistic circuits. In Proc. of Neur IPS, 2021.

Amit Sahai and Salil P. Vadhan. A complete problem for statistical zero knowledge. J. ACM, 50(2): 196 249, 2003.

Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014.

Douglas R. Stinson. Cryptography - theory and practice. Discrete mathematics and its applications series. CRC Press, 1995.

Salil P. Vadhan. Pseudorandomness. Found. Trends Theor. Comput. Sci., 7(1-3):1 336, 2012.

Dominic Welsh. Complexity: Knots, Colourings and Counting. London Mathematical Society Lecture Note Series. Cambridge University Press, 1993.

Published as a conference paper at ICLR 2025

A THE EXPRESSIBILITY OF MIXTURES OF PRODUCTS

We require the following observation. Observation 9. Let D be a distribution over Σn for some alphabet Σ. Then for k := |Σn| there exists a mixture of product distributions (w1, . . . , wk, P1, . . . , Pk) over Σn such that for all x Σn it is the case that

i=1 wi Pi(x) .

Proof. Let x1, . . . , xk be the elements of Σn. For all 1 i k, let Pi be a product distribution on n variables such that Pi(y) = 1 if y = xi, otherwise, Pi(y) = 0. We can construct such a product distribution Pi as follows. First, note that Pi(y) = Qn j=1 pi,j(yj), whereby pi,j(z) is the probability that the j-th coordinate of Pi takes the value z Σ. Let xi = xi,1, . . . , xi,n, whereby xi,j Σ. Then for all j define pi,j(z) := 1 if z = xi,j, otherwise, pi,j(z) := 0. Moreover, let wi := D(xi) and note that Pk i=1 wi = 1. Then the desired equality follows by straightforward calculations.

B PROOF OF PROPOSITION 6

Let us first assume that there is some FPRAS M that runs in time t M and estimates any atomic marginal of any Ising model. We will prove there is some FPRAS that estimates the partition function of any Ising model.

Let P1 be an Ising model over variables x1, . . . , xn with parameters {wi,j}i,j and {hi}i with partition function Z1. We will show how to estimate Z1. To this end, let us define a new Ising model P2 over variables x2, . . . , xn with parameters n w(2) i,j o

i,j and n h(2) i o

i so that w(2) i,j := wi,j

and h(2) i := w1,i + hi for all 2 i < j n. Let Z2 be the partition function of P2. Let also

E1(x) := exp

i,j wi,jxixj + X

, E2(x) := exp

i,j =1 w(2) i,j xixj + X

i =1 h(2) i xi

for all x, and note that Z1 = P

x E1(x) and Z2 = P

x E2(x). We have

x:x1=1 E1(x) = X

i,j wi,jxixj + X

i,j =1 wi,jxixj + X

i w1,ixi + X

i =1 hixi + h1

i,j =1 wi,jxixj + X

i =1 (w1,i + hi) xi + h1

i,j =1 wi,jxixj + X

i =1 (w1,i + hi) xi

= exp(h1) X

i,j =1 wi,jxixj + X

i =1 (w1,i + hi) xi

= exp(h1) X

i,j =1 w(2) i,j xixj + X

i =1 h(2) i xi

= exp(h1) X

i,j =1 w(2) i,j xixj + X

i =1 h(2) i xi

Published as a conference paper at ICLR 2025

= exp(h1) X

x E2(x) = exp(h1) Z2.

Pr P1 [x1 = 1] = 1

x:x1=1 E1(x) or Z1 = 1 Pr P1[x1 = 1]

x:x1=1 E1(x) .

Combining the above, we have that

Z1 = 1 Pr P1[x1 = 1]

x:x1=1 E1(x) = Z2 exp(h1) Pr P1[x1 = 1].

This equality yields a natural recursive algorithm for computing Z1, whereby the computation of Z1 is reduced to the computation of Z2 by making use of the algorithm M that estimates the marginal Pr P1[x1 = 1]. This continues until we reach in the recursion some Ising model Pn over the variable xn. At this point the computation of Zn is trivial, as

Zn = exp h(n) n + exp h(n) n .

We will now bound the approximation and confidence errors of this algorithm. Note that

Z1 = Z2 exp(h1) Pr P1[x1 = 1] = = Zn exp(h1) exp h(2) 2 . . . exp h(n 1) n 1

Pr P1[x1 = 1] Pr P2[x2 = 1] . . . Pr Pn 1[xn 1 = 1]

whereby each atomic marginal in the denominator is approximated with a (1 + ε0) ratio for ε0 := ε/n by making use of M. This yields the desired approximation ratio of

(1 + ε0)n 1 (1 + ε0)n 1 + n ε0 = 1 + n ε

We similarly argue for the confidence ratio δ0 := δ/n that yields the desired confidence ratio of 1 δ. That is, the outlined reduction is approximation preserving. What is left is to bound the running time of this recursive procedure. Let T(n) be the running time in question. We have that

T(n) T(n 1) + t M(n, 1/ε0, 1/δ0) + O(1)

or T(n) = O(n t M(n, 1/ε0, 1/δ0)) = O(poly(n, 1/ε0, 1/δ0)) . This concludes the proof.