# learnability_of_parameterbounded_bayes_nets__6c60215d.pdf

Learnability of Parameter-Bounded Bayes Nets

Arnab Bhattacharyya1*, Davin Choo2*, Sutanu Gayen3*, Dimitrios Myrisiotis4*

1University of Warwick, United Kingdom 2Harvard University, USA 3IIT Kanpur, India 4CNRS@CREATE LTD., Singapore

Bayes nets are extensively used in practice to efficiently represent joint probability distributions over a set of random variables and capture dependency relations. Prior work has shown that given a distribution P defined as the marginal distribution of a Bayes net, it is NP-hard to decide whether there is a parameter-bounded Bayes net that represents P. They called this problem LEARN. In this work, we extend the NPhardness result of LEARN and prove the NP-hardness of a promise search variant of LEARN, whereby the Bayes net in question is guaranteed to exist and one is asked to find such a Bayes net. We complement our hardness result with a positive result about the sample complexity that is sufficient to recover a parameter-bounded Bayes net that is close (in TV distance) to a given distribution P, represented by some parameter-bounded Bayes net, thereby generalizing a degreebounded sample complexity literature result.

1 Introduction

Bayesian networks (Pearl 1988), or simply Bayes nets, are directed acyclic graphs (DAGs), accompanied by a collection of conditional probability distributions (that is, one for each vertex), that are used to represent joint probability distributions over dependent random variables in an elegant and succinct manner. As an example, consider a distribution P over five Boolean variables X = {X1, X2, X3, X4, X5}. Regardless of the dependencies between the variables, P can always be represented by a lookup table that has 25 1 = 31 entries, with one entry for each possible Boolean outcome except one. However, known dependencies between variables may induce a sparser representation. If X2 depends on X1, X3 depends on {X2, X5}, and X4 depends on X3, then the joint distribution P(x1, . . . , x5) decomposes as

P(x1) P(x2 | x1) P(x3 | x2, x5) P(x4 | x3) P(x5).

In fact, one can represent P with a relatively sparse Bayes net G (see Figure 1) with conditional probability tables (CPTs) associated with each vertex. Observe that 8 < 31 numbers suffice to describe the CPTs: One for each of the Bernoulli

*Work done while the author was affiliated with the National University of Singapore, Singapore. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

distributions of X1 and X5, two for the conditional probability distributions of X2 and X4, and four for that of X3. In the rest of this work, we refer to the numbers used in defining the CPTs above as parameters.

X1 X2 X3 X4

Figure 1: Left: A Bayes net G such that the distribution P of our example is represented by G. Right: A Bayes net H such that the distribution that arises from the distribution P after marginalizing out X3 is represented by H.

It is a standard result (Pearl 1988) that there exists a Bayes net G of p parameters that represents a probability distribution P if and only if P is Markov with respect to some Bayes net H of p parameters that has the same underlying DAG as G. (It could be that G = H, but this is not necessary.) Here, the property of a distribution P being Markov with respect to a Bayes net G means that a certain graphical separation condition in the underlying DAG of G, known as d-separation, implies conditional independence in P (see Section 2.2 for formal definitions). We shall make use of this equivalence in the sequel. A series of works studied the problem of learning the underlying DAG of a Bayes net from data, by focusing on maximizing certain scoring criterion by the underlying DAG, see, e.g., (Cooper and Herskovits 1992; Spiegelhalter et al. 1993; Heckerman, Geiger, and Chickering 1995). This task was later shown to be NP-hard by Chickering (1996), which then raised the following natural fundamental question:

Given a succinct description of a distribution P (that is not in terms of a Bayes net), how easy is it to find a Bayes net G such that P is Markov with respect to G?

Unfortunately, Chickering, Heckerman, and Meek (2004) showed that deciding whether a given distribution P is Markov with respect to some Bayes net of at most p N parameters or not is NP-hard.

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

Remark 1.1. In (Chickering, Heckerman, and Meek 2004), the distribution P is described as the certain marginal of a Bayes net of small in-degree, i.e., a succinct Bayes net description over variables X, along with a subset of variables S X to marginalize out. For example, any distribution P that is Markov with respect to the left Bayes net in Figure 1 is Markov with respect to the right Bayes net in Figure 1 after marginalizing out X3. Note that all possible distributions over X are Markov with respect to some Bayes net over a clique, but such a Bayes net requires 2|X| 1 parameters. Regarding upper bounds, there are well-known algorithms for learning the underlying DAG of a Bayes net from distributional samples such as the PC (Spirtes, Glymour, and Scheines 2000) and GES (Chickering 2002) algorithms. Recently, Brustle, Cai, and Daskalakis (2020) also gave finite sample guarantees of learning Bayes nets (that have n nodes, each taking values over an alphabet Σ) from samples. When given the promise that the underlying DAG has bounded indegree of d, Brustle, Cai, and Daskalakis (2020, Theorem 10) assert that using

n |Σ|d+1 log n|Σ|

+ n d log n (1)

samples from the underlying distribution P, one can learn P up to total variation (TV) distance ε with probability at least 1 δ. One standard way to measure the complexity of a Bayes net is by imposing an upper bound on any node s in-degree. In this work, we measure the complexity in terms of the number of parameters p, which is a more fine-grained measure than that of maximum in-degree d. For instance, a Bayes net on n Boolean variables with maximum in-degree d could have as few as O(n + 2d) parameters (e.g., a star graph where d leaves point towards the center of the start) and as many as Ω(n 2d) parameters (e.g., a complete graph).

1.1 Our Contributions Our two main contributions are that we extend the hardness result of Chickering, Heckerman, and Meek (2004) and generalize the finite sample complexity result of Brustle, Cai, and Daskalakis (2020).

Contribution 1. We extend the hardness result of Chickering, Heckerman, and Meek (2004) to the setting where we are guaranteed that the Bayes net in question is promised to have a small number of parameters. In computational complexity theory, this is also known as a promise problem, which generalizes a decision problem in that the input is promised to belong to a certain subset of possible inputs. Our new hardness result confirms the common intuition that it is hard to search for a Bayes net G that is Markov with respect to a given probability distribution, even if it is known that the distribution in question is Markov with respect to a Bayes net that has a small number of parameters.

Definition 1.2 (The REALIZABLE-LEARN problem). Given as input variables X = (X1, . . . , Xn), a probability distribution P on X, a parameter bound p N, and the promise that there exists a Bayes net G with at most p parameters such that

P is Markov with respect to G, output a Bayes net G with at most p parameters such that P is Markov with respect to G .

Theorem 1.3. REALIZABLE-LEARN is NP-hard.

Technically speaking, REALIZABLE-LEARN is a promise search problem. While NP-hardness results usually revolve around decision problems, NP-hardness naturally extends to the more general case of search problems, as follows. A promise search problem P is NP-hard, if the existence of a polynomial-time algorithm for P implies the existence of a polynomial-time algorithm for some NP-hard decision problem Q. In our case, Q is about degree-bounded feedback arc sets.

Contribution 2. We generalized the finite sample result of Brustle, Cai, and Daskalakis (2020) from the degreebounded setting to the parameter-bounded setting.

Theorem 1.4 (Approximating parameter-bounded Bayes nets using samples). Fix any accuracy parameter ε > 0 and confidence parameter δ > 0. Given sample access to a distribution P over n variables, each defined on the alphabet Σ, and the promise that P is Markov with respect to a Bayes net with at most p parameters,

p log n |Σ|

+ n log p n(|Σ| 1)

log |Σ| log n

samples from P suffice to learn the underlying DAG of a Bayes net G with at most p parameters and define a distribution Q that is Markov with respect to G such that d TV(P, Q) ε with success probability at least 1 δ.

Notice that when all in-degrees are at most d, we have p (|Σ| 1) n |Σ|d, so our result generalizes the bound of Brustle, Cai, and Daskalakis (2020) given in Equation (1).1 Finally, we note that while Theorem 1.4 runs in time polynomial in 1/δ, 1/ε2, and it has exponential dependency on the number of samples from P, similar to Brustle, Cai, and Daskalakis (2020).

1.2 Paper Outline After preliminaries in Section 2, we give a high-level overview of the techniques behind our results in Section 3. We then formally prove Theorem 1.3 and Theorem 1.4 in Section 4 and Section 5, respectively. Finally, we conclude with an open problem in Section 6.

2 Preliminaries 2.1 Notation We define the set of natural numbers by N and all logarithms refer to the natural log. Distributions are written as P, Q and graphs in calligraphic letters, e.g., G, H, K. For variables/nodes, we use capital letters, small letters for the values taken by them, and boldface versions for a collection of variables,

1One can further generalize Theorem 1.4 to the case where each node has a different alphabet size, e.g., Xi has alphabet Σi, but this is a straightforward extension.

e.g., X = x and X = x. As shorthands, we write [n] for {1, . . . , n} and P(x) for P(X = x). We will often represent the same set of variables of distributions as nodes in a graph. Problems and algorithms are named in the typewriter font in full caps and capitalized, respectively, e.g., PROBLEM and Algorithm. We will also often use Σ to denote the alphabet set of a variable and write |Σ| to denote the corresponding (conditional) probability simplex.

2.2 Graph-Theoretical Notions

Let G = (X, E) be a fully directed graph on |X| = n vertices and |E| edges where adjacencies are denoted with dashes, e.g., X Y , and arc directions are denoted with arrows, e.g., X Y . For any node X X, we write Pa G(X) X to denote its parents and pa G(X) to denote the values they take. A degree sequence of a graph G on vertex set X = {X1, . . . , Xn} is a list of degrees d = (d1, . . . , dn) of all vertices in the graph, i.e., vertex Xi has degree di. A graph H = (X, E) is said to realize degree sequence d if the degrees of X in H agree with d. Realizability is defined in a similar fashion for in-degree sequences d = (d 1 , . . . , d n ). The graph G is called a directed acyclic graph (DAG) if it does not contain any directed cycles and is said to be complete if for every two of its nodes U, V X either there is an edge V U or an edge U V , i.e., the underlying undirected graph is a clique. A vertex Vi on any simple path V1 . . . Vk is called a collider if the arcs are such that Vi 1 Vi Vi+1. A Bayesian network (or Bayes net) G for a set of n variables X1, . . . , Xn is described by a DAG (X, E) and n corresponding conditional probability tables (CPTs), e.g., the CPT for Xi X describes P(xi | pa G(Xi)) for all possible values of xi and pa G(Xi). The joint distribution for P factorizes as P(x) = Qn i=1 P(xi | pa G(Xi)), and we say that G represents P. All independence constraints that hold in the joint distribution of a Bayes net that has underlying DAG G are exactly captured by the d-separation criterion (Pearl 1988, Section 3.3.1). Two nodes X, Y X are said to be d-separated in a DAG G = (X, E) given a set Z X \ {X, Y } if and only if there is no Z-active path in G between X and Y ; a Z-active path is a simple path Q such that any vertex from Z on Q occurs as a collider and any vertex from X \ Z appears as a non-collider. Two nodes are d-connected if they are not d-separated. It is known that X is d-separated from its non-descendants given its parents (Pearl 1988, Section 3.3.1, Corollary 4). A probability distribution P is said to be Markov with respect to a DAG G if d-separation in G implies conditional independence in P. Note that any distribution is Markov with respect to the complete DAG, since there are no d-separations implied by this kind of DAG. (Moreover, any Bayes net over a complete DAG requires 2|X| 1 parameters to describe.) A probability distribution P is said to be Markov with respect to a Bayes net G if P is Markov with respect to the underlying DAG of G.

2.3 Some Problems of Interest Chickering, Heckerman, and Meek (2004) introduced a decision problem about learning Bayes nets from data, called LEARN, and proved that LEARN is NP-hard by showing a reduction from the NP-hard decision problem degree-bounded feedback arc set (DBFAS). See Definition 2.1 and Definition 2.2. Definition 2.1 (The DBFAS decision problem). Given a directed graph G = (X, E) with maximum vertex degree of 3, and a positive integer k |E|, determine whether there is a subset of edges E E with of size |E | k such that E contains at least one directed edge from every directed cycle in G. Definition 2.2 (The LEARN decision problem). Given variables X = (X1, . . . , Xn), a probability distribution P over X, and a parameter bound p N, determine whether there exists a Bayes net G with at most p parameters such that P is Markov with respect to G. In our work, we focus on the particular LEARN instances (X, P, p) used in (Chickering, Heckerman, and Meek 2004), namely LEARN-DBFAS. Definition 2.3 (The LEARN-DBFAS decision problem). Let R denote the reduction of Chickering, Heckerman, and Meek (2004) from DBFAS to LEARN. We define as LEARN-DBFAS the set of instances of LEARN that are in the range of R. That is, for any instance IL of LEARN-DBFAS, there is some instance ID of DBFAS such that R(ID) = IL. An independence oracle for a distribution P is an oracle that can determine, in constant time, whether or not U V | Z for any U, V X and Z X \ {U, V }. We will use the following result by Chickering, Heckerman, and Meek (2004). Theorem 2.4 ((Chickering, Heckerman, and Meek 2004)). LEARN-DBFAS is NP-hard even when one is given access to an independence oracle.

2.4 Selecting a Close Distribution With Finite Samples The classic method to select an approximate distribution amongst a set of candidate distributions is via the Scheff e tournament of Devroye and Lugosi (2001), which provides a logarithmic dependency on the number of candidates. In our work, we will use the Scheff e-based algorithm of Daskalakis and Kamath (2014),2 which given sample access to an input distribution and explicit access to some candidate distributions, outputs with high probability a candidate distribution that is sufficiently close, in total variation (TV) distance (d TV), to the input distribution. Theorem 2.5 ((Daskalakis and Kamath 2014)). Fix any accuracy parameter ε > 0 and confidence parameter δ > 0. Suppose there is a distribution P over variables X and a collection of explicit distributions Q = {Q1, . . . , Qm}, where

2Their result is actually more general than what we stated here. For instance, they only require sample access to the distributions in Q = {Q1, . . . , Qm} while our setting is simpler as we have explicit descriptions of each of these distributions.

each distribution Qi is defined over the same set X and there exists some Q Q such that d TV(P, Q) ε. Then, there is an algorithm that uses O log 1/δ

ε2 log m samples

from P and returns some Q Q such that d TV(P, Q) 10ε with success probability at least 1 δ and running time poly(m, 1/δ, 1/ε2). To curate a set of candidates Q, we rely on the following lemma of Brustle, Cai, and Daskalakis (2020) which states that any distribution Q which approximately agrees with P on the local conditional distribution at each node will be close in TV distance to P on the entire domain. Lemma 2.6 ((Brustle, Cai, and Daskalakis 2020)). Suppose P and Q are Bayes nets on the same DAG G = (X, E) with n nodes. If

d TV P(X | Pa G(X) = σ), Q(X | Pa G(X)) = σ ε

n for all nodes X X and possible parent values σ Σ|Pa G(X)|, then d TV(P, Q) ε. Although there are infinitely many possible distributions, since we are satisfied with an approximately close distribution, one can discretize the space via an ε-net. Definition 2.7 (ε-nets; (Vershynin 2018)). Fix a metric space (T , d). For any subset K T and ε > 0, a subset N K is called an ε-net of K if every point in K is within distance ε to some point in N. That is, x K, x0 N such that d(x, x0) ε. We say that N ε-covers K. As we shall see in Section 5, the candidate set Q will be created by computing an ε

n-net with respect to the TV distance and then applying Lemma 2.6 suitably.

2.5 Other Related Work We have already referred to some papers that are relevant to our work. We resume this discussion here. Dasgupta (1999) considers the task of learning the maximum-likelihood polytree from data. The main result of this paper is that the optimal branching (or Chow-Liu tree) is a good approximation to the best polytree. This result is then complemented by the observation that this learning problem is NP-hard, even to approximately solve within some constant factor. Teyssier and Koller (2005) propose a simple heuristic method for addressing the task of learning Bayes nets. Their approach is based on the fact that the best network (of bounded indegree) consistent with a given node ordering can be found efficiently. Elidan and Gould (2008) present a method for learning Bayes nets of bounded treewidth that employs global structure modifications, which is polynomial both in the size of the graph and the treewidth bound. Friedman, Nachman, and Pe er (2013) introduce an algorithm that achieves learning by restricting the search space. Their iterative algorithm restricts the parents of each variable to belong to a small subset of candidates. They then search for a network that satisfies these constraints and the learned network is then used for selecting better candidates for the next iteration. Ganian and Korchemna (2021) investigate the parameterized complexity of Bayesian Network Structure Learning

(BNSL). They show that parameterizing BNSL by the size of a feedback edge set yields fixed-parameter tractability. In a similar vein, Gr uttemeier and Komusiewicz (2020) study the fixed-parameter tractability of this problem with respect to the number of edges.

3 Technical Overview

Here, we give a brief high-level overview of the techniques used in our results of Theorem 1.3 and Theorem 1.4.

3.1 NP-Hardness of the Realizable Case

By Theorem 2.4, it would suffice to prove that the existence of a polynomial time algorithm for REALIZABLE-LEARN implies that LEARN-DBFAS instances can be solved in polynomial time if one has access to an independence oracle. The desired result will then follow from the facts that we can efficiently (a) compute the number of parameters of a Bayes net and (b) decide whether a given distribution is Markov with respect to a given Bayes net (when given access to an independence oracle). Suppose we have a polynomial-time algorithm Learner for REALIZABLE-LEARN. Note that it is conventional to assume that such an algorithm always halts within some polynomial-time bound, and outputs some Bayes net, even when the respective promise is violated (Goldreich 2006). This is so, as if there is an algorithm A that solves a promise problem P and A runs in time T, then whenever A attempts to run for more than T steps, we can terminate it and report as its output an arbitrary Bayes net. We define and analyze the following reduction:

Given an arbitrary instance (X, P, p) of REALIZABLE-LEARN, run Learner to obtain a Bayes net G. Then check whether G has at most p parameters and (while using an independence oracle) check whether P is Markov with respect to G or not. If both of these checks are positive, then output YES. Otherwise, output NO. See Section 4 for the formal proof of Theorem 1.3.

3.2 Approximately Learning Parameter-Bounded Bayes Networks

The main idea is to construct an ε-net over all possible DAGs that satisfy the parameter upper bound p, and then apply a well-known bound from the density estimation literature. For this purpose, we need to count all possible Bayes nets that satisfy the parameter upper bound p. By a counting argument, we see that there are not many possible DAGs that give rise to some Bayes net of at most p parameters. Then, by a counting argument again, we see that there are only a few conditional distributions that are Markov with respect to a Bayes net G over a DAG that realizes a given in-degree sequence. Thus we are able to bound the number of distributions that cover all possible conditional distributions which are Markov with respect to G. See Section 5 for the formal proof of Theorem 1.4.

4 REALIZABLE-LEARN is NP-hard To show that REALIZABLE-LEARN is hard, we reduce LEARN-DBFAS to REALIZABLE-LEARN by making polynomially-many calls to an independence oracle. Given any polynomial time algorithm Learner that solves REALIZABLE-LEARN, we will forward the LEARN-DBFAS instance to Learner and examine the produced Bayes net G. We will describe a polynomial time procedure Reduction that uses an independence oracle to determine whether we should correctly output YES or NO for the given LEARNDBFAS instance. See Figure 2 for a pictorial illustration of our reduction strategy.

DBFAS instance

LEARN-DBFAS instance

REALIZABLE-LEARN instance Learner

X, P, p YES or NO

X, P, p Additional promise

Bayes net G

Figure 2: In this work, we show that if one can learn such a Bayes net G (via some blackbox polynomial time algorithm Learner), then there is a polynomial time algorithm Reduction that correctly answers LEARN-DBFAS. Therefore, REALIZABLE-LEARN is also NP-hard.

We begin with the following lemma. Lemma 4.1. Given a Bayes net over G = (X, E), one can compute the number of its parameters in polynomial time.

Proof. Let ΣX denote the alphabet set of X X. Then, the number of parameters of G is

U Pa G(X) |ΣU|

which can be computed in polynomial time.

We require the following notion. Definition 4.2 (Important edges). Let G = (V , E) be a DAG and P be a distribution over V . Then, an edge e E is called (G, P)-important if P is Markov with respect to G but is not Markov with respect to G = (V , E \ {e}). To check whether P is Markov with respect to G, one could verify that any d-separation in G implies conditional independence in P. However, this computation seems to be intractable. In contrast, Corollary 4.5 gives a polynomial time algorithm that checks this while using an independence oracle. The correctness of Corollary 4.5 follows from Lemma 4.3 and Lemma 4.4. Lemma 4.3. Suppose P on variables X is Markov with respect to G = (X, E). Then, an edge A B in E is not (G, P)-important if A B | Pa G(B) \ {A}.

Proof. Consider an arbitrary edge A B in E such that A B | Pa G(B) \ {A}. Say, A = Xj and B = Xk. Letting G = (V , E \ (A, B)) be a subgraph of G that does not contain the edge A B, we see that P(x) is equal to

i=1 P(xi | pa G(Xi)) ( )

= P(xk | pa G(Xk)) Y

i [n]\k P(xi | pa G(Xi))

= P(xk | pa G(Xk) \ xj) Y

i [n]\k P(xi | pa G(Xi)) ( )

= P(xk | pa G (Xk)) Y

i [n]\k P(xi | pa G (Xi)) ( )

i=1 P(xi | pa G (Xi)),

where ( ) is due to P being Markov with respect to G, ( ) is due to Xj Xk | Pa G(Xk) \ {Xj}, and ( ) is due to the definition of G . Since P(x) = Qn i=1 P(xi | pa G (Xi)), we see that P is also Markov with respect to G , and so the edge A B is not (G, P)-important.

Lemma 4.4. Suppose a distribution P on variables X is Markov with respect to a DAG G = (X, E). Let G = (X, E ) be an edge-induced DAG of G with E E. Then, P is Markov with respect to G if and only if A B | Pa G(B) \ {A} for all edges A B in E \ E .

Proof. We prove each direction separately.

( ) Suppose that P is Markov with respect to G . Consider an arbitrary edge A B E\E . Since A is an ancestor of B in G , we have that A remains a non-descendant of B in G after removing the edge A B. So, A and B are d-separated in G given Pa G (B) \ {A}, and so A B | Pa G (B) by the Markov property. That is, A B | Pa G(B) \ {A}.

( ) Suppose that A B | Pa G(B) \ {A} for all edges A B in E \ E . Order the edges in E \ E in an arbitrary sequence, say e1, . . . , e|E\E |. Let us remove these edges sequentially, resulting in a sequence of edge-induced DAGs G = G0, G1, . . . , G|E\E | = G , where Gi is the edge-induced DAG obtained from removing edges {e1, . . . , ei} from G. Observe that non-descendant relationships are preserved as we remove edges, i.e., if A is a non-descendant of B in G, then it is also a non-descendant of B in Gi for any i {1, . . . , |E \ E |}. So, we can apply Lemma 4.3 repeatedly: For any i {1, . . . , |E \ E |}, the edge ei is not (Gi 1, P)- important, so P is Markov with respect to Gi. That is, P is Markov with respect to G .

Corollary 4.5. Suppose P is a distribution over X and G is a Bayes net over the same set of variables X. Then, there is a polynomial time algorithm that uses an independence oracle for P to decide whether or not P is Markov with respect to G.

Proof. Consider the following algorithm Checker:

Given Bayes net G over a DAG (X, E ), consider a DAG K = (X, E), with E E, that is a complete supergraph of (X, E ). If every edge A B E \ E satisfies A B | Pa G(B) \ {A}, output YES. Otherwise, output NO.

Note that since any distribution is Markov with respect to the complete DAG (see Section 2.2), P is Markov with respect to K. The correctness of Checker follows from Lemma 4.4. Checker runs in polynomial time as K can be created in polynomial time with respect to the size of X, and the number of edges in E E to check is polynomial in the size of X.

We are now ready to formally prove our first main result.

Proof of Theorem 1.3. It suffices to show the existence of a polynomial time algorithm for REALIZABLE-LEARN implies that LEARN-DBFAS instances can be answered in polynomial time with access to an independence oracle; see Figure 2. Suppose we have a polynomial time algorithm Learner for REALIZABLE-LEARN. Let us define a reduction algorithm Reduction as follows:

Given an instance (X, P, p) of LEARN-DBFAS, run Learner to obtain a Bayes net G (see Section 3.1; this is a natural assumption for algorithms solving a promise search problem). Compute the number of parameters of G. Run algorithm Checker of Corollary 4.5 on G to check whether or not P is Markov with respect to G. Output YES if G has at most p parameters and P is Markov with respect to G; else, output NO.

The correctness of Reduction follows from the assumption that Learner produces a Bayes net G with at most p parameters such that P is Markov with respect to G (if the underlying promise is satisfied; otherwise, the output is an arbitrary Bayes net), and the correctness of Checker. By assumption, Learner is a polynomial time algorithm. By Lemma 4.1, we can compute the number of parameters of G in polynomial time. By Corollary 4.5, Checker is also a polynomial time algorithm. Therefore, the overall running time for Reduction is polynomial.

5 Approximating Bayes Nets Our strategy for proving our finite sample complexity result (Theorem 1.4) follows that of Brustle, Cai, and Daskalakis (2020, Theorem 10), but we specialize the analysis to the setting where we are given a parameter bound instead of a degree bound. As discussed in Section 1, our result is a generalization of their result since an upper bound on the in-degrees implies a (possibly loose) parameter upper bound.

5.1 Some Graph Counting Arguments To prove Theorem 1.4, we require an upper bound on the number of possible Bayes nets on n nodes that have at most p parameters (Lemma 5.2). To obtain such a result, we first relate the number of parameters p with a specific given indegree sequence (d 1 , . . . , d n ) of a Bayes net, then we upper bound the total number of Bayes nets that has at most p

parameters by summing over all suitable in-degree sequences d = (d 1 , . . . , d n ). Consider an arbitrary Bayes net G with in-degree sequence (d 1 , . . . , d n ) and each node taking on |Σ| values. Since the conditional distribution for vertex Xi is fully described when we know P(xi | pa G(Xi)) for |Σ| 1 possible values of xi,

with respect to |Σ|d i possible values of pa G(Xi). Therefore, we see that the Bayes net has

(|Σ| 1) |Σ|d i = (|Σ| 1)

parameters. Note that this is the exact same reasoning as in Lemma 4.1. So, if the Bayes net has at most p parameters, then

i=1 |Σ|d i = |Σ|d 1 + . . . + |Σ|d n p |Σ| 1. (2)

By the AM-GM inequality, we have that

i=1 |Σ|d i n

n = n|Σ| 1 n Pn i=1 d i . (3)

Combining Equations (2) and (3) together gives us

d 1 + . . . + d n n log p n(|Σ| 1)

log |Σ| . (4)

The following lemma is a combinatorial fact upper bounding on the number of graphs that realize a given degree sequence, which may be of independent interest beyond being used to prove Lemma 5.2. Lemma 5.1. Given an in-degree sequence d = (d 1 , . . . , d n ) with non-negative integers d 1 , . . . , d n , there are at most Qn i=1 n 1 d i DAGs that realize d .

Proof. Fix an arbitrary labelling of vertices from X1 to Xn and consider the sequential process of adding edges into X1, . . . , Xn. For X1, there are n 1 d 1 ways to add d 1 incom-

ing edges that end at X1. For X2, there are n 1 d 2 possibilities.

For X3, there are at most n 1 d 3 possibilities. Note that some of these choices would be incompatible with earlier edge choices as the newly added edges may cause directed cycles to be formed. We repeat this edge adding process until all vertices have added their incoming edges to the graph. So, the upper bound is Qn i=1 n 1 d i .

Lemma 5.2. Suppose that every node takes on at most |Σ| values. Then, there are at most

n log( p n(|Σ| 1))

log p n(|Σ| 1)

log |Σ| + 1

possible DAGs over n nodes that may be used to define some Bayes net that has at most p parameters.

Proof. By Lemma 5.1, there are Qn i=1 n 1 d i possible DAGs

realizing any fixed in-degree sequence d = (d 1 , . . . , d n ). Let ( ) denote the condition that an in-degree sequence d yields a graph that has at most p parameters. Then, X

d satisfies ( )

d satisfies ( ) (n 1)d 1 +...+d n

(n 1)n log( p n(|Σ| 1))

d satisfies ( ) 1

(n 1)n log( p n(|Σ| 1))

log |Σ| n log( p n(|Σ| 1)) log |Σ| + n n

(n 1)n log( p n(|Σ| 1))

log p n(|Σ| 1)

log |Σ| + 1

where the first and last inequalities is because n k en

k k nk, the second inequality is due to Equation (4), and the third inequality is obtained via standard stars and bars counting. That is, we introduce an auxiliary variable d 0 and count the number of non-negative integer solutions of

d 0 + d 1 + . . . + d n = n log p n(|Σ| 1)

5.2 Proof of Theorem 1.4 We are now ready to prove Theorem 1.4. Since Lemma 2.6 tells us that it suffices to approximate each local conditional distribution at each node well. So, we will consider an ε

n-net over all such distributions and then apply a tournament style argument (Theorem 2.5) to pick a good candidate amongst the joint distribution obtained by a combination of such candidate local distributions.

Proof of Theorem 1.4. Fix a DAG G satisfying an arbitrary in-degree sequence d = d 1 , . . . , d n . Then, there are |Σ|d 1 + . . . + |Σ|d n local conditional distributions for any Bayes net over G. From Equation (2) above, we know that n X

i=1 |Σ|d i = |Σ|d 1 + . . . + |Σ|d n p |Σ| 1.

Now, consider an arbitrary local distribution over k = |Σ| values and let us upper bound the number of points in an ε

n-net for this metric space. Observe that each possible distribution is essentially an element of the probability simplex k. To get an ε

n-net of k, we discretize vectors by rounding them to their nearest multiple of ε n|Σ|. If π is a probability vector, and rπ is its rounding, then π rπ 1 ε n|Σ| |Σ| = ε n. Therefore the number of discretized vectors is at most O (n |Σ|/ε)|Σ| , and for any fixed DAG G, there is a set of

m1 O (n |Σ|/ε)

p|Σ| |Σ| 1 (5)

distributions that ε

n-cover any possible joint distributions that can be Markov with respect to a Bayes net over G. Meanwhile, by Lemma 5.2, there are at most

n log( p n(|Σ| 1))

log p n(|Σ| 1)

log |Σ| + 1

possible DAGs that may be used to define a Bayes net on n nodes that has at most p parameters. Let m2 denote the number in Equation (6). We can now define a set of distributions Q over n variables such that there exists Q Q such that d TV(P, Q) ε. Let us denote m = |Q|. Putting together the above bounds, we see that there are at most m = m1 m2 candidates suffice, where m1 and m2 are from Equations (5) and (6). Therefore, with

p log n |Σ|

+ n log p n(|Σ| 1)

log |Σ| log n

samples from P, Theorem 2.5 chooses a distribution Q amongst the m candidates such that d TV(P, Q) ε with success probability at least 1 δ.

6 Conclusion In this work, we showed the hardness result of finding a parameter-bounded Bayes net that represents some distribution P, given sample access to P, even under the promise that such a Bayes net exists. On a positive note, we gave a finite sample complexity bound sufficient to produce a Bayes net representing a probability distribution Q that is close in TV distance to P. Our results generalize earlier known results of Chickering, Heckerman, and Meek (2004) and Brustle, Cai, and Daskalakis (2020) respectively. An intriguing open question is as follows:

Suppose we are given sample access to a distribution P and are promised that there exists a Bayes net on G with at most p parameters such that P is Markov with respect to G. Is it hard to find a Bayes net G that has α p parameters such that P is Markov with respect to G (where G may not be G), for some constant α > 1?

Note that the hardness construction of Chickering, Heckerman, and Meek (2004) only displayed an additive gap in the parameter bound. We conjecture that it is also hard to obtain such a multiplicative gap in the parameter bound, even in the promise setting.

Acknowledgements We would like to thank Dimitris Zoros for helpful discussions. Des Cartes: This research is supported by the National Research Foundation, Prime Minister s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme. This research/project is supported by the National Research Foundation, Singapore

under its AI Singapore Programme (AISG Award No: AISGPh D/2021-08-013). The work of AB was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme (NRF-NRFFAI-2019-0002) and an Amazon Faculty Research Award. SG s work was partially supported by the SERB CRG Award CRG/2022/007985 and an IIT Kanpur initiation grant. Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing.

References Brustle, J.; Cai, Y.; and Daskalakis, C. 2020. Multi-item mechanisms without item-independence: Learnability via robustness. In Proceedings of the 21st ACM Conference on Economics and Computation, 715 761. Chickering, D. M. 1996. Learning Bayesian networks is NP-complete. Learning from data: Artificial intelligence and statistics V, 121 130. Chickering, D. M. 2002. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov): 507 554. Chickering, M.; Heckerman, D.; and Meek, C. 2004. Largesample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research, 5: 1287 1330. Cooper, G. F.; and Herskovits, E. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine learning, 9: 309 347. Dasgupta, S. 1999. Learning Polytrees. In Laskey, K. B.; and Prade, H., eds., UAI 99: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30 - August 1, 1999, 134 141. Morgan Kaufmann. Daskalakis, C.; and Kamath, G. 2014. Faster and sample nearoptimal algorithms for proper learning mixtures of gaussians. In Conference on Learning Theory, 1183 1213. PMLR. Devroye, L.; and Lugosi, G. 2001. Combinatorial methods in density estimation. Springer Science & Business Media. Elidan, G.; and Gould, S. 2008. Learning Bounded Treewidth Bayesian Networks. In Koller, D.; Schuurmans, D.; Bengio, Y.; and Bottou, L., eds., Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, 417 424. Curran Associates, Inc. Friedman, N.; Nachman, I.; and Pe er, D. 2013. Learning Bayesian Network Structure from Massive Datasets: The Sparse Candidate Algorithm. Co RR, abs/1301.6696. Ganian, R.; and Korchemna, V. 2021. The Complexity of Bayesian Network Learning: Revisiting the Superstructure. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, 430 442. Goldreich, O. 2006. On Promise Problems: A Survey. In Goldreich, O.; Rosenberg, A. L.; and Selman, A. L., eds.,

Theoretical Computer Science, Essays in Memory of Shimon Even, volume 3895 of Lecture Notes in Computer Science, 254 290. Springer. Gr uttemeier, N.; and Komusiewicz, C. 2020. Learning Bayesian Networks Under Sparsity Constraints: A Parameterized Complexity Analysis. In Bessiere, C., ed., Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, 4245 4251. International Joint Conferences on Artificial Intelligence Organization. Main track. Heckerman, D.; Geiger, D.; and Chickering, D. M. 1995. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Mach. Learn., 20(3): 197 243. Pearl, J. 1988. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann. Spiegelhalter, D. J.; Dawid, A. P.; Lauritzen, S. L.; and Cowell, R. G. 1993. Bayesian analysis in expert systems. Statistical science, 219 247. Spirtes, P.; Glymour, C. N.; and Scheines, R. 2000. Causation, prediction, and search. MIT press. Teyssier, M.; and Koller, D. 2005. Ordering-Based Search: A Simple and Effective Algorithm for Learning Bayesian Networks. In UAI 05, Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, Edinburgh, Scotland, July 26-29, 2005, 548 549. AUAI Press. Vershynin, R. 2018. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.