# consistency_of_neural_causal_partial_identification__0a3cb52d.pdf

Consistency of Neural Causal Partial Identification

Jiyuan Tan Management Science and Engineering Stanford University Stanford, CA 94305 jiyuantan@stanford.edu

Jose Blanchet Management Science and Engineering Stanford University Stanford, CA 94305 jose.blanchet@stanford.edu

Vasilis Syrgkanis Management Science and Engineering Stanford University Stanford, CA 94305 vsyrgk@stanford.edu

Recent progress in Neural Causal Models (NCMs) showcased how identification and partial identification of causal effects can be automatically carried out via training of neural generative models that respect the constraints encoded in a given causal graph [52, 3]. However, formal consistency of these methods has only been proven for the case of discrete variables or only for linear causal models. In this work, we prove the consistency of partial identification via NCMs in a general setting with both continuous and categorical variables. Further, our results highlight the impact of the design of the underlying neural network architecture in terms of depth and connectivity as well as the importance of applying Lipschitz regularization in the training phase. In particular, we provide a counterexample showing that without Lipschitz regularization this method may not be asymptotically consistent. Our results are enabled by new results on the approximability of Structural Causal Models (SCMs) via neural generative models, together with an analysis of the sample complexity of the resulting architectures and how that translates into an error in the constrained optimization problem that defines the partial identification bounds.

1 Introduction

Identifying causal quantities from observational data is an important problem in causal inference which has wide applications in economics [1], social science [20], health care [34, 18], and recommendation systems [9]. One common approach is to transform causal quantities into statistical quantities using the ID algorithm [49] and deploy general purpose methods to estimate the statistical quantity. However, in the presence of unobserved confounding, typically the causal quantity of interest will not be point-identified by observational data, unless special mechanisms are present in the data generating process (e.g. instruments, unconfounded mediators, proxy controls). In the presence of unobserved confounding, the ID algorithm will typically fail to return a statistical quantity and declare the causal quantity as non-identifiable.

One remedy to this problem, which we focus on in this paper, is partial identification, which aims to give informative bounds for causal quantities based on the available data. At a high level, partial identification bounds can be defined as follows: find the maximum and the minimum value that a target causal quantity can take, among all Structural Causal Models (SCMs) that give rise to the same observed data distribution and respect the given causal graph (as well as any other structural

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

constraints that one is willing to impose). Note that in the presence of unobserved confounding, there will typically exist many structural mechanisms that could give rise to the same observational distribution but have vastly different counterfactual distributions. Hence, partial identification can be formulated as solving a max and a min optimization problem [3]

max M C \ min M C θ(M), (1)

subject to P M(V ) = P M (V ), and GM = GM ,

where θ(M) is the causal quantity of interest, M is the true model, V is the set of observed nodes, P M(V ) is the distribution of V in SCM M, C is a collection of causal models and GM is the causal graph of M (see Section 2 for formal definitions). Two recent lines of work explore the optimization approach to partial identification. The first line deals with discrete Structure Causal Models (SCMs), where all observed variables are finitely supported. In this case, (1) becomes a Linear Programming (LP) or polynomial programming problem and tight bounds can be obtained [37, 4, 5, 27, 44, 45, 40, 8, 54, 56, 14]. The second line of work focuses on continuous models and explores ways of solving (1) in continuous settings using various techniques [24, 32, 29, 38, 3, 41].

Recently, Xia et al. [51] formalized the connection between SCMs and generative models (see also [33] for an earlier version of a special case of this connection). This work showcased that SCMs can be interpreted as neural generative models, namely Neural Causal Models (NCMs), that follow a particular architecture that respects the constraints encoded in a given causal graph. Hence, counterfactual quantities of SCMs can be learned by optimizing over the parameters of the underlying generative models. However, there could be multiple models that lead to the same observed data distribution, albeit have different counterfactual distributions. Xia et al. [51] first analyze the approximation power of NCMs for discrete SCMs and employ the max/min approach to verify the identifiability of causal quantities, without the need to employ the ID algorithm. Balazadeh et al. [3] and Hui et al. [30], extend the method in [51] and re-purpose it to perform partial identification by solving the min and max problem in the partial identification formulation over neural causal models.

However, for SCMs with general random variables and functional relationships, the approximation error and consistency of this optimization-based approach to partial identification via NCMs has not been established. In particular, two problems remain open. First, given an arbitrary SCM, it is not yet known if we can find an NCM that produces approximately the same intervention distribution as the original one. Although Xia et al.[51] show it is possible to represent any discrete SCM by an NCM, their construction highly relies on the discrete assumption and cannot be directly generalized to the general case. Moreover, Xia et al. [51] use step functions as the activation function in their construction, which may create difficulties in the training process since step functions are discontinuous with a zero gradient almost everywhere.

Second, since we only have access to n samples from the true distribution, P M(V ), we need to replace the constraints in (1) with their empirical version that uses the empirical distribution of samples P M n (V ) in place of the population distribution and looks for NCMs, whose implied distribution lies within a small distance from the empirical distribution. Moreover, even the NCM distribution is typically only accessible through sampling, hence we will need to generate mn samples from the NCM and use the empirical distribution of the mn samples from the NCM in place of the true distribution implied by the NCM. Thus, in practice, we will use a constraint of the form d(P M n (V ), P M mn(V )) αn, where d is some notion of distribution distance and αn accounts for the sampling error. It is not clear that this approach is consistent, converges to the correct partial identification bounds, when the sample size n grows to infinity. Balazadeh et al. [3] only show the consistency of this approach for linear SCMs. Consistency results concerning more general SCMs is still lacking in the neural causal literature.

In this paper, we establish representation and consistency results for general SCMs. Our contributions are summarized as follows.

We show that under suitable regularity assumptions, given any Lipschitz SCM, we can approximate it using an NCM such that the Wasserstein distance between any interventional distribution of the NCM and the original SCM is small. Each random variable of the SCM is allowed to be continuous or categorical. We specify two architectures of the Neural Networks (NNs) that can be trained using common gradient-based optimization algorithms (Theorem 2, Theorem 3 and Corollary 1).

To construct the NCM approximation, we develop a novel representation theorem of probability measures (Proposition 1) that may be of independent interest. Proposition 1 implies that under certain assumptions, probability distributions supported on the unit cube can be simulated by pushing forward a multivariate uniform distribution. We discover the importance of Lipschitz regularization by constructing a counterexample where the neural causal approach is not consistent without regularization (Proposition 2). Using Lipschitz regularization, we prove the consistency of the neural causal approach (Theorem 4).

Related Work There exists a rich literature on partial identification of average treatment effects (ATE) [37, 4, 5, 27, 44, 45, 40, 8, 54, 56, 14, 7, 39, 26]. Balke and Pearl [4, 5] first give an algorithm to calculate bounds on the ATE in the Instrumental Variable (IV) setting. They show that regardless of the exact distribution of the latent variables, it is sufficient to consider discrete latent variables as long as all endogenous variables are finitely supported. Moreover, they discover that (1) is an LP problem with a closed-form solution. This LP-based technique was generalized to several special classes of SCMs [44, 8, 27, 46]. For general discrete SCMs, [14, 56, 55] consider transforming the problem (1) into a polynomial programming problem. Xia et al. [51] discover the connection between generative models and SCMs. They show that NCMs are expressive enough to approximate discrete SCMs. By setting C in problem (1) to be the collection of NCMs, they apply NCMs for identification and estimation.

For causal models with continuous random variables, the constraints in (1) become integral equations, which makes the problem more difficult. One approach is to discretize the constraints. [24] uses stochastic process representation of the causal model in the continuous IV setting and transforms the problem into a semi-infinite program. [32] relaxes the constraints to finite moment equations and solves the problem by the Augmented Lagrangian Method (ALM). The other approach is to use generative models to approximate the observational distribution and use some metric to measure the distance between distributions. [33] first propose to use GAN to generate images. Later, [29] uses Wasserstein distance in the constraint and transforms the optimization problem into a min and max problem. Similarly, [3] solves the optimization problem using Sinkhorn distance to avoid instability during training. They propose to estimate the Average Treatment Derivative (ATD) and use ATD to obtain a bound on the ATE. They also prove the consistency of the proposed estimator for linear SCMs. [41] uses a linear combination of basis functions to approximate response functions. [26] uses sieve to solve the resulting optimization problem in the IV setting. [19] proposes a neural network framework for sensitivity analysis under unobserved confounding.

Organization of this paper In Section 2, we introduce the notations and some basic concepts used throughout the paper. Next, in Section 3, we demonstrate how to construct an NCM so that they can approximate a given SCM arbitrarily well. Two kinds of architecture are given along with an approximation error analysis. In Section 4, we highlight the importance of Lipschitz regularization by giving a counterexample that is not consistent. Then, leveraging the previous approximation results, we are able to prove the consistency of this approach under regularization. Finally, we compare our method with the traditional polynomial programming method empirically in Section 4.1.

2 Preliminary

First, we introduce the definition of an SCM. Throughout the paper, we use bold symbols to represent sets of random variables. Definition 1. (Structural Causal Model) A Structural Causal Model (SCM) is a tuple M = (V , U, F, P(U), G0), where V = {Vi}n V i=1 is the set of observed variables; U = {Uj}n U j=1 is the set of latent variables; P(U) is the distribution of latent variables; G0 is an acyclic directed graph whose nodes are V . The values of each observed variable Vi are generated by

Vi = fi (Pa(Vi), U Vi) , where Vi / Pa(Vi) and U Vi U, (2)

where F = (f1, , fn V ), Pa(V ) is the set of parents of V in graph G0 and U Vi is the set of latent variables that affect Vi. Vi takes either continuous values Rdi or categorical in [ni]. We extend graph G0 by adding bi-directed arrows between any Vi, Vj G0 if there exists a correlated latent variable pair (Uk, Ul), Uk U Vi, Ul U Vj. We call the extended graph GM the causal graph of M.

When we write Pa(Vi), we refer to the parents of Vi in the G0. To connect with the literature, the causal graph we define is a kind of Acyclic Directed Mixed Graph (ADMG), which is often used to represent SCMs with unobserved variables [49]. Note that we allow one latent variable to enter several nodes, which differs from the common definition. We use n U and n V to denote the number of latent variables and observable variables. Let T V be a set of treatment variables. The goal is to estimate causal quantities under a given intervention T = t. Formally, the structural equations of the intervened model are

Ti = ti, Ti T , Vi(t) = fi (Pa(Vi), U Vi) , Vi / T .

We denote Vi(t) to be the value of Vi under the intervention T = t and P M(V (t)) to be the distribution of V (t) in M. The notion of a C2 component [51] is defined as follows.

Definition 2 (C2-Component). For a causal graph G, a subset C V is C2-component if each pair Vi, Vj C is connected by a bi-directed arrow in G and C is maximal.

We provide a concrete example in Appendix A to explain all these notions. We will make the following standard assumption about the independence of latent variables. Note that since we allow latent variables to enter in multiple structural equations, this is more a notational convention and not an actual assumption. Also note that under this convention a bi-directed arrow essentially represents the existence of a common latent parental variable. Assumption 1. All the latent variables in U are independent.

To deal with categorical variables, we assume that latent variables that influence categorical variables contain two parts: the shared confounding that influences the propensity functions and the independent noise that generates categorical distributions. Assumption 2. The set of latent variables consists of two parts U = {U1, , Un U } {GVi : Vi is categorical}. Precisely, if Vi V is a categorical variable, the data generation process of Vi satisfies Vi = arg maxk [ni] n g Vi k + log (fi (Pa(Vi), U Vi))k o Categorical(fi/ fi 1),

where GVi = (g Vi 1 , , g Vi ni) are i.i.d. standard Gumbel variables, U Vi {U1, , Un U }.

This convention is without loss of generality at this point, but will be useful when introducing Lispchitz restrictions on the structural equation functions. The Gumbel variables in the assumption can be replaced by any random variables that can generate categorical variables. It can be proven that all discrete SCMs satisfy this assumption. Note that we implicitly assume that all categorical variables Vi are supported on [ni] for some ni. It is straightforward to generalize all results to any finite support. Next, we introduce Neural Causal Models (NCMs).

Definition 3. (Neural Causal Model) A Neural Causal Model (NCM) is a special kind of SCM where U = {U1, , Un U } {GVi : Vi categorical}, all Ui are i.i.d. multivariate uniform variables, GVi are i.i.d Gumblel variables and functions in (2) are Neural Networks (NNs).

The definition of NCMs we use is slightly different from that in [52] because we need to deal with mixed variables in our models. In (1), we usually take C to be the set of all SCMs. However, it is difficult to search over all SCMs since (1) becomes an infinite-dimensional polynomial programming problem. As an alternative, we can search over all NCMs. One quantity of common interest in causal inference is the Average Treatment Effect (ATE). Definition 4. (Average Treatment Effect). For SCM M, the ATE at T = t with respect to T = t0 is given by ATEM(t) = Eu P (U)[Y (t) Y (t0)].

Partial identification can be formulated as estimating the solution to the optimization problems (1) [3]. The max and min values F and F define the interval [F, F] which is the smallest interval we can derive from the observed data without additional assumptions. In particular, if F = F, then the causal quantity is point-identified.

Notations. We use , for the 1-norm and -norm and [n] for {1, , n}. Bold letters represent sets of random variables. We let FL(K1, K2) be the class of Lipschitz L-continuous functions f : K1 K2. We may omit the domain and use FL when the domain is clear from context. Let H(K1, K2) be the set of homeomorphisms from K1 to K2, i.e., injective and continuous

maps in both directions. We define ϵ(F1, F2) = supf2 F2 inff1 F1 f2 f1 for function classes F1, F2. We use standard asymptotic notation O( ), Ω( ). Given a measure µ on Rd1 and a measurable function f : Rd1 Rd2, the push-forward measure f#µ is defined as f#µ(B) = µ f 1(B) for all measurable sets B. We use P(X) to represent the distribution of random variable X. Let n = {(p1, , pn) : Pn i=1 pi = 1, pi 0} be the probability simplex. We use Categorical(p), p n to represent categorical distribution with event probability p. We let W( , ) be the Wasserstein-1 distance and Sλ(µ, ν) be the Sinkhorn distance [12] with regularization parameter λ > 0.

3 Approximation Error of Neural Causal Models

In this section, we study the expressive power of NCMs, which serves as a key ingredient in proving the consistency result. In particular, given an SCM M , we want to construct an NCM ˆ M such that the two causal models produce similar interventional results. Unlike in the discrete case [4, 51], latent distributions can be extremely complicated in general cases. The main challenge is how to design the structure of NCMs to ensure strong approximation power.

In the following, we first derive an upper bound on the Wasserstein distance between two causal models sharing the same causal graph. Using this result, we decompose the approximation error into two parts: the error caused by approximating structural functions via neural networks and the error of approximating the latent distributions. Then, we design different architectures for these two parts.

Decomposing the Approximation Error First, we present a canonical representation of an SCM, which essentially states that we only need to consider the case where each latent variable Ui corresponds to a C2 component of G.

Definition 5 (Canonical representation). A SCM M with causal graph G has canonical form if

1. The set of latent variables consists of two sets,

U = UC : C is a C2-component of G n GVi = (g Vi 1 , , g Vi ni) : Vi is categorical o ,

where UC and g Vi j are independent and g Vi j are standard Gumbel variables.

2. The structure equations have the form

( fi (Pa(Vi), U Vi) , Vi is continuous,

arg maxk [ni] n g Vi k + log (fi (Pa(Vi), U Vi))k o , Vi is categorical, fi 1 = 1, (3) where U Vi = {UC : Vi C, C is a C2-component of G} and (x)k is the k-th coordinate of the vector x. We further assume that fi are normalized for categorical variables.

Given a function class F, the SCM class M(G, F, U) consists of all canonical SCM models with causal graph G such that fi F, i [n V ].

Proposition 4 in the appendix shows that any SCM satisfying Assumption 1,2 can be represented in this way and we provide an example in Appendix B.1 to illustrate how to obtain the canonical representation for a given SCM. Therefore, we restrict our attention to the class M(G, F, U). For two SCM classes M(G, F, U), M(G, ˆF, ˆU), we want to study how well we can represent the models in the first class by the second class. The Wasserstein distance between the intervention distributions is used to measure the quality of the approximation. To approximate the functions in the structural equations, we need to make the following regularity assumptions on the functions. Assumption 3. If Vi is continuous, fi in (3) are Lf-Lipschitz continuous. If Vi is categorical, the propensity functions fi(Pa(Vi), U Vi) P(Vi = j|Pa(Vi), U Vi), j [ni] are Lf-Lipschitz continuous. There exists a constant K > 0 such that maxi [n V ],j [n U]{ Vi , Uj } K.

The following theorem summarizes our approximation error decomposition. Theorem 1. Given any SCM model M M(G, FL, U), let the treatment variable set be T = {Tk}nt k=1 and suppose that Assumption 1, 2 and 3 hold for M with Lipschitz constant L, constant K.

Space-filling Curve

Gumbel-Softmax

Lipschitz Homeomorphism

NC= #Connected components of latent dist.

Width: W2 Depth L2

Width: W1 Depth L1

Randomly output one component

(a) Architecture of wide neural network for 4 dimensional output. The first (yellow) part approximates the distribution on different connected components of the support using the results from [43]. The width and depth of each block in this part are W1 and L1. The second (blue) part transforms the distributions on the unit cube to the distributions on the support. The width and depth of each block in the blue part are W2 and L2. The third (green) part is the Gumbel-Softmax layer. It combines the distributions on different connected components of the support together and outputs the final distribution.

P(U) Zi Unif(0,1)

NC= #Connected components of latent dist.

(b) This figure demonstrates the first two parts of our architecture. Each interval in the yellow box corresponds to one coordinate of input in the left figure. We first push forward uniform distributions to different cubes. Then, using Assumption 4, we adapt the shape of the support and push the measure from unit cubes to the original support of P(U). In this way, we can approximate complicated measures by pushing forward uniform variables.

For any intervention T = t and ˆ M M(G, ˆF, ˆU) , we have

W P M(V (t)), P ˆ M( ˆV (t)) CG(L, K)

i=1 fi ˆfi + W P M(U), P ˆ M( ˆU) !

where CG(L, K) is a constant that only depends on L, K and the causal graph G and fi, ˆfi are structural functions of M and ˆ M respectively.

Theorem 1 separates the approximation error into two parts, which motivates us to construct the NCM in the following way. First, we approximate the functions fi in (3) by NNs ˆfi. Then, we approximate the distribution of latent variables by pushing forward uniform and Gumbel variables using neural networks, i.e., ˆUCj = ˆgj(ZCj), where {Cj} are C2 components and ZCj are multi-variate uniform and Gumbel random variables. The structural equations of the resulting approximated model ˆ M are

ˆfi Pa( ˆVi), (ˆgj(ZCj))UCj U Vi

, Vi is continuous,

arg maxk [ni] n gk + log ˆfi Pa( ˆVi), (ˆgj(ZCj))UCj U Vi

o , Vi is categorical, (5)

wheres NC,j are constants to be specified later.

For the first part, we need to approximate Lipschitz continuous functions. For simplicity, we assume that the domain of the functions are uniform cubes. Similar arguments hold for any bounded cubes. We denote NN k1,k2(W, L) to be the set of Re Lu NNs with input dimension k1, output dimension k2, width W and depth L. It has been shown that ϵ(NN k1,1(2d1 + 10, L0), FL([0, 1]d1, R)) O(L 2/k1 0 ) [53], where ϵ( , ) denotes the approximation error defined in Section 2. For a vector valued function, we can use a wider NN to approximate each coordinate and get a similar rate.

For the second part, we approximate each Ui individually by pushing forward i.i.d. multivariate uniform and Gumbel variables ˆUCi = ˆgi(ZCi) since the latent variables are independent by Assumption 1. To do so, we examine under what assumptions on the measure P over Rn we can find a NN ˆg such that W(ˆg#λ, P) is small, where λ( ) is some reference measure.

3.1 Approximating Mixed Distributions by Wide Neural Networks

In this subsection, we will extend the results in [43] to construct a wide neural network as the push-forward map. It turns out that to get a good approximation of the targeted distribution, the shape of the support is essential.

Assumption 4 (Mixed Distribution). The support of measure P has finite connected components C1, C2 CNC, i.e., supp(P) = SNC i=1 Ci, and each component Ci satisfies H([0, 1]d C i , Ci) FL([0, 1]d C i , Ci) = for some d C i 0. Recall that H(K1, K2) is the set of homeomorphisms from K1 to K2 defined at the end of Section 2.

Assumption 4 encompasses almost all natural distributions. For example, distributions supported on [0, 1]d and closed balls, finitely supported distributions and mixtures of them all satisfy this assumption. Assumption 4 allows us to transform the support of the targeted distribution into unit cubes and the nice geometric properties of unit cubes facilitate our construction.

Now, we briefly explain the construction of the push-forward maps. An example is provided in Appendix B.1 to illustrate the construction. The NN architecture consists of three parts (see Figure 1a). The input dimension is the same as the number of connected components of the support NC. For each component Ci, let Hi H([0, 1]d C i , Ci) FL([0, 1]d C i , Ci), where d C i is the dimension of component Ci in Assumption 4. By P = (Hi)#(H 1 i )#P on Ci, we can approximate (H 1 i )#P first, which is supported on a unit cube. [43] constructs a wide NN ˆg of width W and constant depth such that W(ˆg#λ, (Hi) 1 # P) O(W 1/d C i ) where λ is the uniform measure on [0, 1]. Then, we approximate the Lipschitz map Hi to Ci to pull the distribution back to Ci. These are the first two parts (yellow and blue blocks in Figure 1a) of the architecture.

Suppose that the output of i-th coordinate in the first two parts is vi, the Gumbel-softmax layer in the third part (green box) combines different components of the support. In particular, we want to output vi with probability pi = P(Ci). Let V = [ v1, , vi], this can be achieved by outputting V X, where X = (X1, , XNC)T is a one-hot random vector with P(Xi = 1) = pi. To use backpropagation in training, we use the Gumbel-Softmax trick [31] to (approximately) simulate such a random vector, ˆXτ i = exp((log pi+Gi)/τ) PNC k=1 exp((log pk+Gk)/τ), where τ > 0 is the temperature parameter (a hyperparameter) and

Gi Gumbel(0, 1) are i.i.d. standard Gumbel variables. As τ 0, the distribution of ˆXτ converges almost surely to the categorical distribution [36, Proposition 1]. In particular, when τ = 0, we denote ˆX0 i = Ii=arg maxj{log pi+Gi}. The output of the last layer is V ˆXτ. Note that the Gumbel-softmax function is differentiable with respect to parameter {log(pi)}i=1, ,NC. Therefore, we can train the network with common gradient-based algorithms. Putting things together, we obtain the following theorem. Theorem 2. Given any probability measure P on Rd that satisfies Assumption 4, let λ be the Lebesgue measure on [0, 1]NC, where NC is defined in Assumption 4. There exists a neural network ˆg = ˆgτ 3 ˆg2 ˆg1 with the above architectsure (Figure 1a) such that

W(ˆg#λ, P) O W 1/ maxi{d C i } 1 + L 2/ maxi{d C i } 2 + (τ τ log τ) .

Here, ˆgi, i = 1, 2 has the separable form ˆg1 i (x1), , ˆg NC i (x NC) and ˆgj 1 NN 1,d C j (W1, Θ(d C j )),

ˆgj 2 NN d C j ,d(Θ(d d C j ), L2), j [NC]. {d C j } are the dimension of cubes in Assumption 4. ˆgτ 3 is the Gumbel-Softmax layer with temperature parameter τ > 0.

Note that ˆgτ 3 (the Gumbel-softmax layer) is actually a random function since the coefficient vector ˆXτ is a random variable. In this sense, ˆg#λ can be viewed as pushing forward uniform and Gumbel variables using a neural net.

3.2 Approximating Mixed Distributions by Deep Neural Networks

In this subsection, we will show that under one additional assumption on the distribution, deep Re Lu networks have a stronger approximation power in approximating distributions, which means we can use fewer computational units to achieve the same worst-case theoretical approximation error. Assumption 5 (Lower Bound). Suppose that P is supported on a compact set K RD, there exists a constant Cf > 0, f H([0, 1]d, K) FL([0, 1]d, K), such that for any measurable set B [0, 1]d, P (f(B)) Cfλ(B). Besides, if d > 0, f#P vanishes on the boundary [0, 1]d.

Assumption 5 implies that dλ/d(f 1 # P) exists and is lowered bounded by a constant Cf. The next proposition extends the Skorohod representation theorem [48]. It shows that under Assumption 5, it

is possible to simulate any distribution on the unit cubes with Hölder continuous curves and uniform distribution on [0, 1]. Proposition 1. Let λ be the Lebesgue measure on [0, 1]. Given any probability measure P that satisfies Assumption 5, there exists a continuous curve γ : [0, 1] supp(P) such that γ#λ = P. Furthermore, if d 1, γ is 1/d-Hölder continuous.

Results from [53] show that we can approximate any Hölder continuous d-dimensional function with index α by a deep Re Lu network with depth L and error O(L 2α/d). Leveraging this result, we can replace the first part of the architecture in the previous subsection with deep Re Lu networks (See Figure 8 in the appendix). The remaining two parts are the same as the construction in Figure 1a. The following theorem gives a sharper bound on the approximation error compared with Theorem 2. Theorem 3. Under the Assumption 4, if in addition, P constrained to each component satisfies Assumption 5, there exists a neural network ˆg = ˆgτ 3 ˆg2 ˆg1 with the above architecture such that

W(ˆg#λ, P) O L 2/ maxi{d C i } 1 + L 2/ maxi{d C i } 2 + (τ τ log τ) ,

where d C i are the dimensions of the connected components in Assumption 4, ˆgi, i = 1, 2 has the form ˆg1 i (x1), , ˆg NC i (x NC) and ˆgj 1 NN 1,d C j (Θ(d C j ), L1). ˆgj 2, ˆgτ 3, τ are the same as in Theorem 2.

Let N be the number of nonzero weights in a neural network, Theorem 3 shows that a deep NN can achieve O(N 2/d) error while by Theorem 2 a wide network can only achieve O(N 1/d) error. Now, we can put things together to construct NCM approximations. For simplicity, we omit the input and output dimensions of the neural network. As we mention in previous sections, our construction (5) consists of two parts, ˆfi approximating the structural functions fi in (3), and ˆgj(Zj) = ˆgτ 3,j ˆg2,j ˆg1,j(Zj) approximating the latent variables Uj. Let NCMG(F0, F1, F2, τ) be a collection of NCMs with structural equation (5) and

ˆfi F0, ˆg1,j = (ˆg1 1,j, , ˆg NC,j 1,j ), ˆgi 1,j F1, ˆg2,j = (ˆg1 2,j, , ˆg NC,j 2,j ), ˆgi 2,j F2,

where NC,j is the number of connected components for Uj and Fi are function classes. Corollary 1. Given a causal model M M(G, FL, U), suppose that Assumptions 1-3 hold and the distributions of UC for all C2-component satisfy the assumptions of Theorem 3. Let din max and dout max be the largest input and output dimension of fi in (3) and d U max be the largest dimension of all latent variables. There exists a neural causal model ˆ M NCMG(NN(Θ(din maxdout max), L0), NN Θ d U max , L1 , NN Θ (d U max)2 , L2 , τ)

with structure equations (5). For any intervention T = t, ˆ M satisfies

W P M (V (t)), P ˆ M(V (t)) O(L 2/din max 0 + L 2/d U max 1 + L 2/d U max 2 + (τ τ log τ)).

Similar approximation results also hold for wide NN approximation, as presented in Section 3.1. The proof can be easily adapted to the wide NNs architecture.

4 Consistency of Neural Causal Partial Identification

In this section, we prove the consistency of the max/min optimization approach to partial identification. In the finite sample setting, we consider the following estimator Fn of the optimal values of (1).

Fn = arg min ˆ M NCMG(F0,n,F1,n,F2,n,τn) Et µT E ˆ M[F(V1(t), , Vn V (t))], (6)

s.t. Sλn(P ˆ M mn(V ), P M n (V )) αn,

where P M n , P ˆ M mn are the empirical distribution of P M , P ˆ M with sample size n, mn, µT is some given measure and Fi,n will be specified later. For example, the counterfactual outcome E[Y (1)] is a special case of the objective. Our results can be easily generalized to any linear combination of objective functions of this form. We use the Sinkhorn distance because it can be computed efficiently in practice [15]. We want to study if Fn gives a useful lower bound as n .

To match the observational distribution, we need to increase the width or depth of the NNs we use. As the sample size increases, the number of parameters also increases to infinity, which creates difficulty in the analysis. To obtain consistency, we need to regularize the functions while preserving their approximation power. Surprisingly, if we do not use any regularization, the following proposition implies that consistency may not hold even if the SCM is identifiable. Proposition 2 (Informal, see Proposition 5 for a formal version). There exists a constant c > 0 and an identifiable SCM M satisfying Assumptions 1-5 such that for any ϵ > 0, there exists an SCM Mϵ satisfying W(P M (V ), P Mϵ(V )) ϵ and |ATEM ATEMϵ| > c.

Here, M is the ground-truth model and Mϵ are the models we use to approximate M . Proposition 2 implies that we need some regularization on Mϵ. Otherwise, even if the observation distributions are close, their ATEs can be far away. In particular, we may want to regularize the Lipschitz constant of the NN. Much work has been done to impose Lipschitz regularization during the training process [13, 50, 22, 42, 10]. We denote Lip(f) to be the Lipschitz constant of a function f and define the truncated Lipschitz NN class,

NN Lf ,K d1,d2 (W, L) = {max{ K, min{f, K}} : f NN d1,d2 (W, L) , Lip(f) Lf} .

For simplicity, we omit the dimensions and use shorthand NN Lf ,K(W, L). The next theorem gives the consistency result of the min estimator. To state the theorem, we define F = Et µT EM [F(V1(t), , Vn V (t))] to be the true value and F L to be the optimal value of the following optimization problem over SCMs with Lipschitz constant L.

F L = arg min ˆ M M(G,FK L ,U),P (U) Et µT E ˆ M[F(V1(t), , Vn V (t))], (7)

s.t.W(P ˆ M(V ), P M (V )) = 0,

where the minimum is taken over M G, FK L , U with FK L = {f : f K, f FL} and all latent distributions P(U). Note that if L is the Lipschitz bound Lf that we assume on our structural functions, then F L is the sharp lower bound. Theorem 4. Let M be any SCM satisfying the assumptions of Corollary 1. Suppose that the Lipschitz constant of functions in M is Lf, F : Rn V R in (6) is Lipschitz continuous and τn > 0, let K > 0 be the constant in Assumption 3, ˆLf = p

dinmaxdout max Lf,

F0,n =NN ˆLf ,K W0,n, Θ log din max , F1,n = NN , (Θ(d U max), Li,n),

F2,n =NN ,K(Θ((d U max)2), Li,n),

where d U max, din max and dout max are defined in Corollary 1, take the radius to be αn = ϵn + sn, ϵn =

O(W 1/din max 0,n + P2 i=1 L 2/d U max i,n + τn log τn),

sn = O(m 1/(d U max+2) n log mn + δn + log(nmn)λn), δn = O(n 1/max{2,d U max} log2(n)).

If mn = Ω(n), Li,n = Θ(m d U max/(2d U max+4) n ), i = 1, 2, limn min{τ 1 n , W0,n, (log(nmn)λn) 1} = , then with probability 1, [lim infn Fn, lim supn Fn] [F

Note that our theorem shows that the lower limit of the solution is large than the point F

ˆLf , where ˆLf is slightly larger than the original constraint Lf we impose on the structural functions. Hence, this point can potentially be slightly smaller than the sharp bound F Lf . This worsening is due to the fact that we need to use NNs that satisfy a slightly worse Lipschitz property, to ensure that we have sufficient approximation power. Although Theorem 4 does not guarantee {Fn} converges to a point, it states that {Fn} may oscillate in the interval [F

ˆLf , F ], which will still give a useful lower bound to ground-truth value F . In particular, if the graph is identifiable, we have F = F

ˆLf and {Fn} converges to F . Also, note that in Theorem 4, we use wide NNs rather than deep NN for F0,n because results in [53] show that wide NNs can approximate Lipschitz functions while controlling their Lipschitz constants (a property that is not yet established for deep NNs). Similar results can be obtained if wide neural nets are used for all components, invoking Theorem 2.

As a special case, we leverage Theorem 4 for a non-asymptotic rate for the ATE without confounding.

Proposition 3 (Hölder continuity of ATE). Given two causal models M, ˆ M M(G, FL, U) satisfying Assumption 1 and Assumption 3, let their observational distributions be ν, µ. Suppose the norms of all variables are bounded by K > 0. If (1) (Overlap) ν is absolutely continuous with respect to one probability measure P and the density pν (t|Pa(T) = x) δ > 0 for x almost surely and t [t1, t2] and (2) (No Confounding) there is no confounding in the causal graph G, we have Z t2

t1 (EM[Y (t)] E ˆ M[ ˆY (t)])2P(dt) 2CW

δ W(µ, ν) + 2(L + 1)n V W 2(µ, ν)(t2 t1),

where CW = 4(L + 1)n V K + 2K max {(L + 1)n V , 1}. Corollary 2. Let F, µT in Theorem 4 to be F(V ) = Y, µT Unif([t1, t2]), ϵ > 0. Suppose that the assumptions in Proposition 3 and Theorem 4 hold, with probability at least 1 O(n 2), we have |Fn F | O( αn), where Fn, F , αn are defined in Theorem 4.

4.1 Experiments

In this section, we examine the performance of our algorithm in two settings. We compare our algorithm with the Autobounds algorithm [14] in a binary IV example in [14] and in a continuous IV model 1. Since Autobounds can only deal with discrete models, we discretize the continuous model for comparison purposes. The implementation details are provided in the Appendix D. The setting of the first experiment is taken from [14, Section D.1]. This is a binary IV problem and we can calculate the optimal bound using LP [4]. We find that our bound is close to the optimal bound. The second experiment is a general IV example where the treatment is binary but the rest of the variables are continuous. The program that Autobounds solves after discretization contains 214 variables. Even with such a fine discretization, the bound obtained by Autobounds is not tighter than our NCM approach. The details of the structural equations and analysis can be found in Appendix D. We also provide an extra experiment on the counterexample of Proposition 5 in the appendix to show the effect of Lipschitz regularization.

Setting Algorithm Average Bound Optimal Bound True Value

Binary IV NCM (Ours) [-0.49,0.05] [-0.45, -0.04] -0.31 Autobounds [-0.45,-0.05] [-0.45, -0.04]

General IV NCM (Ours) [2.49,3.24] 3 Autobounds [1.40, 3.48] Table 1: Experiment results of 2 IV settings. The sample sizes are taken to be 5000 in each experiment. STD is the standard derivation. The experiments are repeated 10 times for binary IV and 50 times for continuous IV. In all experiments, the bounds given by both algorithms all cover the true values.

Conclusion In this paper, we provide theoretical justification for using NCMs for partial identification. We show that NCMs can be used to represent SCMs with complex unknown latent distributions under mild assumptions and prove the asymptotic consistency of the max/min estimator for partial identification of causal effects in general settings with both discrete and continuous variables. Our results also provide guidelines on the practical implementation of this method and on what hyperparameters are important, as well as recommendations on values that these hyperparameters should take for the consistency of the method. These practical guidelines were validated with a small set of targeted experiments, which also showcase superior performance of the neural-causal approach as compared to a prior main contender approach from econometrics and statistics, that involves discretization and polynomial programming.

An obvious next step in the theoretical foundation of neural-causal models is providing finite sample guarantees for this method, which requires substantial further theoretical developments in the understanding of the geometry of the optimization program that defines the bounds on the causal effect of interest. We take a first step in that direction for the special case, when there are no unobserved confounders and view the general case as an exciting avenue for future work.

Acknowledgement Vasilis Syrgkanis is supported by NSF Award IIS-2337916 and a 2022 Amazon Research Award.

1The code can be found in https://github.com/Jiyuan-Tan/Neural Partial ID

[1] Joshua Angrist and Guido Imbens. Identification and estimation of local average treatment effects, 1995.

[2] Martin Anthony, Peter L Bartlett, Peter L Bartlett, et al. Neural network learning: Theoretical foundations, volume 9. cambridge university press Cambridge, 1999.

[3] Vahid Balazadeh, Vasilis Syrgkanis, and Rahul G. Krishnan. Partial identification of treatment effects with implicit generative models, October 2022.

[4] Alexander Balke and Judea Pearl. Counterfactual probabilities: Computational methods, bounds and applications. In Uncertainty Proceedings 1994, pages 46 54. Elsevier, 1994.

[5] Alexander Balke and Judea Pearl. Bounds on treatment effects from studies with imperfect compliance. Journal of the American Statistical Association, 92(439):1171 1176, September 1997.

[6] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vcdimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1 17, 2019.

[7] Alexis Bellot. Towards bounding causal effects under markov equivalence. ar Xiv preprint ar Xiv:2311.07259, 2023.

[8] Blai Bonet. Instrumentality tests revisited, January 2013.

[9] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(11), 2013.

[10] Leon Bungert, René Raab, Tim Roith, Leo Schwinn, and Daniel Tenbrinck. CLIP: Cheap Lipschitz Training of Neural Networks. In Abderrahim Elmoataz, Jalal Fadili, Yvain Quéau, Julien Rabin, and Loïc Simon, editors, Scale Space and Variational Methods in Computer Vision, volume 12679, pages 307 319, Cham, 2021. Springer International Publishing.

[11] Victor Chernozhukov, Han Hong, and Elie Tamer. Estimation and confidence regions for parameter sets in econometric models 1. Econometrica, 75(5):1243 1284, 2007.

[12] Lenaic Chizat, Pierre Roussillon, Flavien Léger, François-Xavier Vialard, and Gabriel Peyré. Faster wasserstein distance estimation with the sinkhorn divergence. Advances in Neural Information Processing Systems, 33:2257 2269, 2020.

[13] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International conference on machine learning, pages 854 863. PMLR, 2017.

[14] Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. An automated approach to causal inference in discrete settings, September 2021.

[15] Jean Feydy. Geometric data analysis, beyond convolutions. Applied Mathematics, 2020.

[16] Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2681 2690, 2019.

[17] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability theory and related fields, 162(3-4):707 738, 2015.

[18] Dennis Frauen, Tobias Hatt, Valentyn Melnychuk, and Stefan Feuerriegel. Estimating average causal effects from patient trajectories. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 7586 7594, 2023.

[19] Dennis Frauen, Fergus Imrie, Alicia Curth, Valentyn Melnychuk, Stefan Feuerriegel, and Mihaela van der Schaar. A Neural Framework for Generalized Causal Sensitivity Analysis, April 2024.

[20] David A Freedman. Statistical models and causal inference: a dialogue with the social sciences. Cambridge University Press, 2010.

[21] Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient regression in metric spaces via approximate lipschitz extension. IEEE Transactions on Information Theory, 63(8):4838 4849, 2017.

[22] Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J. Cree. Regularisation of neural networks by enforcing Lipschitz continuity. Machine Learning, 110(2):393 416, February 2021.

[23] Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393 416, 2021.

[24] Florian Gunsilius. A path-sampling method to partially identify causal effects in instrumental variable models, June 2020.

[25] Chris H. Hamilton and Andrew Rau-Chaplin. Compact hilbert indices for multi-dimensional data. In First International Conference on Complex, Intelligent and Software Intensive Systems (CISIS 07), pages 139 146. IEEE.

[26] Sukjin Han and Shenshen Yang. A computational approach to identification of treatment effects for policy evaluation. Journal of Econometrics, 240(1):105680, 2024.

[27] James J. Heckman and Edward J. Vytlacil. Instrumental variables, selection models, and tight bounds on the average treatment effect. In Wolfgang Franz, Michael Lechner, and Friedhelm Pfeiffer, editors, Econometric Evaluation of Labour Market Policies, volume 13, pages 1 15. Physica-Verlag HD, Heidelberg, 2001.

[28] David Hilbert. Über die stetige abbildung einer linie auf ein flächenstück. Dritter Band: Analysis Grundlagen der Mathematik Physik Verschiedenes: Nebst Einer Lebensgeschichte, pages 1 2, 1935.

[29] Yaowei Hu, Yongkai Wu, Lu Zhang, and Xintao Wu. A generative adversarial framework for bounding confounded causal effects. Proceedings of the AAAI Conference on Artificial Intelligence, 35(13):12104 12112, May 2021.

[30] Yaowei Hu, Yongkai Wu, Lu Zhang, and Xintao Wu. A generative adversarial framework for bounding confounded causal effects. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12104 12112, 2021.

[31] Eric Jang, S. Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. Ar Xiv, abs/1611.01144, 2016.

[32] Niki Kilbertus, Matt J. Kusner, and Ricardo Silva. A class of algorithms for general instrumental variable models, October 2020.

[33] Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis, and Sriram Vishwanath. Causalgan: Learning causal implicit generative models with adversarial training. ar Xiv preprint ar Xiv:1709.02023, 2017.

[34] Mohammad Lotfollahi, Anna Klimovskaia Susmelj, Carlo De Donno, Yuge Ji, Ignacio L Ibarra, F Alexander Wolf, Nafissa Yakubova, Fabian J Theis, and David Lopez-Paz. Compositional perturbation autoencoder for single-cell response modeling. Bio Rxiv, 2021.

[35] Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance, May 2018.

[36] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables, March 2017.

[37] Charles F Manski. Nonparametric bounds on treatment effects. The American Economic Review, 80(2):319 323, 1990.

[38] Chengzhi Mao, Augustine Cha, Amogh Gupta, Hao Wang, Junfeng Yang, and Carl Vondrick. Generative interventions for causal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3947 3956, 2021.

[39] Valentyn Melnychuk, Dennis Frauen, and Stefan Feuerriegel. Partial Counterfactual Identification of Continuous Outcomes with a Curvature Sensitivity Model. Advances in Neural Information Processing Systems, 36:32020 32060, December 2023.

[40] Caleb H. Miles, Phyllis Kanki, Seema Meloni, and Eric J. Tchetgen Tchetgen. On Partial Identification of the Pure Direct Effect, September 2015.

[41] Kirtan Padh, Jakob Zeitler, David Watson, Matt Kusner, Ricardo Silva, and Niki Kilbertus. Stochastic causal programming for bounding treatment effects. In Conference on Causal Learning and Reasoning, pages 142 176. PMLR, 2023.

[42] Patricia Pauli, Anne Koch, Julian Berberich, Paul Kohler, and Frank Allgöwer. Training robust neural networks using Lipschitz bounds. IEEE Control Systems Letters, 6:121 126, 2021.

[43] Dmytro Perekrestenko, Léandre Eberhard, and Helmut Bölcskei. High-Dimensional Distribution Generation Through Deep Neural Networks, August 2022.

[44] Roland R. Ramsahai. Causal bounds and observable constraints for non-deterministic models. Journal of Machine Learning Research, 13(29):829 848, 2012.

[45] Amy Richardson, Michael G. Hudgens, Peter B. Gilbert, and Jason P. Fine. Nonparametric Bounds and Sensitivity Analysis of Treatment Effects. Statistical Science, 29(4), November 2014.

[46] Michael C. Sachs, Gustav Jonzon, Arvid Sjölander, and Erin E. Gabriel. A general method for deriving tight symbolic bounds on causal effects, December 2021.

[47] E. Shchepin. On hölder maps of cubes. Mathematical Notes, 87:757 767, 2010.

[48] Anatoly V Skorokhod. Limit theorems for stochastic processes. Theory of Probability & Its Applications, 1(3):261 290, 1956.

[49] Jin Tian and Judea Pearl. On the testable implications of causal models with hidden variables, December 2002.

[50] Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: Analysis and efficient estimation. Advances in Neural Information Processing Systems, 31, 2018.

[51] Kevin Xia, Kai-Zhan Lee, Yoshua Bengio, and Elias Bareinboim. The causal-neural connection: Expressiveness, learnability, and inference, October 2022.

[52] Kevin Xia, Yushu Pan, and Elias Bareinboim. Neural causal models for counterfactual identification and estimation. ar Xiv preprint ar Xiv:2210.00035, 2022.

[53] D. Yarotsky. Optimal approximation of continuous functions by very deep relu networks. Ar Xiv, abs/1802.03620, 2018.

[54] Junzhe Zhang and Elias Bareinboim. Bounding causal effects on continuous outcome. Proceedings of the AAAI Conference on Artificial Intelligence, 35(13):12207 12215, May 2021.

[55] Junzhe Zhang and Elias Bareinboim. Non-parametric methods for partial identification of causal effects. Columbia Causal AI Laboratory Technical Report, 2021.

[56] Junzhe Zhang, Jin Tian, and Elias Bareinboim. Partial counterfactual identification from observational and experimental data, October 2021.

A Illustration of Notions in Section 2

V1 V3 V4 V5

Figure 2: An SCM example.

V1 V3 V4 V5

Figure 3: The causal graph of this SCM.

To further explain the notions in Section 2, we consider the following example. Let M be an SCM with the following structure equations.

V1 = f1(V2, U1), V2 = f2(U1, U2), V3 = f3(V1, V2, U2, U4), (8) V4 = f4(V3.U3, U4), V5 = f5(V4, U3),

where U1 and U2 are correlated and U3, U4 and (U1, U2) are independent. The causal model is shown in Figure 2 and its causal graph is shown in Figure 3. In this example, U V1 = {U1}, U V2 = {U1, U2}, U V3 = {U2, U4}, U V4 = {U3, U4}, U V5 = {U3}. Since U1, U2 are correlated, C1 = {V1, V2, V3} is one C2 component because all nodes in C1 are connected by bi-directed edges. Note that {V1, V2, V3, V4} is not a C2 component because V4 and V2 is not connected by any bi-directive edge. The rest of C2 components are C2 = {V4, V5}, C3 = {V3, V4}.

Now, we consider the intervention V1 = t. Under this intervention, the structure equations can be obtained by setting V1 = t while keeping all other equations unchanged, i.e.,

V1(t) = t, V2(t) = f2(U1, U2), V3 = f3(V1, V2, U2, U4), V4 = f4(V3.U3, U4), V5(t) = f5(V4, U3).

The figure of this model under intervention is shown in Figure 4.

V1(t) V3(t) V4(t) V5(t)

Figure 4: The SCM after intervention V1 = t.

B Proof of approximation Theorems

B.1 Illustration of how to construct canonical representations and neural architectures

In this section, we illustrate how to construct canonical representations and neural architectures given a causal graph via a simple example. We consider the example in the previous section. The causal graph is shown in Figure 3.

V1 V3 V4 V5

Figure 5: The canonical representation of Figure 3.

V1 V3 V4 V5

Figure 6: The neural network architecture.

Following the construction in Proposition 4, we use one latent variable for each C2 component. As we explain in Appendix A, this causal model has three C2 components, {V1, V2, V3}, {V3, V4}, {V4, V5}. In the canonical representation, exactly the latent variables enter their corresponding C2 component. The structure equation of this SCM is as follows.

V1 = f1(V2, E1), V2 = f2(E1), V3 = f3(V1, V2, E1, E2), (9) V4 = f4(V3.E2), V5 = f5(V4, E3),

If we set E1 = (U1, U2), E2 = U4, E3 = U3, (10) is equivalent to (8). Therefore, we can see from this example that the canonical representation does not lose any information about the SCM.

Now, we show how to construct the NCM architecture from a canonical representation. As we mentioned in Section 3, we approximate the latent distribution by pushing forward uniform and Gumbel variables. The structure equation of the NCM is

V1 = f θ1 1 (V2, gθ1 1 (Z1)),

V2 = f θ2 2 (gθ1 1 (Z1)),

V3 = f θ2 3 (V1, V2, gθ1 1 (Z1), gθ1 2 (Z2)), (10)

V4 = f θ4 4 (V3.gθ1 2 (Z2)),

V5 = f θ5 5 (V4, gθ3 2 (Z3)),

where f θi i , gθj j are neural networks, Zi are join distribution of independent uniform and Gumbel

variables, and gθj j has the special architecture described in Section 3.1 and Section 3.2. Figure 6 shows the architecture of the NCM. Each in-edge represents an input of a neural net.

B.2 Proof of Theorem 1

The causal graph G does not specify how latent variables influence the observed variables. There could be many ways of recovering the latent variables from a graph G. Figure 7 shows an example where two different causal models have the same causal graph. The next proposition gives a canonical representation of a causal model.

Figure 7: Example: two different SCMs with the same causal graph.

Proposition 4. Suppose that Assumption 1 and Assumption 2 hold, given any SCM M with causal graph G and latent variables U, we can construct a canonical SCM ˆ M of the form (3) by merging the latent variables in U such that M and ˆ M produce the same intervention results. Besides, functions in ˆ M have the same smoothness as M.

Proof of Proposition 4. Let G = {GVi : Vi is categorical} and the latent variables in the original model M be U = {U1, , Un U } G. By Assumption 1 and Assumption 2, U are independent and the structure equations of M have the form

( fi (Pa(Vi), U Vi) , Vi is continuous,

arg maxk [ni] n g Vi k + log (fi (Pa(Vi), U Vi))k o , Vi is categorical,

where g Vi k are i.i.d. standard Gumbel variables and U Vi {U1, , Un U } contains the latent variables that affect Vi. We regroup and merge the variables Ui to make the model have a canonical form while not changing the functions in the structure equations. Let D1, , Dn C V be the C2-component of M. For each Ui, we define the vertices that are affected by Ui as

I(Ui) = {Vj : Ui U Vj}.

We partition {U1, , Un U } into n C sets ˆU 1, , ˆU n C in the following way. For each Uk, Uk is in the set ˆU i if I(Uk) Di. If there are two components Di, Dj satisfy the condition, we put Uk into either of the sets ˆU i, ˆU j. Let ˆUi have the same distribution as the joint distribution of the random variables in set ˆU i. Let ˆ M M(G, F, ˆU) the SCM with structure equations

fi Pa( ˆVi), ˆUk1, , ˆUkni

, Vi is continuous,

arg maxk [ni] n g Vi k + log fi Pa( ˆVi), ˆUk1, , ˆUkni

o , Vi is categorical, , ˆU ki U Vi = , i [n V ].

Here, we slightly abuse the notation fi. fi ignores inputs from ˆU k1, , ˆU kni that are not in U Vi. Note that ˆU 1, , ˆU n C, G has the same distribution as U because we only merge some latent variables. In addition, the functions in the new model ˆ M has the same smoothness as the original model.

We first verify that the causal graph of ˆ M is G. If there is a bi-directed arrow between nodes Vi, Vj in G, by the independence assumption, there must be one latent variable Uk U Vi U Vj. There exist one Dl such that I(Uk) Dl and ˆU l U Vi = , ˆU l U Vj = . Therefore, ˆUl will affect both Vi and Vj and there is a bi-directed arrow between node Vi, Vj in G ˆ M. Suppose that there is a bi-directed arrow between node Vi, Vj in G ˆ M, it means there exist a ˆ Uk such that ˆU k U Vi = , ˆU k U Vj = . Let Ul1 ˆU k U Vi, Ul2 ˆU k U Vj, then

Vi I(Ul1) Dk, Vj I(Ul2) Dk.

Since Dk is a C2-component, there exist a bi-directed arrow between node Vi, Vj in G. Therefore, G = G ˆ M.

Finally, we verify that M and ˆ M produce the same intervention results. The intervention distribution can be viewed as the push-forward of the latent distribution, i.e., V = F(U), and the intervention operation only changes the function F. In our construction, we only merge the latent variables and leave the functions in structure equations being the same (except that they may ignore some coordinates in input). For any intervention T = t, suppose in M we have V (t) = Ft(U). Then, in ˆ M, we get ˆV (t) = Ft( ˆU). Since U and ˆU has the same distribution, the distribution of V (t), ˆV (t) are the same.

Proof of Theorem 1. The structure equations of M have the form

( fi (Pa(Vi), U Vi) , Vi is continuous,

arg maxk [ni] n g Vi k + log (fi (Pa(Vi), U Vi))k o , Vi is categorical, fi 1 = 1,

where g Vi k are i.i.d. standard Gumbel variables and ˆ M M(G, ˆF, ˆU) has structure equations

ˆfi Pa( ˆVi), ˆU Vi , Vi is continuous,

arg maxk [ni] n ˆg Vi k + log ˆfi Pa( ˆVi), ˆU Vi

o , Vi is categorical, ˆf1 1 = 1,

where ˆg Vi k are i.i.d. standard Gumbel variables. Let the treatment variables set be T = {T1, , Tn T }. We may give all vertices and U 0 = {U1, , Un U }, an topology ordering (rearrange the subscript if necessary) U1, , Un U , T1, , Tn T , V1(t), , Vn V n T (t) such that for each directed edge (Vi(t), Vj(t)), vertex Vi(t) lies before vertex Vj(t) in the ordering, ensuring that all edges start from vertices that appear early in the order and end at vertices that appear later in the order. We put U 0 and T at the beginning because they are the root nodes of the intervened model. We denote µT U 0,T ,1:k (resp. ˆµT U 0,T ,1:k) to be the distribution of U 0, T , V1(t), , Vk(t) (resp. ˆU 0, T , ˆV1(t), , ˆVk(t)), µT Vk|U 0,T ,1:k 1 the distribution of Vk given U 0, T , V1(t), , Vk 1(t).

Let S = Pn V i=1 fi ˆfi . Next, we prove that

W µT U 0,T ,1:k, ˆµT ˆU 0,T ,1:k

(L + 1)W µT U 0,T ,1:k 1, ˆµT ˆU 0,T ,1:k 1

+ 2K2S. (11)

By definition of Wasserstein-1 distance,

W µT U 0,T ,1:k, ˆµT ˆU 0,T ,1:k

= sup g Lip(1)

Z g(u, t, v1, , vk)d µT U 0,T ,1:k ˆµT ˆU 0,T ,1:k

= sup g Lip(1)

Z g(u, t, v1, , vk) dµT Vk|U 0,T ,1:k 1 dµT U 0,T ,1:k 1

Z g(u, t, v1, , vk) dˆµT ˆVk| ˆU 0,T ,1:k 1 dˆµT ˆU 0,T ,1:k 1

= sup g Lip(1)

Z g(u, t, v1, , vk) d µT Vk|U 0,T ,1:k 1 ˆµT ˆVk| ˆU 0,T ,1:k 1

dˆµT ˆU 0,T ,1:k 1 | {z } (1)

+ Z g(u, t, v1, , vk) dµT Vk|U 0,T ,1:k 1 d µT U 0,T ,1:k 1 ˆµT ˆU 0,T ,1:k 1

For (1), if Vk is a continuous variable, we have Z g(u, t, v1, , vk) d µT Vk|U 0,T ,1:k 1 ˆµT ˆVk| ˆU 0,T ,1:k 1

= g (u, t, v1, , vk 1, fk (pa(vk), u Vk)) g u, t, v1, , vk 1, ˆfk (pa(vk), u Vk)

fk (pa(vk), u Vk) ˆfk (pa(vk), u Vk) S, (12)

where we have used the Lipschitz property of g in the first inequality. If Vk is categorical, let ˆp (pa(vk), u Vk) = (ˆp1, , ˆpni) = ˆf (pa(vk), u Vk) and p (pa(vk), u Vk) = (p1, , pni) = f (pa(vk), u Vk) , we get Z g(u, t, v1, , vk) d µT Vk|U 0,T ,1:k 1 ˆµT ˆVk| b U 0,T ,1:k 1

k=1 (g(u, t, v1, , k) g(u, t, v1, , ni))(pk ˆpk) K

k=1 |pk ˆpk| K2S, (13)

where we use Vk K, ni K.

For (2), if Vi is continuous, Z g(u, t, v1, , vk) dµT Vk|U 0,T ,1:k 1 d µT U 0,T ,1:k 1 ˆµT ˆU 0,T ,1:k 1

= Z g (u, t, v1, , vk 1, fk (pa(vk), u Vk)) d µT U 0,T ,1:k 1 ˆµT ˆU 0,T ,1:k 1

Since g, fk are Lipschitz continuous functions, g (u, t, v1, , vk 1, fk (pa(vk), u Vk)) is (L + 1)- Lipschitz continuous with respect to (u, t, v1, , vk 1). We have Z g (u, t, v1, , vk 1, fk (pa(vk), u Vk))d µT U 0,T ,1:k 1 ˆµT U 0,T ,1:k 1

(L + 1)W µT U 0,T ,1:k 1, ˆµT ˆU 0,T ,1:k 1

If Vi is categorical, Z g(u, t, x1, , xk) dµT Xk|U,T ,1:k 1 d µT U,T ,1:k 1 ˆµT ˆU,T ,1:k 1

Z g(u, t, x1, , k) d µT U,T ,1:k 1 ˆµT ˆU,T ,1:k 1

Since g, fk are L-Lipschitz continuous functions, g(u, t, v1, , vk 1, i) is L-Lipschitz continuous with respect to (u, t, v1, , vk 1). We have

Z pkg(u, t, x1, , k) d µT U 0,T ,1:k 1 ˆµT U 0,T ,1:k 1 LW µT U 0,T ,1:k 1, ˆµT ˆU 0,T ,1:k 1

Combine (12)-(16), we prove Equation (11). By induction, one can easily get

W µT U 0,T ,1:n V n T , ˆµT ˆU 0,T ,1:n V n T

(L + 1)n V n T W µT U 0,T , ˆµT ˆU 0,T

+ 2K2S/L 2K2S/L

= (L + 1)n V n T 1

L 2K2S + (L + 1)n V n T W µT U 0,T , ˆµT ˆU 0,T

Let CG(L, K) = 2K2 max{ (L+1)n V n T 1

L , (L+1)n V n T } and notice that W µT U 0,T , ˆµT ˆU 0,T

W P M(U), P ˆ M( ˆU) because interventions T = t are the same, we get

W P M(V (t)), P ˆ M(V (t)) W µT U 0,T ,1:n V n T , ˆµT ˆU 0,T ,1:n V n T

CG(L, K) S + W P M(U), P ˆ M( ˆU) . (17)

B.3 Proof of results in Section 3.1

Proof of Theorem 2. Recall that the neural network consists of three parts ˆg = ˆgτ 3 ˆg2 ˆg1 and ˆg2, ˆg1 have a separable form, dealing with each coordinate of the input individually. As mentioned before, each coordinate approximates the distribution of one connected component of the support. We first construct ˆg1 and ˆg2, then use Gumbel-Softmax layer to combine each component together.

To construct the first two parts, we only need to consider each coordinate individually. For the i-th component Ci, let µi = PCi/P(Ci), where PCi is measure P restricted to component Ci. Under Assumption 4, there exists a Lipschitz map gi 2 H([0, 1]d C i , Ci) FL([0, 1]d C i , Ci). By [43, Theorem VIII.1], there exists quantized Re Lu network ˆgi 1 of width W1 and depth Θ(d C i ) (let s = 1 n(n 1) in [43, Theorem VIII.1]), such that

# µi, P(ˆgi 1(Ui)) O W 1/d C i 1 ,

where Ui are i.i.d. uniform random variables on [0, 1]. By [53, Theorem 2], there exist a deep Re Lu network ˆgi,j 2 of width Θ d C i and depth L2 such that

j ˆgi,j 2 O L 2/d C i 2 ,

Let ˆgi 2(x) = ˆgi,1 2 (x), , ˆgi,d 2 (x) , we get

gi 2 ˆgi 2 O L 2/d C i 2 .

Thus, the width of ˆgi 2 is Θ d d C i . Let

ˆgi(x) = ˆg1 i (x1), , ˆg NC i (x NC) , i = 1, 2, xk Rdi, d1 = 1, d2 = d, k = 1, , NC,

where NC is the number of connected components. By Lemma 3, we get

W µi, P(ˆgi 2 ˆgi 1(Ui)) = W gi 2

# gi 2 1 # µi, ˆgi 2

# P(ˆgi 1(Ui))

LW gi 2 1 # µi, P(ˆgi 1(Ui)) + gi 2 ˆgi 2

= O W 1/d C i 1 + L 2/d C i 2 .

Next, let pi = P(Ci) and the distribution of ˆgk 2 ˆgk 1(Uk) be ˆµk, then we have P = PNC k=1 pkµk. By Lemma 4, we have

k=1 pk W (µi, ˆµk) = O W 1/ max{d C i } 1 + L 2/ max{d C i } 2 .

Finally, we analyze the error caused by the Gumbel-Softmax layer. Note that as temperature parameter τ 0,

exp((log pi + Gi)/τ) PNC k=1 exp((log pk + Gk)/τ)

a.s. One-hot(arg max i {Gi + log pi})

where One-hot(k) is NC-dimensional vector with the k-th coordinate equals to 1 and the remaining coordinates are 0 and Gi Gumbel(0, 1) are i.i.d. Gumbel distribution. Let ντ be the distribution of wτ. By Lemma 6, for 0 < τ < 1, we have

W ντ, ν0 O(τ τ log τ).

By Lemma 5,

W ντ, P( ˆgi 2 ˆgi 1 (Ui)) , ν0, P( ˆgi 2 ˆgi 1 (Ui)) O (τ τ log τ) . (18)

ˆgτ 3(x1, , x NC) =

k=1 wτ k xk, xk Rd, wτ = wτ 1, , wτ NC .

We denote g0 3 = limτ 0 gτ 3. By Assumption 4, the support is bounded. Thus, gi 2 and ˆgi 2 are bounded, i.e., ˆgi 2 (x) K. Since functions h(w, X) = w TX is Lipschitz continuous in the region w 1 = 1, X K, we get

W P(ˆg0 3 ˆg2 ˆg1(Uk)), P(ˆgτ 3 ˆg2 ˆg1(Uk)) O (τ τ log τ) (19)

from (18). Putting things together, we get

W (P, P(ˆgτ 3 ˆg2 ˆg1(Uk))) W

k=1 pkˆµk, P(ˆgτ 3 ˆg2 ˆg1(Uk))

+ W P(ˆg0 3 ˆg2 ˆg1(Uk)), P(ˆgτ 3 ˆg2 ˆg1(Uk))

O τ τ log τ + W 1/ max{d C i } 1 + L 2/ max{d C i } 2 ,

where we use (19).

B.4 Proof of results in Section 3.2

Gumbel-Softmax Space-filling Curve

Lipschitz Homeomorphism

Figure 8: Architecture of the deep neural network for 4 dimensional output. The first (yellow) part approximates the distribution on different connected components using deep Re Lu Networks. The remaining two parts are similar to the wide neural network in Figure 1a.

Before we prove Proposition 1, we need to introduce some important notions. Let {Cn} be a sequence of partitions of the unit cube [0, 1]d. We say {Cn} has property (*) if the following properties are satisfied.

1. C0 = {C 1}, C 1 = [0, 1]d

2. Cn+1 = {Ci1,i2 ,in+1}ij=0, ,2d 1 is a refinement of Cn = {Ci1,i2 ,in}ij=0, ,2d 1, i.e.,

Ci1,i2 ,in = S2d 1 in+1=0 Ci1,i2 ,in+1. Besides, Cn is obtained by cutting the unit cube [0, 1]d

by planes that are paralleled to the coordinate planes.

The following lemma is important in the construction of Hilbert curve. Lemma 1 (Algorithm 3 in [25]). Given a sequence of partitions of the unit cube {Cn} that has property (*), there exist a ordering for each partition Cn, i.e., On : 1, , 2nd

i1i2 in : ij = 0, , 2d 1 , such that the following two conclusions hold.

1. If (j 1) 2d < i j 2d, the first n digits of On+1(i) are the same as On(j).

2. Adjacent cubes in the ordering intersect, i.e., COn(i) COn(i+1) = . Furthermore, COn(i) COn(i+1) is a d 1 dimensional cube.

Proof of Proposition 1. Without loss of generality, we can assume that P is supported on [0, 1]d, vanishes at the boundary and P(B) C1λ(B) for some constant C1 > 0 and all measurable sets B [0, 1]d. Suppose we have proven the conclusion for this case. By Assumption 5, there exists f H([0, 1]d, K) FL([0, 1]d, K) such that f 1 # P satisfies these conditions. There exists continuous function γ such that f 1 # P = γ#λ, which implies P = (f γ)#λ. In particular, if γ is 1/d-Holder continuous, f γ is also 1/d-Holder continuous because f is Lipschitz continuous.

When d = 0, the constant map satisfies the requirement. We focus on the case d 1 in the following. The proof is to modify the construction of the Hilbert space-filling curve. By changing the velocity of the curve, the push-forward measure of the Hilbert space-filling curve can simulate a great number of distributions. We use λ to represent the Lebesgue measure [0, 1].

Proof Sketch: The construction of f is inspired by the famous Hilbert space-filling curve [28]. To illustrate the idea, let us first assume that P is absolutely continuous with respect to the Lebesgue measure λ and consider d = 2. In the k-th step, we divide the unit cube into 22k evenly closed cubes Cn = {C1,n, , C22k,n} such that Cn is a refinement of Cn 1.

The construction of a standard Hilbert curve is to find a sequence of curves γn that go through all the cubes in Cn. The curve γn has one special property. If γn 1(t) Ck,n 1, then γm(t) Ck,n 1, m n. For example, in Figure 9, the points on the curve in the lower-left cubes will stay in the lower-left cubes. Note that (γn)#λ(Ck,n) = λ γ 1 n (Ck,n) is the time curve γn stays in cubes Ck,n. The idea is to change the speed of the curve so that P and (γn)#λ agree on cubes in Cn, i.e., P(Ck,n) = λ γ 1 n (Ck,n) . Since we assume that P is absolutely continuous, P( Ck,n) = 0 and we don t need to worry about how to divide the mass on the boundary. For example, let the green cubes in the Figure 9 to be C0 and suppose that γ1 starts from C0. We will change the speed so that γ1 spends P(C0) time in this region. In the next step, we divide C0 into four colored cubes C1, , C4 on the right. We change the speed again to let the time spent in each cube equal to P(Ci). Note that this construction preserves the aforementioned property, i.e., for t [0, P(C0)], γn(t) C0, n 1. As n , it can be proven that γn converges uniformly to a curve γ. We can also prove that P and γ#λ agree on n=1Cn. Given that n=1Cn generate the standard Borel algebra on [0, 1]2, we can conclude that P = γ#λ.

In the general case, P may not be absolutely continuous. As a result, the boundary may not be P-measure zero. We will need to perturb the cutting planes to ensure their boundaries are measured zero.

Figure 9: The construction of Hilbert curve.

Preparation. For a n-dimensional cube C = Qn i=1[ai, bi], the diameter of C is diam(C) = p Pn i=1(bi ai)2. We also define the function R that measures the ratio between maximum edge and minimum edge. For a collection of cubes C = {C1, , Cm}, Ck = Qn i=1 ak i , bk i , R(C) is defined to be

R(C) = maxi,k |ak i bk i | mini,k |ak i bk i | .

R(C) measures the shape of the cubes in a collection. We need to control R(C) in the construction to obtain Holder continuity in Step 2.

Step 1: Define partition. First, we construct a sequence of partitions that has the property (*) recursively. Let C0 = [0, 1]d . Suppose we have defined close cubes collection Cn = {Ci1,i2 ,in}ij=0, ,2d 1, , such that

1. Cn is obtained by cutting the unit cube [0, 1]d by planes that are paralleled to the coordinate planes.

2. Ci1,i2 ,in 1 = S

in Ci1,i2 ,in and diam(Ci1,i2 ,in 1) 2/3 diam(Ci1,i2 ,in).

3. P( Ci1,i2 ,in) = 0 and R(Cn+1) (1 + 2 n) R(Cn).

Next, we construct Cn+1 from Cn that preserves these properties. We refine the division Cn = {Ci1,i2 ,in}ij=0, ,2d 1 by union of hyperplane Pi,r = {y : yi = r} to get {Ci1,i2 ,in+1} such that Ci1,i2 ,in = S

in+1 Ci1,i2 ,in+1 and diam(Ci1,i2 ,in 1) 2/3 diam(Ci1,i2 ,in). By property 1, Cn is obtained by dividing [0, 1]d using

P1,r1 0, , P1,r1 2n 1, , Pi,ri j, , Pd,rd 2n, ri 0 = 0, ri 2n = 1,

we can further cut the unit cube using hyperplane {Pi,(ri j+ri j+1)/2)}i=1, d,j=1, 2n 1. However, this

construction may not satisfy property 3, P( Ci1,i2 ,in+1) = 0. We need to perturb each hyperplane to ensure a measure-zero boundary. Since

Zi = r R : P x [0, 1]d, [x]i = r = 0

is at most countable, we can perturb the hyperplanes a little bit, i.e., {Pi,(ri j+ri j+1)/2+ϵi,j)}i=1, d,j=1, 2n 1 to ensure property 3. In this way, we can choose

|ϵi,j| to be sufficiently small such that

diam(Ci1,i2 ,in+1) 2/3 diam(Ci1,i2 ,in),

and maxi,j ri j+1 ri j /2 + |ϵi,j|

mini,j ri j+1 ri j /2 |ϵi,j| maxi,j ri j+1 ri j

mini,j ri j+1 ri j 1 + 2 n .

Therefore, R(Cn+1) (1 + 2 n) R(Cn) and it is easy to see that properties 1,2,3 are satisfied.

Step 2: Construct Hilbert Curve. By Lemma 1, there exist an sequence of ordering {On} of {Cn} that satisfies

The first n digits of On+1(i) with On(j), (j 1) 2d < i j 2d.

COn+1(i) COn+1(i+1) are d 1 dimensional cubes.

Let I0 = {[0, 1]}, we define In recursively. Suppose we have define interval collection In = {{Ii1,i2 ,in}ij=0, ,2d 1} such that Ii1,i2 ,in 1 = S

in Ii1,i2 ,in and P(Ci1,i2 ,in) = |Ii1,i2 ,in|. Since P( Ci1,i2 ,in+1) = 0, we have P(Ci1,i2 ,in) = P

in+1 P(Ci1,i2 ,in+1). With the ordering, we divide each IOn(j), 0 j 2nd 1 into 2d closed sub-interval IOn+1(i), j2d i < (j + 1)2d such that IOn+1(i) = [Pi 1 k=1 pk, Pi k=1 pk] , where pi = P(COn+1(i)). This construction satisfies |IOn+1(i)| = P(COn+1(i)). Note that by condition 3 in step 1, P vanishes at the boundary of the cubes. Therefore,

|IOn(i)| = P(COn(i)) =

j=(i 1)2d+1 P(COn+1(j)) =

j=(i 1)2d+1 |IOn+1(j)|. (20)

Now, we can construct a piecewise linear function γn+1(t) such that

(γn+1)#λ(COn+1(i)) = P(COn+1(i)), (γn+1)#λ( COn+1(i)) = 0, i. (21)

The idea is to construct a piecewise linear curve going through all cubes in Cn+1 exactly once and modify its speed. Take v0 Center(COn+1(1)), vi Center(COn+1(i) COn+1(i+1)), v2d(n+1) Center(COn+1(2d(n+1))), where Center(Qk i=1[ai, bi]) = ((ai + bi)/2)i=1, ,k is the center of a cube. vi is well-defined since COn+1(i) COn+1(i+1) are cubes of dimension d 1 by Lemma 1. One possible choice of γn+1(t) is

γn+1(t) = v0 +

1 pi (vi vi 1)

The first two curves γ1, γ2 are shown in Figure 9. It is straightforward to verify that γn+1(Pi k=1 pk) = vi and

γn+1(IOn+1(i)) = γn+1([

k=1 pk]) COn+1(i).

In fact, since vi 1, vi are on different surfaces of the cube COn+1(i) , the line segment between vi 1, vi lies inside the interior of COn+1(i) and γn+1 goes through all the cubes COn+1(i) exactly once. We have γ 1 n+1(Int(COn+1(i))) [Pi 1 k=1 pk, Pi k=1 pk] and (γn+1)#λ(COn+1(i)) = P(COn+1(i)), where Int( ) is the interior of a set. Besides, we also have (γn+1)#λ( COn+1(i)) λ(γ 1 n+1({vj}j=1 2d(n+1))) = 0.

Finally, for any k1 k2 n, t [0, 1], by property 2 in step 1, (20) and (22), γk1(t), γk2(t) are in one cube Ci1,i2 ,in. and diam(Ci1,i2 ,in) 2

d. Thus, |γk1(t) γk2(t)| 2

d which implies {γn(t)} converges uniformly to one continuous function γ(t).

Step 3: Verify Conclusion P = γ#λ. By W((γk)#λ, γ#λ) γk γ 0, (γk)#λ converges weakly to γ#λ. By construction, P( Ci1,i2 ,in) = 0 and (γn)#λ(Ci1,i2 ,in) = P(Ci1,i2 ,in). Therefore, by condition 2 in step 1, for any k n,

(γk)#λ(Ci1,i2 ,in) = (γk)#λ( [

in+1, ,ik Ci1,i2 ,ik)

in+1, ,ik (γk)#λ(Ci1,i2 ,ik)

in+1, ,ik λ γ 1 k (Ci1,i2 ,ik)

in+1, ,ik P(Ci1,i2 ,ik) = P(Ci1,i2 ,in), (23)

where we use P( Ci1,i2 ,in) = 0 and (γk)#λ( Ci1,i2 ,in) = 0 in the second and the last equation. We claim that γ#λ( Ci1,i2 ,in) = 0. For k > n, we define

Ci1,i2 ,ik Ci1,i2 ,in = Ci1,i2 ,ik.

Let k > k1 n and Bϵ n = {x : d(x, Ci1,i2 ,in) < ϵ}, we have

(γk)#λ (Int(Bk1)) (γk)#λ(Bk1) = P(Bk1) P B(2/3)k1

d n , k > k1,

where we use (23) and the fact that (γk)#λ vanishes on Ci1,i2 ,ik in the first equality and diam(Ci1,i2 ,ik1) 2

d in the last inequality. Let k , by Portmanteau Theorem, we get

γ#λ( Ci1,i2 ,in) γ#λ (Int(Bk1)) lim inf k (γk)#λ(Bk1) lim inf k P B(2/3)k1

d n = P B(2/3)k1

Since k1 is arbitrary, let k1 ,

γ#λ( Ci1,i2 ,in) lim k1 P B(2/3)k1

d n = P( Ci1,i2 ,in) = 0

and we conclude that γ#λ( Ci1,i2 ,in) = 0. By Portmanteau Theorem, let k in (23), we obtain γ#λ(Ci1,i2 ,in) = P(Ci1,i2 ,in). Let C = S i=1 Ci. Notice that for any open set U [0, 1]d, U = S

C U,C C C, which means the σ-algebra generated by C contains all Borel sets. Besides, C is a π-system. Hence, λ γ 1(U) = P(U) for all Lebesgue measurable set U.

Step 4: Holder Continuity of γ. We verify the condition of [47, Theorem 1]. In and Cn in the above construction correspond to developments αk, βk in [47, Theorem 1]. Define fk(IOk(i)) = COk(i), i = 1, , 2kd as a map from In to Cn.

1. By construction, if I1 Ik, I2 Ik 1, I1 I2, fk(I1) fk 1(I2).

2. If I1, I2 Ik and I1 I2 = , I1, I2 are adjacent. By Lemma 1, fk(I1), fk(I2) share a (d 1)-dimensional boundary. Therefore, fk(I1) fk(I2) = .

3. We have |Ck 1| 2d|Ck|.

4. By property 3,

1 + 2 k < .

Let A = Q k=1 1 + 2 k . We have for any C Cn, by Assumption 5

diamd(C) R(Cn)λ(C) AC 1 1 P(C) AC 1 1 max I I (diam(I)) ,

where C1 is the constant in Assumption 5.

5. Since Ik consists of one-dimensional intervals, we have diam(Ik) = gap(Ik).

Therefore, γ is 1/d-Holder continuous.

Proof of Theorem 3. The proof is similar to the proof of Theorem 2. The difference is that in the first part, we use a deep Re LU network to approximate the Holder continuous curve γ instead of wide Re LU network.

Let µi = PCi/P(Ci). By Assumption 4 and Assumption 5, for each connected component Ci of the support, there exists Lipschitz maps gi 2 H([0, 1]d C i , Ci) FL([0, 1]d C i , Ci) such that

(gi 2) 1 # µi(B) Cgλ(B)/P(Ci),

for any measurable set B [0, 1]d C i . By Proposition 1, there exist a 1/d C i -Holder continuous curve γi : [0, 1] [0, 1]d C i such that (γi)#λ = (gi 2) 1 # µi.

By [53, Theorem 2] (take ω(x) = C x 1/d where C is the Holder continuity constant factor), there exists deep Re Lu network ˆgi,j 1 of width 12 and depth L1 such that

ˆgi,j 1 (γi)j O L 2/d C i 1 ,

where (γi)j is the j-coordinate of γi. Let ˆgi 1(x) = ˆgi,1 1 (x), , ˆgi,d 1 (x) , we get

ˆgi 1 γi O L 2/d C i 1 .

Thus, the width of ˆgi 1 is Θ d C i . By Lemma 3

W gi 2 1 # µi, P(ˆgi 1(Ui)) = W (γi)#Ui, P(ˆgi 1(Ui)) ˆgi 1 γi O L 2/d C i 1 .

The rest are the same as proof of Theorem 2.

Proof of Corollary 1. Suppose that structure equations of M are

( fi (Pa(Vi), U Vi) , Vi is continuous,

arg maxk [ni] n g Vi k + log (fi (Pa(Vi), U Vi))k o , Vi is categorical, fi 1 = 1 , fi FL.

Let din i be the input dimension of fi, dout i be the output dimension,

arg min ˆ fi NN din i ,dout i (dout i (2din i +10),L0) ˆfi fi , Vi continuous,

max{0, arg min ˆ fi NN din i ,dout i (dout i (2din i +10),L0) ˆfi fi }, Vi categorical.

We truncate the neural network at 0 for categorical variables because the propensity functions are required to be non-negative. This truncate operation will not influence the approximation error since fi are non-negative. According to Assumption 3, fi are Lipschitz continuous. By [53, Theorem 2],

we have, if Vi is continuous, ˆfi fi O(L 1/din i 0 ) O(L 1/din max 0 ).

If Vi is categorical, let S = ˆf (pa(vk), u Vk) f (pa(vk), u Vk) , p = f (pa(vk), u Vk) , ˆp = ˆf (pa(vk), u Vk) / ˆf 1. If S > 1/(2K), p ˆp 2KS. If S 1/(2K),

ˆf 1 ˆf (pa(vk), u Vk) f (pa(vk), u Vk) + 1

ˆf 1 | ˆf 1 1| f

ˆf 1 S (ni + 1)

1 ni S S 2(K + 1)S

where we use Assumption 3 that ni K and S 1/(2K). Thus,

fi ˆfi/ ˆfi 1 O( fi ˆfi ) O(L 1/din max 0 ). By Theorem 3, there exists a neural network ˆgj with architecture in Theorem 3 such that

W(P(Uj), P(ˆgj(Zj))) O L 2/d U j 1 + L 2/d U j 2 + (τ τ log τ) ,

where Zj U([0, 1]NC,j) i.i.d., NC,j is the number of connected components of support of Uj, d U j is the dimension of latent variables Uj and L1, L2, τ are hyperparameters of the neural network defined in Theorem 3. Let ˆUj = ˆgj(Zj), By Lemma 5,

W(P(U1, , Un U ), P( ˆU1, , ˆUn U ))

j=1 W(P(Uj), P(ˆgj(Zj))) O L 2/d U j 1 + L 2/d U j 2 + (τ τ log τ) .

Let ˆ M be the casual model with the following structure equations

ˆfi Pa( ˆVi), (ˆgj(ZCj))UCj U Vi

, Vi is continuous,

arg maxk [ni] n gk + log ˆfi Pa( ˆVi), (ˆgj(ZCj))UCj U Vi

o , Vi is categorical.

By Theorem 1, we get

W P M (V (t)), P ˆ M(V (t)) O( X

Vi continuous fi ˆfi + X

Vi categorical fi ˆfi/ ˆfi 1

+ W(P(U1, , Un U ), P( ˆU1, , ˆUn U )))

O(L 2/din max 0 + L 2/d U max 1 + L 2/d U max 2 + (τ τ log τ)), for any intervention T = t.

C Proof of Consistency

C.1 Inconsistent Counterexample (Proposition 2)

Proposition 5. There exists a constant c > 0 and an SCM M satisfying Assumptions 1-5 with a backdoor causal graph such that there is no unobserved confounding in M . For any ϵ > 0, there exists an SCM Mϵ with the same causal graph such that W(P M (V ), P Mϵ(V )) ϵ and |ATEM ATEMϵ| > c.

Proof. Let V = {X, T, Y } be the covariate, treatment and outcome respectively. T is a binary variable. Let structure equations of causal model Mδ be

1 , w.p. 1/2 0 , w.p. 1/4 δ , w.p. 1/4,

P(T = 1|X = x) = 1/2 , x = 1 or 0, 3/4 , x = δ,

Y = T + Uy , X < 0 X/δ + Uy , X 0

where all latent variables are independent and Uy is mean-zero noise. The distribution of Mδ is

PX,T,Y (0, 1, y) = p Uy(y)/8, PX,T,Y (0, 0, y) = p Uy(y)/8

PX,T,Y (δ, 1, y) = 3p Uy(y 1)/16, PX,T,Y (δ, 0, y) = p Uy(y 1)/16.

PX,T,Y ( 1, 1, y) = p Uy(y 1)/4,PX,T,Y ( 1, 0, y) = p Uy(y)/4.

As δ 0, distribution of Mδ d M , where structure equations of M are

P(U1 = 1) = 3/5, P(U1 = 0) = 2/5, P(U2 = 1) = 1/3, P(U1 = 0) = 2/3,

X = 1 , w.p. 1/2, 0 , w.p. 1/2,

P(T = 1|X = x) = 1/2 , x = 1, 5/8 , x = 0,

T + Uy , X = 1, U1 + Uy , X = 0, T = 1, U2 + Uy , X = 0, T = 0,

where U1, U2, Uy are independent. The distribution of M is

PX,T,Y (0, 1, y) = p Uy(y)/8 + 3p Uy(y 1)/16,

PX,T,Y (0, 0, y) = p Uy(y)/8 + p Uy(y 1)/16,

PX,T,Y ( 1, 1, y) = p Uy(y 1)/4,

PX,T,Y ( 1, 0, y) = p Uy(y)/4.

It is easy to see that M satisfies Assumption 1-5. Some calculation gives W(P M (V ), P Mδ(V )) δ/2. It is easy to calculate the ATE of the two models.

ATEM = 19/30, ATEMδ = 1/2.

This example implies that even as the Wasserstein distance between P M (V ), P Mδ(V ) converges to zero, their ATEs do not change. As mentioned in the main body, this problem is caused by the violation of Lipschitz continuity assumption for Mδ. Note that in Mδ, E[Y |X, T] = X/δ. As δ 0, the Lipschitz constant explodes.

This problem may also arise in (6). If the distribution ball Bn = { ˆ M : W(P ˆ M, P M n ) αn} includes a small ball around the true distribution Sϵ = { ˆ M : W(P ˆ M, P M ) ϵ} Bn and the NCM is expressive enough to approximate all the SCMs in the Sϵ, this example tells us the confidence interval may not shrink to one point as sample size increases to infinity even if the model is identifiable.

C.2 Proof of Theorem 4

To prove Theorem 4, we will need the following proposition.

Proposition 6. Given a metric space (M, d) and sets Θ1 Θ2 Θn Θ M, where Θ = n=1Θn , positive sequences {ϵn}n N, {δn}n N such that limn ϵn = limn τn = limn δn = 0 and continuous functions f, gn : Θ R, suppose that

1. Θ is compact.

2. gn satisfies gn(θ1) gn(θ2) Lgd(θ1, θ2) + τn, θ1, θ2 Θ (24)

and supθ Θ gn(θ) g(θ) δn, gn 0.

3. There exists a compact subset Θ Θ such that for any θ0 Θ , there exists θ Θn such that d(θ, θ0) ϵn.

Consider the following optimization problems:

min θ Θnf(θ),

s.t. gn(θ) Lgϵn + δn, (25)

min θ Θ f(θ),

s.t. g(θ) = 0. (26)

min θ Θ f(θ),

s.t. g(θ) = 0. (27)

We assume that the feasible region of (26) and (27) are nonempty. Let f n, f , f be the minimal of (25), (26) and (27) respectively. Then, [lim infk f n, lim supk f n] [f , f ].

Proof of Proposition 6. By compactness of Θ and Θ , the minimizers of (26) and (27) are achievable insider these two sets. Let θ Θ , θ Θ be the minimizer of (26) and (27). We first prove that lim supn f n f . Note that by (24), the limiting function g is Lg-Lipschitz. By condition

3, there exist θn Θn such that d θn, θ ϵn. Note that

gn(θn) |g(θn) gn(θn)| + |g(θn) g (θ ) | + |g (θ ) | δn + Lgd (θn, θ ) δn + Lgϵn.

Therefore, θn is a feasible point of (25). We have lim supn f n lim supn f(θn) =

Next, we argue that lim infn f n f . If this equation does not hold, there exists ϵ > 0 and subsequence f nk

k N such that f nk < f ϵ, k N. By compactness of Θ , for each k, there exists a subsequence of θ nk Θ such that θ nk satisfies constraint of (25) and f nk = f θ nk . By compactness, θ nk

k N has a converging subseqence. Without loss of generality, we may assume

k N converges to ˆθ Θ . Since gk converge uniformly to g,

lim k gk θ nk lim k |gk θ nk g θ nk | + g θ nk = g ˆθ lim k αk = 0.

ˆθ is a feasible point of Equation (26). Since f is continuous on Θ,

f ˆθ = lim k f θ nk f ϵ.

which leads to contradiction.

Proof of Theorem 4. We begin by defining a proper metric space. By Assumption 3, all random variables are bounded. Suppose that maxi,j{ Vi , Uj } K. A canonical causal model M M(G, F, U) is decided by

θM = (f1, , fn V , P(U1), , P(Un U )),

where fi are functions in the structure equations (3) and Uj are uniform parts of the latent variables. We denote Mθ to be the SCM represented by θ, the underlying SCM be Mθ and P θ to be the distribution of Mθ. We consider the space

M = FV1 FVn V P [ K, K]d U 1 P [ K, K]d U n U ,

FK L [ K, K]d V i,in, [ K, K]d V i,out , Vi continuous,

{f : f 1 = 1, f FK L [ K, K]d V i,in, [ K, K]d V i,out } Vi categorical,

d V i,in, d V i,out are the input and output dimensions of fi,

FK L = {f : f K, f Lipschitz continuous}

and P(K) is the probability space on K. For θ = (f1, , fn V , P(U1), , P(Un U )) , θ = f 1, , f n V , P(U 1), , P(U n U ) , we define a metric on M

k=1 fk f k +

k=1 W(P(Uk), P(U k)).

Theorem 1 states that the Wasserstein distance between two causal models is Lipschitz with respect to metric d. Now, we define Θn. Let

Pn = n P(U) : U is the latent distribution of ˆ M NCMG(F0,n, F1,n, F2,n, τn) o .

In other words, Pn contains all the push-forward measures of the uniform distribution by neural networks. We denote Θn = ˆFV1,n ˆFVn V ,n Pn, where

ˆFVi,n = F0,n, Vi continuous, {f/ f 1 : f F0,n} Vi categorical,

and F0,n is defined in Theorem 4. Note that by construction, latent variables are independent and Pn can be decomposed into direct produce Pn,1 Pn,n U . Let

Θ = n=1Θn, gn(θ) = Sλn P θ n (V ), P θ mn(V ) , f(θ) = Et µT EMθ[F(V1(t), , Vn V (t))]

and Θ = {θ }. Note that f(θ) is a continuous function since by Theorem 1 and the fact that F is Lispchitz continuous,

|f(θ) f(θ )| Et µT |W(P θ(V (t)), P θ (V (t))|

Et µT [O(d(θ, θ ))] = O(d(θ, θ )).

Now, we verify the conditions in Proposition 6.

1. By Arzelà Ascoli theorem, FK L [ K, K]d V i,in, [ K, K]d V i,out are precompact set with respect to the infinity norm in space of continuous functions. And thus FVi are compact sets.

Since measures in P [ K, K]d U j are tight , P [ K, K]d U j are compact with respect to weak topology by Prokhorov s theorem. And the Wasserstein distance metricizes the weak topology. Thus, P [ K, K]d U j are compact and the space (M, d) is compact space. Closed set Θ M is also compact.

2. By definition of gn,

|gn(θ) gn(θ )| = |Sλn P θ n (V ), P θ mn(V ) Sλn P θ n (V ), P θ mn(V ) |

|W P θ n (V ), P θ mn(V ) W P θ n (V ), P θ mn(V ) | + 4(log(mn) + 1)λn

W P θ mn(V ), P θ mn(V ) + 4(log(mn) + 1)λn

O(d(θ, θ )) + 4(1 + log(nmn))λn,

where we use Lemma 7 in the second inequality and triangle inequality and Theorem 1 in the third inequality. By the condition in Theorem 4, the second term (1 + log(nmn))λn 0 as n . Therefore, (24) is verified with τn = 4(1 + log(nmn))λn.

We then verify the uniform convergence of gn. By triangle inequality,

|gn(θ) g(θ)| = |Sλn P θ n (V ), P θ mn(V ) W P θ (V ), P θ(V ) |

|Sλn P θ n (V ), P θ mn(V ) W P θ n (V ), P θ mn(V ) |

+ |W P θ n (V ), P θ mn(V ) W P θ (V ), P θ(V ) |.

By Lemma 7, |Sλn P θ n (V ), P θ mn(V ) W P θ n (V ), P θ mn(V ) | 2(log(mn) + 1)λn and we get

|gn(θ) g(θ)| 2(log(mn) + 1)λn + |W P θ n (V ), P θ mn(V ) W P θ n (V ), P θ(V ) |

+ |W P θ n (V ), P θ(V ) W P θ (V ), P θ(V ) |

2(log(mn) + 1)λn + W P θ mn(V ), P θ(V ) + W(P θ n (V ), P θ (V ))

2(log(mn) + 1)λn + O W(P θ n (U), P θ (U)) + W P θ(U), P θ mn(U) ,

where we use Theorem 1 in the last inequality.

We first bound supθ W P θ(U), P θ mn(U) using standard VC dimension argument. Note that {U1, , Un U } are independent. By Lemma 5,

W P θ(U), P θ mn(U)

i=1 W P θ(Ui), P θ mn(Ui) . (29)

By the construction in Theorem 3, P θ(Ui) P

j [NC,i] pj P(fθi,j(Zi,j)), Zi,j Unif(0, 1), where fθi,j are neural networks with constant width and depth L1,n +L2,n, NC,i is the number of connected components of supp(P(Ui)) and θi,j are the parameters of the neural networks. By Lemma 4,

sup θ W P θ(Ui), P θ mn(Ui) O( max j [NC,j] sup θi,j W(P(fθi,j(Zi,j)), Pmn(fθi,j(Zi,j))).

By [6], the pseudo-dimension of NN(Θ(1), L) is O L2 log L . By boundness of all the neural networks and Lemma 8 2, with probability at least

1 exp O (ϵ n) d U max log(ϵ n) 1 + (Ln,1 + Ln,2)2 log(Ln,1 + Ln,2) log ϵ 1 n mnϵ2 n ,

the following event happens.

sup θi,j W(P(fθi,j(Zi,j)), Pmn(fθi,j(Zi,j)) ϵn + ϵ n.

2Note that we truncate the neural network to ensure boundness in Theorem 4. This extra truncate operation f = max{ K, min{K, f}} can be viewed as two extra Re Lu layers, so Lemma 8 is still applicable.

ϵ n = m 1/(d U max+2) n log mn, Ln,i = m d U max/(2d U max+4) n log mn, ϵn = Cm 1/(d U max+2) n log mn

with constant C > 0 sufficiently large, we get

sup θi,j W(P(fθi,j(Zi,j)), Pmn(fθi,j(Zi,j))) O m 1/(d U max+2) n log mn

1 O m 2 n .

As long as mn = Ω(n), the Borel-Cantelli lemma implies that almost surely, there exists N > 0 such that when n > N, supθi,j W(P(fθi,j(Zi,j)), Pmn(fθi,j(Zi,j)))

O m 1/(d U max+2) n log mn

for all i, j. Therefore, almost surely, for sufficiently large n,

sup θ W P θ(Ui), P θ mn(Ui) O m 1/(d U max+2) n log mn

[17, Theorem 2] shows that for sufficiently large n, if δn = C1n 1/ max{d U max,2} log2(n), where C1 > 0 is a constant,

P(W(P θ (Ui), P θ n (Ui)) > δn)

exp( CC1 log2(n)) d U max = 1,

exp CC1 log4(n) log2(2+C 1 1 n1/2 log 2(n))

d U max = 2,

exp( CC1 log2/d U max(n)) d U max > 2,

where C > 0 is a constant. Let C1 sufficiently large such that

P(W(P θ (Ui), P θ n (Ui)) > δn) O n 2 .

Therefore, P n=1 P(W(P θ (Ui), P θ n (Ui)) > δn) < . By Borel-Cantelli lemma, with probability 1, there exist N > 0 such that W(P θ (Ui), P θ n (Ui)) δn, n > N. By Equation (29), with probability 1, there exist N > 0 such that

W(P θ (U), P θ n (U)) O(δn), n > N. (31)

Therefore, by Equations (28), (30) and (31), almost surely, the following inequality holds eventually,

sup θ Θ |gn(θ) g(θ)| O log(nmn)λn + δn + m 1/(d U max+2) n log mn

where sn is defined in Theorem 4.

3. [53, Proposition 1] shows that given a L-Lipschitz continuous function f : [0, 1]m R, there exist ˆf NN m,1(W, Θ(log(m)) such that f ˆf O W 1/d and ˆf is m LLipschitz continuous. 3 For a multivariate function with output dimension m , we can approximate each coordinate individually and get a neural network approximation ˆf that is

mm L-Lipschitz continuous and f ˆf O W 1/m .

Next, we define the truncate operator TKf = min{K, max{ K, f}}, which is a contraction mapping, and prove that applying the truncate operator will not increase approximation error. If |f| K, we have

f TK ˆf = TKf TK ˆf f ˆf O W 1/m .

3The original proof does not specify the depth of the neural network. The authors show that the network architectures can be chosen as consisting of O(W) parallel blocks each having the same architecture. Each block is used to realize the function

ϕ(x) = (min(min k =s(xk xs + 1), min k (1 + xk), min k (1 xk)))+, x Rm.

This function can be realized by feed-forward network with width O m2 and depth O(log m).

Therefore, we get that

sup f FK L ([ K,K]m,[ K,K]m ) inf ˆ f N N

mm L,K m,m (W0,n,Θ(log m)) f ˆf O W 1/m 0,n . (32)

Theorem 3 implies that for each latent variable Ui, there exist a neural network gi with architecture in Corollary 1 such that

W(P(Ui), P(gi(Zi))) O(

i=1 L 2/dmax i,n + τn(1 log τn)).

By Assumption 3, we know Ui K, which implies

W(P(Ui), P((TKgi)(Zi))) W(P(Ui), P(gi(Zi))) O(

i=1 L 2/d U max i,n + τn(1 log τn)).

Combining Equations (32) and (33) with the same proof as Corollary 1, it can be proven in the same way as Corollary 1 that there exists a θn Θn satisfying

d(θ , θn) ϵn = O W 1/din max 0,n + L 2/d U max 1,n + L 2/d U max 2,n + τn(1 log τn) .

Let Θ = {θ }, we have verified the third assumption in Proposition 6.

Take the Wasserstein radius to be αn = O(sn + ϵn), Proposition 6 implies the conclusion.

C.3 Proof of Proposition 3

Lemma 2. Let ( ˆT, ˆY ) µ, (T, Y ) ν and suppose that f(t) = Eν[Y |T], ˆf(t) = Eµ[ ˆY | ˆT] are L-Lipschitz continuous and |f(t)| K, | ˆf(t)| K, then we have Z (f(t) ˆf(t))2dν(dt) CW W(µ, ν),

where CW = 4LK + 2K max{L, 1}.

Proof of Lemma 2. By the duality formulation of Wasserstein-1 distance, we have

W(µ, ν) = sup g Lip(1) Eµ[g( ˆT, ˆY )] Eν[g(T, Y )].

Let g0(t, y) = ( ˆf(t) f(t))y, we verify g0 is a Lipschitz continuous function in {(t, y) : (t, y) K}.

|g0(t1, y1) g0(t2, y2)| |g0(t1, y1) g0(t1, y2)| + |g0(t1, y2) g0(t2, y2)|

= |( ˆf(t1) f(t1))(y1 y2)| + |y2( ˆf(t1) f(t1) ( ˆf(t2) f(t2)))|

2K|y1 y2| + K(| ˆf(t1) ˆf(t2)| + |f(t1) f(t2)|) 2K|y1 y2| + 2KL|t1 t2|.

Let Lg = 2K max{L, 1}, we have proven that g0 is Lg-Lipschitz continuous in {(t, y) : (t, y) K}. Thus,

Lg (Eµ[g0( ˆT, ˆY )] Eν[g0(T, Y )])

Lg (Eµ[( ˆf( ˆT) f( ˆT))E[ ˆY | ˆT = t]] Eν[( ˆf(T) f(T))E[Y |T = t]])

Lg (Eµ[( ˆf( ˆT) f( ˆT)) ˆf( ˆT)] Eν[( ˆf(T) f(T))f(T)]). (34)

Now, let h(t) = ( ˆf(t) f(t)) ˆf(t). Following the same argument, it can be proven that h(t) is Lipschitz continuous with Lipschitz constant being Lh = 4LK. Hence,

Eµ[h( ˆT)] Eν[h(T)] Lh W(µ, ν).

Plug into (34), and we get

(Lg + Lh)W(µ, ν) Eν[( ˆf(T) f(T)) ˆf(T) ( ˆf(T) f(T))f(T)]

= Eν h ( ˆf(T) f(T))2i .

Proof of Proposition 3. If Y is not descendant of T, E[Y (t)] = E[Y ] and we have Z t2

t1 (EM[Y (t)] E ˆ M[ ˆY (t)])2dt = Z t2

t1 (EM[Y ] E ˆ M[ ˆY ])2dt

= (t2 t1)(EM[Y ] E ˆ M[ ˆY ])2 (t2 t1)W(P M(V ), P ˆ M( ˆV ))2.

Now, suppose that Y is a descendant of T. Let X = Pa(T). Note that fy(x, t) = EM[Y |X = x, T = t] is Ly-Lipschitz continuous with Lipschitz constant Ly (L + 1)n V . This is because Y can be written as Y = F0(X, T, U y) where U y are latent variables independent of T, X and F0 is composition of fi in structure equations. The composition of Lipschitz functions is still Lipschitz and F0 is a composition of at most n V L-Lipschitz functions. F0 is (L + 1)n V -Lipschitz and so is fy(x, t) = EU y[F0(X, T, U y)]. Recall that ν, µ are the observation distributions of M and ˆ M respectively in Proposition 3. Similarly, ˆfy(x, t) = E ˆ M[ ˆY |X = x, T = t] is Ly-Lipschitz continuous. By Lemma 2 and the overlap assumption, we have

CW W(µ, ν) Eν h (fy(X, T) ˆfy(X, T))2i

= Z (fy(x, t) ˆfy(x, t))2ν(dt|x)ν(dx)

Z (fy(x, t) ˆfy(x, t))2ν(dx)P(dt).

Z fy(x, t)ν(dx) Z ˆfy(x, t)ν(dx) 2 P(dt).

Note that ˆfy(x, t) is Lipschitz continuous, we have Z ˆfy(x, t)ν(dx) Z ˆfy(x, t)µ(dx) Ly W(µ, ν).

Therefore, Z fy(x, t)ν(dx) Z ˆfy(x, t)ν(dx) 2 1

Z fy(x, t)ν(dx) Z ˆfy(x, t)µ(dx) 2

Z ˆfy(x, t)µ(dx) Z ˆfy(x, t)ν(dx) 2

Z fy(x, t)ν(dx) Z ˆfy(x, t)µ(dx) 2 L2 y W 2(µ, ν)

2(EM[Y (t)] E ˆ M[ ˆY (t)])2 L2 y W 2(µ, ν).

Combine all the equations, we get Z t2

t1 (EM[Y (t)] E ˆ M[ ˆY (t)])2P(dt) 2CW

δ W(µ, ν) + 2L2 y W 2(µ, ν)(t2 t1).

Proof of Corollary 2. In the proof of Theorem 4, we know with probability at least 1 O(n 2), (6) has feasible solutions and

W(P M n (V ), P M (V )) O(αn), W(P θn mn(V ), P θn(V )) O(αn),

where notations θ and P θ are defined in the proof of Theorem 4 and θn is one of the minimizers of (6). By Lemma 7, we know that

W(P M n (V ), P θn mn(V )) Sλn(P M n (V ), P θn mn(V )) + 2(log(mnn) + 1)λn O(αn).

We get that

W(P M (V ), P θn(V )) W(P M n (V ), P M (V ))+W(P M (V ), P θn(V ))+W(P θn mn(V ), P θn(V )) O(αn).

By Proposition 3, we get |Fn F | O( αn).

D Experiments

The structure equations of the generative models are (5). We use three-layer feed-forward neural networks with width 128 for each ˆfi and six-layer neural networks with width 128 for each ˆgj. We use the Augmented Lagrangian Multiplier (ALM) method to solve the optimization problems as in [3]. We run 600 epochs and use a batch size of 2048 in each epoch. mn is set to be mn = n. The "geomloss" package [16] is used to calculate the Sinkhorn distance. To impose Lipschitz regularization, we use the technique from [23] to do layer-wise normalization to the weight matrices with respect to infinity norm. The upper bound of the Lipschitz constant in each layer is set to be 8. The τ in the Gumbel-softmax layer is set to be 0.

For the choice of Wasserstein ball radius αn, we use the subsampling technique from [11, Section 3.4]. We can take the criterion function in [11] to be Q(θ) = W(P M (V ), P Mθ(V )) and its empirical estimation to be Qn(θ) = W(P M n (V ), P Mθ n (V )). In the first step, we minimize the Wasserstein distance ˆθ n = arg minθ W(P M n (V ), P Mθ n (V )). Then, [11] propose to refine the radius Qn(ˆθ n) by taking the quantile of subsample {supθ:Qn(ˆθ) βQn(ˆθ n) Qj,b(θ)}, where Qj,b denotes the criterion function estimated at j-th subsample of size b. However, it is time-consuming to solve this optimization problem many times. In practice, we set the radius αn to be the 95% quantile of {W( P M j }(V ), P M ˆ θ n n (V )), where we use 50 subsamples P M j , j = 1, , 50 from P M n(V )

with size b = 15 n .

D.1 Discrete IV [14]

We consider the noncompliance binary IV example in [14, Section D.1]. The causal graph is shown in Figure 10. We could see from Table 2 that the bound given by NCM is slightly worse than Autobounds but is still close to the optimal bound. Besides, the Autobounds bound does not cover the optimal bound in this example.

Figure 10: Instrumental Variable (IV) graph. Z is the instrumental variable, T is the treatment and Y is the outcome.

Algorithm Average Bound SD of length Optimal Bound True Value NCM (Ours) [-0.49,0.05] 0.05 [-0.45, -0.04] -0.31 Autobounds ([14]) [-0.45,-0.05] 0.02 [-0.45, -0.04] -0.31 Table 2: The sample size is taken to be 5000. We average over 10 runs with different random seeds. SD is short for the standard derivation.

D.2 Continuous IV

Now, we turn to the continuous IV setting, where T is still binary, but Y and Z can be continuous. Let Eλ i λZ1 + (1 λ)Unif( 1, 1), where Z1 is the Gaussian variable conditioning on [ 1, 1] and the structure equations of Mλ to be

EY = Eλ 1 , U = Eλ 2 , Z = Eλ 3 , P(T = 1|Z) = (1.5 + Z + 0.5U)/3, Y = 3T 1.5TU + U + EY ,

where Eλ i are independent. It is easy to see the ATE of this model is 3 regardless of λ. In the experiment, we randomly choose ten λ Unif(0, 1). For each λ, we run the algorithms 5 times to get the bound. We choose different latent distributions (indexed by λ) in the experiments to create more difficulty.

Since Autobounds can only deal with discrete variables, we discretize Z and Y . Suppose that Z [l, u], we map all points in intervals [l + i(u l)/k, l + (i + 1)(u l)/k], I = 0, 1 , k 1 to the middle point l + (i + 1/2)(u l)/k. We choose k = 8 in the following experiments. The choice will give rise to polynomial programming problems with 214 decision variables, which is quite large.

Table 3 demonstrate the results. While both algorithms cover the true ATE well, we can see that NCM gives much tighter bounds on average. The main reason may be that the discretized problem does not approximate the original problem well enough. It is possible that a larger discretized parameter k can give a better bound, but since the size of the polynomial problem grows exponentially with k, the optimization problem may be intractable for large k. On the contrary, NCM does not suffer from computational difficulties.

Algorithm Average Bound SD of length Success Rate True Value NCM (Ours) [2.49,3.24] 0.49 50/50 3 Autobounds ([14]) [1.40, 3.48] 0.26 50/50 3 Table 3: We take a sample size of 5000. We randomly choose 10 λ Unif(0, 1) and get 10 models Mλi. For each model Mλi, we run the two algorithms for 5 times. The success rate is the number of times when the obtained bounds cover the true ATE divided by the total number of experiments. SD is short for the standard derivation.

D.3 Counterexample

We test our neural causal method on the counterexample in Appendix C.1. We choose the noise Uy to be the normal variable. The structure equations are

P(U1 = 1) = 3/5, P(U1 = 0) = 2/5, P(U2 = 1) = 1/3, P(U1 = 0) = 2/3,

X = 1 , w.p. 1/2, 0 , w.p. 1/2,

P(T = 1|X = x) = 1/2 , x = 1, 5/8 , x = 0,

T + Uy , X = 1, U1 + Uy , X = 0, T = 1, U2 + Uy , X = 0, T = 0,

We compare the confidence intervals of the unregularized neural casual method and the Lipschitz regularized one under different architectures. We choose the layerwise Lispchitz constants upper bound to be 5 and 1.2. As a benchmark, we also include the bound obtained by the Double Machine Learning estimator using the Econ ML package. The result is shown in Appendix D.3. For this identifiable case example, the double ML estimator produces better bounds for the ATE. The intervals given by regularized and unregularized NCM are similar to the regularized one slightly better in the left figure, where we use medium-sized NNs. However, in the right figure, the obtained intervals after regularization are tighter, although slightly biased, compared with the regularized setting. From these two experiments, we conclude that the architecture of NNs will also influence the results and adding regularization during the training process can prevent extreme confidence intervals or inconsistency.

Figure 11: Comparison of Lipschitz regularized and unregularized neural causal algorithm. The two figures show the results of different architectures. The figure on the left side uses a medium-sized NN (width 128, depth 3) to approximate each structural function, while the right figure uses extremely small NNs (width 3, depth 1). In all experiments, we use the projected gradient to regularize the weight of the neural network. For each sample size, we repeat the experiment 5 times and take the average of the upper (lower) bound.

E Technical Lemmas

Lemma 3. Let µ, ˆµ be two measures on compact set K Rd1, for any measurable functions F, ˆF : K Rd2, if F is L-Lipschitz continuous, then

W(F#µ, ˆF#ˆµ) LW(µ, ˆµ) + F1 F2 .

Proof. For any 1-Lipschitz function g : Rd2 R,

EX F#µ[g(X)] EX ˆ F# ˆµ[g(X)] = EX µ[g F(X)] EX ˆµ[g ˆF(X)]

= (EX µ[g F(X)] EX ˆµ[g F(X)])

+ (EX ˆµ[g F(X)] EX ˆµ[g ˆF(X)]).

Note that for any x, y Rd1,

|g F(x) g F(y)| F(x) F(y) L x y .

Thus, g F is L-Lipschitz continuous and

EX µ[g F(X)] EX ˆµ[g F(X)] LW(µ, ˆµ). (35)

For the second term,

EX ˆµ[g F(X)] EX ˆµ[g ˆF(X)] EX ˆµ[|g F(X) g ˆF(X)|]

EX ˆµ[ F(X) ˆF(X) ]

F(X) ˆF(X) . (36)

Combine (35) and (36), we have the conclusion.

Lemma 4. Let µ1, , µn, ˆµ1, , ˆµn be measure on Rd, for any p1, , pn [0, 1] such that Pn k=1 pk = 1, we have

k=1 pk W(µk, ˆµk).

Proof. For any 1-Lipschitz function f : Rd R, we have

EX Pn k=1 pkµk[f(X)] EX Pn k=1 pk ˆµk[f(X)] =

k=1 pk(EXk µk[f(Xk)] EXk ˆµk[f(Xk)])

k=1 pk W(µk, ˆµk).

By dual formulation of Wasserstein-1 distance,

= sup f 1-Lipschitz EX Pn k=1 pkµk[f(X)] EX Pn k=1 pk ˆµk[f(X)]

k=1 pk W(µk, ˆµk).

Lemma 5. Given two measures µ = µ1 µn, ˆµ = ˆµ1 ˆµn, we have

k=1 W(µk, ˆµk).

Proof. For any ϵ > 0, let πk be a coupling of µk, ˆµk such that

EXk,Yk πk[ Xk Yk 1] W(µk, ˆµk) + ϵ.

W(µ, ˆµ) Eπ1 πn[ X Y ]

k=1 EXk,Yk πk[ Xk Yk 1]

W(µk, ˆµk) + ϵ.

Let ϵ 0, we get the result.

Lemma 6 (Approximation error of Gumbel-Softmax layer). Given 1 > τ > 0 and p1, , pn > 0, let µτ Xτ = (Xτ 1 , , Xτ n) be

Xk = exp((log pk + Gk)/τ) Pn k=1 exp((log pk + Gk)/τ), (37)

where Gk exp( x exp( x)) are i.i.d. standard Gumbel variables. Let µ0 be the distribution of X0 = limτ 0 Xτ = X0 1, , X0 n . Then,

W µτ, µ0 2(n 1)τ 2n(n 1)e 1τ log τ.

Proof. Without loss of generality, we may assume that Pn k=1 pk = 1. Otherwise, we can divide the denominator and numerator in (37) by Pn k=1 pk. Let n = {(x1, , xn) : Pn k=1 xk = 1}. We construct a transportation map T : n n as follows. T(x1, , xn) is a n-dimensional vector with all zeros except that the i = arg maxj xj-th entry being one. If there are multiple coordinates

that is maximum, we choose the first coordinate and let the remaining coordinate be zero. It is easy to verify that T#µτ and µ0 have the same distribution [36]. For any δ > 0, we have

P(|Gk1 + log pk1 (Gk2 + log pk2)| < δ) = Z

Z y+log pk2 pk1 +δ

y+log pk2 pk1 δ exp( x y exp( y) exp( x))dxdy

exp( y exp( y))dy = 2δ,

where we have used exp( x exp( x)) exp( 1). Let event Eδ = { i, j [n] : |Gi + log pi (Gj + log pj)| < δ}, we have

i =j P(|Gi + log pi (Gj + log pj)| < δ)

δn(n 1)e 1.

Now, we calculate the transportation cost

EXτ µτ [ T (Xτ) Xτ 1] P(Eδ)EXτ µτ [ T (Xτ) Xτ 1|Eδ] + P (Ec δ) EXτ µτ [ T (Xτ) Xτ 1|Ec δ] .

Note that T (Xτ) Xτ 1 2. For the first term, we have

P(Eδ)EXτ µτ [ T (Xτ) Xτ 1|Eδ] 2δn(n 1)e 1.

For the second term, under event Ec δ, let kmax = arg maxi Xi, if j = kmax = arg maxi(log pk+Gk)

exp((log pj + Gj)/τ) Pn k=1 exp((log pk + Gk)/τ) 1 =

k =j exp((log pk + Gk)/τ) Pn k=1 exp((log pk + Gk)/τ)

P k =j exp((log pk + Gk)/τ)

exp((log pj + Gj)/τ)

k =j exp((log pk + Gk (log pj + Gj))/τ)

(n 1) exp( δ/τ).

If j = kmax, we have exp((log pj + Gj)/τ) Pn k=1 exp((log pk + Gk)/τ)

exp((log pj + Gj (log pmax + Gmax))/τ) exp( δ/τ).

EXτ µτ [ T (Xτ) Xτ 1|Ec δ] 2(n 1) exp( δ/τ).

We get an upper bound of the transportation cost,

W µτ, µ0 EXτ µτ [ T (Xτ) Xτ 1] 2δn(n 1)e 1 + 2(n 1) exp( δ/τ).

Let δ = τ log τ, we get

W µτ, µ0 2(n 1)τ 2n(n 1)e 1τ log τ.

Lemma 7 (Approximation Error of Sinkhorn distance). For any µ n, ν m and λ > 0, we have

0 W1,λ(µ, ν) W(µ, ν) (log(mn) + 1)λ,

where W1,λ( , ) is the entropy regularized Wasserstein-1 distance. Moreover, we have the following estimation of approximation error.

|Sλ(µ, ν) W(µ, ν)| 2(log(mn) + 1)λ.

Proof. Let h(T) = Pn,m i,j=1 Tij log Tij + 1, by [35, Proposition 1], we have

0 W1,λ(µ, ν) W(µ, ν) cλ, (38) where c = max {h(T) : transportation plan T is achieve optimal loss W(µ, ν)}. Since Tij [0, 1] and f(x) = x log x is concave, we have

h(T) = nm 1 nm

i,j=1 Tij log Tij + 1.

= 1 + log(nm). By definition of Sinkhorn distance, Sλ(µ, ν) = W1,λ(µ, ν) W1,λ(µ, µ)/2 W1,λ(ν, ν)/2. By (38), we get |Sλ(µ, ν) W(µ, ν)| 2(log(mn) + 1)λ.

Lemma 8. Given a measure µ on Rd and a real function class F with output dimension d, suppose that the pseudo-dimension of F is less than d F < and there exists K > 0 such that |f(x)| K, f F, x Rd , then with probability at least 1 exp O δ d log δ 1 + d F log ϵ 1 nϵ2 , sup f F W(f#µ, f#µn) δ + ϵ

for all δ, ϵ > 0, where µn is the empirical distribution of µ.

Proof. By the dual formulation of the Wasserstein distance, sup f F W(f#µ, f#µn) = sup h F1(Rd,Rd),h(0)=0 sup f F EX µn[h f(X)] EX µ[h f(X)].

We define N1(ϵ, F, n) = sup x1, ,xn N(ϵ, {(f(x1), , f(xn) : f F)}, 1),

where N(ϵ, S, 1) is the covering number of set S in ℓ1 norm. Obviously, if h is 1-Lipschitz, N1(ϵ, h F, n) N1(ϵ, F, n). By Theorem 18.4 in [2], for any fixed h,

N1(ϵ, h F, n) N1(ϵ, F, n) e(d F + 1) 2e

By Theorem 17.1 in [2],

sup f F EX µn[h f(X)] EX µ[h f(X)] > ϵ

exp O d F log ϵ 1 nϵ2 .

Let Hδ be the δ-net of the set H = h : h F1 [ K, K]d, [ K, K]d , h(0) = 0 in ℓ norm. By Lemma 6 in [21], |Hδ| exp O δ d log δ 1 .

Therefore, with probability no more than exp O δ d log δ 1 + d F log ϵ 1 nϵ2

sup h Hδ,f F EX µn[h f(X)] EX µ[h f(X)] > ϵ.

Notice that sup h H,f F EX µn[h f(X)] EX µ[h f(X)] 2δ + sup h Hδ,f F EX µn[h f(X)] EX µ[h f(X)],

which implies that with probability no more than exp O δ d log δ 1 + d F log ϵ 1 nϵ2

sup h H,f F EX µn[h f(X)] EX µ[h f(X)] > 2δ + ϵ.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: We accurately summarize our contributions in the abstract and introduction.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We mention some limitations as future directions in the conclusion section.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: All assumptions and detailed proof are given. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide all implementation details in the appendix. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes] Justification: We provide all implementation details in the appendix. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide all implementation details in the appendix. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We report the confidence interval. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean.

It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide all implementation details in the appendix. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have reviewed the Code. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This is a theoretical work. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to

generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: This is a theoretical work. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: This is a theoretical work. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA] Justification:This is a theoretical work. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This is a theoretical work. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This is a theoretical work. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.