# conditional_independence_testing_under_misspecified_inductive_biases__770aa480.pdf

Conditional independence testing under misspecified inductive biases

Felipe Maia Polo Department of Statistics University of Michigan felipemaiapolo@gmail.com

Yuekai Sun Department of Statistics University of Michigan yuekai@umich.edu

Moulinath Banerjee Department of Statistics University of Michigan moulib@umich.edu

Conditional independence (CI) testing is a fundamental and challenging task in modern statistics and machine learning. Many modern methods for CI testing rely on powerful supervised learning methods to learn regression functions or Bayes predictors as an intermediate step; we refer to this class of tests as regressionbased tests. Although these methods are guaranteed to control Type-I error when the supervised learning methods accurately estimate the regression functions or Bayes predictors of interest, their behavior is less understood when they fail due to misspecified inductive biases; in other words, when the employed models are not flexible enough or when the training algorithm does not induce the desired predictors. Then, we study the performance of regression-based CI tests under misspecified inductive biases. Namely, we propose new approximations or upper bounds for the testing errors of three regression-based tests that depend on misspecification errors. Moreover, we introduce the Rao-Blackwellized Predictor Test (RBPT), a regression-based CI test robust against misspecified inductive biases. Finally, we conduct experiments with artificial and real data, showcasing the usefulness of our theory and methods.

1 Introduction

Conditional independence (CI) testing is fundamental in modern statistics and machine learning (ML). Its use has become widespread in several different areas, from (i) causal discovery [12, 24, 34, 11] and (ii) algorithmic fairness [28], to (iii) feature selection/importance [5, 37] and (iv) transfer learning [25]. Due to its growing relevance across different sub-fields of statistics and ML, new testing methods with different natures, from regression to simulation-based tests, are often introduced.

Regression-based CI tests, i.e., tests based on supervised learning methods, have become especially attractive in the past years due to (i) significant advances in supervised learning techniques, (ii) their suitability for high-dimensional problems, and (iii) their simplicity and easy application. However, regression-based tests usually depend on the assumption that we can accurately approximate the regression functions or Bayes predictors of interest, which is hardly true if (i) either the model classes are misspecified or if (ii) the training algorithms do not induce the desired predictors, i.e., if we have misspecified inductive biases. Misspecified inductive biases typically lead to inflated Type-I error rates but also can cause tests to be powerless. Even though these problems can frequently arise in practical situations, more attention should be given to theoretically understanding the effects of misspecification on CI hypothesis testing. Moreover, current regression-based methods are usually not designed to be robust against misspecification errors, making CI testing less reliable. In this work, we study the performance of three major regression-based conditional independence tests under misspecified inductive biases and propose the Rao-Blackwellized Predictor Test (RBPT), which is more robust against misspecification.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

With more details, our main contributions are:

We present new robustness results for three relevant regression-based conditional independence tests: (i) Significance Test of Feature Relevance (STFR) [7], (ii) Generalized Covariance Measure (GCM) test [31], and (iii) REgression with Subsequent Independence Test (RESIT) [42, 24, 12]. Namely, we derive approximations or upper bounds for the testing errors that explicitly depend on the level of misspecification. We introduce the Rao-Blackwellized Predictor Test (RBPT), a modification of the Significance Test of Feature Relevance (STFR) [7] test that is robust against misspecified inductive biases. In contrast with STFR and previous regression/simulation-based1 methods, the RBPT does not require models to be correctly specified to guarantee Type-I error control. We develop theoretical results about the RBPT, and experiments show that RBPT is robust when controlling Type-I error while maintaining non-trivial power.

2 Preliminaries

Conditional independence testing. Let (X, Y, Z) be a random vector taking values in X Y Z Rd X d Y d Z and P be a fixed family of distributions on the measurable space (X Y Z, B), where B = B(X Y Z) is the Borel σ-algebra. Let (X, Y, Z) P and assume P P. If P0 P is the set of distributions in P such that X Y | Z, the problem of conditional independence testing can be expressed in the following way:

H0 : P P0 H1 : P P\P0

In this work, we also write H0 : X Y | Z and H1 : X Y | Z. We assume throughout that we have access to a dataset D(n+m) = {(Xi, Yi, Zi)}n+m i=1 independent and identically distributed (i.i.d.) as (X, Y, Z), where D(n+m) splits into a test set D(n) te = {(Xi, Yi, Zi)}n i=1 and a training set D(m) tr = {(Xi, Yi, Zi)}n+m i=n+1. For convenience, we use the training set to fit models and the test set to conduct hypothesis tests, even though other approaches are possible.

Misspecified inductive biases in modern statistics and machine learning. Traditionally, misspecified inductive biases in statistics have been linked to the concept of model misspecification and then strictly related to the chosen model classes. For instance, if the best (Bayes) predictor for Y given X, f , is a non-linear function of X, but we use a linear function to predict Y , then we say our model is misspecified because f is not in the class of linear functions. In modern machine learning and statistics, however, it is known that the training algorithm also plays a crucial role in determining the trained model. For example, it is known that training overparameterized neural networks using stochastic gradient descent bias the models towards functions with good generalization [14, 33]. In addition, D Amour et al. [9] showed that varying hyperparameter values during training could result in significant differences in the patterns learned by the neural network. The researchers found, for instance, that models with different random initializations exhibit varying levels of out-of-distribution accuracy in predicting skin health conditions for different skin types, indicating that each model learned distinct features from the images. The sensitivity of the trained model concerning different training settings suggests that even models capable of universal approximation may not accurately estimate the target predictor if the training biases do not induce the functions we want to learn.

We present a toy experiment to empirically demonstrate how the training algorithm can prevent us from accurately estimating the target predictor even when the model class is correctly specified, leading to invalid significance tests. We work in the context of a high-dimensional (overparameterized) regression with a training set of 250 observations and 500 covariates. We use the Generalized Covariance Model (GCM) test2 [31] to conduct the CI test. The data are generated as

Z N(0, I500), X | Z N(β XZ, 1), and Y | X, Z N(β Y Z, 1),

where the first five entries of βX are set to 20, and the remaining entries are zero, while the last five entries of βY are set to 20, and the remaining entries are zero. This results in X and Y being conditionally independent given Z and depending on Z only through a small number of entries. Additionally, E[X | Z] = β XZ and E[Y | Z] = β Y Z, indicating that the linear model class is correctly specified. To

1Simulation-based tests usually rely on estimating conditional distributions. 2See Appendix A.3 for more details

perform the GCM test, we use LASSO ( 1 penalization term added to empirical squared error) and the minimum-norm least-squares solution to fit linear models that predict X and Y given Z. In this problem, the LASSO fitting approach provides the correct inductive bias since βX and βY are sparse.

Figure 1: Type-I error rate is contingent on the training algorithm and not solely on the model classes. Unlike the minimum-norm solution, the LASSO fit gives the correct inductive bias in high-dimensional regression, providing better Type-I error control.

We set the significance level to α = 10% and estimate the Type-I error rate for 100 different training sets. Figure 1 provides the Type-I error rate empirical distribution and illustrates that, despite using the same model class for both fitting methods, the training algorithm induces an undesired predictor in the minimum-norm case, implying an invalid test most of the time. On the other hand, the LASSO approach has better Type-I error control. In Appendix A, we give a similar example but using the Significance Test of Feature Relevance (STFR) [7].

In this work, we formalize the idea of misspecified inductive biases in the following way. Assume that a training algorithm A is used to choose a model ˆg(m) = A(D(m) tr ) from the class G(m). We further assume that the sequence (ˆg(m))m N converges to a limiting model g in a relevant context-dependent sense. We use different notions of convergence depending on the specific problem under consideration, which will be clear in the following sections. We say that g "carries misspecified inductive biases" if it does not equal the target Bayes predictor or regression function f . There are two possible reasons for g carrying misspecified biases: either the limiting model class is small and does not include f , or the training algorithm A cannot find the best possible predictor, even asymptotically.

Notation. We write EP and Var P for the expectation and variance of statistics computed using i.i.d. copies of (X, Y, Z) P. Consequently, PP (A) = EP 1A, where 1A is the indicator of an event A. If EP and Var P are conditioned on some other statistics, we assume those statistics are also computed using i.i.d. samples from P. As usual, Φ is the N(0, 1) distribution function. If (am)m N and (bm)m N are sequences of scalars, then am = o(bm) is equivalent to am/bm 0 as m and am = bm + o(1) means am bm = o(1). If (V (m))m N is a sequence of random variables, where V (m) as constructed using i.i.d. samples of P (m) P for each m, then (i) V (m) = op(1) means that for every ε > 0 we have PP (m)(|V (m)| > ε) 0 as m , (ii) V (m) = Op(1) means that for every ε > 0 there exists a M > 0 such that supm N PP (m)(|V (m)| > M) < ε, (iii) V (m) = am + op(1) means V (m) am = op(1), (iv) V (m) = op(am) means V (m)/am = op(1), and (v) V (m) = Op(am) means V (m)/am = Op(1). Finally, let (V (m) P )m N,P P be a family of random variables that distributions explicitly depend on m N and P P. We give an example to clarify what we mean by "explicitly" depending on a specific distribution. Let V (m) P = 1 m Pm i=1(Xi µP ), where µP = EP [X]. Here, V (m) P explicitly depends on P because of the quantity µP. In this example, Xi s outside the expectation can have an arbitrary distribution (unless stated), i.e., could be determined by P or any other distribution. With this context, (i) V (m) P = o P(1) means that for every ε > 0 we have sup P P PP (|V (m) P | > ε) 0 as m , (ii) V (m) P = OP(1) means that for every ε > 0 there exists a M > 0 such that supm N,P P PP (|V (m) P | > M) < ε, (iii) V (m) P = o P(am) means V (m) P /am = o P(1), and (iv) V (m) P = OP(am) means V (m) P /am = OP(1).

Related work. There is a growing literature on the problem of conditional independence testing regarding both theoretical and methodological aspects3. From the methodological point of view, there is a great variety of tests with different natures. Perhaps, the most important groups of tests are (i) simulation-based tests [5, 3, 4, 32, 35, 20], (ii) regression-based tests [40, 24, 42, 38, 31, 7], (iii) kernel-based tests [10, 8, 34, 30], and (iv) information-theoretic based tests [29, 16, 39]. Due to the advance of supervised and generative models in recent years, regression and simulation-based tests have become particularly appealing, especially when Z is not low-dimensional or discrete. A related but different line of research is constructing a lower confidence bound for conditional dependence of X and Y given Z [41]. In that work, the authors propose a method that relies on computing the conditional expectation of a possibly misspecified regression model, which can be related to

3See, for example, Marx and Vreeken [21], Shah and Peters [31], Li and Fan [19], Neykov et al. [23], Watson and Wright [37], Kim et al. [15], Shi et al. [32], Scetbon et al. [30], Tansey et al. [35], Zhang et al. [39], Ai et al. [1]

our method presented in Section 4. Despite the relationship between methods, their motivations, assumptions, and contexts are different.

Simulation-based tests depend on the fact that we can, implicitly or explicitly, approximate the conditional distributions PX|Z or PY |Z. Two relevant simulation-based methods are the conditional randomization and conditional permutation tests (CRT/CPT) [5, 3, 4, 35]. For these tests, Berrett et al. [4] presents robustness results showing that we can approximately control Type I error even if our estimates for the conditional distributions are not perfect and we are under a finite-sample regime. However, it is also clear from their results that CRT and CPT might not control Type I error asymptotically when models for conditional distributions are misspecified. On the other hand, regression-based tests work under the assumption that we can accurately approximate the conditional expectations E[X | Z] and E[Y | Z] or other Bayes predictors, which is hardly true if the modeling and training inductive biases are misspecified. To the best of our knowledge, there are no published robustness results for regression-based CI tests like those presented by Berrett et al. [4]. We explore this literature gap.

3 Regression-based conditional independence tests under misspecified inductive biases

This section provides results for the Significance Test of Feature Relevance (STFR) [7]. Due to limited space, the results for the Generalized Covariance Measure (GCM) test [31] and the REgression with Subsequent Independence Test (RESIT) [42, 24, 12] are presented in Appendix A. From the results in Appendix A, one can easily derive a double robustness property for both GCM and RESIT, implying that not all models need to be correctly specified or trained with the correct inductive biases for Type-I error control.

3.1 Significance Test of Feature Relevance (STFR)

The STFR method studied by Dai et al. [7] offers a scalable approach for conducting conditional independence testing by comparing the performance of two predictors. To apply this method, we first train two predictors ˆg(m) 1 : X Z Y and ˆg(m) 2 : Z Y on the training set D(m) tr to predict Y given (X, Z) and Z, respectively. We assume that candidates for ˆg(m) 2 are models in the same class as ˆg(m) 1 but replacing X with null entries. Using samples from the test set D(n) te , we conduct the test rejecting H0 : X Y | Z if the statistic Λ(n,m) n T (n,m)/ˆσ(n,m) exceeds τα Φ 1(1 α), depending on the significance level α (0, 1). We define T (n,m) and ˆσ(n,m) as

n Pn i=1 T (m) i and ˆσ(n,m) 1 n Pn i=1(T (m) i )2 1 n Pn i=1 T (m) i 2 1/2 (3.1)

with T (m) i ℓ(ˆg(m) 2 (Zi), Yi) ℓ(ˆg(m) 1 (Xi, Zi), Yi) + εi. Here, ℓis a loss function, typically used during the training phase, and {εi}n i=1 iid N(0, ρ2) are small artificial random noises that do not let ˆσ(n,m) vanish with a growing training set, thus allowing the asymptotic distribution of Λ(n,m) to be standard normal under H0 : X Y | Z. If the p-value is defined as p(D(n) te , D(m) tr ) = 1 Φ(Λ(n,m)), the test is equivalently given by

φSTFR α (D(n) te , D(m) tr )

( 1, if p(D(n) te , D(m) tr ) α 0, otherwise (3.2)

The rationale behind STFR is that if H0 : X Y | Z holds, then ˆg(m) 1 and ˆg(m) 2 should have similar performance in the test set. On the other hand, if H0 does not hold, we expect ˆg(m) 1 to have significantly better performance, and then we would reject the null hypothesis. Said that, to control STFR s Type-I error, it is necessary that the risk gap between ˆg(m) 1 and ˆg(m) 2 , EP [ℓ(ˆg(m) 2 (Z), Y ) | D(m) tr ] EP [ℓ(ˆg(m) 1 (X, Z), Y ) | D(m) tr ], under H0 vanishes as the training set size increases. Moreover, we need the risk gap to be positive for the test to have non-trivial power. These conditions can be met if the risk gap of g 1,P and g 2,P, the limiting models of ˆg(m) 1 and ˆg(m) 2 , is the same as the risk gap of the Bayes predictors

f 1,P arg minf1EP [ℓ(f1(X, Z), Y )] and f 2,P arg minf2EP [ℓ(f2(Z), Y )],

where the minimization is done over the set of all measurable functions4. However, the risk gap between ˆg(m) 1 and ˆg(m) 2 will typically not vanish if g 1,P and g 2,P are not the Bayes predictors even under H0. In general, we should expect g 1,P to perform better than g 2,P because the second predictor does not depend on X. Furthermore, their risk gap can be non-positive even if f 1,P performs better than f 2,P. In Appendix A.2, we present two examples in which model misspecification plays an important role when conducting STFR. The examples show that Type-I error control and/or power can be compromised due to model misspecification.

To derive theoretical results, we adapt the assumptions from Dai et al. [7]: Assumption 3.1. There are functions g 1,P, g 2,P, and a constant γ > 0 such that

EP ℓ(ˆg(m) 2 (Z), Y ) | D(m) tr EP ℓ(g 2,P (Z), Y ) EP ℓ(ˆg(m) 1 (X, Z), Y ) | D(m) tr EP ℓ(g 1,P (X, Z), Y )

= OP(m γ) Assumption 3.2. There exists a constant k > 0 such that

EP |T (m) 1 |2+k | D(m) tr = OP(1) as m Assumption 3.3. For every P P, there exists a constant σ2 P > 0 such that

Var P [T (m) 1 | D(m) tr ] σ2 P = o P(1) as m and inf P P σ2 P > 0

Finally, we present the results for this section. We start with an extension of Theorem 2 presented by Dai et al. [7] in the case of misspecified inductive biases. Theorem 3.4. Suppose that Assumptions 3.1, 3.2, and 3.3 hold. If n is a function of m such that n and n = o(m2γ) as m , then

EP [φSTFR α (D(n) te , D(m) tr )] = 1 Φ τα q n

σ2 P ΩSTFR P + o(1)

where o(1) denotes uniform convergence over all P P as m and

ΩSTFR P EP [ℓ(g 2,P (Z), Y )] EP [ℓ(g 1,P (X, Z), Y )]

Theorem 3.4 demonstrates that the performance of STFR depends on the limiting models g 1,P and g 2,P. Specifically, if ΩSTFR P > 0, then EP [φSTFR α (D(n) te , D(m) tr )] 1 even if H0 : X Y | Z holds. In practice, we should expect ΩSTFR P > 0 because of how we set the class for ˆg(m) 2 . In contrast, we could have ΩSTFR P 0, and then EP [φSTFR α (D(n) te , D(m) tr )] α+o(1), even if the gap between Bayes predictors is positive. See examples in Appendix A.2 for both scenarios. Next, we provide Corollary 3.6 to clarify the relationship between testing and misspecification errors. This corollary formalizes the intuition that controlling Type-I error is directly related to misspecification of g 2,P, while minimizing Type-II error is directly related to misspecification of g 1,P. Definition 3.5. For a distribution P and a loss function ℓ, define the misspecification gaps:

1,P EP [ℓ(g 1,P (X, Z), Y )] EP [ℓ(f 1,P (X, Z), Y )] and 2,P EP [ℓ(g 2,P (Z), Y )] EP [ℓ(f 2,P (Z), Y )]

The misspecification gaps defined in Definition 3.5 quantify the difference between the limiting predictors g 1,P and g 2,P and the Bayes predictors f 1,P and f 2,P, i.e., give a misspecification measure for g 1,P and g 2,P. Corollary 3.6 implies that the STFR controls Type-I error asymptotically if 2,P = 0, and guarantees non-trivial power if the degree of misspecification of g 1,P is not large compared to the performance difference of the Bayes predictors P, that is, when P 1,P > 0. Corollary 3.6 (Bounding testing errors). Suppose we are under the conditions of Theorem 3.4.

(Type-I error) If H0 : X Y | Z holds, then

EP [φSTFR α (D(n) te , D(m) tr )] 1 Φ τα q n

σ2 P 2,P + o(1)

where o(1) denotes uniform convergence over all P P0 as m .

(Type-II error) In general, we have

1 EP [φSTFR α (D(n) te , D(m) tr )] Φ τα q n

σ2 P ( P 1,P ) + o(1)

where o(1) denotes uniform convergence over all P P as m and P EP [ℓ(f 2,P (Z), Y )] EP [ℓ(f 1,P (X, Z), Y )].

4We assume f 1,P and f 2,P to be well-defined and unique.

4 A robust regression-based conditional independence test

This section introduces the Rao-Blackwellized Predictor Test (RBPT), a misspecification robust conditional independence test based on ideas from both regression and simulation-based CI tests. The RBPT assumes that we can implicitly or explicitly approximate the conditional distribution of X | Z and does not require inductive biases to be correctly specified. Because RBPT involves comparing the performance of two predictors and requires an approximation of the distribution of X | Z, we can directly compare it with the STFR [7] and the conditional randomization/permutation tests (CRT/CPT) [5, 4]. The RBPT can control Type-I error under relatively weaker assumptions compared to other tests, allowing some misspecified inductive biases.

The RBPT can be summarized as follows: (i) we train ˆg(m) that predicts Y given (X, Z) using D(m) tr ; (ii) we obtain the Rao-Blackwellized predictor h(m) by smoothing ˆg(m), i.e.,

h(m)(z) R ˆg(m)(x, z)d PX|Z=z(x),

then (iii) compare its performance with ˆg(m) s using the test set D(n) te and a convex loss5 function ℓ(not necessarily used to train ˆg(m)), and (iv) if the performance of ˆg(m) is statistically better than h(m) s,

Algorithm 1: Obtaining p-value for the RBPT

1 Input: (i) Test set D(n) te = {(Xi, Yi, Zi)}n i=1, (ii) initial predictor ˆg(m), (iii) conditional distribution estimate ˆQ(m) X|Z, (iv) convex loss function ℓ;

2 Output: p-value p;

3 For each i [n], get ˆh(m)(Zi) = R ˆg(m)(x, Zi)d ˆQ(m) X|Z=Zi(x);

4 Compute Ξ(n,m) n T (n,m)/ˆσ(n,m) where T (n,m) 1

n Pn i=1 T (m) i with

T (m) i ℓ(ˆh(m)(Zi), Yi) ℓ(ˆg(m)(Xi, Zi), Yi)

and ˆσ(n,m) being {Ti} s sample std dev (Eq. 3.1).

5 return p = 1 Φ(Ξ(n,m)).

we reject H0 : X Y | Z. The procedure described here bears a resemblance to the Rao-Blackwellization of estimators. In classical statistics, the Rao-Blackwell theorem [17] states that by taking the conditional expectation of an estimator with respect to a sufficient statistic, we can obtain a better estimator if the loss function is convex. In our case, the variable Z can be seen as a "sufficient statistic" for Y under the assumption of conditional independence H0 : X Y | Z. If H0 holds and the loss ℓ(ˆy, y) is convex in its first argument, we can show using Jensen s inequality that the resulting model h(m) has a lower risk relative to the initial model ˆg(m), i.e., EP [ℓ(h(m)(Z), Y ) | D(m) tr ] EP [ℓ(ˆg(m)(X, Z), Y ) | D(m) tr ] 0. Then, the risk gap in RBPT is non-positive under H0 in contrast with STFR s risk gap, which we should expect to be always non-negative given the definition of ˆg(m) 2 in that case. That fact negatively biases the RBPT test statistic, enabling better Type-I error control.

In practice, we cannot compute h(m) exactly because PX|Z is usually unknown. Then, we use an approximation ˆQ(m) X|Z, which can be given explicitly, e.g., using probabilistic classifiers or conditional density estimators [13], or implicitly, e.g., using generative adversarial networks (GANs) [22, 3]. We assume that ˆQ(m) X|Z is obtained using the training set. The approximated h(m) is

ˆh(m)(z) R ˆg(m)(x, z)d ˆQ(m) X|Z=z(x)

where the integral can be solved numerically in case ˆQ(m) X|Z has a known probability mass function or Lebesgue density (e.g., via trapezoidal rule) or via Monte Carlo integration in case we can only sample from ˆQ(m) X|Z. Finally, for a fixed significance level α (0, 1), the test φRBPT α is given by Equation 3.2 where the p-value is obtained via Algorithm 1.

Before presenting RBPT results, we introduce some assumptions. Let Q X|Z represent the limiting model for ˆQ(m) X|Z. The conditional distribution Q X|Z depends on the underlying distribution P, but we omit additional subscripts for ease of notation. Assumption 4.1 defines the limiting models and fixes a convergence rate.

5The loss function ℓ(ˆy, y) needs to be convex with respect to its first entry (ˆy) for all y. Both the test set and training set sizes, and the loss function ℓcan be chosen using the heuristics introduced by Dai et al. [7].

Assumption 4.1. There is a function g P, a conditional distribution Q X|Z, and a constant γ > 0 s.t.

ˆg(m)(Z) g P (Z) 2

= OP(m γ) and EP h d TV( ˆQ(m) X|Z, Q X|Z) | D(m) tr i = OP(m γ)

where d TV denotes the total variation (TV) distance. Additionally, assume that both ˆQ(m) X|Z and Q X|Z are dominated by a common σ-finite measure which does not depend on Z or m.

The common dominating measure in Assumption 4.1 could be, for example, the Lebesgue measure in Rd X. Next, Assumption 4.2 imposes additional constraints on the limiting model Q X|Z. Under that assumption, the limiting misspecification level must be uniformly bounded over all P P. Assumption 4.2. For all P P, the chi-square divergence

χ2 Q X|Z||PX|Z R d Q X|Z d PX|Z d Q X|Z 1

is a well-defined integrable random variable and sup P P EP χ2 Q X|Z||PX|Z < .

Now, assume ˆg(m) is chosen from a model class G(m). Assumption 4.3 imposes constraints on the model classes {G(m)} and loss function ℓ. Assumption 4.3. Assume (i) supg G(m) sup(x,z) X Z g(x, z) 1 M < , for some real and positive M > 0, uniformly for all m, and (ii) that ℓis a L Lipschitz loss function (with respect to its first argument) for a certain L > 0, i.e., for any ˆy, ˆy , y Y, we have that |ℓ(ˆy, y) ℓ(ˆy , y)| L ˆy ˆy 2.

Assumption 4.3 is valid by construction since we choose G(m) and the loss function ℓ. That assumption is satisfied when, for example, (a) models in m G(m) are uniformly bounded, (b) ℓ(ˆy, y) = ˆy y p p with p 1, and (c) Y is a bounded subset of Rd Y , i.e., in classification problems and most of the practical regression problems. The loss ℓ(ˆy, y) = ˆy y p p, with p 1, is also convex with respect to its first entry and then a suitable loss for RBPT. It is important to emphasize that ℓdoes not need to be the same loss function used during the training phase. For example, we could use ℓ(ˆy, y) = ˆy y 2 2 in classification problems, where y is a one-hot encoded class label and ˆy is a vector of predicted probabilities given by a model trained using the cross-entropy loss. Theorem 4.4. Suppose that Assumptions 3.2, 3.3, 4.1, 4.2, and 4.3 hold. If n is a function of m such that n and n = o(mγ) as m , then

EP [φRBPT α (D(n) te , D(m) tr )] = 1 Φ τα q n

σ2 P ΩRBPT P + o(1)

where o(1) denotes uniform convergence over all P P as m and ΩRBPT P = ΩRBPT P,1 ΩRBPT P,2 with

ΩRBPT P,1 EP ℓ R g P (x, Z)d Q X|Z(x), Y EP ℓ R g P (x, Z)d PX|Z(x), Y

ΩRBPT P,2 | {z } Jensen s gap

EP ℓ(g P (X, Z), Y ) EP ℓ R g P (x, Z)d PX|Z(x), Y

When H0 : X Y | Z holds and ℓis a strictly convex loss function (w.r.t. its first entry), we have that ΩRBPT P,2 > 0, allowing6 some room for the "incorrectness" of Q X|Z. That is, from Theorem 4.4, as long as ΩRBPT P 0, i.e., if Q X|Z s incorrectness (measured by ΩRBPT P,1 ) is not as big as Jensen s gap ΩP,2, RBPT has asymptotic Type-I error control. Uniform asymptotic Type-I error control is possible if sup P P0 ΩRBPT P 0. This is a great improvement of previous work (e.g., STFR, GCM, RESIT, CRT, CPT) since there is no need for any model to converge to the ground truth if ΩRBPT P,1 ΩRBPT P,2 , which is a weaker condition. See however that a small ΩRBPT P,2 reduces the room for Q X|Z incorrectness. In the extreme case, when g P is the Bayes predictor, and therefore does not depend on X under H0, we need7 Q X|Z = PX|Z almost surely. On the other hand, if g P is close to the Bayes predictor, RBPT has better power. That imposes an expected trade-off between Type-I error control and

6In practice, we do not need ℓto be strictly convex for the Jensen s gap to be positive. Assuming that g P depends on X under H0 is necessary, though. That condition is usually true when g P is not the Bayes predictor. 7In this case, Assumption 3.3 is not true. We need to include artificial noises in the definition of Ti as it was done in STFR by Dai et al. [7] in case we have high confidence that models converge to the ground truth.

Figure 2: Type-I error rates (c = 0). In the first two plots, we set θ = 0 for RBPT, permitting Type-I error control across different d Z values (Theorem 4.4), while RBPT2 allows Type-I error control for moderate d Z. All the baselines fail to control Type-I errors regardless of d Z. The last two plots illustrate that CRT emerges as the least robust test in this context, succeeded by RBPT and CPT.

power. To make a comparison with Berrett et al. [4] s results in the case of CRT and CPT, we can express our remark in terms of the TV distance between Q X|Z and PX|Z. It can be shown that if EP [d TV(Q X|Z, PX|Z)] ΩRBPT P,2 /(2ML), then Type-I error control is guaranteed (see Appendix A.5). This contrasts with Berrett et al. [4] s results because EP [d TV(Q X|Z, PX|Z)] = 0 is not needed.

We conclude this section with some relevant observations related to the RBPT.

On RBPT s power. Like STFR, non-trivial power is guaranteed if the predictor g P is good enough. Indeed, the second part of Corollary 3.6 can be applied for an upper bound on RBPT s Type-II error.

Semi-supervised learning. Let Y denote a label variable. Situations in which unlabeled samples (Xi, Zi) are abundant while labeled samples (Xi, Yi, Zi) are scarce happen in real applications of conditional independence testing [5, 4]. RBPT is well suited for those cases because the practitioner can use the abundant data to estimate PX|Z flexibly. The semi-supervised learning scenario also applies to RBPT2, which we describe next.

Running RBPT when it is hard to estimate PX|Z: the RBPT2. There might be situations in which it is hard to estimate the full conditional distribution PX|Z. An alternative approach would be estimating the Rao-Blackwellized predictor directly using a second regressor. After training ˆg(m), we could use the training set, organizing it in pairs {(Zi, ˆg(m)(Zi, Xi))}, to train a second predictor ˆh(m) to predict ˆg(m)(Z, X) given Z. That predictor could be trained to minimize the mean-squared error. The model ˆh(m) should be more complex than ˆg(m), in the sense that we should hope that the first model performs better than the second under H0 in predicting Y . Consequently, this approach is effective when unlabeled samples are abundant, and we can train ˆh using both unlabeled data and the given training set. After obtaining ˆh(m), the test is conducted normally. We name this version of RBPT as "RBPT2". We include a note on how to adapt Theorem 4.4 for RBPT2 in Appendix A.6.

5 Experiments

We empirically8 analyze RBPT/RBPT2 in the following experiments and compare them with relevant benchmarks, especially when the used models are misspecified. We assume α = 10% and ℓ(ˆy, y) = (ˆy y)2. The benchmarks encompass STFR [7], GCM [31], and RESIT [42], which represent regression-based CI tests. Furthermore, we examine the conditional randomization/permutation tests (CRT/CPT) [5, 4] that necessitate the estimation of PX|Z.

Artificial data experiments. Our setup takes inspiration from Berrett et al. [4], and the data is generated as

Z N (0, Id Z) , X | Z N (b Z)2, 1 , Y | X, Z N c X + a Z + γ(b Z)2, 1 .

Here, d Z denotes the dimensionality of Z, a and b are sampled from N (0, Id Z), the constant c determines the conditional dependence of X and Y on Z, and the parameter γ dictates the hardness of conditional independence testing: a non-zero γ implies potential challenges in Type-I error control as there might be a pronounced marginal dependence between X and Y under H0. Moreover, the training (resp. test) dataset consists of 800 (resp. 200) entries, and every predictor we employ operates on linear regression. RESIT employs Spearman s correlation between residuals as a test statistic while

8Code in https://github.com/felipemaiapolo/cit.

CRT and CPT9 deploy STFR s test statistic, all of them with p-values determined by conditional sampling/permutations executed 100 times (B = 100), considering ˆQX|Z = N (b Z)2 + θ, 1 . The value of θ gives the error level in approximating PX|Z. To get ˆh in the two variations of RBPT, we either use ˆQX|Z (RBPT) or kernel ridge regression (KRR) equipped with a polynomial kernel to predict ˆg1(X, Z) from Z (RBPT2). We sample generative parameters (a, b) five times using different random seeds, and for each iteration, we conduct 480 Monte Carlo simulations to estimate Type-I error and power. The presented results are the average ( standard deviation) estimated Type-I error/power across iterations.

Figure 3: Making RBPT2 more robust using unlabeled data. With d Z = 40, we gradually increase the unlabeled sample size from 0 to 1000 when fitting ˆh. The results show that a larger unlabeled sample size leads to effective Type-I error control. Even though we present this result for RBPT2, the same pattern is expected for RBPT in the presence of unlabeled data.

In Figure 2 (resp. 4) we compare our methods Type I error rates (with c = 0) (resp. power) against benchmarks. Regarding Figure 2, we focus on regressionbased tests (STFR, GCM, and RESIT) in the first two plots and on simulation-based tests (CRT and CPT) in the last two plots. Regarding the first two plots, it is not straightforward to compare the level of misspecification between our methods and the benchmarks, so we use this as an opportunity to illustrate Theorem 4.4 and results from Section 3 and Appendix A. Fixing θ = 0 for RBPT, the Rao-Blackwellized predictor h is perfectly obtained, permitting Type-I error control regardless of the chosen d Z. Using KRR for RBPT2 makes ˆh close to h when d Z is not big and permits Type-I error control. When d Z is big, more data is needed to fit ˆh, which can be accomplished using unlabeled data, as demonstrated in Figure 3 and commented in Section 4. On the other hand, Type-I error control is always violated for STFR, GCM, and RESIT when γ grows. Regarding the final two plots, we can more readily assess the robustness of the methods when discrepancies arise between ˆQX|Z and PX|Z as influenced by varying θ. Figure 2 illustrates that CRT is the least robust test in this context, succeeded by RBPT and CPT. In Figure 4, we investigate how powerful RBPT and RBPT2 can be in practice when d Z = 30. We compare our methods with CPT (when θ = 0), which seems to have practical robustness against misspecified inductive biases. Figure 4 shows that RBPT2 and CPT have similar power while RBPT is slightly more conservative.

Figure 4: Power curves for different methods. We compare our methods with CPT (when θ = 0), which seems to have practical robustness against misspecified inductive biases. RBPT2 and CPT have similar power, while RBPT is slightly more conservative.

Some concluding remarks are needed. First, RBPT and RBPT2 have shown to be practical and robust alternatives to conditional independence testing, exhibiting reasonable Type-I error control, mainly when employed in conjunction with a large unlabeled dataset, and power. Second, while CPT demonstrates notable robustness and relatively good power, its practicality falls short compared to RBPT (or RBPT2). This is because CPT needs a known density functional form for ˆQX|Z (plus the execution of MCMC chains) whereas RBPT (resp. RBPT2) can rely on conventional Monte Carlo integration using samples from ˆQX|Z (resp. supervised learning).

Real data experiments. For our subsequent experiments, we employ the car insurance dataset examined by Angwin et al. [2]. This dataset encompasses four US states (California, Illinois, Missouri, and Texas) and includes information from numerous insurance providers compiled at the ZIP code granularity. The data offers a risk metric and the insurance price levied on a hypothetical customer with consistent attributes from every ZIP code. ZIP codes are categorized as either minority or non-minority, contingent on the percentage of non-white residents. The variables in consideration are Z, denoting the driving risk; X, an indicator for minority ZIP codes; and Y , signifying the insurance price. A pertinent question revolves around

9For CPT execution, the Python script at http://www.stat.uchicago.edu/~rina/cpt.html was used, operating a single MCMC chain and preserving all other parameters as defined by the original authors.

Figure 5: Type-I error control and power analysis using car insurance data [2]. The first plot shows that RBPT and RBPT2 have better control over Type-I errors compared to all other methods, including CPT. The second plot reveals that all methods give the same qualitative result, corroborating the findings of Angwin et al. [2], suggesting that RBPT and RBPT2 can have good power while being more robust to Type-I errors.

the validity of the null hypothesis H0 : X Y | Z, essentially questioning if demographic biases influence pricing.

We split our experiments into two parts. In the initial part, our primary goal is to compare the Type-I error rate across various tests. To ensure that H0 is valid, we discretize Z into twenty distinct values and shuffle the Y values corresponding to each discrete Z value. If a test maintains Type-I error control, we expect it to reject H0 for at most α = 10% of the companies in each state. In the second part, we focus on assessing the power of our methods. Given our lack of ground truth, we qualitatively compare RBPT and RBPT2 findings with those obtained by baseline methods and delineated by Angwin et al. [2], utilizing a detailed and multifaceted approach. In this last experiment, we aggregate the analysis for each state without conditioning on the firm. We resort to logistic regression for estimating the distribution of X | Z used by RBPT, GCM, CRT, and CPT. For RBPT2, we use a Cat Boost regressor [26] to yield the Rao-Blackwellized predictor. We omit RESIT in this experiment as the additive model assumption is inappropriate. Both CRT and CPT methods utilize the same test metrics as STFR. The first plot10 of Figure 5 shows that RBPT and RBPT2 methods have better control over Type-I errors compared to all other methods, including CPT. The second plot reveals that all methods give the same qualitative result that discrimination against minorities in ZIP codes is most evident in Illinois, followed by Texas, Missouri, and California. These findings corroborate with those of Angwin et al. [2], indicating that our methodology has satisfactory power while maintaining a robust Type-I error control.

6 Conclusion

In this work, we theoretically and empirically showed that widely used regression-based conditional independence tests are sensitive to the specification of inductive biases. Furthermore, we introduced the Rao-Blackwellized Predictor Test (RBPT), a misspecification-robust conditional independence test. RBPT is theoretically grounded and has been shown to perform well in practical situations compared to benchmarks.

Limitations and future work. Two limitations of RBPT are that (i) the robustness of RBPT can lead to a more conservative test, as we have seen in the simulations; moreover, (ii) it requires the estimation of the conditional distribution PX|Z, which can be challenging. To overcome the second problem, we introduced a variation of RBPT, named RBPT2, in which the Rao-Blackwellized predictor is obtained in a supervised fashion by fitting a second model ˆh : Z Y that predicts the outputs of the first model ˆg : X Z Y. However, this solution only works if ˆh is better than ˆg in predicting Y under H0, which ultimately depends on the model class for ˆh and how that model is trained. Future research directions may include (i) theoretically studying the power of RBPT in more detail and (ii) better understanding RBPT2 from a theoretical or methodological point of view, e.g., answering questions on how to choose and train the second model.

7 Acknowledgements

This paper is based upon work supported by the National Science Foundation (NSF) under grants no. 1916271, 2027737, 2113373, and 2113364.

10We run the experiment for 48 different random seeds and report the average Type-I error rate.

[1] Chunrong Ai, Li-Hsien Sun, Zheng Zhang, and Liping Zhu. Testing unconditional and conditional independence via mutual information. Journal of Econometrics, 2022.

[2] Julia Angwin, Jeff Larson, Lauren Kirchner, and Surya Mattu. Minority neighborhoods pay higher car insurance premiums than white areas with the same risk, April 2017. URL https://www.propublica.org/article/ minority-neighborhoods-higher-car-insurance-premiums-white-areas-same-risk. [Online; last accessed in 09-September-2022].

[3] Alexis Bellot and Mihaela van der Schaar. Conditional independence testing using generative adversarial networks. Advances in Neural Information Processing Systems, 32, 2019.

[4] Thomas B Berrett, Yi Wang, Rina Foygel Barber, and Richard J Samworth. The conditional permutation test for independence while controlling for confounders. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1):175 197, 2020.

[5] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: modelx knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(3):551 577, 2018.

[6] Olivier Cappé, Eric Moulines, and Tobias Rydén. Inference in hidden markov models. In Proceedings of EUSFLAT conference, pages 14 16, 2009.

[7] Ben Dai, Xiaotong Shen, and Wei Pan. Significance tests of feature relevance for a black-box learner. IEEE Transactions on Neural Networks and Learning Systems, 2022.

[8] Gary Doran, Krikamol Muandet, Kun Zhang, and Bernhard Schölkopf. A permutation-based kernel conditional independence test. In UAI, pages 132 141, 2014.

[9] Alexander D Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research, 2020.

[10] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. Advances in neural information processing systems, 20, 2007.

[11] Clark Glymour, Kun Zhang, and Peter Spirtes. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019.

[12] Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear causal discovery with additive noise models. Advances in neural information processing systems, 21, 2008.

[13] Rafael Izbicki and Ann B. Lee. Converting high-dimensional regression to high-dimensional conditional density estimation. 2017.

[14] Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019.

[15] Ilmun Kim, Matey Neykov, Sivaraman Balakrishnan, and Larry Wasserman. Local permutation tests for conditional independence. ar Xiv preprint ar Xiv:2112.11666, 2021.

[16] Mariusz Kubkowski, Jan Mielniczuk, and Pawel Teisseyre. How to gain on power: Novel conditional independence tests based on short expansion of conditional mutual information. J. Mach. Learn. Res., 22:62 1, 2021.

[17] Erich L Lehmann and George Casella. Theory of point estimation. Springer Science & Business Media, 2006.

[18] Erich Leo Lehmann, Joseph P Romano, and George Casella. Testing statistical hypotheses, volume 3. Springer, 2005.

[19] Chun Li and Xiaodan Fan. On nonparametric conditional independence tests for continuous variables. Wiley Interdisciplinary Reviews: Computational Statistics, 12(3):e1489, 2020.

[20] Molei Liu, Eugene Katsevich, Lucas Janson, and Aaditya Ramdas. Fast and powerful conditional randomization testing via distillation. Biometrika, 109(2):277 293, 2022.

[21] Alexander Marx and Jilles Vreeken. Testing conditional independence on discrete data using stochastic complexity. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 496 505. PMLR, 2019.

[22] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014.

[23] Matey Neykov, Sivaraman Balakrishnan, and Larry Wasserman. Minimax optimal conditional independence testing. The Annals of Statistics, 49(4):2151 2177, 2021.

[24] Jonas Peters, Joris M Mooij, Dominik Janzing, and Bernhard Schölkopf. Causal discovery with continuous additive noise models. Journal of Machine Learning Research, 15:2009 2053, 2014.

[25] Felipe Maia Polo, Rafael Izbicki, Evanildo Gomes Lacerda Jr, Juan Pablo Ibieta-Jimenez, and Renato Vicente. A unified framework for dataset shift diagnostics. Information Sciences, 649: 119612, 2023.

[26] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31, 2018.

[27] Sidney Resnick. A probability path. Springer, 2019.

[28] Ya acov Ritov, Yuekai Sun, and Ruofei Zhao. On conditional parity as a notion of nondiscrimination in machine learning. ar Xiv preprint ar Xiv:1706.08519, 2017.

[29] Jakob Runge. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In International Conference on Artificial Intelligence and Statistics, pages 938 947. PMLR, 2018.

[30] Meyer Scetbon, Laurent Meunier, and Yaniv Romano. An asymptotic test for conditional independence using analytic kernel embeddings. ar Xiv preprint ar Xiv:2110.14868, 2021.

[31] Rajen D Shah and Jonas Peters. The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3):1514 1538, 2020.

[32] Chengchun Shi, Tianlin Xu, Wicher Bergsma, and Lexin Li. Double generative adversarial networks for conditional independence testing. J. Mach. Learn. Res., 22:285 1, 2021.

[33] Samuel Smith, Erich Elsen, and Soham De. On the generalization benefit of noise in stochastic gradient descent. In International Conference on Machine Learning, pages 9058 9067. PMLR, 2020.

[34] Eric V Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional independence tests for fast non-parametric causal discovery. Journal of Causal Inference, 7(1), 2019.

[35] Wesley Tansey, Victor Veitch, Haoran Zhang, Raul Rabadan, and David M Blei. The holdout randomization test for feature selection in black box models. Journal of Computational and Graphical Statistics, 31(1):151 162, 2022.

[36] Alexandre B Tsybakov. Introduction to nonparametric estimation, 2009. URL https://doi.org/10.1007/b13794. Revised and extended from the, 9(10), 2004.

[37] David S Watson and Marvin N Wright. Testing conditional independence in supervised learning algorithms. Machine Learning, 110(8):2107 2129, 2021.

[38] Hao Zhang, Shuigeng Zhou, Jihong Guan, and Jun Huan. Measuring conditional independence by independent residuals for causal discovery. ACM Transactions on Intelligent Systems and Technology (TIST), 10(5):1 19, 2019.

[39] Hao Zhang, Shuigeng Zhou, Kun Zhang, and Jihong Guan. Residual similarity based conditional independence test and its application in causal discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 5942 5949, 2022.

[40] Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. ar Xiv preprint ar Xiv:1202.3775, 2012.

[41] Lu Zhang and Lucas Janson. Floodgate: inference for model-free variable importance. ar Xiv preprint ar Xiv:2007.01283, 2020.

[42] Qinyi Zhang, Sarah Filippi, Seth Flaxman, and Dino Sejdinovic. Feature-to-feature regression for a two-step conditional independence test. In 33rd Conference on Uncertainty in Artificial Intelligence, UAI 2017. Association for Uncertainty in Artificial Intelligence (AUAI), 2017.

A Extra content

A.1 Misspecified inductive biases in modern statistics and machine learning

Figure 6: Type-I error rate is contingent on the training algorithm and not solely on the model classes. Unlike PCR, the LASSO fit provides the correct inductive bias in highdimensional regression, controlling Type-I error.

We present a toy experiment to empirically demonstrate how the training algorithm can prevent us from accurately estimating the Bayes predictor even when the model class is correctly specified, leading to invalid significance tests. We work in the context of a high-dimensional (overparameterized) regression with a training set of 250 observations and 300 covariates. We use the Significance Test of Feature Relevance test11 (STFR) [7] to conduct the CI test. The data are generated as

Z N(0, I300), X | Z N(β Z, 1), Y | X, Z N(β Z, 1)

where the first thirty entries of β are set to 1, and the remaining entries are zero. See that X Y | Z and that Y is linearly related to Z and (X, Z), and then the class of linear predictors is correctly specified when predicting Y from Z or (X, Z). To perform the STFR test, we use LASSO ( 1 penalization term added to empirical squared error) and principal components regression (PCR) to train the linear predictors. Since β is sparse, the LASSO fit provides the correct inductive bias while PCR leads to misspecification. We set the significance level to α = 1% and estimate the Type-I error rate for 100 different training sets. Figure 6 provides the Type-I error rate empirical distribution and illustrates that, despite using the same model class for both fitting methods, the training algorithm induces the wrong model in the PCR case, implying an invalid test most of the time.

A.2 Examples on when STFR fails

Examples A.1 and A.2 show simple situations in which Type-I error control is compromised or the conditional independence test has no power due to model misspecification. As we see in the next examples, Type-I error control is directly related to G2 misspecification, while Type-II error minimization directly relates to G1 misspecification. Example A.1 (No Type-I error control). Suppose Y = Z + Z2 + εy and X = Z2 + εx, where εy, εx N(0, 1) are independent noise variables and Z has finite variance. Consequently, X Y | Z. Let G1 and G2 be the classes of linear regressors with no intercept, i.e.,

G1 = {g1(x, z) = βxx + βzz : βx, βz R} and G2 = {g2(z) = βzz : βz R}.

If ℓdenotes the mean squared error, we have that EP [ℓ(g 2,P (Z), Y )] EP [ℓ(g 1,P (X, Z), Y )] > 0 because the model class G2 is misspecified, that is, it does not contain the Bayes predictor given by the conditional expectation E[Y | Z].

Example A.2 (Powerless test). Suppose Y = Z + sin(X) + εy, where Z, X, εy iid N(0, 1). Consequently, X Y | Z. Define G1 and G2 as in Example A.1. If ℓdenotes the mean squared error, we have that12 EP [ℓ(g 2,P (Z), Y )] EP [ℓ(g 1,P (X, Z), Y )] = 0 even though the different Bayes predictors have a difference in performance. This happens because the model g 1,P is not close enough to the Bayes predictor EP [Y | X, Z].

A.3 Generalized Covariance Measure (GCM) test

In the GCM test proposed by Shah and Peters [31], the expected value of the conditional covariance between X and Y given Z is estimated and then tested to determine if it equals zero. To simplify the exposition, we consider X and Y univariate and work in a setup similar to the STFR s. If (X, Y, Z) P, the GCM test relies on the observation that we can always write

X = f 1,P (Z) + ϵ and Y = f 2,P (Z) + η,

11See Section 3.1 for more details 12Because EP [XY ] = EP [XEP [Y |X]] = EP [X 0] = 0.

where f 1,P (Z) = EP [X | Z] and f 2,P (Z) = EP [Y | Z] while the error terms {ϵ, η} have zero mean when conditioned on Z. Consequently, we can write EP [Cov P (X, Y | Z)] = EP [ϵη]. To estimate EP [Cov P (X, Y | Z)], we can first fit two models ˆg(m) 1 : Z X and ˆg(m) 2 : Z Y, that approximate f 1,P and f 2,P, using the training set D(m) tr , and then compute an empirical version of EP [ϵη] = EP [(X f 1,P (Z))(Y f 2,P (Z))] using ˆg(m) 1 , ˆg(m) 2 , and D(n) te .

In the GCM test, we reject H0 : X Y | Z if the statistic Γ(n,m) | n T (n,m)/ˆσ(n,m)| exceeds τα/2 Φ 1(1 α/2), depending on the test significance level α (0, 1). Here, T (n,m) and ˆσ(n,m)

are defined as in 3.1 with T (m) i (Xi ˆg(m) 1 (Zi))(Yi ˆg(m) 2 (Zi)). If the p-value is defined as p(D(n) te , D(m) tr ) = 2(1 Φ(Γ(n,m))), the test φGCM α (D(n) te , D(m) tr ) is analogously given by Equation 3.2. Like the STFR, the GCM test depends on the models classes and implicitly on the training algorithm. If the limiting models g 1,P and g 2,P are not f 1,P and f 2,P, then Type-I error control is not guaranteed.

We introduce definitions and assumptions. Assumption A.3 gives a rate of convergence for the models ˆg(m) j in the mean squared error sense. Definition A.4 gives a definition for the misspecification gaps.

Assumption A.3. There are functions g 1,P, g 2,P, and a constant γ > 0 such that

EP (ˆg(m) j (Z) g j,P (Z))2 | D(m) tr = OP(m γ), for j = 1, 2

Definition A.4. For each j {1, 2}, define the misspecification gap as δj,P g j,P f j,P.

In the next result, we approximate GCM test Type-I error rate and power using the gaps in Definition A.4 and Assumptions A.3, 3.2, and 3.3 applied to this context. Theorem A.5. Suppose that Assumptions 3.2, 3.3, and A.3 hold. If n is a function of m such that n and n = o(mγ) as m , then

EP [φGCM α (D(n) te , D(m) tr )] = 1 Φ τα/2 r n

σ2 P ΩGCM P

+ Φ τα/2 r n

σ2 P ΩGCM P

where o(1) denotes uniform convergence over all P P as m and

ΩGCM P EP [Cov P (X, Y | Z)] + EP [δ1,P (Z)δ2,P (Z)]

From Theorem A.5, it is possible to verify that if δj,P (Z) is zero for at least one j {1, 2}, i.e., if at least one model converges to the conditional expectation, the GCM test asymptotically controls Type-I error. This can be seen as a double-robustness property of the GCM, which is not present13 in Shah and Peters [31]. If EP [δ1,P (Z)δ2,P (Z)] = 0, then EP [φGCM α (D(n) te , D(m) tr )] 1 as m even when H0 : X Y | Z. Under the alternative, if ΩGCM P = 0, Type-II error approaches 0 asymptotically.

A.4 REgression with Subsequent Independence Test (RESIT)

As revisited by Zhang et al. [42], the idea behind RESIT is to first residualize Y and X given Z and then test dependence between the residuals. It is similar to GCM, but requires the error terms and Z to be independent. When that assumption is reasonable, one advantage of RESIT over GCM is that it has power against a broader set of alternatives. In this section, we use a permutation test [18, Example 15.2.3] to assess the independence of residuals. We analyse RESIT s Type-I error control.

If (X, Y, Z) P and (X, Y ) can be modeled as an additive noise model (ANM), that is, we can write

X = f 1,P (Z) + ϵ and Y = f 2,P (Z) + η, (A.1)

where f 1,P (Z) = EP [X | Z], f 2,P (Z) = EP [Y | Z], and the error terms (ϵ, η) are independent of Z, it is possible to show that X Y | Z ϵ η. To facilitate our analysis14, we consider first fitting two models ˆg(m) 1 and ˆg(m) 2 that approximate f 1,P and f 2,P using the training set D(m) tr and then test the independence of the residuals15 ˆϵi = Xi ˆg(m) 1 (Zi) and ˆηi = Yi ˆg(m) 2 (Zi) using the test set D(n) te . Define (i) (ˆϵ, ˆη) {(ˆϵi, ˆηi)}n i=1 (test set residuals vertically stacked in matrix form) and (ii) (ˆϵ, ˆη)(b)

13This property is clear in our result because we consider data splitting. 14In practice, data splitting is not necessary. However, this procedure helps when theoretically analyzing the method. 15We omit the residuals superscript to ease notation.

as one of the B permutations, i.e., consider that we fix ˆϵ and permute ˆη row-wise. Let Ψ be a test statistic and Ψ((ˆϵ, ˆη)) and Ψ((ˆϵ, ˆη)(b)) its evaluation on the original residuals and the b-the permuted set. If the permutation p-value is given by

p(D(n) te , D(m) tr ) = 1 + PB b=1 1[Ψ((ˆϵ, ˆη)(b)) Ψ((ˆϵ, ˆη))]

1 + B (A.2)

a test φRESIT α aiming level α (0, 1) is given by Equation 3.2.

Similarly to STFR and GCM, we consider g 1,P and g 2,P to be the limiting models for ˆg(m) 1 and ˆg(m) 2 . Different from GCM, both models g 1,P and g 2,P are multi-output since X and Y are not necessarily univariate.

Now, we introduce some assumptions before we present our result for RESIT. Assumption A.6 gives a rate of convergence for the models ˆg(m) j in the mean squared error sense.

Assumption A.6. There are models g 1,P, g 2,P, and a constant γ > 0 such that

ˆg(m) j (Z) g j,P (Z) 2

= OP0(m γ), j = 1, 2

Assumption A.7 puts more structure on the distributions of the error terms (ϵ, η) and is a mild assumption.

Assumption A.7. Assume that for all P P0, the distribution of (ϵ, η), Pϵ,η, is absolutely continuous with respect to the Lebesgue measure in Rd X Rd Y with L-Lipschitz density pϵ,η for a certain L > 0. That is, for any e1, e2 Rd X and h1, h2 Rd Y , we have

|pϵ,η(e1, h1) pϵ,η(e2, h2)| L (e1, h1) (e2, h2) 2

We assume that L does not depend on P.

Assumption A.8 states that some of the variables we work with are uniformly almost surely bounded over all P P0. This assumption is realistic in most practical cases.

Assumption A.8. There is bounded Borel set A B(Rd X d Y ) such that

inf P P0 PP ((X, Y ) A) = 1

inf P P0 inf g1,g2 PP ((g1(Z), g2(Z)) A) = 1

Here, infg1,g2 is taken over the model classes we consider (if the model classes vary with m, consider the union of model classes). In the following, we present the result for RESIT. For that result, let: (i) ϵ ϵ δ1,P (Zi) and η η δ2,P (Zi), where the misspecification gaps are given as in Definition A.4; (ii) d TV represent the total variation (TV) distance between two probability distributions [36]; and (iii) the superscript n, e.g., in P n ϵ ,η , represent a product measure.

Theorem A.9. Under Assumptions A.6, A.7, and A.8, if H0 : X Y | Z holds and n is a function of m such that n and n = o(m γ 2 ) as m , then

EP [φRESIT α (D(n) te , D(m) tr )] α + min{d TV(P n ϵ ,η , P n ϵ,η ), d TV(P n ϵ ,η , P n ϵ ,η)} + o(1)

where o(1) denotes uniform convergence over all P P0 as m .

From Theorem A.9, we can see that if at least one of the misspecification gaps δ1,P or δ2,P is null, then EP [φRESIT α (D(n) te , D(m) tr )] α under H0 : X Y | Z. This can be seen as a double-robustness property of the RESIT. If none of the misspecification gaps are null, we do not have any guarantees on RESIT s Type-I error control. It can be shown that the proposed upper bound converges to 1.

A.5 RBPT extra derivation

Let µX be a dominating measure of Q X|Z and PX|Z that does not depend on Z. Let q X|Z (resp. p X|Z) be Q X|Z (resp. PX|Z) density with respect to µX.

ΩRBPT P,1 = EP

ℓ Z g P (x, Z)d Q X|Z(x), Y ℓ Z g P (x, Z)d PX|Z(x), Y #

Z g P (x, Z)d Q X|Z(x) Z g P (x, Z)d PX|Z(x) 2

Z g P (x, Z)d Q X|Z(x) Z g P (x, Z)d PX|Z(x) 1

Z g P (x, Z)(q X|Z(x|Z) p X|Z(x|Z))dµX(x) 1

L EP Z g P (x, Z) 1 |q X|Z(x|Z) p X|Z(x|Z)|dµX(x)

ML EP Z |q X|Z(x|Z) p X|Z(x|Z)|dµX(x)

= 2ML EP [d TV(Q X|Z, PX|Z)]

If EP [d TV(Q X|Z, PX|Z)] ΩRBPT P,2 /(2ML) then ΩRBPT P 0

A.6 How to obtain a result for RBPT2?

We informally give some ideas on extending Theorem 4.4. Theorem 4.4 states that

EP [φRBPT α (D(n) te , D(m) tr )] = 1 Φ τα r n

σ2 P ΩRBPT P

where ΩRBPT P = ΩRBPT P,1 ΩRBPT P,2 with

ΩRBPT P,1 EP

ℓ Z g P (x, Z)d Q X|Z(x), Y #

ℓ Z g P (x, Z)d PX|Z(x), Y #

ΩRBPT P,2 EP ℓ(g P (X, Z), Y ) EP

ℓ Z g P (x, Z)d PX|Z(x), Y #

If we wanted to adapt that result for RBPT2, we would have to redefine ΩRBPT P,1 . The analogue of ΩRBPT P,1 for RBPT2 would be

ΩRBPT2 P,1 EP

ℓ EP [g P (X, Z) | Z] , Y #

ℓ(EP [g P (X, Z) | Z] , Y )

where EP [g P (X, Z) | Z = z] denotes the limiting regression model to predict g P (X, Z) given Z. If we assume the existence of a big unlabeled dataset, deriving a result might be easier as we can avoid the asymptotic details on the convergence of the second regression model by assuming that the limiting Rao-Blackwellization model, for a fixed initial predictor, is known. The only challenge is proving the convergence of EP [ˆg(X, Z) | Z] to EP [g P (X, Z) | Z].

B Technical proofs

Lemma B.1. Assume we are under the conditions of Theorem 3.4. Then:

(ˆσ(n,m))2 Var P [T (m) 1 | D(m) tr ] = o P(1) as m

Proof. First, see that for an arbitrary ε > 0, there must be16 a sequence of probability measures in P, (P (m))m N, such that

sup P P PP h (ˆσ(n,m))2 Var P [T (m) 1 | D(m) tr ] > ε i PP (m) h (ˆσ(n,m))2 Var P (m)[T (m) 1 | D(m) tr ] > ε i + 1

Pick one of such sequences. Then, to prove that (ˆσ(n,m))2 Var P [T (m) 1 | D(m) tr ] = o P(1) as m , it suffices to show that

PP (m) h (ˆσ(n,m))2 Var P (m)[T (m) 1 | D(m) tr ] > ε i 0 as m

Now, expanding (ˆσ(n,m))2 we get

(ˆσ(n,m))2 = 1

i=1 (T (m) i )2

i=1 T (m) i

h (T (m) i )2 EP (m)[(T (m) 1 )2 | D(m) tr ] i + EP (m)[(T (m) 1 )2 | D(m) tr ]

i=1 T (m) i

h (T (m) i )2 EP (m)[(T (m) 1 )2 | D(m) tr ] i

i=1 T (m) i EP (m)[T (m) 1 | D(m) tr ]

i=1 T (m) i EP (m)[T (m) 1 | D(m) tr ] (EP (m)[T (m) 1 | D(m) tr ])2 !

+ Var P (m)[T (m) 1 | D(m) tr ]

h (T (m) i )2 EP (m)[(T (m) 1 )2 | D(m) tr ] i

i=1 T (m) i EP (m)[T (m) 1 | D(m) tr ]

2EP (m)[T (m) 1 | D(m) tr ]

i=1 T (m) i EP (m)[T (m) 1 | D(m) tr ]

+ Var P (m)[T (m) 1 | D(m) tr ]

(ˆσ(n,m))2 Var P (m)[T (m) 1 | D(m) tr ] =

h (T (m) i )2 EP (m)[(T (m) 1 )2 | D(m) tr ] i

i=1 T (m) i EP (m)[T (m) 1 | D(m) tr ]

2EP (m)[T (m) 1 | D(m) tr ]

i=1 T (m) i EP (m)[T (m) 1 | D(m) tr ]

Using a law of large numbers for triangular arrays [6, Corollary 9.5.6] (we comment on needed conditions to use this result below) and the continuous mapping theorem, we have that

1 n Pn i=1 h (T (m) i )2 EP (m)[(T (m) 1 )2 | D(m) tr ] i = op(1)

1 n Pn i=1 T (m) i EP (m)[T (m) 1 | D(m) tr ] 2 = op(1)

2EP (m)[T (m) 1 | D(m) tr ] | {z } Op(1) (Assumption 3.2)

1 n Pn i=1 T (m) i EP (m)[T (m) 1 | D(m) tr ] = op(1)

16Because of the definition of sup.

sup P P PP h (ˆσ(n,m))2 Var P [T (m) 1 | D(m) tr ] > ε i PP (m) h (ˆσ(n,m))2 Var P (m)[T (m) 1 | D(m) tr ] > ε i + 1

as m , i.e., (ˆσ(n,m))2 Var P [T (m) 1 | D(m) tr ] = o P(1) as m

Conditions to use the law of large numbers. Let (P (m))m N be an arbitrary sequence of probability measures in P. Define our triangular arrays as n V (m) i,1 o

1 i n and n V (m) i,2 o

1 i n, where V (m) i,1

T (m) i EP (m)[T (m) 1 | D(m) tr ] and V (m) i,2 (T (m) i )2 EP (m)[(T (m) 1 )2 | D(m) tr ]. Now, we comment on the conditions for the law of large numbers [6, Corollary 9.5.6]:

1. This condition naturally applies by definition and because of Assumption 3.2.

2. From Assumption 3.2 and Resnick [27, Example 6.5.2],

EP (m) h V (m) i,1 | D(m) tr i EP (m) h T (m) i | D(m) tr i + EP (m)[T (m) 1 | D(m) tr ] 2EP (m) h T (m) i | D(m) tr i = Op(1)

EP (m) h V (m) i,2 | D(m) tr i 2EP (m) h (T (m) i )2 | D(m) tr i = Op(1)

3. Fix any ϵ > 0 and let k be defined as in Assumption 3.2. See that

EP (m) h V (m) i,1 1[|V (m) i,1 | ϵn] | D(m) tr i = EP (m) h V (m) i,1 1[(|V (m) i,1 |/(ϵn))k 1] | D(m) tr i

1 (nϵ)k EP (m) h |V (m) i,1 |1+k | D(m) tr i

= 1 (nϵ)k EP (m) h |T (m) i EP (m)[T (m) 1 | D(m) tr ]|1+k | D(m) tr i

EP (m) h |T (m) i |1+k | D(m) tr i 1 1+k + EP (m) h |T (m) i | | D(m) tr i 1+k

= 1 (nϵ)k Op(1)

where the third inequality is obtained via Minkowski Inequality [27] and the fifth step is an application of Assumption 3.2 and Resnick [27, Example 6.5.2]. Analogously, define k = k/2 and see that

EP (m) h V (m) i,2 1[|V (m) i,2 | ϵn] | D(m) tr i = EP (m) h V (m) i,2 1[(|V (m) i,2 |/(ϵn))k 1] | D(m) tr i

1 (nϵ)k EP (m) h |V (m) i,2 |1+k | D(m) tr i

= 1 (nϵ)k EP (m) h |(T (m) i )2 EP (m)[(T (m) 1 )2 | D(m) tr ]|1+k | D(m) tr i

EP (m) h (T (m) i )2(1+k ) | D(m) tr i 1 1+k + EP (m) h (T (m) i )2 | D(m) tr i 1+k

EP (m) h (T (m) i )2+k | D(m) tr i 1 1+k + EP (m) h (T (m) i )2 | D(m) tr i 1+k

= 1 (nϵ)k Op(1)

Theorem 3.4. Suppose that Assumptions 3.1, 3.2, and 3.3 hold. If n is a function of m such that n and n = o(m2γ) as m , then

EP [φSTFR α (D(n) te , D(m) tr )] = 1 Φ τα r n

σ2 P ΩSTFR P

where o(1) denotes uniform convergence over all P P as m and

ΩSTFR P EP [ℓ(g 2,P (Z), Y )] EP [ℓ(g 1,P (X, Z), Y )]

Proof. First, note that there must be17 a sequence of probability measures in P, (P (m))m N, such that

EP [φSTFR α (D(n) te , D(m) tr )] 1 + Φ τα r n

σ2 P ΩSTFR P

EP (m)[φSTFR α (D(n) te , D(m) tr )] 1 + Φ

τα r n σ2 P (m) ΩSTFR P (m)

Then, it suffices to show that the RHS vanishes when we consider such a sequence (P (m))m N.

Now, let us first decompose the test statistic Λ(n,m) in the following way:

n T (n,m) EP (m)[T (m) 1 | D(m) tr ]

ˆσ(n,m) + n EP (m)[T (m) 1 | D(m) tr ] ˆσ(n,m)

n T (n,m) EP (m)[T (m) 1 | D(m) tr ]

n EP (m)[ℓ(ˆg(m) 2 (Z1), Y1) ℓ(g 2,P (m)(Z1), Y1) | D(m) tr ] EP (m)[ℓ(ˆg(m) 1 (X1, Z1), Y1) ℓ(g 1,P (m)(X1, Z1), Y1) | D(m) tr ]

n EP (m)[ℓ(g 2,P (m)(Z1), Y1)] EP (m)[ℓ(g 1,P (m)(X1, Z1), Y1)]

n W (m) 1,P (m) ˆσ(n,m) +

n W (m) 2,P (m) ˆσ(n,m) + nΩSTFR P (m) ˆσ(n,m)

Given that n is a function of m, we omit it when writing the W (m) j,P (m) s. Define σ(m) P (m) q

Var P (m)[T (m) 1 | D(m) tr ] and see that

EP (m)[φSTFR α (D(n) te , D(m) tr )] = PP (m)[p(D(n) te , D(m) tr ) α]

= PP (m) h 1 Φ Λ(n,m) α i

= PP (m) h Λ(n,m) τα i

n W (m) 1,P (m) ˆσ(n,m) +

n W (m) 2,P (m) ˆσ(n,m) + nΩSTFR P (m) ˆσ(n,m) τα

n W (m) 1,P (m) σP (m) +

n W (m) 2,P (m) σP (m) + nΩSTFR P (m) σP (m) + τα τα ˆσ(n,m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) + τα τα ˆσ(n,m)

σ(m) P (m) σP (m) τα nΩSTFR P (m) σP (m)

τα r n σ2 P (m) ΩSTFR P (m)

+ o(1) (B.1)

17Because of the definition of sup.

Implying that

EP [φSTFR α (D(n) te , D(m) tr )] 1 + Φ τα r n

σ2 P ΩSTFR P

= o(1) as m

Justifying step B.1. First, from a central limit theorem for triangular arrays [6, Corollary 9.5.11], we have that

W (m) 1,P (m)

T (n,m) EP (m)[T (m) 1 | D(m) tr ]

T (m) i EP (m)[T (m) 1 | D(m) tr ]

we comment on the conditions to use this theorem below.

Second, we have that n W (m) 1,P (m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) + τα τα ˆσ(n,m)

σ(m) P (m) σP (m)

n W (m) 1,P (m)

1 σ(m) P (m) σP (m)

n W (m) 2,P (m) σP (m) + τα

σ(m) P (m) 1

! σ(m) P (m) σP (m) 1

σ(m) P (m) σP (m) 1

σ(m) P (m) 1

= op(1) as m

To see why the random quantity above converges to zero in probability, see that because of Assumption 3.3, Lemma B.1, and continuous mapping theorem, we have that

σ(m) P (m) 1 = op(1) and σ(m) P (m) σP (m) 1 = op(1) as m

Additionally, because of Assumptions 3.1, 3.3 and condition n = o(m2γ), we have that

n W (m) 2,P (m) σP (m)

= o(mγ)Op(m γ)

o(mγ)Op(m γ)

Finally, n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) + τα τα ˆσ(n,m)

σ(m) P (m) σP (m) =

n W (m) 1,P (m)

n W (m) 1,P (m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) + τα τα ˆσ(n,m)

σ(m) P (m) σP (m)

n W (m) 1,P (m)

σ(m) P (m) + op(1) N(0, 1)

by Slutsky s theorem. Because N(0, 1) is a continuous distribution, we have uniform convergence of the distribution function [27][Chapter 8, Exercise 5] and we do not have to worry about the fact that

nΩSTFR P (m) σP (m) depends on m.

Conditions to apply the central limit theorem. Now, we comment on the conditions for the central limit theorem [6, Corollary 9.5.11]. Define our triangular array as n V (m) i o

1 i n, where V (m) i

T (m) i EP (m) [T (m) i |D(m) tr ]

1. This condition naturally applies by definition and because of Assumption 3.2.

2. See that, for any m, we have that

EP (m)[(V (m) i )2 | D(m) tr ] = 1

3. Fix any ϵ > 0 and let k be defined as in Assumption 3.2. See that

EP (m) h (V (m) i )21[|V (m) i | ϵn] | D(m) tr i =

= EP (m) h (V (m) i )21[(|V (m) i |/(ϵn))k 1] | D(m) tr i

1 (nϵ)k EP (m) h |V (m) i |2+k | D(m) tr i

= 1 (nϵ)k EP (m)

T (m) i EP (m)[T (m) i | D(m) tr ]

2+k | D(m) tr

(σ(m) P (m))2+k

EP (m) h |T (m) i |2+k | D(m) tr i 1 2+k + EP (m) h |T (m) i | | D(m) tr i 2+k

1 (σP (m))2+k + op(1) EP (m) h |T (m) i |2+k | D(m) tr i 1 2+k + EP (m) h |T (m) i | | D(m) tr i 2+k

1 (inf P P σP )2+k + op(1) Op(1)

= 1 (nϵ)k Op(1)

where the third inequality is obtained via Minkowski Inequality [27] and the last inequality is an application of Assumption 3.2 and Resnick [27, Example 6.5.2].

Corollary 3.6. Suppose we are under the conditions of Theorem 3.4.

(Type-I error) If H0 : X Y | Z holds, then

EP [φSTFR α (D(n) te , D(m) tr )] 1 Φ τα r n

where o(1) denotes uniform convergence over all P P0 as m .

(Type-II error) In general, we have

1 EP [φSTFR α (D(n) te , D(m) tr )] Φ τα r n

σ2 P ( P 1,P ) + o(1)

where o(1) denotes uniform convergence over all P P as m and P EP [ℓ(f 2,P (Z), Y )] EP [ℓ(f 1,P (X, Z), Y )].

Proof. Under the conditions of Theorem 3.4, we start proving that

1. P 1,P ΩSTFR P holds;

2. Under H0, ΩSTFR P 2,P holds.

For (1), see that,

ΩSTFR P = EP [ℓ(g 2,P (Z), Y )] EP [ℓ(g 1,P (X, Z), Y )]

= EP [ℓ(g 2,P (Z), Y )] EP [ℓ(f 2,P (Z), Y )] ( 0)

+ EP [ℓ(f 1,P (X, Z), Y )] EP [ℓ(g 1,P (X, Z), Y )]

+ EP [ℓ(f 2,P (Z), Y )] EP [ℓ(f 1,P (X, Z), Y )]

EP [ℓ(f 1,P (X, Z), Y )] EP [ℓ(g 1,P (X, Z), Y )] + P = P 1,P

Now, for (2):

ΩSTFR P = EP [ℓ(g 2,P (Z), Y )] EP [ℓ(g 1,P (X, Z), Y )]

= EP [ℓ(g 2,P (Z), Y )] EP [ℓ(f 2,P (Z), Y )]

+ EP [ℓ(f 1,P (X, Z), Y )] EP [ℓ(g 1,P (X, Z), Y )] ( 0)

+ EP [ℓ(f 2,P (Z), Y )] EP [ℓ(f 1,P (X, Z), Y )] (= 0, because H0 holds)

EP [ℓ(g 2,P (Z), Y )] EP [ℓ(f 2,P (Z), Y )]

Finally, see that

σ2 P ΩSTFR P

σ2 P ΩSTFR P

σ2 P ( P 1,P )

Combining these observations with Theorem 3.4, we get the result.

Theorem A.5. Suppose that Assumptions A.3, 3.2, and 3.3 hold. If n is a function of m such that n and n = o(mγ) as m , then

EP [φGCM α (D(n) te , D(m) tr )] = 1 Φ τα/2 r n

σ2 P ΩGCM P

+ Φ τα/2 r n

σ2 P ΩGCM P

where o(1) denotes uniform convergence over all P P as m and

ΩGCM P EP [Cov P (X, Y | Z)] + EP [δ1,P (Z)δ2,P (Z)]

Proof. First, note that there must be18 a sequence of probability measures in P, (P (m))m N, such that

EP [φGCM α (D(n) te , D(m) tr )] 1 + Φ τα/2 r n

σ2 P ΩGCM P

σ2 P ΩGCM P

EP (m)[φGCM α (D(n) te , D(m) tr )] 1 + Φ

τα/2 r n σ2 P (m) ΩGCM P (m)

τα/2 r n σ2 P (m) ΩGCM P (m)

Then, it suffices to show that the RHS vanishes when we consider such a sequence (P (m))m N.

Now, let us first decompose the test statistic Γ(n,m) in the following way:

n T (n,m) EP (m)[T (m) 1 | D(m) tr ]

ˆσ(n,m) + n EP (m)[T (m) 1 | D(m) tr ] ˆσ(n,m)

n T (n,m) EP (m)[T (m) 1 | D(m) tr ]

n EP (m)[(ˆg(m) 1 (Z1) g 1,P (m)(Z1))(ˆg(m) 2 (Z1) g 2,P (m)(Z1)) | D(m) tr ]

n EP (m)[(ˆg(m) 1 (Z1) g 1,P (m)(Z1))δ2,P (m)(Z1) | D(m) tr ]

n EP (m)[δ1,P (m)(Z1)(ˆg(m) 2 (Z1) g 2,P (m)(Z1)) | D(m) tr ]

+ nΩGCM P (m) ˆσ(n,m)

n W (m) 1,P (m) ˆσ(n,m) +

n W (m) 2,P (m) ˆσ(n,m) +

n W (m) 3,P (m) ˆσ(n,m) +

n W (m) 4,P (m) ˆσ(n,m) + nΩGCM P (m) ˆσ(n,m)

18Because of the definition of sup.

The terms involving one of the ϵ and η were all zero and were omitted. Given that n is a function of

m, we omit it when writing the W (m) j,P (m) s. Define σ(m) P (m) q

Var P (m)[T (m) 1 | D(m) tr ] and see that

EP [φGCM α (D(n) te , D(m) tr )] =

= PP [p(D(n) te , D(m) tr ) α]

= PP h 1 Φ Γ(n,m) α/2 i

= PP h Γ(n,m) τα/2 i

n W (m) 1,P (m) ˆσ(n,m) +

n W (m) 2,P (m) ˆσ(n,m) +

n W (m) 3,P (m) ˆσ(n,m) +

n W (m) 4,P (m) ˆσ(n,m) + nΩGCM P (m) ˆσ(n,m) τα/2

n W (m) 1,P (m) ˆσ(n,m) +

n W (m) 2,P (m) ˆσ(n,m) +

n W (m) 3,P (m) ˆσ(n,m) +

n W (m) 4,P (m) ˆσ(n,m) + nΩGCM P (m) ˆσ(n,m) τα/2

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) + ˆσ(n,m)

σ(m) P (m) σP (m) τα/2 τα/2 τα/2 nΩGCM P (m) σP (m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) ˆσ(n,m)

σ(m) P (m) σP (m) τα/2 + τα/2 τα/2 nΩGCM P (m) σP (m)

τα/2 r n σ2 P (m) ΩGCM P (m)

τα/2 r n σ2 P (m) ΩGCM P (m)

+ o(1) (B.2)

Implying that

EP [φGCM α (D(n) te , D(m) tr )] 1 + Φ τα/2 r n

σ2 P ΩGCM P

σ2 P ΩGCM P

= o(1) as m

Justifying step B.2. First, from a central limit theorem for triangular arrays [6, Corollary 9.5.11], we have that

W (m) 1,P (m)

T (n,m) EP (m)[T (m) 1 | D(m) tr ]

T (m) i EP (m)[T (m) 1 | D(m) tr ]

The conditions for the central limit theorem [6, Corollary 9.5.11] can be proven to hold like in Theorem 3.4 s proof.

Second, we have that

n W (m) 1,P (m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) τα/2 + τα/2 ˆσ(n,m)

σ(m) P (m) σP (m)

n W (m) 1,P (m)

1 σ(m) P (m) σP (m)

n W (m) 2,P (m) σP (m)

n W (m) 3,P (m) σP (m)

n W (m) 4,P (m) σP (m) +

! σ(m) P (m) σP (m) 1

1 σ(m) P (m) σP (m)

= op(1) as m

n W (m) 1,P (m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) + τα/2 τα/2 ˆσ(n,m)

σ(m) P (m) σP (m)

n W (m) 1,P (m)

1 σ(m) P (m) σP (m)

n W (m) 2,P (m) σP (m)

n W (m) 3,P (m) σP (m)

n W (m) 4,P (m) σP (m) +

σ(m) P (m) 1

! σ(m) P (m) σP (m) 1

σ(m) P (m) σP (m) 1

σ(m) P (m) 1

= op(1) as m

To see why the random quantities above converge to zero in probability, see that because of Assumption 3.3, Lemma19 B.1, and continuous mapping theorem, we have that

σ(m) P (m) 1 = op(1) and σ(m) P (m) σP (m) 1 = op(1) as m

Additionally, because of Assumptions A.3, 3.3, Cauchy-Schwarz inequality, and condition n = o(mγ), we have that

n W (m) 2,P (m) σP (m)

n EP (m)[(ˆg(m) 1 (Z1) g 1,P (m)(Z1))(ˆg(m) 2 (Z1) g 2,P (m)(Z1)) | D(m) tr ]

n EP (m)[(ˆg(m) 1 (Z1) g 1,P (m)(Z1))2 | D(m) tr ]EP (m)[(ˆg(m) 2 (Z1) g 2,P (m)(Z1))2 | D(m) tr ]

o(mγ)Op(m 2γ)

inf P P σP = op(1)

n W (m) 3,P (m) σP (m)

n EP (m)[(ˆg(m) 1 (Z1) g 1,P (m)(Z1))δ2,P (m)(Z1) | D(m) tr ]

n EP (m)[(ˆg(m) 1 (Z1) g 1,P (m)(Z1))2 | D(m) tr ]EP (m)[(δ2,P (m)(Z1))2]

o(mγ)Op(m γ)

inf P P σP = op(1)

n W (m) 4,P (m) σP (m)

n EP (m)[(ˆg(m) 2 (Z1) g 2,P (m)(Z1))δ1,P (m)(Z1) | D(m) tr ]

n EP (m)[(ˆg(m) 2 (Z1) g 2,P (m)(Z1))2 | D(m) tr ]EP (m)[(δ1,P (m)(Z1))2]

o(mγ)Op(m γ)

inf P P σP = op(1)

19We can apply this STFR s lemma because it still holds when we consider GCM s test statistic.

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) τα/2 + τα/2 ˆσ(n,m)

σ(m) P (m) σP (m)

n W (m) 1,P (m)

n W (m) 1,P (m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) τα/2 + τα/2 ˆσ(n,m)

n W (m) 1,P (m)

σ(m) P (m) + op(1) N(0, 1)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) + τα/2 τα/2 ˆσ(n,m)

σ(m) P (m) σP (m)

n W (m) 1,P (m)

n W (m) 1,P (m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) + τα/2 τα/2 ˆσ(n,m)

n W (m) 1,P (m)

σ(m) P (m) + op(1) N(0, 1)

by Slutsky s theorem. Because N(0, 1) is a continuous distribution, we have uniform convergence of the distribution function [27][Chapter 8, Exercise 5] and we do not have to worry about the fact that τα/2 q n σ2 P (m) ΩGCM P (m) or τα/2 q n σ2 P (m) ΩGCM P (m) depends on m.

Lemma B.2. Let PU,V and P U,V be two distributions on U V, B(U V) , U V Rd U d V , with d U, d V 1. Assume PU and P U are dominated by a common σ-finite measure µ and that PV |U=u = P V |U=u is dominated by a σ-finite measure νu, for all u Rd U . Then,

d TV(PU,V , P U,V ) = d TV(PU, P U)

where d TV denotes the total variation distance between two probability measures.

Proof. Let p U and p U denote the densities of PU and P U w.r.t. µ, and let p V |U( | u) denote the density of PV |U=u = P V |U=u w.r.t. νu. From Scheffe s theorem [36][Lemma 2.1], we have that:

d TV(PU,V , P U,V ) = 1

Z Z |p U(u)p V |U(v | u) p U(u)p V |U(v | u)|dνu(v)dµ(u)

Z Z |p V |U(v | u)|dνu(v) |p U(u) p U(u)|dµ(u)

Z Z p V |U(v | u)dνu(v) |p U(u) p U(u)|dµ(u)

Z |p U(u) p U(u)|dµ(u)

= d TV(PU, P U)

Lemma B.3. For all i [n], consider

(ˆϵi, ˆηi) | D(m) tr Pˆϵ,ˆη|D(m) tr , (ϵi, ˆηi) | D(m) tr Pϵ,ˆη|D(m) tr , (ˆϵi, ηi) | D(m) tr Pˆϵ,η|D(m) tr ,

where ˆϵi = ϵi δ1,P (Zi) ˆg(m) 1 (Zi) g 1,P (Zi) and ˆηi = ηi δ2,P (Zi) ˆg(m) 2 (Zi) g 2,P (Zi) .

Under Assumption A.7 and H0 : X Y | Z, we have that

EP [φRESIT α (D(n) te , D(m) tr ) | D(m) tr ] α + min n d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

, d TV P n ˆϵ,ˆη|D(m) tr , P n ˆϵ,η|D(m) tr

where d TV denotes the total variation distance between two probability measures.

Proof. Let us represent the stacked residuals (in matrix form) as (ˆϵ, ˆη) = {(ˆϵi, ˆηi)}n i=1. See that (ˆϵ, ˆη)(1), , (ˆϵ, ˆη)(B) | (ˆϵ, ˆη) = ( ϵ, η), D(m) tr

d= (ϵ, ˆη)(1), , (ϵ, ˆη)(B) | (ϵ, ˆη) = ( ϵ, η), D(m) tr

Then, because the random quantities above are conditionally discrete, their distribution is dominated by a counting measure depending on ( ϵ, η). Because the distribution of (ϵ, η) is absolutely continuous with respect to the Lebesgue measure, (ˆϵ, ˆη) | D(m) tr and (ϵ, ˆη) | D(m) tr are also absolutely continuous20 for every training set configuration, and then we can apply Lemma B.2 to get that

((ˆϵ, ˆη), (ˆϵ, ˆη)(1), , (ˆϵ, ˆη)(B)) | D(m) tr , ((ϵ, ˆη), (ϵ, ˆη)(1), , (ϵ, ˆη)(B)) | D(m) tr

(ˆϵ, ˆη) | D(m) tr , (ϵ, ˆη) | D(m) tr

In the last step, we abuse TV distance notation: by the TV distance of two random variables, we mean the TV distance of their distributions.

Now, define the event

((e, h), (e, h)(1), , (e, h)(B)) : 1 + PB b=1 1[Ψ((e, h)(b)) Ψ((e, h))]

By the definition of the TV distance, we have that (under H0):

EP [φRESIT α (D(n) te , D(m) tr ) | D(m) tr ] = PP ((ˆϵ, ˆη), (ˆϵ, ˆη)(1), , (ˆϵ, ˆη)(B)) Aα | D(m) tr

PP ((ϵ, ˆη), (ϵ, ˆη)(1), , (ϵ, ˆη)(B)) Aα | D(m) tr + d TV

(ˆϵ, ˆη) | D(m) tr , (ϵ, ˆη) | D(m) tr

(ˆϵ, ˆη) | D(m) tr , (ϵ, ˆη) | D(m) tr

= α + d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

where the last equality holds from the fact that, given the training set, rows of (ˆϵ, ˆη) and (ϵ, ˆη) are i.i.d.. See that PP ((ϵ, ˆη), (ϵ, ˆη)(1), , (ϵ, ˆη)(B)) Aα | D(m) tr α because H0 holds and therefore

ϵi ˆηi | D(m) tr (making the permuted samples exchangeable).

Using symmetry, we have that

EP [φRESIT α (D(n) te , D(m) tr ) | D(m) tr ] α + min n d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

, d TV P n ˆϵ,ˆη|D(m) tr , P n ˆϵ,η|D(m) tr

Lemma B.4. For any i [n], consider that

(ˆϵi, ˆηi) | D(m) tr Pˆϵ,ˆη|D(m) tr , (ϵi, ˆηi) | D(m) tr Pϵ,ˆη|D(m) tr , (ˆϵi, ηi) | D(m) tr Pˆϵ,η|D(m) tr

(ϵ i , η i ) Pϵ ,η , (ϵi, η i ) Pϵ,η , (ϵ i , ηi) Pϵ ,η

where ˆϵi = ϵi δ1,P (Zi) ˆg(m) 1 (Zi) g 1,P (Zi) , ϵ i = ϵi δ1,P (Zi), ˆηi = ηi δ2,P (Zi) ˆg(m) 2 (Zi) g 2,P (Zi) , and η i = ηi δ2,P (Zi). Then, under H0 : X Y | Z and Assumptions

20Given any training set configuration, the vectors (ˆϵi, ˆηi), (ϵi, ˆηi), (ˆϵi, ηi) are given by the sum of two independent random vectors where at least one of them is continuous because of Assumption A.7 and therefore the sum must be continuous, e.g., (ˆϵi, ˆηi) = (ϵi, ηi)+(g 1,P (Zi) ˆg(m) 1 (Zi) δ1,P (Zi), g 2,P (Zi) ˆg(m) 2 (Zi) δ2,P (Zi)). See Lemma B.4 for a proof.

A.6 and A.7, all six distributions are absolutely continuous with respect to the Lebesgue measure and their densities are given by

pˆϵ,ˆη|D(m) tr (e, h | D(m) tr ) = EP h pϵ e + δ1,P (Z) + ˆg(m) 1 (Z) g 1,P (Z) pη h + δ2,P (Z) + ˆg(m) 2 (Z) g 2,P (Z) | D(m) tr i

pϵ,ˆη|D(m) tr (e, h | D(m) tr ) = pϵ(e) EP h pη h + δ2,P (Z) + ˆg(m) 2 (Z) g 2,P (Z) | D(m) tr i

pˆϵ,η|D(m) tr (e, h | D(m) tr ) = EP h pϵ e + δ1,P (Z) + ˆg(m) 1 (Z) g 1,P (Z) | D(m) tr i pη(h)

pϵ ,η (e, h) = EP [pϵ (e + δ1,P (Z)) pη (h + δ2,P (Z))]

pϵ,η (e, h) = pϵ(e) EP [pη (h + δ2,P (Z))]

pϵ ,η(e, h) = EP [pϵ (e + δ1,P (Z))] pη(h)

Additionally, we have that

(e,h) Rd X d Y

pˆϵ,ˆη|D(m) tr (e, h | D(m) tr ) pϵ ,η (e, h) 2 = OP0(m γ)

(e,h) Rd X d Y

pϵ,ˆη|D(m) tr (e, h | D(m) tr ) pϵ,η (e, h) 2 = OP0(m γ)

(e,h) Rd X d Y

pˆϵ,η|D(m) tr (e, h | D(m) tr ) pϵ ,η(e, h) 2 = OP0(m γ)

Proof. Assume we are under H0. In order to show that Pˆϵ,ˆη|D(m) tr is absolutely continuous w.r.t. Lebesgue measure (for each training set configuration) and that its density is given by

pˆϵ,ˆη|D(m) tr (e, h | D(m) tr ) = EP h pϵ e + δ1,P (Z) + ˆg(m) 1 (Z) g 1,P (Z) pη h + δ2,P (Z) + ˆg(m) 2 (Z) g 2,P (Z) | D(m) tr i ,

it suffices to show that

PP ((ˆϵ, ˆη) A | D(m) tr ) =

= Z 1A(u, v)EP h pϵ u + δ1,P (Z) + ˆg(m) 1 (Z) g 1,P (Z) pη v + δ2,P (Z) + ˆg(m) 2 (Z) g 2,P (Z) | D(m) tr i d(u, v)

for any measurable set A.

Using Fubini s theorem, we get

PP ((ˆϵ, ˆη) A | D(m) tr ) =

= EP h 1A(ϵ δ1,P (Z) ˆg(m) 1 (Z) + g 1,P (Z), η δ2,P (Z) ˆg(m) 2 (Z) + g 2,P (Z)) | D(m) tr i

Z 1A(e δ1,P (Z) ˆg(m) 1 (Z) + g 1,P (Z), h δ2,P (Z) ˆg(m) 2 (Z) + g 2,P (Z))pϵ,η(e, h)d(e, h) | D(m) tr

Z 1A(e δ1,P (Z) ˆg(m) 1 (Z) + g 1,P (Z), h δ2,P (Z) ˆg(m) 2 (Z) + g 2,P (Z))pϵ(e)pη(h)d(e, h) | D(m) tr

Z 1A(u, v)pϵ u + δ1,P (Z) + ˆg(m) 1 (Z) g 1,P (Z) pη v + δ2,P (Z) + ˆg(m) 2 (Z) g 2,P (Z) d(u, v) | D(m) tr

= Z 1A(u, v)EP h pϵ u + δ1,P (Z) + ˆg(m) 1 (Z) g 1,P (Z) pη v + δ2,P (Z) + ˆg(m) 2 (Z) g 2,P (Z) | D(m) tr i d(u, v)

The proof is analogous to the other distributions.

Now, we proceed with the second part of the lemma. Using Assumptions A.6 and A.7, we get pˆϵ,ˆη|D(m) tr (e, h | D(m) tr ) pϵ ,η (e, h) 2 =

= EP h pϵ e + δ1,P (Z) + ˆg(m) 1 (Z) g 1,P (Z) pη h + δ2,P (Z) + ˆg(m) 2 (Z) g 2,P (Z)

pϵ (e + δ1,P (Z)) pη (h + δ2,P (Z)) | D(m) tr i 2

L2 EP h ˆg(m) 1 (Z) g 1(Z), ˆg(m) 2 (Z) g 2(Z) 2 | D(m) tr i 2

ˆg(m) 1 (Z) g 1(Z), ˆg(m) 2 (Z) g 2(Z) 2

2 | D(m) tr

ˆg(m) 1 (Z) g 1,P (Z) 2

2 | D(m) tr

ˆg(m) 2 (Z) g 2,P (Z) 2

2 | D(m) tr

where the last inequality is obtained via Jensen s inequality. Because the upper bound obtained through Lipschitzness does not depend on (e, h), we get

(e,h) Rd X d Y

pˆϵ,ˆη|D(m) tr (e, h | D(m) tr ) pϵ ,η (e, h) 2 = OP0(m γ)

The results for the other converging quantities are obtained in the same manner.

Theorem A.9. Under Assumptions A.6, A.7, and A.8, if H0 : X Y | Z holds and n is a function of m such that n and n = o(m γ 2 ) as m , then

EP [φRESIT α (D(n) te , D(m) tr )] α + min{d TV(P n ϵ ,η , P n ϵ,η ), d TV(P n ϵ ,η , P n ϵ ,η)} + o(1)

where o(1) denotes uniform convergence over all P P0 as m .

Proof. We are trying to prove that

h EP [φRESIT α (D(n) te , D(m) tr )] α min{d TV(P n ϵ ,η , P n ϵ,η ), d TV(P n ϵ ,η , P n ϵ ,η)} i o(1) as m

First, see that using Lemma B.3 we get

EP [φRESIT α (D(n) te , D(m) tr )] =

= EP [EP [φRESIT α (D(n) te , D(m) tr ) | D(m) tr ]]

α + EP h min n d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

, d TV P n ˆϵ,ˆη|D(m) tr , P n ˆϵ,η|D(m) tr

α + min n EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

i , EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ˆϵ,η|D(m) tr

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

i d TV(P n ϵ ,η , P n ϵ,η ) + d TV(P n ϵ ,η , P n ϵ,η ),

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ˆϵ,η|D(m) tr

d TV(P n ϵ ,η , P n ϵ ,η) + d TV(P n ϵ ,η , P n ϵ ,η) i )

( EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

i d TV(P n ϵ ,η , P n ϵ,η ) + d TV(P n ϵ ,η , P n ϵ,η ),

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ˆϵ,η|D(m) tr

d TV(P n ϵ ,η , P n ϵ ,η) + d TV(P n ϵ ,η , P n ϵ ,η) i )

h EP [φRESIT α (D(n) te , D(m) tr )] α min{d TV(P n ϵ ,η , P n ϵ,η ), d TV(P n ϵ ,η , P n ϵ ,η)} i sup P P0 (m) P

( EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

i d TV(P n ϵ ,η , P n ϵ,η ) + d TV(P n ϵ ,η , P n ϵ,η ),

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ˆϵ,η|D(m) tr

i d TV(P n ϵ ,η , P n ϵ ,η) + d TV(P n ϵ ,η , P n ϵ ,η)

min{d TV(P n ϵ ,η , P n ϵ,η ), d TV(P n ϵ ,η , P n ϵ ,η)}

It suffices to show that sup P P0 (m) P = o(1) as m . Next step is to show that

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

i d TV(P n ϵ ,η , P n ϵ,η ) = o(1)

and sup P P0

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ˆϵ,η|D(m) tr

i d TV(P n ϵ ,η , P n ϵ ,η) = o(1)

as m . Given the symmetry, we focus on the first problem.

By the triangle inequality, we obtain

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

i EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ ,η i +d TV P n ϵ ,η , P n ϵ,η +EP h d TV P n ϵ,η , P n ϵ,ˆη|D(m) tr

and consequently

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

i d TV P n ϵ ,η , P n ϵ,η

sup P P0 EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ ,η i

+ sup P P0 EP h d TV P n ϵ,ˆη|D(m) tr , P n ϵ,η i

We treat these terms separately.

(I) Consider a sequence of probability distributions (P (m))m N in P0 such that

sup P P0 EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ ,η i EP (m) h d TV P (m)n ˆϵ,ˆη|D(m) tr , P (m)n ϵ ,η i + 1

Here, the distributions P and P (m) determine not only the distribution associated with D(m) tr but also the distribution of (ˆϵ, ˆη, ϵ , η ). Because we have that

d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ ,η

n d TV Pˆϵ,ˆη|D(m) tr , Pϵ ,η (Subadditivity of TV distance)

Z pˆϵ,ˆη|D(m) tr (e, h | D(m) tr ) pϵ ,η (e, h) d(u, v) (Scheffe s theorem [36][Lemma 2.1])

(u,v) Rd X d Y

pˆϵ,ˆη|D(m) tr (e, h | D(m) tr ) pϵ ,η (e, h) (Assumption A.8)

(u,v) Rd X d Y

pˆϵ,ˆη|D(m) tr (e, h | D(m) tr ) pϵ ,η (e, h) 2 !1/2

= o(mγ)OP0(m γ) 1/2 (Lemma B.4)

where V is the volume of a ball containing the support of the RVs (existence of that ball is due to Assumption A.8), we also have that d TV P (m)n ˆϵ,ˆη|D(m) tr , P (m)n ϵ ,η = op(1), when D(m) tr samples

come from the sequence (P (m))m N. By the Dominated Convergence Theorem (DCT) [27, Corollary 6.3.2], we have

sup P P0 EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ ,η i EP (m) h d TV P (m)n ˆϵ,ˆη|D(m) tr , P (m)n ϵ ,η i + 1

To see why we can use the DCT here, realize that when samples come from P (m), W (m) =

d TV P (m)n ˆϵ,ˆη|D(m) tr , P (m)n ϵ ,η can be seen as a measurable function going from an original probability

space to some other space. Different distributions {P (m)} are due to different random variables while the original probability measure is fixed. Because of that,

EP (m) h d TV P (m)n ˆϵ,ˆη|D(m) tr , P (m)n ϵ ,η i = E h W (m)i

for the bounded random variable W (m), where the last expectation is taken in the original probability space. We apply the DCT in E h W (m)i .

(II) Following the same steps as in part (I), we obtain

sup P P0 EP h d TV P n ϵ,ˆη|D(m) tr , P n ϵ,η i = o(1)

Going back to step B.3, consider another sequence of probability distributions (Q(m))m N in P0 such that sup P P0 (m) P (m) Q(m) + 1

(m) Q(m) = min

( EQ(m) h d TV Q(m)n ˆϵ,ˆη|D(m) tr , Q(m)n ϵ,ˆη|D(m) tr

i d TV(Q(m)n ϵ ,η , Q(m)n ϵ,η ) + d TV(Q(m)n ϵ ,η , Q(m)n ϵ,η ),

EQ(m) h d TV Q(m)n ˆϵ,ˆη|D(m) tr , Q(m)n ˆϵ,η|D(m) tr

d TV(Q(m)n ϵ ,η , Q(m)n ϵ ,η) + d TV(Q(m)n ϵ ,η , Q(m)n ϵ ,η) i )

min{d TV(Q(m)n ϵ ,η , Q(m)n ϵ,η ), d TV(Q(m)n ϵ ,η , Q(m)n ϵ ,η)}

Because of continuity of min and

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ϵ,ˆη|D(m) tr

i d TV(P n ϵ ,η , P n ϵ,η ) = o(1)

and sup P P0

EP h d TV P n ˆϵ,ˆη|D(m) tr , P n ˆϵ,η|D(m) tr

i d TV(P n ϵ ,η , P n ϵ ,η) = o(1)

we have that

= min{d TV(Q(m)n ϵ ,η , Q(m)n ϵ,η ) + o(1), d TV(Q(m)n ϵ ,η , Q(m)n ϵ ,η) + o(1)}

min{d TV(Q(m)n ϵ ,η , Q(m)n ϵ,η ), d TV(Q(m)n ϵ ,η , Q(m)n ϵ ,η)}

= min{d TV(Q(m)n ϵ ,η , Q(m)n ϵ,η ), d TV(Q(m)n ϵ ,η , Q(m)n ϵ ,η))}

min{d TV(Q(m)n ϵ ,η , Q(m)n ϵ,η ), d TV(Q(m)n ϵ ,η , Q(m)n ϵ ,η)} + o(1)

Finally implying, from step B.3,

h EP [φRESIT α (D(n) te , D(m) tr )] α min{d TV(P n ϵ ,η , P n ϵ,η ), d TV(P n ϵ ,η , P n ϵ ,η)} i o(1) as m

Theorem 4.4. Suppose that Assumptions 4.1, 4.2, 4.3, 3.2, and 3.3 hold. If n is a function of m such that n and n = o(mγ) as m , then

EP [φRBPT α (D(n) te , D(m) tr )] = 1 Φ τα r n

σ2 P ΩRBPT P

where o(1) denotes uniform convergence over all P P as m and

ΩRBPT P = ΩRBPT P,1 ΩRBPT P,2

ΩRBPT P,1 EP

ℓ Z g P (x, Z)d Q X|Z(x), Y ℓ Z g P (x, Z)d PX|Z(x), Y #

ΩRBPT P,2 | {z } Jensen s gap

ℓ(g P (X, Z), Y ) ℓ Z g P (x, Z)d PX|Z(x), Y #

Proof. First, note that there must be21 a sequence of probability measures in P, (P (m))m N, such that

EP [φRBPT α (D(n) te , D(m) tr )] 1 + Φ τα r n

σ2 P ΩRBPT P

EP (m)[φRBPT α (D(n) te , D(m) tr )] 1 + Φ

τα r n σ2 P (m) ΩRBPT P (m)

Then, it suffices to show that the RHS vanishes when we consider such a sequence (P (m))m N.

Now, let us first decompose the test statistic Ξ(n,m) in the following way:

Ξ(n,m) n T (n,m)

n T (n,m) EP (m)[T (m) 1 | D(m) tr ]

ˆσ(n,m) + n EP (m)[T (m) 1 | D(m) tr ] ˆσ(n,m)

n T (n,m) EP (m)[T (m) 1 | D(m) tr ]

ˆσ(n,m) + n EP (m)[ℓ(ˆh(m)(Z1), Y1) ℓ(ˆg(m)(X1, Z1), Y1) | D(m) tr ] ˆσ(n,m)

n T (n,m) EP (m)[T (m) 1 | D(m) tr ]

n EP (m)[ℓ( R ˆg(m)(x, Z1)d ˆQ(m) X|Z(x), Y1) ℓ(ˆg(m)(X1, Z1), Y1) | D(m) tr ]

n T (n,m) EP (m)[T (m) 1 | D(m) tr ]

n EP (m)[ℓ( R ˆg(m)(x, Z1)d ˆQ(m) X|Z(x), Y1) ℓ( R ˆg(m)(x, Z1)d Q X|Z(x), Y1) | D(m) tr ]

n EP (m)[ℓ( R ˆg(m)(x, Z1)d Q X|Z(x), Y1) ℓ( R g P (m)(x, Z1)d Q X|Z(x), Y1) | D(m) tr ] ˆσ(n,m)

+ n EP (m)[ℓ(g P (m)(X1, Z1), Y1) ℓ(ˆg(m)(X1, Z1), Y1) | D(m) tr ] ˆσ(n,m)

+ n ˆσ(n,m) EP (m)

"h ℓ Z g P (m)(x, Z1)d Q X|Z(x), Y ℓ Z g P (m)(x, Z1)d P (m) X|Z(x), Y i +

+ h ℓ Z g P (m)(x, Z1)d P (m) X|Z(x), Y ℓ(g P (m)(X1, Z1), Y1) i#

n W (m) 1,P (m) ˆσ(n,m) +

n W (m) 2,P (m) ˆσ(n,m) +

n W (m) 3,P (m) ˆσ(n,m) +

n W (m) 4,P (m) ˆσ(n,m) + nΩRBPT P (m) ˆσ(n,m)

21Because of the definition of sup.

Given that n is a function of m, we omit it when writing the W (m) j,P (m) s. Define σ(m) P (m) q

Var P (m)[T (m) 1 | D(m) tr ] and see that

EP (m)[φRBPT α (D(n) te , D(m) tr )] = (B.4)

= PP (m)[p(D(n) te , D(m) tr ) α]

= PP (m) h 1 Φ Ξ(n,m) α i

= PP (m) h Ξ(n,m) τα i

n W (m) 1,P (m) ˆσ(n,m) +

n W (m) 2,P (m) ˆσ(n,m) +

n W (m) 3,P (m) ˆσ(n,m) +

n W (m) 4,P (m) ˆσ(n,m) + nΩRBPT P (m) ˆσ(n,m) τα

n W (m) 1,P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) + nΩRBPT P (m) σP (m) + τα τα ˆσ(n,m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) + τα τα ˆσ(n,m)

σ(m) P (m) σP (m) τα nΩRBPT P (m) σP (m)

τα r n σ2 P (m) ΩRBPT P (m)

+ o(1) (B.5)

Implying that

EP [φRBPT α (D(n) te , D(m) tr )] 1 + Φ τα r n

σ2 P ΩRBPT P

= o(1) as m

Justifying step B.5. First, from a central limit theorem for triangular arrays [6, Corollary 9.5.11], we have that

W (m) 1,P (m)

T (n,m) EP (m)[T (m) 1 | D(m) tr ]

T (m) i EP (m)[T (m) 1 | D(m) tr ]

The conditions for the central limit theorem [6, Corollary 9.5.11] can be proven to hold like in Theorem 3.4 s proof.

Second, we have that n W (m) 1,P (m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) + τα τα ˆσ(n,m)

σ(m) P (m) σP (m)

n W (m) 1,P (m)

1 σ(m) P (m) σP (m)

n W (m) 2,P (m) σP (m)

n W (m) 3,P (m) σP (m)

n W (m) 4,P (m) σP (m) +

σ(m) P (m) 1

! σ(m) P (m) σP (m) 1

σ(m) P (m) σP (m) 1

σ(m) P (m) 1

= op(1) as m

To see why the random quantity above converges to zero in probability, see that because of Assumption 3.3, Lemma22 B.1, and continuous mapping theorem, we have that

σ(m) P (m) 1 = op(1) and σ(m) P (m) σP (m) 1 = op(1) as m

Additionally, because of Assumptions 4.1, 4.2, and 4.3 and condition n = o(mγ), we have that

22We can apply this STFR s lemma because it still holds when we consider GCM s test statistic.

n σP (m) W (m) 2,P (m)

n σP (m) EP (m) ℓ Z ˆg(m)(x, Z1)d ˆQ(m) X|Z(x), Y ℓ Z ˆg(m)(x, Z1)d Q X|Z(x), Y | D(m) tr

n σP (m) EP (m) ℓ Z ˆg(m)(x, Z1)d ˆQ(m) X|Z(x), Y ℓ Z ˆg(m)(x, Z1)d Q X|Z(x), Y | D(m) tr

σP (m) EP (m)

Z ˆg(m)(x, Z1)d ˆQ(m) X|Z(x) Z ˆg(m)(x, Z1)d Q X|Z(x) 2 | D(m) tr

σP (m) EP (m)

Z ˆg(m)(x, Z1)d ˆQ(m) X|Z(x) Z ˆg(m)(x, Z1)d Q X|Z(x) 1 | D(m) tr

σP (m) EP (m) Z ˆg(m)(x, Z1) 1 |ˆq(m) X|Z(x|Z) q X|Z(x|Z)|dµ(x) | D(m) tr

σP (m) EP (m) Z |ˆq(m) X|Z(x|Z) q X|Z(x|Z)|dµ(x) | D(m) tr

= 2LM n inf P P σP EP (m) h d TV( ˆQ(m) X|Z, Q X|Z) | D(m) tr i

= 2LM inf P P σP o(mγ)Op(m 2γ) 1/2

n σP (m) W (m) 3,P (m)

= EP (m) ℓ Z ˆg(m)(x, Z1)d Q X|Z(x), Y ℓ Z g P (m)(x, Z1)d Q X|Z(x), Y | D(m) tr

n σP (m) EP (m) ℓ Z ˆg(m)(x, Z1)d Q X|Z(x), Y ℓ Z g P (m)(x, Z1)d Q X|Z(x), Y | D(m) tr

σP (m) EP (m)

Z ˆg(m)(x, Z1)d Q X|Z(x) Z g P (m)(x, Z1)d Q X|Z(x) 2 | D(m) tr

σP (m) EP (m)

Z ˆg(m)(x, Z1) g P (m)(x, Z1)d Q X|Z(x) 2 | D(m) tr

σP (m) EP (m) Z ˆg(m)(x, Z1) g P (m)(x, Z1) 2 d Q X|Z(x) | D(m) tr

σP (m) EP (m) Z ˆg(m)(x, Z1) g P (m)(x, Z1) 2

d Q X|Z d PX|Z (x)d PX|Z(x) | D(m) tr

σP (m) EP (m) ˆg(m)(X1, Z1) g P (m)(X1, Z1) 2

d Q X|Z d PX|Z (X1) | D(m) tr

EP (m) ˆg(m)(X1, Z1) g P (m)(X1, Z1) 2

2 | D(m) tr

" d Q X|Z d PX|Z (X1) 2#!1/2

σP (m) EP (m) ˆg(m)(X1, Z1) g P (m)(X1, Z1) 2

2 | D(m) tr

EP (m) Z d Q X|Z d PX|Z (x)d Q X|Z(x) 1/2

σP (m) EP (m) ˆg(m)(X1, Z1) g P (m)(X1, Z1) 2

2 | D(m) tr

EP (m) χ2 Q X|Z||PX|Z + 1 1/2

LR n inf P P σP EP (m) ˆg(m)(X1, Z1) g P (m)(X1, Z1) 2

2 | D(m) tr

= LR inf P P σP o(mγ)Op(m γ) 1/2

n σP (m) W (m) 4,P (m)

= EP (m) h ℓ ˆg(m)(X1, Z1), Y1 ℓ g P (m)(X1, Z1), Y1 | D(m) tr i

n σP (m) EP (m) h ℓ ˆg(m)(X1, Z1), Y1 ℓ g P (m)(X1, Z1), Y1 | D(m) tr i

σP (m) EP (m) h ˆg(m)(X1, Z1) g P (m)(X1, Z1) 2 | D(m) tr i

L n inf P P σP EP (m) ˆg(m)(X1, Z1) g P (m)(X1, Z1) 2

2 | D(m) tr

= L inf P P σP o(mγ)Op(m γ) 1/2

where M is an upper bound for the sequence ˆg(m)(x, z) 1

m N (Assumption 4.3) and R is an upper

bound for the sequence (EP (m) χ2 Q X|Z||PX|Z + 1 )1/2

m N (Assumption 4.2).

Finally, n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) + τα τα ˆσ(n,m)

σ(m) P (m) σP (m) =

n W (m) 1,P (m)

n W (m) 1,P (m)

n W (m) 1,P (m)

σ(m) P (m) σP (m) +

n W (m) 2,P (m) σP (m) +

n W (m) 3,P (m) σP (m) +

n W (m) 4,P (m) σP (m) + τα τα ˆσ(n,m)

σ(m) P (m) σP (m)

n W (m) 1,P (m)

σ(m) P (m) + op(1) N(0, 1)

by Slutsky s theorem. Because N(0, 1) is a continuous distribution, we have uniform convergence of the distribution function [27][Chapter 8, Exercise 5], and we do not have to worry about the fact that

nΩRBPT P (m) σP (m) depends on m.

C Experiments

C.1 Running times

Artificial-data experiments. Regarding running times (average per iteration), RBPT took 5 10 4s to run, RBPT2 took 1.9s, STFR took 7.5 4s, RESIT took 1.3 10 1s, GCM took 6 10 4s, CRT took 1.7 10 2s, and CPT took 6.2 10 1s, all in a Mac Book Air 2020 M1.

Real-data experiments. Regarding running times (average per iteration), RBPT took 1.1 10 1s to run, RBPT2 took 2.4 10 1s, STFR took 10 3s, GCM took 10 3s, CRT took 2.5 10 2s, and CPT took 7.2 10 1s, all in a Mac Book Air 2020 M1.

C.2 Extra results

We include extra experiments in which Y | X, Z has skewed normal distributions with location µ = c X + a Z + γ(b Z)2, scale σ = 1, and shape s = 3 (shape s = 0 lead to the normal distribution). From the following plots, we can learn that the skewness often impacts negatively in Type-I error control.

Figure 7: Type-I error rates (c = 0). In the first two plots, we set θ = 0 for RBPT, permitting Type-I error control across different d Z values (Theorem 4.4), while RBPT2 allows Type-I error control for moderate d Z. All the baselines fail to control Type-I errors regardless of d Z. The last two plots illustrate that CRT emerges as the least robust test in this context, succeeded by RBPT and CPT.

Figure 8: (i) Making RBPT2 more robust using unlabeled data. With d Z = 40, we gradually increase the unlabeled sample size from 0 to 1000 when fitting ˆh. The results show that a larger unlabeled sample size leads to effective Type-I error control. Even though we present this result for RBPT2, the same pattern is expected for RBPT in the presence of unlabeled data. (ii) Power curves for different methods. We compare our methods with CPT (when θ = 0), which seems to have practical robustness against misspecified inductive biases. RBPT2 and CPT have similar power while RBPT is slightly more conservative.