# outofvariable_generalisation_for_discriminative_models__d33653bb.pdf Published as a conference paper at ICLR 2024 OUT-OF-VARIABLE GENERALIZATION FOR DISCRIMINATIVE MODELS Siyuan Guo , , Jonas Wildberger Bernhard Schölkopf The ability of an agent to do well in new environments is a critical aspect of intelligence. In machine learning, this ability is known as strong or out-of-distribution generalization. However, merely considering differences in distributions is inadequate for fully capturing differences between learning environments. In the present paper, we investigate out-of-variable generalization, which pertains to an agent s generalization capabilities concerning environments with variables that were never jointly observed before. This skill closely reflects the process of animate learning: we, too, explore Nature by probing, observing, and measuring proper subsets of variables at any given time. Mathematically, oov generalization requires the efficient re-use of past marginal information, i.e., information over subsets of previously observed variables. We study this problem, focusing on prediction tasks across environments that contain overlapping, yet distinct, sets of causes. We show that after fitting a classifier, the residual distribution in one environment reveals the partial derivative of the true generating function with respect to the unobserved causal parent in that environment. We leverage this information and propose a method that exhibits non-trivial out-of-variable generalization performance when facing an overlapping, yet distinct, set of causal predictors. Code: https://github.com/syguo96/Out-of-Variable-Generalization 1 INTRODUCTION Much of modern machine learning can be viewed as large-scale pattern recognition on suitably collected independent and identically distributed (i.i.d.) data. Its success builds on generalizing from one observation to the next, sampled from the same distribution. Animate intelligence differs from this in its ability to generalize from one problem to another. The machine learning community studies the latter under the term out-of-distribution (OOD) generalization (Shen et al., 2021; Parascandolo et al., 2021; Ahuja et al., 2021; Krueger et al., 2021; Zhang et al., 2021b;a; Schölkopf, 2022), where training and test data differ in their distributions. However, differences in distributions do not fully capture differences in environments. In the present work, we investigate generalization across environments where different sets of variables are observed, referring to the problem as out-of-variable (OOV) generalization. While in practice we would expect that many real-world situations exhibit both aspects of OOD and OOV, we note that the OOV problem can occur even if there is no shift in the underlying distribution. In the present paper, we will focus on this setting. OOV generalization aims to transfer knowledge learnt from a set of source environments to a target environment that contains variables never jointly present in any of the sources, or even not present at all. OOV is a ubiquitous problem in inference. Scientific discovery synthesizes information and generalizes both out-of-distribution and out-of-variable (Seneviratne et al., 2018; Hey et al., 2009). Medicine is a field where machine learning is thought to have great potential since it can learn from millions of patients while a doctor may only see a few thousand during their lifetime. However, we face strong limitations in guaranteeing dataset consistency. To begin with, patients have unique circumstances, and some diseases/symptoms are rare. More generally, medical datasets come with different variable sets - diagnostic measurements greatly vary across patients (serum, various forms of imaging, genomics, proteomics, immunoassays, etc.). A good doctor, however, will be able to generalize across patients even if the measured variables are not identical. In practice, data scientists Correspondence to: siyuan.guo@tuebingen.mpg.de University of Cambridge, United Kingdom Max Planck Institute for Intelligent Systems, Tübingen, Germany Published as a conference paper at ICLR 2024 end up using imputation or simply ignoring rarely-measured features. To realize the potential of AI in medicine, we need to understand the OOV problem. To provide context, we briefly discuss pertinent research threads. Missing data Rubin (1976) refers to when covariates are missing for individual data points. Most approaches (Donner, 1982; Kim and Curry, 1977) either omit data that contain missing values or perform imputation. Our problem differs in that some variables are missing in entire environments. Transfer learning studies how to re-use previous knowledge for future tasks. Recent work focused on transferring re-usable features (Long et al., 2015; Oquab et al., 2014; Tzeng et al., 2015) or model parameters (Dodge et al., 2020; Sermanet et al., 2013; Hoffman et al., 2014; Cortes et al., 2019; Wenzel et al., 2022) of discriminative models. Deep learning based approaches (Meyerson and Miikkulainen, 2020; Reed et al., 2022) embed variable relationships as proximity in the latent spaces. Our work presents a theoretical study on OOV generalization showing that without additional assumptions, the discriminative OOV problem is not solvable: marginal consistency between source and target discriminative models does not uniquely determine a solution. Marginal problems in the statistical (Vorob ev, 1962; Sklar, 1996) and causal literature (Mejia et al., 2021; Gresele et al., 2022; Janzing, 2018; Janzing and Schölkopf, 2010; Evans and Didelez, 2021; Robins, 1999), on the other hand, study how to merge marginal information from different sources. Concrete methods may involve searching for a joint distribution that is consistent with marginal observations. The elegant work of Mejia et al. (2021) uses the maximum entropy principle to infer joint distributions compatible with observed marginal datasets; Gresele et al. (2022) study the existence of consistent causal models; Janzing (2018) aims to learn useful causal models that can predict properties of previously unobserved variable sets. Note that inferring the joint distribution for prediction tasks may be inefficient, in line with Vapnik s principle (Vapnik, 1999): given some task, one should avoid solving a more general problem as an intermediate step. Our work takes a different approach, showing that learning from a residual error distribution is sufficient to achieve identifiability in nontrivial discriminative OOV scenarios. Causality has been argued to be related to the issue of generalization across domains (Zhang et al., 2015; Schölkopf et al., 2011; Arjovsky et al., 2019; Pearl and Bareinboim, 2022). Distribution shifts between domains can be modelled as sparse causal mechanism shifts (Bengio et al., 2019; Schölkopf, 2022; Perry et al., 2022) , and the correct causal structure may help efficient modular adaptation (Parascandolo et al., 2018; Goyal et al., 2020). Other work considers domain differences as shifts in spurious correlations or aims to learn invariant causal information, robust across environments (Schölkopf et al., 2011; Peters et al., 2016; Rojas-Carulla et al., 2018; Heinze-Deml et al., 2018; Arjovsky et al., 2019; Jiang and Veitch, 2022; Krueger et al., 2021; Parascandolo et al., 2021; Ahuja et al., 2021; Lu et al., 2021; Heinze-Deml et al., 2018; Pfister et al., 2019; Rojas-Carulla et al., 2018). Causality approaches also include the transfer of causal effects across different experimental conditions (Pearl and Bareinboim, 2022; Bareinboim and Pearl, 2013; 2016; Degtiar and Rose, 2023). In the present paper, we highlight another connection between causality and generalization, where domain differences are due to distinct sets of variables contained within them, and causal assumptions allow us to generalize even to variables that we have never seen during training. Out-of-variable generalization studies the efficient re-use of marginal observations. We do not solve this problem, neither do we present a robust algorithm for real-world settings. We do present a proof-of-concept, proposing a setting and a predictor provably capable of leveraging additional information beyond what is typically used by discriminative models. We do so without the need for inferring joint distributions. Our main contributions are: We contextualize ( 1) and study OOV generalization ( 2). We investigate challenges for common approaches (e.g., transferring reusable features or model parameters) to solve discriminative OOV problems ( 3.2). We show (Theorem 1) marginal consistency condition alone does not permit identification of the target predictive function for common OOV scenarios. We study the identification problem in OOV scenarios when source and target covariates have dependent ( 3.3.1) and independent ( 3.3.2) structures. We find that the moments of the error distribution in the source domain reveal the partial derivative of the true generating function with respect to the unobserved causal parents ( 3.3.4). Published as a conference paper at ICLR 2024 (b) Proposed (c) Oracle (d) Marginal (e) Mean Imputed Figure 1: Example of an OOV scenario: (a) the blue box includes observed variables in the source domain, and the orange box those in the target domain. A directed edge represents a causal relationship. With Y not observed in the target domain, the goal is to predict Y in the target domain using the source domain. (b)-(e) shows an example of contour lines of various methods prediction on E[Y | X2, X3]. Our proposed predictor (b) results in a close match with the true expectation (c), an oracle solution trained as if we have sufficient data and observe all variables of interest. In contrast, marginal and mean imputed predictors (d, e) deviate far from the true expectation (for details, cf. 4). We then propose an OOV predictor and evaluate its performance experimentally ( 4), showing that our approach achieves a non-trivial degree of OOV transfer. Fig. 1 provides a toy example of our problem. At first glance, it would seem all but impossible to have any transfer from the source environment (blue box) to the target environment (orange box) about information on an unobserved variable in the source. The goal of the present paper is to show that under certain causal and functional assumptions, there is a previously overlooked source of information in this OOV setting, making it possible after all. 2 OUT-OF-VARIABLE GENERALIZATION Denote by X a random variable with values x, P a probability distribution with density p. Consider an acyclic structural causal model (SCM) M consisting of a collection of random variables and structural assignments (Pearl, 2009) Xi := fi(PAi, Ui), i = 1, . . . , n, (1) where PAi are the parents or direct causes of Xi and Ui are jointly independent noise variables. Given an SCM M, one can define its corresponding directed acyclic graph (DAG) where the incoming edges for each node are given by its parent set. A joint distribution generated from some SCM M with DAG G allows the Markov factorization p(x1, . . . , xn) = p(xi | pa G i are parents of Xi in G. The factors ( mechanisms ) in (2) are postulated to be independent: Principle 1 (Independent Causal Mechanisms (ICM) (Peters et al., 2017)). A change in one mechanism p(xi | pa G i ) does not inform (Guo et al., 2022; Janzing and Schölkopf, 2010) or influence (Schölkopf et al., 2011) any of the other mechanisms p(xj | pa G j )(i 6= j). Before defining OOV generalization, we begin by motivating the problem. Probabilistic representations (such as the Markov factorization Eq. 2) have been argued to offer advantages for probabilistic inference (Koller and Friedman, 2010) and interpretability. We highlight that in addition, the Markov factorization frees us from the need of observing all variables of interest at the same time: Observation (Estimating the joint via causal modules). Suppose that we have knowledge of the causal DAG and would like to estimate the joint density p. Provided that for each variable Xi, we observe an environment containing Xi and its causal parents, we can recover the joint density by multiplying (according to Eq. 2) the conditionals p(xi | pa G i ) estimated separately in the environments. Published as a conference paper at ICLR 2024 This phenomenon also occurs in undirected probabilistic graphical models, where the joint density p(x1, . . . , xn) = 1 c2cliques c(xc) is recoverable given potentials learnt from environments that contain the variables appearing in each clique. The above are the simplest cases of OOV generalization, yet they already illustrate that causal assumptions can help. We will study a more subtle case below. To this end, we model an environment E = (D, T ) as a domain D and a task T . The domain contains a variable space X := (X1, X2, . . . ) and its joint probability distribution P(X). Given a domain, a task contains a target variable space Y := Y , and a predictor f : X ! Y. To differentiate between components belonging to the source and target environments, subscripts s and t are used. Definition 1 (OOV Generalization). OOV uncertainty arises when the variable space of the target environment is not contained in any of the variable spaces of the source environments, i.e., 8s, {Xt, Yt} 6 {Xs, Ys}. If a method for estimating a quantity in the target environment (e.g., a predictor ft) improves by utilizing data from the source environments, we say it generalizes OOV. While there is nothing causal about OOV generalization, we use SCMs since it turns out that they allow the formulation of assumptions and methods that provably exhibit OOV generalization. 3 RESIDUAL GENERALIZATION UNDER CAUSAL ASSUMPTIONS 3.1 PROBLEM FORMULATION For simplicity, we consider a univariate setting, referring to Appendix C.1 for a multivariate extension. Consider an SCM with additive noise (Hoyer et al., 2009) Y := φ(X1, X2, X3) + (3) with a function φ, jointly independent causes Xi, and N(0, σ2). Assume that we do not have access to an environment jointly containing all variables (X1, X2, X3, Y ). Instead, we have (Fig. 1a): A source environment with jointly observed (X1, X2, Y ), and A target environment with jointly observed (X2, X3), and unobserved Y . Our goal is to predict Y given X2, X3. This OOV scenario posits two challenges: without joint observations of (X2, X3, Y ), we cannot train or fine-tune a discriminative model in the target environment; further, due to the independence among the covariates, it is impossible to infer X3 from the covariates observed in the source environment. Fig. 2a shows a visualization of the problem. To ground the problem in the real world, consider two medical labs collecting different sets of variables. Lab A collects X1 = lifestyle factors and X2 = blood test; Lab B, in addition to X2, collects X3 = genomics. Lab A is hospital-based and can measure diseases Y , whereas B is a research lab. The OOV problem asks: given a model trained to predict Y on Lab A s data, how should Lab B use this model for its own dataset that differs in the set of input variables? 3.2 CHALLENGES IN DISCRIMINATIVE OOV GENERALIZATION Transfer learning often transfers reusable features or model parameters of a discriminative model fitted in the source environment. This approach has inherent limitations when it comes to OOV scenarios: lacking the outcome variable Y , we cannot fine-tune in the target environment; further the common features, in our case, are the common variables X2 shared between environments. A naive approach is then to predict the target sample using the model restricted to X2, i.e., the marginal predictor, cf. our experiments ( 4). However, such a predictor yields constant prediction irrespective of changing X3. To leverage the shared variables further, we study the properties that the optimal target predictor is expected to satisfy to restrict the set of potential target predictors. To see this concretely: suppose data is sufficient and one can observe all variables of interest, training discriminative models on each environment yields the optimal predictive functions that minimize the mean squared error loss fs(x1, x2) = EX3[Y | x1, x2] (4) ft(x2, x3) = EX1[Y | x2, x3] (5) Published as a conference paper at ICLR 2024 (a) Marginal to Marginal (b) Marginal to Joint (c) Merge datasets Figure 2: Examples of OOV scenarios where marginal consistency condition alone (6) does not permit the identification of the optimal predictive function in the corresponding target domain. for the source and target environment, respectively. With the discriminative model fs fitted in the source environment, its residual distribution is the distribution of differences between the observed value and the prediction, Y fs(X1, X2) | X1, X2. Note that the optimal predictive functions fs, ft automatically satisfy the marginal consistency condition (see Appendix A): for any x2, we have EX1[fs(X1, x2)] = EX3[ft(x2, X3)] (6) Suppose we have trained the optimal predictor in the source environment. The marginal consistency condition (6) enforcing consistency over the shared variables, then restricts the solution space for the predictor in the target environment. However, Theorem 1 shows that this restriction does not uniquely determine the target predictor, i.e., it does not permit identification of the optimal predictive function in the target environment for all the scenarios shown in Fig. 2. See Appendix C.2 for the multivariate version, and proofs. Theorem 1. Consider the OOV scenarios in Fig. 2, each governed by the SCM described in 3.1. Suppose that the variables considered in Fig. 2a and Fig. 2b are real-valued and the variables X1 and X3 in Fig. 2c are binary. We assume that for all i, the marginal density pi(xi) is known, and denote its support set as Si := {x 2 R | pi(x) > 0}. Suppose that for all i there exist two distinct points x, x0 2 Si. Then, for any pair fs, ft satisfying marginal consistency (6) and for any R > 0, there exists another function f 0 t with kft f 0 tk2 R that also satisfies marginal consistency. 3.3 IDENTIFICATION IN OOV GENERALIZATION We now study when identifiability of the optimal target predictive function can be achieved. To start with we consider a different setting than our setup (Fig. 1), where covariates between source and target environments are independent and causes of the outcome variable contained in the target environment. 3.3.1 WITH DEPENDENT COVARIATES Theorem 2. Consider a target variable Y and its direct cause PAY . Suppose that we observe: source environment contains variables (Z, Y ); training a discriminative model on this environment yields a function fs(z) = E[Y | Z], target environment contains variable PAY Suppose Y := φ(PAY ) + Y , Z = g(PAY ) + Z where g is known and invertible with φ, g 1 uniformly continuous. Then in the limit of E[| Z|] ! 0, the composition of the discriminative models in source environments also approaches the optimal predictor, i.e., 8pa Y : fs g(pa Y ) ! φ(pa Y ). Appendix C.3 details its multivariate statement and proof and Fig. 5 in Appendix shows an example of such a scenario. Informally, Theorem 2 states that one can identify the optimal target predictive function from the learnt source function in our setup if the dependence structure between the source and target covariates is known and satisfies the above assumptions. However, in real-world applications, the dependence structure between the source and target covariates may not be known or even exist. To further understand this OOV problem, we next study a more challenging scenario when all covariates are independent from each other, and demonstrate a seemingly surprising result, that under certain assumptions, the optimal target predictive function is identifiable without the knowledge of the dependence structure among covariates. Published as a conference paper at ICLR 2024 3.3.2 WITH INDEPENDENT COVARIATES With theoretical results on the limitations of current approaches in transferring with discriminative model for OOV scenarios, we present a practical method for the base case illustrated in Fig. 1a and detail its underlying assumptions. Simple Additive Model One solution to tackle the problem in Fig. 1a is to train separate discriminative models for each observed variable. For example, given the source environment, we learn function mappings on (X1, Y ) and (X2, Y ) as f1, f2. When facing a different set of variables, e.g., (X2, X3), we could directly re-use the learnt f2. With additional collection of Y in the target domain, we then train a model on (X3, Y ). The method offers a degree of compositional flexibility and circumvents the need to jointly observe variables of interest, e.g., (X2, X3, Y ). However, such a method first requires the collection of variable Y in the target domain and assumes that the generating function of Y has only linear relationships with its causes (i.e., there is no interaction term like Xi Xj, i 6= j). A detailed description of the model and its underlying assumptions can be found in Appendix B. Below, we propose a method to transfer in OOV scenarios that 1) does not require us to observe Y in the target domain, and 2) relaxes the linearity assumption. Note the main idea is general to work for all scenarios in Fig. 2. We illustrate our method via an example ( 3.3.3) with details in 3.3.4. 3.3.3 MOTIVATING EXAMPLE Consider the problem described in 3.1 in the case where φ is a polynomial: Y := 1X1 + 2X2 + 3X3 + 4X1X2 + 5X1X3 + 6X2X3 + 7X1X2X3 + (7) Let Xi have mean µi, variance σi for all i. Given sufficient data and the observation of variable Y in the target environment, we train discriminative models in each environment, yielding the optimal predictive functions that minimize the mean squared error as: fs(x1, x2) = ( 3µ3) + ( 1 + 5µ3)x1 + ( 2 + 6µ3)x2 + ( 4 + 7µ3)x1x2 ft(x2, x3) = ( 1µ1) + ( 3 + 5µ1)x3 + ( 2 + 4µ1)x2 + ( 6 + 7µ1)x2x3 We first illustrate fine-tuning, in this example, cannot identify target predictive function. Note coefficients { i} are model parameters. We observe the coefficients for the common term x2 share some constituents between fs and ft in (8). One can thus expect, during fine-tuning, the coefficients may adapt quickly. However, it is clear that one cannot uniquely determine the coefficients of ft without observing Y from the target environment, since the system of equations is under-determined with eight unknown coefficients and four estimated values even in the above polynomial case. No noise regime First assume that there is no noise, i.e. = 0. Although we do not observe the cause X3, we nevertheless have information about it in the source environment: The unobserved variable act as a noise term, and the residual distribution in the source environment carries a footprint of it. We will see below that subject to suitable assumptions, this idea carries over to the noisy case. With noise regime Now consider additional additive noise. We will see that the idea outlined above carries over under suitable assumptions. The third moment from the residual distribution in the source environment takes the following form: (Y fs(x1, x2))3 | x1, x2 = ( 3 + 5x1 + 6x2 + 7x1x2)3 E[(X3 µ3)3] (9) We observe that the term in parentheses coincides exactly with the partial derivative, i.e., = 3 + 5x1 + 6x2 + 7x1x2 (10) Under φ in (7) as a polynomial, we know terms in the source environment with non-zero coefficients are g(x1, x2) = [1, x1, x2, x1x2]. One can then fit a linear model with features in g on the source environment and estimate the coefficients. The resulting predictor is fs(x1, x2) = βT g(x1, x2), where β1 = 3µ3, β2 = 2 + 5µ3, β3 = 2 + 6µ3, β4 = 4 + 7µ3. It is clear that learning β alone cannot uniquely determine the coefficients i. This intuition is supported by Theorem 1. To illustrate the main idea of our method, consider the error in the source environment after fitting a linear predictive model fs: Y fs(x1, x2). Let W be some transformation of the error, where Published as a conference paper at ICLR 2024 W = (Y fs(x1, x2))3/k3 and k3 = E[(X3 µ3)3] estimated by observed X3 samples in the target environment. Fit W against ( T g(x1, x2))3 and estimate the coefficients , as shown in (9), enables the estimation of the coefficients 3, 5, 6, 7. Combined with the estimated coefficients β, we can uniquely determine the coefficients of ft without the need to observe Y from the target environment. Discussion The intuition behind this seemingly surprising result is rather straightforward X3 though unobserved in the source, is a generating factor of Y . Its information is not only contained in the marginalized mean but also in the residual distribution of the error after fitting a discriminative model. 3.3.4 OUT-OF-VARIABLE LEARNING This phenomenon is extendable to more general settings. Theorem 3 shows that the moments in the residuals still provide additional information about the partial derivative of the function φ w.r.t X3 for general nonlinear smooth functions. Appendix C.4 shows its multivariate statement and the proof. Theorem 3. Consider the problem setup in 3.1 and assume the function φ is everywhere twice differentiable with respect to X3. Suppose from the source environment we learn a function fs(x1, x2) = E[Y | x1, x2]. Using first-order Taylor approximation on the function φ : x1 x2 X3 ! R for fixed x1, x2, the moments of the residual distribution in the source environment take the form E[(Y fs(x1, x2))n | x1, x2] = E[(X3 µ3)n k]. (11) For n = 3, this reduces to E[(Y fs(x1, x2))3 | x1, x2] = E[(X3 µ3)3] + E[ 3]. (12) Theorem 3 shows that the moments of the residual distribution include a contribution from both the moments of the noise variable and the propagated effects caused by variables unique to the target environment. When n = 3, most terms that involve the undesired noise variable disappear. Corollary 4. For OOV scenarios described in 3.1, learning from the moment of the error distribution allows exact identification of φ when φ(x1, x2) = P p,q cih(x1, x2)pxq 3, where p, q 2 {0, 1} and ci 2 R, 8i and h can be any function. To see Corollary 4 in action, recall when φ is as in (7), our solution is analytically exact, as shown in (10). Theorem 3 and Corollary 4 demonstrate that in this challenging OOV scenario where existing transfer learning methods fail to apply (cf. Theorem 1), learning from the residual distribution offers exact identification for a certain class of generating functions. Next, we build a practical predictor that utilizes the above theoretical insights and present experimental results to evaluate OOV learning performance. To start with, note the target predictive function can be Monte Carlo approximated if the true function φ is known: ft(x2, x3) = φ(x1, x2, x3)p(x1)dx1 1 φ(x1,i, x2, x3), where x1,i p(x1) Assume φ is smooth, by first-order Taylor approximation evaluated at (x1, x2, µ3), rewrite φ as: φ(x1, x2, x3) = φ(x1, x2, µ3) + @φ (x3 µ3) + O((x3 µ3)2) (13) Taking expectations of X3 on both sides of Eq. 13, we see that fs(x1, x2) φ(x1, x2, µ3). Theorem 3 states that we can estimate the partial derivative term from the third moment of the error distribution. We thus propose Moment Learn, an OOV estimate ft for the target predictive function: ft(x2, x3) = 1 fs(x1,i, x2) + h (x1,i, x2)(x3 µ3), with x1,i p(x1), (14) where h is a MLP parameterized by , modelling the partial derivative by regressing on the 3rd moment of the residual distribution from the source. Note the proposed predictor is strictly better than naïvely marginalizing fs on the shared variables. The proposed estimate also satisfies the marginal consistency condition as the second term in (14) vanishes when taking expectations. See Algorithm 1 in Appendix D.3 for a detailed procedure. Published as a conference paper at ICLR 2024 i