# learning_from_biased_data_a_semiparametric_approach__6c669d0c.pdf Learning from Biased Data: A Semi-Parametric Approach Patrice Bertail * 1 Stephan Cl emenc on * 2 Yannick Guyonvarch * 2 Nathan Noiry * 2 We consider risk minimization problems where the (source) distribution PS of the training observations Z1, . . . , Zn differs from the (target) distribution PT involved in the risk that one seeks to minimize. Under the natural assumption that PS dominates PT, i.e. PT << PS , we develop a semiparametric framework in the situation where we do not observe any sample from PT, but rather have access to some auxiliary information at the target population scale. More precisely, assuming that the Radon-Nikodym derivative d PT/d PS (z) belongs to a parametric class {g(z, α), α A} and that some (generalized) moments of PT are available to the learner, we propose a two-step learning procedure to perform the risk minimization task. We first select ˆα so as to match the moment constraints as closely as possible and then reweight each (biased) training observation Zi by g(Zi, ˆα) in the final Empirical Risk Minimization (ERM) algorithm. We establish a OP(1/ n) generalization bound proving that, remarkably, the solution to the weighted ERM problem thus constructed achieves a learning rate of the same order as that attained in absence of any sampling bias. Beyond these theoretical guarantees, numerical results providing strong empirical evidence of the relevance of the approach promoted in this article are displayed. 1. Introduction In the classic formulation of predictive learning problems, the goal is to find θ in a parameter space Θ with minimum risk RP(θ) = EP[ℓ(Z, θ)]. Here Z is a random variable, taking its values in some measurable space Z, with unknown *Equal contribution 1Universit e Paris-Nanterre, France 2T el ecom Paris, France. Correspondence to: Patrice Bertail , Stephan Cl emenc on , Yannick Guyonvarch , Nathan Noiry . Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). distribution P and ℓ: Z Θ R+ is a given loss function. In order to solve approximately the risk minimization problem inf θ Θ RP(θ), (1) one assumes that a training dataset Dn = {Z1, . . . , Zn} composed of n 1 independent copies of the generic r.v. Z is available. A natural learning procedure consists in replacing the unknown risk by its empirical version based on the Zi s and solving next inf θ Θ 1 n i=1 ℓ(Zi, θ). (2) The accuracy of this procedure, referred to as Empirical Risk Minimization (ERM in abbreviated form), is usually assessed by establishing upper confidence bounds for the risk excess of empirical minimizers, that is the difference between the risk of solutions ˆθn to (2) and the minimum risk attained over the class Θ, under suitable assumptions on the loss function ℓand the parameter space Θ, see e.g. (Devroye et al., 1996). Such results offer statistical guarantees regarding the generalization capacity of the predictive rule encoded by the learned parameter ˆθn, when applied to a new/test observation Ztest with distribution P. The usual validity framework for ERM crucially relies on the assumption that the distributions of the random variables involved in the training and test/prediction stages are the same. However, this assumption is now highly arguable in a wide variety of situations. Whereas, in the recent past, data collection was expensive and still performed by means of carefully elaborated experimental designs through surveys and questionnaires, practitioners have more and more often poor control over the information acquisition process in the Big Data era. The massive datasets captured by connected devices (e.g. Io T sensors) or collected via the Internet may be in significant part corrupted, or non-representative of the target population. For instance, as recently hotly discussed and debated (see e.g. (Wang et al., 2019)), certain facial recognition systems may be trained on public databases, possibly very different from the statistical population to which they will be applied when deployed. In other words, the data available for predictive learning may frequently stem from a source distribution PS that differs from the distribution of interest PT, referred to as the target distribution here. One Learning from Biased Data: A Semi-Parametric Approach may refer to (Quionero-Candela et al., 2009) for an account of the different types of sampling biases or dataset shifts occurring frequently in practice. In this paper, we consider risk minimization problems in presence of selection/sampling bias. They are now the subject of much attention in the literature, see e.g. (Bolukbasi et al., 2016), (Zhao et al., 2017) or (Liu et al., 2016), and can be viewed as specific transfer learning problems, cf (Pan and Yang, 2010) or (Weiss et al., 2016), insofar as the key idea is to find/approximate a suitable transformation of the source distribution PS so that relevant information can be transferred from the training dataset to the test population. Most contributions documented in the literature focus on the situation where two samples are observed, one drawn from PS and another one from PT. In particular, two very popular transfer-learning algorithms have been proposed in this specific case: the Kullback Leibler Information Estimation Procedure (KLIEP) in (Sugiyama et al., 2007; 2008) and the Kernel Mean Matching (KMM) in (Huang et al., 2007), see also (Gretton et al., 2009). As instances from the target distribution are available to learn a predictive rule, the task performed by these algorithms essentially consists of data augmentation: the reweighted source sample is merely viewed as a tool to increase the size of the target sample. Here, focus is on a more challenging problem, corresponding to many practical situations, where statistical learning must be based on a single sample drawn from PS , combined with appropriate assumptions about the relation between distributions PS and PT. While it has been seldom considered in transfer learning, see however (Laforgue and Cl emenc on, 2019; Vogel et al., 2020), this setup has proved useful for various problems in Statistics. It is for instance used in the literature devoted to denoising tasks, where one aims at recovering the distribution of variables that are observed with some corrupting noise. It is generally assumed that uncorrupted observations are not available so that the distribution of interest has to be inferred from either one or several samples of noisy observations, based on structural assumptions on the nature of the noise, see e.g. (Chen et al., 2011; Kato and Sasaki, 2019; Kato et al., 2019). In order to overcome the lack of data drawn from the target distribution PT, two crucial assumptions are made in the present paper. First, the modelling we propose concerns the link function between PT and PS : we assume that PT is absolutely continuous w.r.t. PS and that the Radon Nykodim derivative d PT/d PS (z) belongs to a parametric family G. This connects the framework we develop for learning from biased data to semi-parametric statistics, see e.g. (Gilbert et al., 1999) or (Zhang, 2000): as a matter of fact, even if the elements of class G are supposed to be characterized by a parameter α in A Rp with p 1, no parametric assumption is made on the class Θ encoding the decision rule candidates. While tractable, our approach offers a great flexibility: as will be shown in the subsequent analysis, miss-specified models G can be handled as well. Second, we assume that we have access to several moments under PT, or estimators thereof. This situation is rather common in practice, as a result of privacy constraints in particular. For instance, in countries where national censuses are conducted, only socio-economic information at the population level is made available to the public. The information furnished by Internet providers is another relevant example: those providers allow users to download freely summary statistics on queries made on the web worldwide but the knowledge of individual-level search data remains confidential. Using both assumptions, the learning methodology we propose is close in spirit to the so-called calibration method in survey sampling (Deville and S arndal, 1992; Deville, 2000; Guggemos and Till e, 2010) or its analogue in the econometrics literature (Imbens and Lancaster, 1994). It consists in solving a reweighted version of the ERM problem based on the Zi s, using some auxiliary information, namely known moments from PT. It is implemented in two steps. First, the link function d PT/d PS (z) is learned via a generalized-method-of-moments approach, see e.g. (Hansen, 1982). Each (biased) training observation Zi drawn from PS can be next reweighted by g(α, Zi) using functions g(., α) in G and one searches for a minimizer in G of the distance between the reweighted empirical moments and the known moments from PT. We establish nonasymptotic concentration guarantees for this estimator, see Proposition 1. Second, the learned link function is then used to construct a reweighted version of the ERM problem (2). Considering a minimizer of the reweighted empirical risk ˆθ, we finally prove a nonasymptotic generalization bound which controls the gap between RPT (ˆθ) and infθ Θ RPT (θ), see Theorem 1. The paper is organized as follows. Section 2 presents the theoretical framework for statistical learning based on biased training data considered throughout the article. The algorithmic approach we propose is described in section 3, together with a rate bound analysis proving its accuracy. For illustration purpose, experimental results are displayed in section 4 and some concluding remarks are collected in section 5. The proofs of the main results are sketched in the Appendix section, while additional technical details are deferred to the Supplementary Material. 2. Theoretical Framework In this section we present at length the framework we consider for statistical learning in presence of sampling bias and introduce the main notations of the paper. We also describe the semi-parametric approach we develop in the subsequent analysis and state the assumptions required to extend Em- Learning from Biased Data: A Semi-Parametric Approach pirical Risk Minimization with statistical guarantees in the form of nonasymptotic generalization bounds. As described in the Introduction, PS and PT are two probability measures on a measurable space Z, referred to as the source and target distributions respectively. Here and throughout, the target distribution is assumed to be absolutely continuous w.r.t. the source distribution, by ||h|| is meant the sup norm of any bounded function h : Z R, the space Rq, q 1, is equipped with the usual Euclidean norm ||.||, the unit sphere is denoted by Sq 1 = {u Rq : ||u|| = 1} and the Dirac mass at any point x by δx. Assumption 1. (Domination) The probability measure PT is absolutely continuous with respect to the probability measure PS . The corresponding Radon-Nikodym derivative is denoted by w : z Z 7 d PT d PS (z). (3) Remark 1. (Comparison with existing works) Having in mind the generic decomposition Z = (Y, X) with Y a label we would like to infer from a feature vector X, our framework encompasses the so-called covariate shift model where P(y | x) is fixed and P(x) is allowed to vary. It also covers the question tackled in (Vogel et al., 2020): the authors consider a classification problem where only the class probabilities (of labels) can differ across source and target and the probability of observing a positive label under PT is assumed to be known. In this very specific case, the ratio d PY T/d PY S (y) is partly known and the problem boils down to estimating the class probabilities under the source distribution. Comparing our work to (Laforgue and Cl emen con, 2019) is also instructive. Their setup differs from ours: instead of having several biased samples at our disposal with known biasing functions, we assume knowledge of several moments under the target and stipulate a parametric model for the transfer/likelihood function. The framework we develop is more adapted to certain practical situations, when learning from socio-demographic/economic data for instance, as discussed in Remark 2. As formulated in the Introduction section, the goal pursued is to approximate a solution θ in Θ to the minimization problem (1) with P = PT, based on i.i.d. training data Z1, . . . , Zn drawn from the distribution PS , a priori different from PT. Notice that under Assumption 1, the risk minimization problem (1) rewrites inf θ Θ EPS [w(Z)ℓ(Z, θ)] . (4) In the (unrealistic) situation where the Radon-Nikodym derivative w is known, one can solve the following weighted Empirical Risk Minimization problem: inf θ Θ 1 n i=1 w(Zi)ℓ(Zi, θ). (5) By means of a direct application of the contraction principle, a generalization bound for the excess of PT-risk of solutions to (5) can be easily established, showing that the same learning rate as that attained in the case where PS = PT, i.e. OP(1/ n), is achieved, see Lemma 1 in (Vogel et al., 2020). Transfer learning with marginal constraints. In practice, the weight function w(z) is unknown. The approach we develop here is of plug-in type. An estimator ˆw of w based on some auxiliary information is first constructed and a version of (5), where w is replaced with ˆw is next solved. To make such a strategy tractable, we assume that the weight function w belongs to a parametric class. Assumption 2. (Parametric link function) Let A Rp be Borel-measurable and let g : Z A R+ be a measurable function such that ||g|| = supα A ||g( , α)|| < + and such that sup(α1,α2) A A |g(z, α1) g(z, α2)|/||α1 α2|| L for some L < + . The function w belongs to the class G := {g( , α) : α A}. Among the numerous parametric models that can be considered in practice, the class below shall serve as a running example in this section. To the best of our knowledge, the class presented in Example 1 has not be investigated for estimating Radon derivatives and is thus a nice complement to existing approaches which model w as the exponential of some (semi-)parametric function, see e.g. (Gilbert et al., 1999; Zhang, 2000). Example 1. (Quadratic parametrization) Let z Z 7 W(z) be a measurable function mapping Z to the set of p p symmetric positive matrices. We define: GW := n g( , α) = αTW( )α : α Rpo . The map z 7 W(z) can be chosen based on some prior information about the form of the true Radon-Nikodym derivative w. As a simple example, we could pick W(z) as a diagonal matrix with diagonal entries equal to positive-valued functions of z, which capture different characteristics of w, e.g. lightness/heavyness of the tails, occurrence of multiple local extrema, possible irregularities. In the case where Assumption 2 is satisfied, there exists at least one element α A such that w( ) = g( , α ). Observe that, for this α , the relation EPS [g(Z, α )] = 1 necessarily holds true. However, this relation is far from characterizing fully α and w in general, in the sense that many elements in A and then many elements in G may satisfy it. This issue is all the more acute as the class G grows richer. Having access to extra relations linking PS and PT is therefore crucial to build a criterion which is able to tell how plausible distinct values in A are. To design such a criterion, we assume that some extra features of the target distribution PT are accessible. More precisely we suppose that the following Learning from Biased Data: A Semi-Parametric Approach quantities are known Ml := EPT [ml(Z)], l = 1, . . . , d, (6) as well as the supposedly PT-integrable functions ml : Z R, l {1, . . . , d}. If w( ) = g( , α ), the relation EPS [g(Z, α )ml(Z)] = Ml is satisfied for every 1 l d. Remark 2. (Macro-information about the target population) The assumption that quantities of the form (M1, . . . , Md) or estimators thereof can be put at the learner s disposal is realistic in many practical situations. In the open data era, national statistical agencies increasingly enable access to a wealth of macro-level summary statistics on numerous topics (e.g. average wage, household composition, health status, life expectancy). For privacy reasons, the micro-level data behind those summary statistics is kept secret by national agencies. For instance, the In Fuse portal of the Office for National Statistics in the United Kingdom, or the Quickfacts interface of the US Census Bureau provide macro-information at fairly disaggregated geographical levels (county-level and below in the US). In many fields, exhaustive data collection cannot be conducted at a very high frequency due to the substantial costs induced by this collection. On the other hand, cheap internet surveys allow to collect a huge quantity of data but there is generally little or even no control of the responding population. This highlights the importance of the design of statistical learning methods capable of exploiting biased microeconomic surveys combined with the extra-information brought by macro-level surveys. A single theoretical index that incorporates all the accessible information on PT can be built: Ψ : α 7 Ψ (α) = EPS g(Z, α)m(Z) M , (7) with m(Z) = (1, m1(Z), . . . , md(Z))T and M = (1, . . . , Md)T. The transfer criterion Ψ is always positive and equal to zero if and only if α satisfies all the constraints aforementioned. When w G, there exists at least one element in α A such that Ψ (α ) = 0. Attention should be paid to the fact that minimization of the criterion is relevant even in the miss-specified case where w < G. In this context, it may happen that there is no α A such that all the constraints are satisfied. However, the closer Ψ (α) to zero, the more plausible g( , α). As discussed at length in the Supplementary Material, the analysis carried out in Section 3 can be extended to this situation, at the price of an additional approximation term in the generalization bound. The identifiability assumption below rules out some scenarios that are difficult to handle from a theoretical point of view. Assumption 3. (Uniqueness) There exists a unique α A such that Ψ (α ) = min α A Ψ (α). Combined with the last point of Assumption 2 (i.e w G), Assumption 3 ensures that w is the only element in G that satisfies all the moment constraints on which Ψ is based. We say that w is well-identified by the moment constraints considered. We maintain this hypothesis throughout the paper to simplify the analysis while conveying the main ideas. In the Supplementary Material, we study the more general case where several minimizers of Ψ may exist and discuss the implications in terms of identification of w. Existence of a minimizer of Ψ can be checked in many situations. This is the case for instance when Ψ is continuous on A and either A is compact, or A = Rp and Ψ is coercive (i.e lim||α|| + Ψ (α) = + ). Example 2. (Quadratic parametrization, bis) We denote by || || the operator norm on matrices, that is ||W|| = sup||x||=1 ||Wx||. When G = GW, the map Ψ is continuous and attains its minimum on A as long as EPS [ ||W(Z)|| ] < + and EPS [ ||W(Z)|| |ml(Z)| ] < + for every 1 l d. This map is also coercive since EPS [W(Z)] is symmetric and strictly positive (see the Supplementary Material for a proof). Unfortunately Ψ (α1) = Ψ (α2) does not imply α1 = α2, so that uniqueness of the minimizer of Ψ does not hold in general. We can however uniquely recover g( , α ) under additional assumptions, for instance when W(z) is diagonal for every z Rm and d = p 1. Since w GW, it is enough to solve Ψ (α)2 = 0 which is equivalent to EPS [αTW(Z)αm(Z)] = M. Letting eα := (α2 1, . . . , α2 p)T, this can be rewritten Γeα = M, where Γ is a p p matrix with first line equal to EPS diag W(Z) T, and second to last lines equal to (EPS ml(Z)diag W(Z) T)p 1 l=1 . We conclude that eα = Γ 1M as long as Γ is invertible. We can set α = ( p eα1, . . . , p eαp)T and remark that we can uniquely identify g( , α ) = αT W( )α . 3. Semi-Parametric Transfer Learning In this section, the algorithmic approach we propose to solve the learning task described in the previous section is detailed and a rate bound analysis is next carried out, providing generalization guarantees for the predictive rules built this way. 3.1. A Two-Stage Learning Approach Here we present the extension of the ERM methodology we promote for statistical learning based on a biased training dataset. It is assumed that auxiliary information about the target population is available to the learner in the form of a vector of PT-integrals of known functions ml(z). The learning procedure of plug-in type is implemented in two steps that are summarized in Fig. 1. One first minimizes over A a statistical counterpart of the transfer criterion Ψ (α), obtained by replacing the source distribution PS in Learning from Biased Data: A Semi-Parametric Approach (7) with its empirical version based on the Zi s, namely Ψn(α) = 1 n i=1 g(Zi, α)m(Zi) M , α A. (8) The minimizer ˆα thus obtained is next used to form a reweighted version of the ERM problem (Rw-ERM in abbreviated form), where each observation Zi is weighted by g(Zi, ˆα). This can be seen as a surrogate to (5). Attention should be paid to the fact that the link function g(., ˆα) computed in the first step of the procedure sketched above can be used for additional learning tasks, involving other decision spaces and/or loss functions. Remark 3. (Ease of implementation) Popular implementations of ERM-like learning procedures such as scikit-learn (Pedregosa et al., 2018) support a weight option which allows to replace the empirical risk (1/n) Pn i=1 ℓ(Zi, θ) with (1/n) Pn i=1 ωiℓ(Zi, θ) when specifying a list of weights (ωi)n i=1. In practice, minimizing the objective function (10) for a given ERM task can thus be done by feeding the weight option of existing algorithms with the weights g(Zi,bα) n i=1 stemming from (9). At a more theoretical level, solving (10) amounts to replacing the canonical empirical distribution (1/n) Pn i=1 δZi in the ERM procedure with the measure (1/n) Pn i=1 g(Zi, ˆα)δZi. We may also re-normalized the weights to 1 to obtain a true probability distribution. Weighted ERM through Semi-Parametric Transfer Input: Biased training dataset {Z1, . . . , Zn}, function m(z), parameter M. 1 Transfer criterion minimization. Estimate the αparameter ˆα arg min α A Ψn(α). (9) 2 Weighted ERM. Minimize the resulting reweighted empirical risk ˆθ arg min α Θ i=1 g(Zi, ˆα)ℓ(Zi, θ). (10) Output: (ˆα, ˆθ) Figure 1. The Rw-ERM Algorithm In the subsequent analysis, we focus on the statistical performance of the predictive rules empirically defined by Rw- ERM. The design of practical algorithms (e.g. based on gradient descent) for computing approximately minimizers of the transfer and reweighted empirical risk is beyond the scope of the subsequent analysis. A detailed account of the optimization techniques used in the experiments displayed in Section 4 is however given in the Supplementary Material. 3.2. Main Results - Generalization Bounds Just like for the implementation, two stages are required to establish generalization guarantees for the Rw-ERM methodology. As a first go, we show that the parameter ˆα produced in the first step of the algorithm is close to the optimal value α . Additional technical assumptions are required for this purpose. To obtain a concentration inequality on ˆα, it is first crucial to describe the richness of the class of functions {g( , α)m( ) : α A}. For this purpose, we consider the Rademacher complexity defined by En(A) := Eσ i=1 σig(Zi, α)m(Zi) where σ1, . . . , σn are independent Rademacher variables, independent from the Zi s. Assumption 4. There exist two constants K1, K2 > 0 such that sup 1 l d+1 |g(z, α)ml(z)| K1 and EP n S [En(A)] K2/ n. Assumption 5. There exist constants ε, R, c > 0 such that (i) α A, ||α α || > R Ψ (α) > ε (ii) (v, t) Sp 1 [ R, R], Ψ (α + tv) c|t|. Remark 4. Let us comment on the above assumptions. Assumption 4 is introduced to control the distance between Ψn and Ψ uniformly over A and with large probability. Its first requirement allows us to resort to the Bousquet-Talagrand inequality (Boucheron et al., 2013)[Theorem 12.5], and its second requirement is a high-level restriction on the complexity of our model. Many concrete examples of functional classes satisfy this second requirement. We illustrate this in the Supplementary Material, where we give simple primitive conditions on A and g to ensure EP n S [En(A)] K2/ n. Assumption 5.(i) implies that the minimum of the function Ψ (which is zero) cannot be arbitrarily approached by a sequence of parameters (αi)i 1 that do not converge to α . Assumption 5.(ii) rules out situations where Ψ is almost flat around its global minimizer. In the Supplementary Material, we give simple sufficient conditions in terms of the Hessian of Ψ2 . Under the above assumptions, the functions Ψn and Ψ reach their minimum roughly at the same point, up to a term of order 1/ n, with high probability. Learning from Biased Data: A Semi-Parametric Approach Proposition 1. Let ˆα arg min Ψn(α). Under Assumptions 1 to 5, for every δ (0, 1) there exists C(δ) < + and nδ,ε such that for every n nδ,ε, ||ˆα α || > 2C(δ) The constant C(δ) depends only on δ, K1, K2 and d. Proposition 1 provides theoretical guarantees on the first step (9) of our algorithm. The constants involved in its formulation are described in the Technical proofs and the Supplementary Material. More precisely, the bound (11) controls the level of error incurred when replacing α with ˆα and is instrumental in deriving our main result, namely Theorem 1. Proposition 1 may be of independent interest as it provides insight into the non-asymptotic behavior of a class of GMM problems (Generalized Method of Moment). In our setting, the GMM criterion is the functional Ψn we seek to minimize. With Proposition 1 in hand, we are in a position to state a theoretical result that guarantees the generalization capacity of the minimizer of the weighted ERM problem (10). For this, we define the Rademacher complexity, this time associated to the class {ℓ( , θ) : θ Θ}: En(Θ) := Eσ i=1 σiℓ(Zi, θ) We now state our main theorem, which ensures the solution to (10) is close to the true risk with high probability. Theorem 1. Suppose that ||ℓ|| := supθ Θ ||ℓ( , θ)|| < + and Assumptions 1 to 5 hold. Let δ (0, 1). Then, there exist C(δ) and nδ,ε (defined in Proposition 1) such that for every n nδ,ε with probability at least 1 δ RPT (bθ) inf θ Θ RPT (θ) 4L||ℓ|| C(δ/2) c n + 4||g|| EP n S [En(Θ)] + ||g|| ||ℓ|| n + 2 ln(2/δ) This theorem shows that, remarkably, the same learning rate as that attained if training observations sampled from the target data generating process were available is achieved by the Rw-ERM method. The generalization bound can be decomposed into two terms. The first one, which depends on C(δ/2), stems from Proposition 1 and quantifies the error between g( , ˆα) and w( ), while the second one comes from a stochastic control of (5). For completeness, notice finally that under standard regularity assumptions on the class {ℓ( , θ) : θ Θ}, the complexity term EP n S [En(Θ)] is of order O(1/ n). 4. Numerical Experiments In this section, we present two numerical experiments which complement the previous theoretical analysis. We start with a simulated dataset where we design both the source and target distributions. We then turn to a more realistic framework: we use the Life Expectancy Dataset and create some distributional shifts on the observations. Artificial data. The simulation scheme is as follows. We consider a regression model with features XS = (XS 1 , XS 2 )T and outcome YS . To form the feature vector, we independently draw XS 1 N(0, 1) and XS 2 Γ(2, 1). The outcome variable is generated as YS = 0.5(c1XS 1 + c2XS 2 + c3XS 1 XS 2 ) + ε, where ε N(0, 1) (XS 1 , XS 2 ). The target distribution is then defined through the following Radon derivative: w(x1, x2, y) = d PT d PS (x1, x2, y) = αT W(x1, x2, y)α , where α = (1, 1, 2)T and W(x1, x2, y) = Diag(x2 1, 4x2 2, 2y2). 4 2 0 2 4 Feature first marginal 0 2 4 6 8 10 12 14 Feature second marginal 10 0 10 20 Target marginal Figure 2. Histograms of the densities of the samples (nobs = 100, 000). In blue: source. In orange: target. Notice that this seemingly simple model already displays striking differences between the source and the target distributions. As depicted in Figure 2, the unimodal distributions of XS 1 and YS turn into bimodal densities while the density of the second feature is shifted to the right. Starting from a sample of size nobs (ZS i )nobs i=1 drawn in PS , we seek to learn a regressor which generalizes well under PT implementing our algorithm given by (9) and (10). We assume we know two marginal moments under PT, namely EPT [X1] and EPT [X2] (these two quantities can be computed from the data generating process described above). To estimate α , we implement a gradient descent algorithm to minimize Ψnobs. To avoid getting trapped in potential local minima, we rerun the descent algorithm nboot times using a bootstrapping rationale presented in the Supplementary Material. Among the sequence (α(b))nboot b=1 thus constructed, we select arg minα (α(b)) nboot b=1 Ψnobs(α(b)) as our final estimator ˆα. In the last step, we train several regression-type algorithms Learning from Biased Data: A Semi-Parametric Approach (OLS, SVR, RF) on (Zi)nobs i=1 with weights (g(Zi, ˆα))nobs i=1 . In parallel, we train those regression-type algorithms on the same training sample, but without weights, and on another sample of size nobs drawn from PT. We finally evaluate the performance of these three approaches through their MSE score computed on ntest new observations drawn from PT. We repeat the whole procedure nrep times. Table 1 presents the mean and standard error of the MSE of the algorithms under consideration for the following choices of parameters: nobs = 10, 000, ntest = 500, nrep = 100 and nboot = 100. Table 1. Performance evaluation on simulated data using MSE. Algorithm Rw-ERM(PS ) ERM(PT) ERM(PS ) OLS 3.8 0.4 3.8 0.4 6.3 0.7 SVR 1.5 0.5 1.2 0.3 2.8 0.8 RF 1.7 0.2 1.6 0.2 2.5 0.4 No matter the ERM task carried out (i.e OLS, SVR or RF), we remark that the predictive accuracy of our new procedure (Rw-ERM(PS )) is almost as good as that of an ERM trained on unbiased data (ERM(PT)). On the other hand, using PS without reweighting to build a regressor (ERM(PS )) does not generalize well under PT. Experimental results on the Life Expectancy Dataset. Let us now consider the slightly more realistic situation where we do not have access to the weight function g(α , z). We use the Life Expectancy Dataset and only keep the Adult Mortality Rate (x1) and the Alcohol Consumption (x2) features in order to predict the Life Expectancy (y) output. We drop the observations containing a missing value and normalize the resulting sample (of size 2,735) whose empirical distribution serves as the target. The data are divided into two groups: G1 containing 90% of the data (n G1 = 2, 462) and G2 containing the rest (n G2 = 273). G2 corresponds to the test set. The source sample is artificially built from G1 with the use of a stratified sampling procedure. We first partition G1 thanks to a grid on the two-dimensional space spanned by the Adult Mortality Rate and the Life Expectancy variables, and we assign a probability weight pij to each box Bi j. Then, a source observation is created by selecting a box Bi j with probability pij and drawing uniformly at random an element of G1 Bij. Figure 3 provides a heatmap representation of this procedure: each box Bij is associated with its probability to be sampled in the source representation (lefthand graph), and with the true proportion of observations it contains in the target representation (right-hand graph). Using this stratified sampling process, we build a sample of n G1 source observations. We then run our twostep procedure on this sample using the parametric class {g(z, α) = αTDiag(x2 1, y2)α, α R2} for the first step, and the OLS and SVR algorithms for the second step. We compare the MSE performance of our approach with the MSE Adult Mortality Rate Life Expectancy Adult Mortality Rate Figure 3. Heatmap representation of the stratified sampling procedure. performances of the same machine learning algorithms, respectively trained without weights on the same source sample and on the initial sample G1 whose empirical law is roughly equal to the target distribution. The corresponding scores are displayed in Table 2. Table 2. Performance evaluation on real data using MSE. Algorithm Rw-ERM(PS ) ERM(PT) ERM(PS ) OLS 0.50 0.08 0.45 0.07 0.71 0.09 SVR(0.01) 0.71 0.19 0.40 0.07 0.83 0.11 SVR(0.1) 0.39 0.08 0.35 0.07 0.50 0.09 SVR(1) 0.39 0.08 0.34 0.07 0.37 0.07 In the OLS case, our approach seems to be efficient: the MSE is very close to the one obtained when training directly on the target, whereas an unweighted training on the source performs poorly. The SVR algorithm is run for three different values of the parameter C (0.01, 0.1 and 1). This corresponds to an increasing amount of complexity in the model learned during the second step of our algorithm. In contrast, the complexity of the first step corresponds to the quality of the chosen parametric class and remains constant. This could explain why the performance of our method draws nearer to that of naive ERM(PS ) as C becomes larger. We conjecture that a solution to alleviate this issue would be to increase the complexity of the first step, by choosing a higher-dimensional parametric class for instance. 5. Conclusion We considered the problem of building a machine learning algorithm when the training data at hand stems from a biased version of the target distribution we wish to generalize on. We placed ourselves in the context where we do not have access to individual observations sampled from the target, but where some of its moments are at our disposal, a Learning from Biased Data: A Semi-Parametric Approach frequently encountered framework when dealing with sensitive data. We proposed a flexible semi-parametric approach (Rw-ERM) to leverage this auxiliary knowledge, which can be decomposed into two steps: 1) an appropriate link function in a parametric class is first learned by minimizing an empirical criterion involving the known moments, 2) based on the former, a weighted version of the ERM procedure is next performed. Under natural assumptions on our model, we provide non-asymptotic guarantees on this procedure, revealing that the learning rate achieved by Rw-ERM is of the same order as that attained when training a learning algorithm directly on the target. We also provided preliminary experimental results illustrating the relevance of the procedure. Let us finally mention some exciting future lines of investigation. A first interesting question concerns the choice of the parametric class: given some a priori knowledge on the shape of the link, are there some natural choices to make? In another direction, our approach could be extended in a non-parametric fashion, where the link function is estimated directly using flexible learning algorithms with loss given by the empirical transfer criterion Ψn. Technical Proofs The proofs of the results stated in Section 3 are sketched below. We start with the proof of Proposition 1 followed by that of Theorem 1. For the sake of completeness, we state two additional lemmas which are instrumental in obtaining our results. The proofs of these lemmas together with all additional details are postponed to the Supplementary Material. Proof of Proposition 1 The proof of Proposition 1 first consists in introducing for every δ (0, 1) Ωδ,n = ( sup α A Ψn(α) Ψ (α) C(δ, n) n On the event Ωδ,n, the functions Ψn and Ψ are uniformly close. Combining this fact with Assumption 5, it is then possible to prove that the inequality ||ˆα α || C(δ, n)/(c n) holds deterministically on the event Ωδ,n. Therefore, the demonstration essentially boils down to obtaining the following lemma, whose proof is postponed to the Supplementary Material. Lemma 1. Let δ (0, 1). Under Assumption 4, with probability at least 1 δ Ψn(α) Ψ (α) C(δ, n) n , where C(δ, n) is an explicit quantity given in the proof which depends only on δ, n, K1, K2 and d that satisfies supn 1 C(δ, n) < + . Assuming Lemma 1 is proved, let us give more details on the approach we have just described. Let δ (0, 1). We place ourselves on the event Ωδ,n. In order to lighten notation, we introduce t := C(δ, n)/ n. Let ˆα arg min Ψn(α). We claim that ||ˆα α || R. Indeed, if one had ||ˆα α || > R, then one would have that Ψ (ˆα) > ε by Assumption 5.(i). Since we placed ourselves on the event Ωδ,n, we would then have the following lower bound Ψn(ˆα) Ψ (ˆα) t > ε t. The latter would contradict the fact that Ψn attains its minimum at ˆα, since Ψn(α ) t and since t < ε t for large enough n. More precisely, the latter inequality is true whenever n nδ,ε := 4 supn 1 C(δ, n)2/ε2. Therefore there exists some v Sp 1 and some λ0 [0, R] such that ˆα = α + λ0v. This prompts us to introduce the functions: fn(λ) := Ψn(α + λv), f (λ) := Ψ (α + λv), which are well-defined for every λ [0, R]. λ3 0 λmin λmax Figure 4. In red: the function f . In blue: the function fn, which attains its minimum at λ1 0, λ2 0 and λ3 0. The plain gray curves corresponds to the translations by t of f . They delimit the area to which the function fn belongs on the event Ωδ,n. The heavy black line is the function λ 7 c|λ|, the lower bound on f given by Assumption 5.(ii) and the dashed black line corresponds to its translation by t. The points that are the most distant from 0 where the function fn could attain its minimum are λmin and λmax, in magenta. These reals satisfy 2t/c λmin and λmax 2c/t under Assumption 5.(ii): the worst situation corresponds to f (λ) = c|λ|, in which case fn could possibly attain its minimal value at 2t/c (or 2c/t) which is the abscissa of the (green) point of intersection between the curve y = t and y = cx t (or y = cx t). Finally, we claim that λ0 2t c . Indeed, if one had λ0 > 2t c , then minα A Ψn(α) = Ψn(ˆα) = fn(λ0) > t by Assumption 5.(ii), which would contradict the fact that ˆα arg min Ψn(α) since Ψn(α ) t. See Figure 4 for an illustration. Hence, on the event Ωδ,n for every n nδ,ε, ||ˆα α || 2t/c = (2C(δ, n))/(c n). We set C(δ) := supn 1 C(δ, n) to conclude. Proof of Theorem 1 To avoid space-consuming formulas, we introduce the two following notations for the reweighted risk and its empirical Learning from Biased Data: A Semi-Parametric Approach counterpart: Rα(θ) := EPS [g(Z, α)ℓ(Z, θ)], Rα n(θ) := 1 i=1 g(Zi, α)ℓ(Zi, θ). We remark that for every θ Θ, RPT (θ) = Rα (θ) so that RPT (ˆθ) inf θ Θ RPT (θ) = Rα (ˆθ) inf θ Θ Rα (θ) =: (I) Let θ be such that Rα (θ ) = infθ Θ Rα (θ). Using that Rˆα n(ˆθ) Rˆα n(θ ) 0 by definition of ˆθ, we can deduce that (I) 2 sup θ Θ Rˆα n(θ) Rα n (θ) + 2 sup θ Θ Rα n (θ) Rα (θ) . (13) Using Assumption 2, boundedness of ℓand Equation (11), we can bound the first term with probability at least 1 δ/2 as follows Rˆα n(θ) Rα n (θ) 2L||ℓ|| ||ˆα α || 4L||ℓ|| C(δ/2) The second term in (13) can be bounded using Lemma 2 which is proved in the Supplementary Material. Lemma 2. Let δ (0, 1). Under Assumption 2 and ||ℓ|| < + , with probability at least 1 δ Rα n (θ) Rα (θ) 4||g|| EP n S [En(Θ)] + ||g|| ||ℓ|| + 2||g|| ||ℓ|| ln(1/δ) We apply Lemma 2 with δ/2 instead of δ. This ends the proof. Acknowledgements The authors are greatly indebted to the chair DSAIDIS of Telecom Paris for the support. Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., and Kalai, A. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In NIPS, page 4349 4357. Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford. Chen, X., Hong, H., and Nekipelov, D. (2011). Nonlinear models of measurement errors. Journal of Economic Literature, 49(4):901 37. Deville, J.-C. (2000). Generalized calibration and application to weighting for non-response. In Bethlehem, J. G. and van der Heijden, P. G. M., editors, COMPSTAT, pages 65 76, Heidelberg. Physica-Verlag HD. Deville, J.-C. and S arndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418):376 382. Devroye, L., Gy orfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer. Gilbert, P. B., Lele, S. R., and Vardi, Y. (1999). Maximum likelihood estimation in semiparametric selection bias models with application to aids vaccine trials. Biometrika, 86(1):27 43. Gin e, E. and Nickl, R. (2015). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Scholkopf, B. (2009). Covariate shift by kernel mean matching. In NIPS 2009. Guggemos, F. and Till e, Y. (2010). Penalized calibration in survey sampling: Design-based estimation assisted by mixed models. Journal of Statistical Planning and Inference, 140(11):3199 3212. Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica, 50(4):1029 1054. Huang, J., Gretton, A., Borgwardt, K. M., Sch olkopf, B., and Smola, A. J. (2007). Correcting sample selection bias by unlabeled data. In NIPS, pages 601 608. Imbens, G. W. and Lancaster, T. (1994). Combining micro and macro data in microeconometric models. The Review of Economic Studies, 61(4):655 680. Kato, K. and Sasaki, Y. (2019). Uniform confidence bands for nonparametric errors-in-variables regression. Journal of Econometrics, 213(2):516 555. Kato, K., Sasaki, Y., and Ura, T. (2019). Inference based on kotlarski s identity. Laforgue, P. and Cl emenc on, S. (2019). Statistical learning from biased training samples. Liu, Z., Yang, J.-a., Liu, H., and Wang, W. (2016). Transfer learning by sample selection bias correction and its application in communication specific emitter identification. JCM, 11:417 427. Learning from Biased Data: A Semi-Parametric Approach Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345 1359. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., M uller, A., Nothman, J., Louppe, G., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Edouard Duchesnay (2018). Scikit-learn: Machine learning in python. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2009). Dataset Shift in Machine Learning. The MIT Press. Sugiyama, M., Nakajima, S., Kashima, H., Von Buenau, P., and Kawanabe, M. (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, volume 7, pages 1433 1440. Citeseer. Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von B unau, P., Motoaki, and Kawanabe (2008). Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 8(35):985 1005. van der Vaart, A. and Wellner, J. (1996). Weak Convergence of Empirical Processes: with Applications to Statistics. Springer-Verlag New York. Vogel, R., Achab, M., Cl emenc on, S., and Tillier, C. (2020). Weighted empirical risk minimization: Sample selection bias correction based on importance sampling. In Proceedings of ICMA. Wang, M., Deng, W., Hu, J., Tao, X., and Huang, Y. (2019). Racial faces in the wild: Reducing racial bias by information maximization adaptation network. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, pages 692 702. IEEE. Weiss, K., Khoshgoftaar, T. M., and Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3(1):9. Zhang, B. (2000). M-estimation under a two-sample semiparametric model. Scandinavian Journal of Statistics, 27(2):263 280. Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.- W. (2017). Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In EMNLP.