# doubleweighting_for_covariate_shift_adaptation__32718a15.pdf Double-Weighting for Covariate Shift Adaptation Jos e I. Segovia-Mart ın 1 Santiago Mazuelas 1 2 Anqi Liu 3 Supervised learning is often affected by a covariate shift in which the marginal distributions of instances (covariates x) of training and testing samples ptr(x) and pte(x) are different but the label conditionals coincide. Existing approaches address such covariate shift by either using the ratio pte(x)/ptr(x) to weight training samples (reweighted methods) or using the ratio ptr(x)/pte(x) to weight testing samples (robust methods). However, the performance of such approaches can be poor under support mismatch or when the above ratios take large values. We propose a minimax risk classification (MRC) approach for covariate shift adaptation that avoids such limitations by weighting both training and testing samples. In addition, we develop effective techniques that obtain both sets of weights and generalize the conventional kernel mean matching method. We provide novel generalization bounds for our method that show a significant increase in the effective sample size compared with reweighted methods. The proposed method also achieves enhanced classification performance in both synthetic and empirical experiments. 1. Introduction Most supervised learning methods assume that training and testing samples are drawn i.i.d. from the same underlying distribution. However, practical scenarios are often affected by a covariate shift in which the marginal distributions of instances (covariates x) of training and testing samples ptr(x) and pte(x) are different (see e.g., (Sugiyama & Kawanabe, 2012; Qui nonero-Candela et al., 1Basque Center for Applied Mathematics (BCAM), Bilbao, Spain 2IKERBASQUE-Basque Foundation for Science 3CS department, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, USA. Correspondence to: Jos e I. Segovia-Mart ın , Santiago Mazuelas , Anqi Liu . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 2008)), while the conditional label distribution stays the same. In such scenarios, conventional supervised classification methods, like empirical risk minimization, can perform poorly because the empirical risk is approximating the training expected risk, rather than the test expected risk. Most of the existing methods for covariate shift adaptation are based on the reweighted approach (Sugiyama & Kawanabe, 2012; Qui nonero-Candela et al., 2008; Cortes et al., 2008; Huang et al., 2006). These methods weight loss functions at training using the ratio pte(x)/ptr(x) so that training samples more likely in the test distribution are assigned higher weights (see Fig. 1), increasing their relevance at training. Such ratios can be estimated by using training and testing instances (Tsuboi et al., 2009; Yamada et al., 2011; Liu et al., 2013). Reweighted methods are designed for situations where the support of ptr contains that of pte. However, even if such condition is satisfied, reweighted methods may achieve poor performances if the ratio pte(x)/ptr(x) take large values at certain training samples, leading to inaccurate estimations of expected losses (see e.g., (Cortes & Mohri, 2014; Reddi et al., 2015)). Such problems can be alleviated by flattening the above ratio (Shimodaira, 2000; Yamada et al., 2011), by utilizing a regularization term based on the unweighted solution (Reddi et al., 2015), and by directly estimating weights for training samples through kernel mean matching (KMM) methods (Gretton et al., 2008; Huang et al., 2006). Robust methods for covariate shift adaptation (Liu & Ziebart, 2014; 2017; Chen et al., 2016) are derived from a distributionally robust learning framework, where the feature expectation matching constraints are obtained from training samples but the adversarial risk is defined on the test distribution. Such methods weight feature functions at testing using the ratio ptr(x)/pte(x) (see Fig. 1). The resulting parametric form produces less confident predictions when testing samples are less likely in the training distribution. Robust methods are designed for situations where the support of pte contains that of ptr. However, even if such condition is satisfied, robust methods may achieve poor performances if the ratio ptr(x)/pte(x) take large values at certain testing samples, leading to overconfident classification rules. Double-Weighting for Covariate Shift Adaptation Weights for training samples Weights for testing samples Reweighted approach Robust approach Proposed approach Figure 1. Different approaches for covariate shift adaptation (training and testing instances follow Gaussian distributions with probability mass concentrated in the red and blue circles, resp.). Reweighted methods weight training instances x using the ratio β(x) = pte(x)/ptr(x) while robust methods weight testing instances x using the ratio α(x) = ptr(x)/pte(x). The proposed approach utilizes weights both for training and testing instances and can avoid large weights β(x) by reducing the corresponding α(x) and avoid large weights α(x) by reducing the corresponding β(x). In practice, the distributions of training and testing instances can differ in an arbitrary manner (e.g., their supports may not be contained in each other). This paper proposes a learning methodology that can tackle such general covariate shift and addresses the limitations of existing approaches by weighting both training and testing samples (see Fig. 1). In particular, the methods proposed are based on minimax risk classifiers (MRCs) (Mazuelas et al., 2022; 2023) and utilize weighted averages of training samples to estimate expectations of weighted feature functions under the test distribution. Specifically, the main contributions in the paper are as follows. We present a learning framework for general covariate shift adaptation based on a double-weighting of both training and testing samples. Our framework encompasses existing approaches for specific choices of weights. We propose effective techniques that obtain weights for training and testing samples, and generalize the conventional KMM that only obtains weights for training samples. We develop generalization bounds for the proposed methods that show a significant increase in effective sample size compared with reweighted approaches. We experimentally assess the performance improvement obtained by the proposed techniques in multiple covariate shift scenarios. Notations. Calligraphic upper case letters represent sets; bold lower and upper case letters represent vectors and matrices, respectively; for a vector v, v(i) denotes its i-th component, |v| denotes its component-wise absolute value, and (v)+ denotes its positive part; 1 denotes a vector with all components equal to 1; || ||1, || || , and || ||H denote the 1-norm, the infinity, and the Hilbert space norm of its argument, respectively; and represent vector componentwise inequalities; N(m, Σ) denotes the pdf of a Gaussian r.v. x with mean m and covariance matrix Σ; and Ep{ } denotes the expectation of its argument w.r.t distribution p. 2. Preliminaries This section describes the learning setup, the two main existing approaches for covariate shift adaptation, and the framework of MRCs. Setup. Let X be the set of instances and Y the set of labels represented by the set {1, . . . , |Y|}. We denote by (X Y) the set of probability distributions over X and Y, and by T(X, Y) the set of classification rules. For h T(X, Y), we denote by h(y|x) the probability with which instance x X is classified by label y Y. We use the notation pte for the underlying distribution at test, and (x1, y1), (x2, y2) . . . , (xn, yn) for the set of training samples. The ℓ-risk of a classification rule h is its expected classification loss with respect to the true underlying distribution at test pte, i.e., R(h) = Epte {ℓ(h, (x, y))}. The learning objective is to use the training samples to find a classification rule h that has small ℓ-risk R(h). In this paper, we consider 0-1-loss and log-loss: ℓ01 (h, (x, y)) = 1 h(y|x) (1) ℓlog (h, (x, y)) = log h(y|x). (2) Covariate shift. Under covariate shift, the training samples (x1, y1), (x2, y2), . . . , (xn, yn) follow a distribution ptr(x, y) such that the marginal distributions of in- Double-Weighting for Covariate Shift Adaptation stances differ, pte(x) = ptr(x), but label conditionals coincide, ptr(y|x) = pte(y|x). In addition, covariate shift methods assume that the ratio between ptr(x) and pte(x) is known or that t unsupervised samples xn+1, xn+2, . . . , xn+t from pte(x) are known at training. In previous literature, it is also usually assumed that the training support contains that at testing or vice versa. In this paper, we consider general scenarios of covariate shift in which such supports are not required to contain each other. 2.1. Main existing approaches Reweighted methods. Most of the techniques for covariate shift adaptation are based on the reweighted approach (Sugiyama & Kawanabe, 2012; Shimodaira, 2000; Zadrozny, 2004; Cortes et al., 2008; Dud ık et al., 2005; Lin et al., 2002). These methods exploit that, for any function f, we have that Eptef(x, y) = Eptrβ(x)f(x, y), for β(x) = pte(x) if pte(x) > 0 ptr(x) > 0. Reweighted methods weight loss functions at training by means of the weight function β(x) in (3), as detailed in Appendix A. Using these weights, such methods can account for the fact that some training instances are unlikely at testing, and assign low relevance to such instances at training (see Fig.1). Reweighted methods assume the support of ptr contains that of pte (i.e., pte(x) > 0 ptr(x) > 0) so that (3) is valid. Even if this condition is satisfied, such methods may achieve poor performances if the ratio β(x) in (3) takes large values at certain training samples. In these cases, the learning process is dominated by few training samples (see e.g., (Cortes & Mohri, 2014; Gretton et al., 2008)). The flattening approach alleviates such problems using weights for training samples smoothed utilizing a hyperparameter γ as (pte(x)/ptr(x))γ in (Shimodaira, 2000) and as pte(x)/ (γpte(x) + (1 γ)ptr(x)) in (Yamada et al., 2011). Robust methods. Robust methods under covariate shift (Liu & Ziebart, 2014; 2017) exploit that, for any f: Epteα(x)f(x, y) = Eptrf(x, y), for α(x) = ptr(x) if ptr(x) > 0 pte(x) > 0. Robust methods weight feature functions at testing1 by means of the weight function α(x) in (4), as detailed in Appendix A. Using these weights, such methods can account for the fact that some testing 1The robust bias-aware prediction weight the feature functions in both training and testing as the weight appears in the predictive parametric form. Here we are emphasizing the weights at testing to show a symmetric view between these two methods. instances are unlikely at training, and consider rules that assign low-confidence predictions to such instances (see Fig. 1). Robust methods assume the support of pte contains that of ptr (i.e., ptr(x) > 0 pte(x) > 0) so that (4) is valid. Even if this condition is satisfied, such methods may achieve poor performances if the ratio α(x) in (3) takes large values at certain testing samples. In these cases, the classification rule would only provide confident predictions at few testing samples. The connection and symmetric relation between reweighted and robust methods are also shown in Theorem 3 in (Liu & Ziebart, 2014). They can both be regarded as special cases of the adversarial risk minimization framework (Fathony et al., 2016) when the feature expectation matching constraints are set to match different empirical estimates of the features. To enable covariate shift adaptation in general cases (e.g., training and testing supports not contained in each other), this paper proposes a learning framework that avoids the limitations of existing methods by utilizing a double-weighting approach. 2.2. Minimax Risk Classifiers Similarly to other approaches based on robust risk minimization (RRM) (Farnia & Tse, 2016; Fathony et al., 2016), MRC methods (Mazuelas et al., 2022; 2023) do not require that the training samples follow the same distribution as the testing samples. MRCs minimize the worst-case expected loss with respect to distributions in uncertainty sets that can contain the true underlying distribution with high probability. The uncertainty sets are given by constraints on the expectation of a function Φ : X Y Rm referred to as feature mapping. Such a function can be defined using one-hot encodings of the elements of Y as Φ(x, y) = ey x, where ey is the y-th element of the canonical basis of R|Y| and denotes the Kronecker product. Given the uncertainty set U, we say that a classification rule h U is a ℓ-MRC for U if h U arg min h T(X,Y) max p U ℓ(h, p) (5) and, we denote by R(U) the minimax risk against U, i.e., R(U) = min h T(X,Y) max p U ℓ(h, p) (6) where ℓ(h, p) denotes the expected loss of classification rule h w.r.t. distribution p. 3. Framework for Adaptation to General Covariate Shift This section first describes the proposed double-weighting approach and the corresponding MRC learning methodo- Double-Weighting for Covariate Shift Adaptation logy. We then describe its relationship with existing techniques, present finite-sample generalization bounds for the proposed methods, and discuss the trade-off involved in the choice of weight functions. 3.1. Double-weighting The proposed framework considers both training and testing weights, β(x) and α(x) (see Fig.1). We exploit the fact that, for any function f, we have that Epteα(x)f(x, y) = Eptrβ(x)f(x, y) (7) can be attained by multiple choices of weights α(x) and β(x). For instance, it is satisfied taking α(x) = min C ptr(x) pte(x), 1 , β(x) = min pte(x) for any C > 0, since α(x)pte(x) = β(x)ptr(x), x X. Notice that the equality in (7) is satisfied taking weights as in (8) even if the supports of ptr and pte do not contain each other. Such a double-weighting approach can avoid the limitations of reweighed and robust methods. For x X with large ratio pte(x)/ptr(x), using a small α(x) can enable to have α(x)pte(x) = β(x)ptr(x) with moderate values of β(x). Reciprocally, for x X with large ratio ptr(x)/pte(x), using a small β(x) can enable to have α(x)pte(x) = ptr(x)β(x) with moderate values of α(x). For instance, using weights as in (8) we have that β(x) C and α(x) 1 for any x X. Considering both weights β(x) and α(x), we can both assign low relevance to training instances that are unlikely at testing, and also assign low-confidence predictions to testing instances that are unlikely at training. 3.2. MRC learning framework using double-weighting The proposed framework adapts to general covariate shift by constructing the uncertainty set U in (5) using both weights α(x) and β(x). In particular, we use feature mappings weighted by α(x) as Φα(x, y) = α(x)Φ(x, y) and constrain the difference between the expectation and empirical mean estimates of feature mappings as follows U = {p (X Y) : |EpΦα(x, y) τ| λ and p(x) = pte(x), x X} (9) where τ denotes the mean vector of expectation estimates, and λ is a vector that determines the confidence with which pte(x, y) U. The expectation of the feature mapping Φα(x, y) is estimated using averages of training samples weighted by β(x) as i=1 Φβ(xi, yi), for Φβ(x, y) = β(x)Φ(x, y). (10) Notice that the mean vector τ is an unbiased estimator of EpteΦα(x, y) for any choice of weights satisfying α(x)pte(x) = β(x)ptr(x), in particular for those given by (8). In addition, the accuracy of τ can be improved using weights α(x) that avoid large weights β(x), as discussed in Section 3.4 below. Convex optimization. We next show how MRCs corresponding with uncertainty sets (9) can be learned by solving the convex optimization problem min µ τ Tµ + Epte(x)ϕℓ(µ, x, α(x)) + λT|µ| (11) where ϕℓis a function defined under different loss functions. For 0-1-loss, we have ϕ01(µ, x, α(x)) = 1 + max C Y P y C Φα(x, y)Tµ 1 and, for log-loss, we have ϕlog(µ, x, α(x)) = log X y Y exp Φα(x, y)Tµ . (13) Theorem 3.1. Let τ, λ Rm be such that the uncertainty set U in (9) is not the empty set. If µ is a solution of (11) for 0-1-loss, the classification rule h U(y|x) = α(x)Φ(x, y)Tµ ϕ01(µ , x, α(x)) + 1 + (14) is a 0-1-MRC for U. If µ is a solution of (11) for log-loss, the classification rule h U(y|x) = exp α(x)Φ(x, y)Tµ ϕlog(µ , x, α(x)) (15) is a log-MRC for U. In addition, the minimax risk R(U) is given by R(U) = τ Tµ +Epte(x)ϕℓ(µ , x, α(x))+λT|µ |. (16) Proof. See Appendix B. Remarks. The optimization in (11) can be addressed in practice using conventional optimization methods such as stochastic gradient descent. If unlabeled instances from the test distribution are available at training, they can directly be used to obtain samples corresponding to the (sub)gradient of Epte(x)ϕℓ(µ, x, α(x)) since the function ϕℓdoes not depend on labels. If the marginals ptr(x), pte(x) are known, training samples can be used to obtain samples of the above gradient using (3). This theorem is novel as we apply weights α and β for covariate shift adaptation, even though the general form is analogous to the results in (Mazuelas et al., 2022; 2023), which studies MRC with train and test data sampled i.i.d. from the same distribution. Double-Weighting for Covariate Shift Adaptation Regularization. The convex optimization problem (11) carries out an L1-type regularization, where the regularization parameter is given by vector λ. The regularization term in (11) allows to penalize each component of parameter µ differently, such that feature components with poorly estimated expectations (i.e., components i with large λ(i)) are strongly penalized. Classification rule. The deterministic classifier associated with h U classifies each instance with the label maximizing h U(y|x). For both losses, this deterministic classifier is given by arg max y Y h U(y|x) = arg max y Y α(x)Φ(x, y)Tµ = arg max y Y Φ(x, y)Tµ . (17) Such deterministic classifiers, denoted by h U d , allow us to classify testing samples even if we do not know the weights α(x) associated with them. Predictive confidence. The values of α(x) adjust the confidence with which each sample is classified. For instance, for very small values of α(x), the classifier h U uniformly assigns labels in the set Y for both losses, i.e., h U(y|x) = 1/|Y| for all y Y. Relation with existing approaches. The general framework proposed above encompasses existing approaches, as detailed in Appendix A for binary classification with log-loss. The usage of weights α(x) = 1, β(x) = pte(x)/ptr(x) leads to reweighted methods (Sugiyama & Kawanabe, 2012), approximating Epte(x)ϕℓ(µ, x, α(x)) in (11) using training instances. The usage of weights α(x) = ptr(x)/pte(x), β(x) = 1 leads to robust methods (Liu & Ziebart, 2014), approximating the gradient of Epte(x)ϕℓ(µ, x, α(x)) in (11) using training instances. 3.3. Generalization bounds The following shows the generalization bounds of the proposed methods in Section 3.2. Such bounds are given in terms of smallest minimax risk, R , that corresponds with the uncertainty set given by the exact expectations, and is defined by R = min µ EpteΦα(x, y)Tµ + Epte(x)ϕℓ(µ, x, α(x)). (18) The MRC corresponding to that smallest minimax risk R could only be obtained by an exact estimation of the expectation of the feature mapping Φα that in turn would require an infinite amount of training samples. The theorem below shows risk bounds for the proposed MRCs in terms of minimax risks R(U) and smallest minimax risks R . Theorem 3.2. Let U be a non-empty uncertainty set given by (9) and h U be an ℓ-MRC for U. If µ and µ are solu- tions to (11) and (18), respectively, then, we have that R (h U) R(U) + (|τ EpteΦα(x, y)| λ)T |µ | (19) R (h U) R + λT (|µ | |µ |) + |τ EpteΦα(x, y)|T |µ µ | . (20) Proof. See Appendix B. Note that the minimax risk R(U) obtained at learning offers an upper bound for the ℓ-risk if λ |τ EpteΦα(x, y)| and an approximate upper bound for general λ. In addition, the difference between the risk R (h U) and the smallest minimax risk R decreases with the estimation error |τ EpteΦα(x, y)|. We next show how the proposed methods can lead to a significant increase in effective size compared with reweighted methods. Corollary 3.3. Let U be a non-empty uncertainty set given by (9) with λ = 0, and h U be an ℓ-MRC for U. If weights α(x) and β(x) are given by (8) with C = B/ D for D 1 and B = sup x X pte(x)/ptr(x). (21) Then, with probability at least 1 δ we have that R(h U) R + M µ µ where M is a constant satisfying Φ(x, y) M for all x X, y Y. Proof. A direct consequence of Theorem 3.2 and Hoeffding s inequality. As described in (Cortes et al., 2010; Yu & Szepesv ari, 2012), reweighted methods have an estimation error of the order q δ so that the methods proposed can achieve an effective sample size increased by a factor of D using the double-weighting given by (8) with C = B/ D. The next section more broadly discusses such an increase in effective sample size and the corresponding trade-off for predictions confidence. 3.4. Choice of weight functions Existing reweighted and robust methods, as well as the proposed general framework, utilize weights α(x) and β(x) in the estimation of expectations: i=1 β(xi)f(xi, yi) Epteα(x)f(x, y). (23) The error of such estimates is determined by the weights β(x). If α(x) and β(x) satisfy α(x)pte(x) = β(x)ptr(x), using Hoeffding s inequality we have that Double-Weighting for Covariate Shift Adaptation i=1 β(xi)f(xi, yi) Epteα(x)f(x, y) with probability at least 1 δ. In particular, for reweighted methods the bound (24) be- comes ||f|| q δ with B given by (21) as shown in (Cortes et al., 2010; Yu & Szepesv ari, 2012). The error in the expectations estimates in (23) decreases when we choose the weights α(x) adequately. In particular, using small values of α(x) we can achieve α(x)pte(x) = β(x)ptr(x) with moderate values of β(x). Such improvement comes at the expense of using classification rules with significant confidence only in the subregion of X in which α(x) is significantly larger than 0. The above trade-off between error in expectations estimates and confidence of classification rules can be addressed using pairs of weights of the form (8) and varying the value of C. For values C B, α(x) = 1 and β(x) = pte(x)/ptr(x) that corresponds to the reweighted approach. For values C < B, the expectations estimates improve as we decrease C since β = C. However, the corresponding classification rules would only predict with significant confidence in the subregion of X where α(x) is significantly larger than 0. Such subregion shrinks when C decreases because it is composed by the x X where pte(x) is not significantly larger than Cptr(x). In the following, we present methods that obtain weights α and β addressing the above trade-off, and generalize conventional KMM methods. 4. Double-weighting Kernel Mean Matching The KMM method obtains weights β Rn for n training instances x1, x2, . . . , xn using t testing instances xn+1, xn+2, . . . , xn+t (Huang et al., 2006; Gretton et al., 2008). We propose the double-weighting KMM (DW-KMM) method that obtains weights β Rn for the n training instances together with weights α Rt for the t testing instances by solving the optimization problem i=1 α(i)K(xn+i) 1 i=1 β(i)K(xi) H s.t. 0 β(i) B/ D, for i = 1, . . . , n 0 α(i) 1, for i = 1, . . . , t 1 n ||α 1|| 1 1 where K : X H is a feature map corresponding with a reproducing kernel Hilbert space (RKHS) H with kernel k(x, x) = K(x), K( x) H. As described above, the hyperparameter D 1 in (25) balances the trade-off between error in expectation estimates and confidence of the classification. For D = 1, the optimization problem becomes that of KMM for reweighted methods (Huang et al., 2006; Gretton et al., 2008). Performance guarantees. The proposed approach is an empirical version of the following (population) problem given by exact expectations min α(x),β(x) Epte(x)α(x)K(x) Eptr(x)β(x)K(x) H s.t 0 β(x) B/ D, 0 α(x) 1, x X Epte(x)α(x) = Eptr(x)β(x) Epte(x) (α(x) 1)2 1 1/ The minimum value of (26) is zero since (8) with C = B/ D is a feasible solution. Then, solutions of (26), ˆβ(x), ˆα(x), provide consistent estimators of expectations because Epte(x,y)ˆα(x)Φ(x, y) = Eptr(x,y) ˆβ(x)Φ(x, y) (27) is satisfied if the kernel k is characteristic or if Epte(y|x)Φ(x, y) belongs to H, analogously as shown in (Yu & Szepesv ari, 2012). With finite samples, the following theorem shows bounds for the difference between the empirical means in feature space for solutions of (26). Theorem 4.1. If ˆβ(x) and ˆα(x) are solutions of (26), with probability at least 1 δ we have that i=1 ˆβ(xi)K(xi) 1 i=1 ˆα(xn+i)K(xn+i) where the constant κ satisfies |k(x, x)| κ2 for all x X. Proof. See Appendix C. Relation with conventional KMM. The solutions ˆβ(x) for conventional KMM in reweighted methods satisfy with probability at least 1 δ i=1 ˆβ(xi)K(xi) 1 i=1 K(xn+i) Double-Weighting for Covariate Shift Adaptation as shown in Lemma 4 of (Huang et al., 2006) and equation (10) in (Yu & Szepesv ari, 2012). Therefore, the proposed DW-KMM allows to significantly improve the effective sample by exploiting the usage of weights α. Analogously to the results shown in Section 3.4, the effective sample size of the methods proposed is D times larger than that of existing KMM for reweighted methods. 5. Practical Algorithm In this section, we present a practical algorithm for the proposed Double-Weighting for General Covariate Shift (DW-GCS), detailed in Algorithm 1. We first compute weights α and β by solving (25), then, we learn the classifier s parameters by solving (11) using mean vector τ defined in (10) and confidence vector λ. Algorithm 1 The proposed algorithm: DW-GCS Input: Training samples (x1, y1), (x2, y2), . . . , (xn, yn) Testing instances xn+1, xn+2, . . . , xn+t, D Output: Weights ˆβ and ˆα Classifier parameters µ , Minimax risk R(U) 1: ˆβ, ˆα solution of (25) 2: τ 1 n Pn i=1 ˆβ(i)Φ(xi, yi) 3: λ solution of (31) 4: µ solution of (30) using (12) for 0-1-loss, and (13) for log-loss 5: R(U) τ Tµ + 1 i=1 ϕℓ(µ , xn+i, ˆα(i)) + λT|µ | Computing weights and learning MRCs. Weights α and β are computed solving the convex optimization (25), which is a quadratic problem as detailed in Appendix D. The optimization in (11) can be addressed by approximating the expectation by means of the t instances in testing xn+1, xn+2, . . . , xn+t as min µ τ T µ + 1 i=1 ϕℓ(µ, xn+i, α(i)) + λT |µ| (30) that is an unconstrained convex optimization problem and can be efficiently solved by conventional methods. Hyperparameters. In principle, both hyperparameters λ and D can be obtained by cross-validation. However, standard cross-validation is not valid under covariate shift (Sugiyama et al., 2007). We hence avoid cross-validation and determine both parameters as follows. As detailed in Section 3.4, the hyperparameter D serves to address the trade-off between error in expectation estimates and confidence of classification rules. For instance, values of D close to 1 can be effective in situations with a large number of samples while higher values of D can be effective with a reduced number of samples. This is shown by the theoretical results in the paper, since the estimation error in the proposed methods is of the order O(1/ Dn), as described by the performance bounds in Corollary 3.3 and Theorem 4.1. In practice, we propose to select the hyperparameter D taking advantage of the minimax risk provided at the learning stage by the methods presented. Specifically, we select the value of D to achieve the lowest minimax risk over a certain range D 1. Note that, as described in Theorem 3.2, the minimax risk R(U) obtained at learning offers an upper bound for the risk if λ |τ EpteΦα(x, y)|, and an approximate upper bound for general λ. Therefore, the proposed selection method in the paper uses the value of D that results in the lowest upper bound over a range of values for D. Appendix E further illustrates the adequacy of such approach in practice. The second hyperparameter λ is determined solving min p,λ 1Tλ y Y p(y|xn+i)Φα(xn+i, y) τ + λ y Y p(y|xn+i) = 1/t for i = 1, . . . , t (31) that ensures the uncertainty set used is non-empty. Complexity and implementation without testing instances. The computational complexity of the methods proposed is similar to existing methods for covariate shift adaptation. Specifically, the step for DW-KMM that obtains weights has a similar complexity as that for conventional KMM. The main difference is that (25) has t additional variables and t + 1 additional constraints corresponding to the weights α. The step that obtains the classifier parameters solving convex optimization problem (11) has the same complexity as that for conventional methods. Finally, the step that determines hyperparameters not only avoids the usage of cross-validation but can also reduce complexity. In particular, cross-validation with P partitions would require solving (11) P times for each candidate value for hyperparameters, while the methods proposed only require solving (31) and (11) once, for each candidate value of D. Algorithm 1 details the implementation of DW-GCS in cases where testing instances are available at training. The methods proposed can be implemented with small modifications in cases where only training instances are available and the marginals (or their ratios) are known. In these cases, weights α(x) and β(x) can be determined using (8) with C = B/ D instead of solving (25), and optimization (11) can be addressed using the training instances instead of testing instances making use of equality (3). Double-Weighting for Covariate Shift Adaptation 6. Experiments This section shows experimental results for the proposed approach in comparison with existing methods on synthetic and real datasets. Reweighted and robust approaches are implemented as in (Sugiyama et al., 2007; Liu & Ziebart, 2014) and described in Appendix A, the flattening method is implemented as in (Shimodaira, 2000), the Ru LSIF is implemented as in (Yamada et al., 2011), the KMM method is implemented as in (Huang et al., 2006), and the methods proposed are implemented as described in Alg. 1. The source code for the methods presented is publicly available in the library MRCpy (Bondugula et al., 2023) and the experimental setup in https://github.com/Machine Learning BCAM/ MRCs-for-Covariate-Shift-Adaptation. For existing methods, the regularization parameter has been fine-tuned as shown in Appendix E. For the proposed methods, hyperparameters are obtained as described in Section 5. The results in this section are complemented by those in Appendix E that provide further implementation details and experimental results. In particular, the appendix shows that selecting the hyperparameter D with lowest minimax risk results in performances near those obtained with the best value for D by grid search. Parameter δ Classification error No adapt. Reweighted Robust DW-GCS 0.05 0.1 0.2 0.35 0.4 0.45 Figure 2. Classification error for different types of covariate shift. In the case δ = 0.05, the training support contains that at testing, while in the case δ = 0.45 we have the opposite. Experiments with synthetic data. In the first set of results we show how the proposed approach can achieve covariate shift adaptation in situations where existing methods are challenged. For such results, the training and testing samples are drawn from distributions ptr(x) = (0.5 δ)N(m1, Σ1) + (0.5 + δ)N(m2, Σ2) pte(x) = (1 δ)N(m1, Σ1) + δN(m2, Σ2) (32) with m1 = [ 3/2, 0]T, m2 = [3/2, 0]T, Σ1 = Σ2 = (1/4)I, and labels are y = 1 if x(1)x(2) 0 and y = 2 otherwise. We use values δ {0.05, 0.1, 0.2, 0.35, 0.4, 0.45} to simulate different relations between the marginals of training and testing instances. We utilize the non-linear feature mapping given by instances components and their products and implement existing and proposed methods using the exact marginals. In addition, for each type of covariate shift (value of δ) we carry out 1,000 random repetitions with 100 training and testing samples. Results. Figure 2 shows box-plots corresponding to the classification error of exiting and proposed approaches in comparison to that obtained without covariate shift adaptation (α(x) = β(x) = 1). The results in the figure show how reweighted (resp. robust) methods obtain poor performances in situations where the support of training (resp. testing) instances does not contain that of testing (resp. training) instances. On the other hand, the methods proposed can leverage the presented double-weighting approach and adapt to more general covariate shifts. Experiments with real datasets. In the second set of results, we assess the performance of the proposed methods in comparison with existing techniques using real datasets. In particular, reweighted and robust methods are implemented with marginal distributions estimated using log-linear models as shown in (Bickel et al., 2007; 2009). We generate covariate shift in the datasets following (Huang et al., 2006) and (Gretton et al., 2008). In particular, we select training and testing samples with different probabilities based on the medians of the first 3 features, and based on the median of the first principal component of features. In (Huang et al., 2006) and (Gretton et al., 2008), covariate shift is generated with a biased sampling for testing instances that are drawn with probability δte if the first principal component or feature is larger than a certain value. In those works, the training samples are uniformly sampled, so that the generated covariate shifts correspond to situations where the support of training samples contains that of testing samples. In the numerical results of the table below, we generate more general covariate shifts by using a biased sampling both for training and testing instances (using probabilities δtr = 0.7 and δte = 0.3). These covariate shifts correspond to situations where the support of training and testing samples have certain overlap but they do not need to be contained in each other. Additionally, we include experimental results using the News20groups dataset that is intrinsically affected by a covariate shift since the training and testing partitions correspond to different times (Zhang et al., 2013). We consider the same 5 binary problems used in (Zhang et al., 2013), utilize the 1,000 features with highest Pearson s correlation, and randomly sample 1,000 training and testing samples in each repetition. Results. Table 1 shows the averaged classification error Double-Weighting for Covariate Shift Adaptation Table 1. Classification errors in 21 scenarios show that the proposed methods can more adequately adapt to general covariate shift. Values in bold show best classification error in each scenario. Datasets Reweighted Flattening Ru LSIF Robust KMM DW-GCS 0-1 DW-GCS log Blood Feature 1 .55 .08 .48 .11 .29 .04 .34 .06 .32 .03 .30 .03 .31 .03 Feature 2 .39 .03 .38 .03 .39 .03 .40 .03 .39 .04 .38 .05 .38 .05 Feature 3 .43 .05 .41 .05 .36 .04 .39 .04 .36 .04 .34 .03 .35 .03 PCA .48 .05 .48 .05 .29 .05 .44 .05 .30 .05 .28 .04 .28 .04 Breast Cancer Feature 1 .05 .02 .05 .03 .05 .02 .06 .03 .06 .02 .04 .02 .04 .02 Feature 2 .06 .02 .05 .02 .06 .03 .07 .03 .06 .03 .04 .02 .04 .02 Feature 3 .05 .02 .05 .02 .05 .02 .06 .03 .05 .02 .04 .02 .04 .02 PCA .03 .01 .03 .01 .03 .01 .03 .01 .03 .01 .02 .01 .02 .01 Haberman Feature 1 .48 .07 .47 .08 .31 .06 .41 .09 .34 .10 .28 .07 .29 .06 Feature 2 .46 .08 .44 .08 .31 .06 .39 .08 .36 .10 .29 .08 .30 .07 Feature 3 .33 .05 .33 .05 .33 .05 .36 .06 .42 .08 .35 .07 .36 .06 PCA .43 .12 .42 .12 .29 .05 .42 .11 .35 .08 .30 .08 .31 .07 Ringnorm Feature 1 .27 .02 .26 .02 .25 .02 .26 .02 .25 .02 .25 .02 .25 .02 Feature 2 .28 .02 .27 .02 .25 .02 .27 .02 .26 .02 .25 .02 .25 .02 Feature 3 .28 .02 .27 .02 .25 .02 .27 .02 .26 .03 .25 .02 .25 .02 PCA .32 .03 .29 .03 .25 .02 .26 .02 .28 .02 .27 .02 .26 .02 20 Newsgroups comp vs sci .41 .02 .41 .02 .41 .02 .42 .03 .40 .02 .22 .02 .22 .02 comp vs talk .37 .03 .37 .03 .37 .03 .40 .05 .34 .03 .11 .02 .11 .02 rec vs sci .43 .02 .42 .02 .42 .02 .42 .03 .41 .02 .17 .02 .17 .02 rec vs talk .40 .03 .40 .03 .40 .03 .41 .03 .38 .03 .15 .02 .15 .02 sci vs talk .41 .03 .41 .02 .41 .02 .41 .04 .39 .02 .20 .02 .20 .02 corresponding to different datasets and covariate shift situations, together with their standard deviations over 100 random partitions as detailed in Appendix E. The first column of the table describes the different covariate shift datasets generated as described above. Overall, the experimental results show that the proposed method provides improved adaptation to general covariate shifts, even in situations where the supports of training and testing samples are not contained in each other. These results agree with the discussion in Sections 2.1 and 3.1 as well as the theoretical results in Corollary 3.3 and Theorem 4.1 that show how the proposed methodology can be effective in situations where existing methods based on single weights are challenged. The improvement obtained by the methods presented can be clearly observed by comparing the results obtained by the KMM method, since that technique is the most closely related to the proposed method. In particular, the results show that significant performance improvements can be obtained using a double weighting of both training and testing samples solving (25) instead of using the existing KMM method (that solves (25) fixing the weights α to be one). 7. Conclusion Existing approaches for covariate shift adaptation use the ratios between marginal distributions to either weight train- ing or testing samples. However, the performance of such approaches can be poor when the marginals supports are not contained in each other or when marginals ratios take large values. This paper proposes a minimax risk classification (MRC) approach for covariate shift adaptation that avoids such limitations by weighting both training and testing samples. We present effective techniques that obtain both sets of weights generalizing the conventional kernel mean matching method that only obtains weights for training samples. In addition, we present generalization bounds for the proposed methods that show a significant increase in effective sample size. The unifying approach and the learning methods proposed can enable techniques capable to adapt to more general scenarios affected by covariate shift. Acknowledgments Funding in direct support of this work has been provided by projects PID2019-105058GA-I00, CNS2022-135203, and CEX2021-001142-S funded by MCIN/AEI/10.13039/ 501100011033 and the European Union Next Generation EU /PRTR, programmes ELKARTEK and BERC2022-2025 funded by the Basque Government, the project Early Prognosis of COVID-19 Infections via Machine Learning funded by the AXA Research Fund, and by the JHU-Amazon AI2AI Faculty Award. Double-Weighting for Covariate Shift Adaptation Altun, Y. and Smola, A. Unifying divergence minimization and statistical inference via convex duality. In Proceedings of the 19th Annual Conference on Computational Learning Theory, pp. 139 153, 2006. Bickel, S., Br uckner, M., and Scheffer, T. Discriminative learning for differing training and test distributions. In Proceedings of the 24th International Conference on Machine Learning, pp. 81 88, 2007. Bickel, S., Br uckner, M., and Scheffer, T. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10:2137 2155, 2009. Bondugula, K., Alvarez, V., Segovia-Mart ın, J. I., P erez, A., and Mazuelas, S. MRCpy: A library for minimax risk classifiers. ar Xiv preprint ar Xiv:2108.01952, 2023. Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. Chen, X., Monfort, M., Liu, A., and Ziebart, B. D. Robust covariate shift regression. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 1270 1279, 2016. Cortes, C. and Mohri, M. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103 126, 2014. Cortes, C., Mohri, M., Riley, M., and Rostamizadeh, A. Sample selection bias correction theory. In Proceedings of the 19th International Conference on Algorithmic Learning Theory, pp. 38 53, 2008. Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems, pp. 442 450, 2010. Dua, D. and Graff, C. UCI Machine Learning Repository, 2017. URL http://archive.ics.uci.edu/ml. Dud ık, M., Schapire, R. E., and Phillips, S. J. Correcting sample selection bias in maximum entropy density estimation. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems, pp. 323 330, 2005. Farnia, F. and Tse, D. A minimax approach to supervised learning. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pp. 4240 4248, 2016. Fathony, R., Liu, A., Asif, K., and Ziebart, B. D. Adversarial multiclass classification: A risk minimization perspective. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pp. 559 567, 2016. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Sch olkopf, B. Covariate shift by kernel mean matching. In Dataset shift in machine learning, pp. 131 160. MIT Press, 2008. Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., and Sch olkopf, B. Correcting sample selection bias by unlabeled data. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pp. 601 608, 2006. Kanamori, T., Hido, S., and Sugiyama, M. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391 1445, 2009. Lin, Y., Lee, Y., and Wahba, G. Support vector machines for classification in nonstandard situations. Machine Learning, 46(1):191 202, 2002. Liu, A. and Ziebart, B. D. Robust classification under sample selection bias. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems, pp. 37 45, 2014. Liu, A. and Ziebart, B. D. Robust covariate shift prediction with general losses and feature views. ar Xiv preprint ar Xiv:1712.10043, 2017. Liu, S., Yamada, M., Collier, N., and Sugiyama, M. Change-point detection in time-series data by relative density-ratio estimation. Neural Networks, 43:72 83, 2013. Mazaheri, B., Jain, S., and Bruck, J. Robust correction of sampling bias using cumulative distribution functions. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 3546 3556, 2020. Mazuelas, S., Shen, Y., and Perez, A. Generalized maximum entropy for supervised classification. IEEE Transactions on Information Theory, 68(4):2530 2550, 2022. Mazuelas, S., Romero, M., and Gr unwald, P. Minimax risk classifiers with 0-1 loss. ar Xiv preprint ar Xiv:2201.06487, 2023. Qui nonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset Shift in Machine Learning. MIT Press, 2008. Reddi, S. J., P oczos, B., and Smola, A. Doubly robust covariate shift correction. In Proceedings of the 29th Double-Weighting for Covariate Shift Adaptation AAAI Conference on Artificial Intelligence, pp. 2949 2955, 2015. Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 244, 2000. Sugiyama, M. and Kawanabe, M. Machine learning in nonstationary environments: Introduction to covariate shift adaptation. MIT press, 2012. Sugiyama, M., Krauledat, M., and M uller, K.-R. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985 1005, 2007. Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., and Sugiyama, M. Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing, 17:138 155, 2009. Wen, J., Yu, C.-N., and Greiner, R. Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In Proceedings of the 31st International Conference on Machine Learning, pp. 631 639, 2014. Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., and Sugiyama, M. Relative density-ratio estimation for robust distribution comparison. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems, pp. 594 602, 2011. Yu, Y.-L. and Szepesv ari, C. Analysis of kernel mean matching under covariate shift. In Proceedings of the 29th International Conference on Machine Learning, pp. 1147 1154, 2012. Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In Proceedings of the 21st International Conference on Machine Learning, pp. 114, 2004. Zhang, K., Zheng, V. W., Wang, Q., Kwok, J. T., Yang, Q., and Marsic, I. Covariate shift in hilbert space: A solution via surrogate kernels. In Proceedings of the 30th International Conference on Machine Learning, pp. 388 395, 2013. Double-Weighting for Covariate Shift Adaptation A. Detailed derivations describing existing methods and relation with the proposed framework The following describes reweighted and robust methods for binary classification with Y { 1, 1} and log-loss. In particular, we show how, using the specific weights in (3) and (4), such methods can be obtained from Theorem 3.1 in Section 3.2 corresponding to the proposed framework. Reweighted methods consider classification rules of the form h(y|x) = 1 1 + exp { yx T µ} (33) and learn the parameter µ using the fact that equality (3) in Section 2.1 allows to estimate expected losses with respect to the test distribution using training samples since Epte(x,y) log 1 + exp yx T µ = Eptr(x,y)β(x) log 1 + exp yx T µ . for β(x) = pte(x)/ptr(x). Robust methods consider classification rules of the form h(y|x) = 1 1 + exp { α(x)yx T µ} (34) with α(x) = ptr(x)/pte(x). Such methods learn the parameter µ using the fact that equality (4) in Section 2.1 allows to estimate the expected gradient of losses with respect to the test distribution using training samples since Epte(x,y) µ log 1 + exp ptr(x) pte(x)yx T µ = Epte(x,y)α(x) 1 + exp n ptr(x) pte(x)yx T µ o = Eptr(x,y) yx T 1 + exp n ptr(x) pte(x)yx T µ o. for α(x) = ptr(x)/pte(x). For their derivation from the Theorem 3.1 corresponding to the proposed framework; taking Φ(x, y) = yx/2, we have that optimization problem in (11) of Theorem 3.1 becomes i=1 β(xi)yix T i 2 µ + Epte(x) n log exp α(x)x T 2 µ + exp α(x)x T in binary classification with log-loss. If α(x) = 1, β(x) = pte(x)/ptr(x), the classifier in (15) of Theorem 3.1 coincides with that of reweighted methods in (33). In addition, using (3) in Section 2.1 and approximating the expectation with training samples, the optimization in (35) becomes pte(xi) ptr(xi) log 1 + exp yix T i µ (36) that coincides with that of reweighted logistic regression (Sugiyama & Kawanabe, 2012). If α(x) = ptr(x)/pte(x), β(x) = 1, the classifier in (15) of Theorem 3.1 coincides with that of robust methods in (34). In addition, using (4) in Section 2.1, the gradient of objective function in (35) becomes 2 + Eptr(x) x T 1 exp n ptr(x) pte(x)x T µ o 1 + exp n ptr(x) pte(x)x T µ o (37) that coincides with that shown in equation (7) in (Liu & Ziebart, 2014) for robust methods. Double-Weighting for Covariate Shift Adaptation B. Proofs for Section 3 The proofs of Theorem 3.1 and 3.2 below are done for the case of finite X. The proofs for infinite X can be carried out analogously using Fenchel duality instead of Lagrange duality, similarly to as is done in (Altun & Smola, 2006; Mazuelas et al., 2023). Proof of Theorem 3.1. Firstly, for each h T(X, Y), we have that maxp U ℓ(h, p) = max p l Tp I+(p) s.t. P y Y p(x, y) = pte(x), x X τ λ ΦT αp τ + λ (38) where l, p, and Φα denote the vectors and matrix with rows ℓ(h, (x, y)), p(x, y), and Φα(x, y)T, respectively, for x X, y Y, and I+(p) = 0 if p 0 otherwise. Optimization problem (38) has Lagrange dual min µ1,µ2,ν(x) τ λ Tµ1 + τ + λ Tµ2 + Epte(x)ν(x) + f (Φα(µ1 µ2) ν) s.t. µ1, µ2 0 where ν is the vector in R|X||Y| with component corresponding with (x, y) for x X, y Y given by ν(x), and f is the conjugate function of f(p) = l Tp + I+(p) given by f (w) = sup p 0 w Tp + l Tp = 0 if w l otherwise . Therefore, the Lagrange dual above becomes min µ1,µ2,ν(x) τ λ Tµ1 + τ + λ Tµ2 + Epte(x)ν(x) s.t. µ1, µ2 0 Φα(x, y)T(µ1 µ2) ν(x) ℓ(h, (x, y)), x X, y Y. It is easy to see that the solution of such optimization problem µ1, µ2 satisfies that µ(i) 1 µ(i) 2 = 0 for any i such that λ(i) > 0. Then λT( µ1 + µ2) = λT| µ1 µ2| and taking µ = µ1 µ2 the Lagrange dual above is equivalent to min µ,ν(x) τ Tµ + λT|µ| + Epte(x)ν(x) Φα(x, y)Tµ ν(x) ℓ(h, (x, y)), x X, y Y that has the same value as maxp U ℓ(h, p) since the constraints in (38) are affine and U is non-empty. min h T(X,Y) max p U ℓ(h, p) = min h,µ,ν(x) τ Tµ + λT|µ| + Epte(x)ν(x) Φα(x, y)Tµ ν(x) ℓ(h, (x, y)), x X, y Y. For 0-1-loss we have that Φα(x, y)Tµ ν(x) 1 + h(y|x), x X, y Y Φα(x, y)Tµ ν(x) + 1 1, C Y, x X y C Φα(x, y)Tµ 1 |C| , C Y, x X ν(x) ϕ01(µ, x, α(x)), x X. Double-Weighting for Covariate Shift Adaptation Therefore, for each µ, we have that any classification rule satisfying h(y|x) Φα(x, y)Tµ ϕ01(µ, x, α(x)) + 1, x X, y Y is solution of min h,ν(x) Epte(x)ν(x) = Epte(x)ϕ01(µ, x, α(x)) Φα(x, y)Tµ ν(x) + 1 h(y|x), x X, y Y and the result is obtained because for any x X, we have that X Φα(x, y)Tµ ϕ01(µ, x, α(x)) + 1 because otherwise there would exist νx < ϕ01(µ, x, α(x)) such that Φα(x, y)Tµ νx + 1 + = max C Y Φα(x, y)Tµ νx + 1 which contradicts the definition of ϕ01(µ, x, α(x)). The case of log-loss is analogous to the case for 0-1-loss above taking into account that Φα(x, y)Tµ ν(x) log(h(y|x)), x X, y Y y Y exp{Φα(x, y)Tµ ν(x)} 1, x X y Y exp{Φα(x, y)Tµ} , x X ν(x) ϕlog(µ, x, α(x)), x X. The lemma below is used in the proof of Theorem 3.2. Lemma B.1. Let U be the uncertainty set given by (9) for τ, λ Rm, and h be a classification rule. If R01 (U, h) = min µ τ T µ + Epte(x) max y Y 1 + α(x)Φ(x, y)T µ h(y|x) + λT |µ| (39) Rlog (U, h) = min µ τ T µ + Epte(x) max y Y α(x)Φ(x, y)T µ log h(y|x) + λT |µ| (40) then, for any p U ℓ01(h, p) R01 (U, h) (41) ℓlog(h, p) Rlog (U, h) . (42) Proof of Lemma B.1. The proof is analogous to the proof of Theorem 5 of (Mazuelas et al., 2023). The case U = is trivial. For the case where U = , we will first calculate the Lagrange dual of the optimization problem minˆp U Eˆpq for a general function q : X Y R. Then we will consider the fact that for any p U and h T(X, Y), min ˆp U ℓ(h, ˆp) ℓ(h, p) max ˆp U ℓ(h, ˆp) max ˆp U ℓ01(h, ˆp) = min ˆp U Eˆp {h(y|x) 1} max ˆp U ℓlog(h, ˆp) = min ˆp U Eˆp log h(y|x) Double-Weighting for Covariate Shift Adaptation for 0-1-loss and log-loss respectively. First, we have that minˆp U Eˆpq is equal to min ˆp q T ˆp + I+(ˆp) y Y ˆp(x, y) = pte(x) for all x X τ λ ΦT α ˆp τ + λ (43) where ˆp, q, Φα denote the vectors and matrix with rows ˆp(x, y), q(x, y) and α(x)Φ(x, y)T , respectively, for x X, y Y, and I+(ˆp) = 0 if ˆp 0 otherwise. Optimization problem (43) has Lagrange dual max µ1,µ2,ν(x) (τ λ)T µ1 (τ + λ)T µ2 + Epte(x)ν(x) f (Φα(µ1 µ2) + ν) s.t. µ1, µ2 0 where ν denotes the vector in R|X||Y| with component corresponding with (x, y) for x X, y Y given by ν(x), and f is the conjugate function of f(ˆp) = q T ˆp + I+(ˆp) that becomes f (w) = 0 if w q otherwise. Therefore, the previous Lagrange dual becomes max µ1,µ2,ν(x) (τ λ)T µ1 (τ + λ)T µ2 + Epte(x)ν(x) s.t. µ1, µ2 0 Φα(µ1 µ2) + ν q which is equivalent to max µ1,µ2 (τ λ)T µ1 (τ + λ)T µ2 + Epte(x) min y Y q(x, y) α(x)Φ(x, y)T (µ1 µ2) s.t. µ1, µ2 0. Taking µ = µ1 µ2 the Lagrange dual problem is equivalent to max µ τ T µ + Epte(x) min y Y q(x, y) α(x)Φ(x, y)T µ λT |µ| that has the same value as its primal minˆp U Eˆpq since the constraints defining U are affine and U = . Then, we have that max ˆp U ℓ01(h, ˆp) = min ˆp U Eˆp {h(y|x) 1} = min µ τ T µ + Epte(x) max y Y 1 + α(x)Φ(x, y)T µ h(y|x) + λT |µ| max ˆp U ℓlog(h, ˆp) = min ˆp U Eˆp log h(y|x) = min µ τ T µ + Epte(x) max y Y α(x)Φ(x, y)T µ log h(y|x) + λT |µ| for 0-1-loss and log-loss respectively. Proof of Theorem 3.2. For inequality (19), let U be the uncertainty set given by the exact mean vector τ = EpteΦα(x, y), i.e., U = {p (X Y) : |EpΦα(x, y) τ | λ and p(x) = pte(x), x X} . (44) Double-Weighting for Covariate Shift Adaptation It is clear that we have pte(x, y) U , then for 0-1-loss, using Lemma B.1 and the definition of h(y|x) in (14), we have that R(h U) R01(U , h U) = min µ τ T µ + Epte(x) max y Y 1 + α(x)Φ(x, y)T µ h(y|x) τ T µ + Epte(x) max y Y 1 + α(x)Φ(x, y)T µ h(y|x) (45) τ T µ + Epte(x) max y Y ϕ01(µ , x, α(x)) (46) = τ T µ + Epte(x)ϕ01(µ , x, α(x)) (47) = R(U) + (τ τ )T µ λT |µ |. (48) where, for inequality (45)-(46), we have used the fact that h(y|x) α(x)Φ(x, y)T µ + 1 ϕ01(µ , x, α(x)) and for inequality (47)-(48) we have added and subtracted τ T µ and λT µ , and used the definition of R(U) in (16). For log-loss, using Lemma B.1 and the definition of h(y|x) in (15), we have that R(h U) Rlog(U , h U) = min µ τ T µ + Epte(x) max y Y α(x)Φ(x, y)T µ log h(y|x) τ T µ + Epte(x) max y Y α(x)Φ(x, y)T µ log h(y|x) (49) = τ T µ + Epte(x) max y Y ϕlog(µ , x, α(x)) (50) = τ T µ + Epte(x)ϕlog(µ , x, α(x)) (51) = R(U) + (τ τ )T µ λT |µ | (52) where, for inequality (49)-(50) we have used the fact that log h(y|x) = α(x)Φ(x, y)T µ ϕlog(µ , x, α(x)) and for inequality (51)-(52) we have added and subtracted τ T µ and λT µ , and used the definition of R(U) in (16). For inequality (20), note that using the definition of µ and (47) (resp. (51)) for 0-1-loss (resp. log-loss), we have that R(h U) τ T µ + Epte(x)ϕℓ(µ , x, α(x)) τ T µ + Epte(x)ϕℓ(µ , x, α(x)) + λT |µ | + (τ τ )T µ λT |µ | = R + λT (|µ | |µ |) + (τ τ)T µ + (τ τ )T µ R + λT (|µ | |µ |) + |τ τ |T |µ µ | . C. Proofs for Section 4 Proof of Theorem 4.1. The proof is analogous to Example 6.3 in (Boucheron et al., 2013) that shows a Hoeffding-type inequality in Hilbert space. We consider n + t independent random variables taking values in the Hilbert space H as follows 1 n ˆβ(xi)K(xi) if i = 1, 2, . . . , n t ˆα(xi)K(xi) if i = n + 1, n + 2, . . . , n + t. (53) and we want to bound || Pn+t i=1 fi||H. We have that, Dκ if i = 1, 2, . . . , n 1 t κ if i = n + 1, n + 2, . . . , n + t. (54) Double-Weighting for Covariate Shift Adaptation Taking v = κ2 B2 Dn + 1 t and using the bounded differences inequality (Theorem 6.2 in (Boucheron et al., 2013)), we have that, for all l v l E Pn+t i=1 fi H Finally, using H older s inequality and by independence, we have that i=1 E ||fi||2 H v. i=1 ˆβ(xi)K(xi) 1 i=n+1 ˆα(xn+i)K(xn+i) with probability at least 1 δ. D. Quadratic version of DW-KMM The convex optimization in (25) is a quadratic problem since the squared norm in H can be written as 1 i=1 α(i)K(xn+i) 1 i=1 β(i)K(xi) i,j=1 α(i)α(j)k(xn+i, xn+j) + 1 i,j=1 β(i)β(j)k(xi, xj) 2 j=1 α(i)β(j)k(xn+i, xj) k(xn+1, xn+1) k(xn+1, xn+t) ... ... ... k(xn+t, xn+1) k(xn+t, xn+t) k(x1, x1) k(x1, xn) ... ... ... k(xn, x1) k(xn, xn) k(x1, xn+1) k(x1, xn+t) ... ... ... k(xn, xn+1) k(xn, xn+t) = h βT /n, αT /t i K β/n α/t where K is the kernel matrix given by K(i,j) = k(xi, xj). Double-Weighting for Covariate Shift Adaptation Therefore, the optimization problem (25) is equivalent to the quadratic optimization problem h βT /n, αT /t i K β/n α/t s.t. 0 β (B/ D)1, 0 α 1 βT 1/n αT 1/t ǫ (56) ||α 1|| 1 1 E. Implementation details and additional experimental results This appendix details the datasets and settings used for the experiments in Section 6 and shows additional experiments. For the experiments in Section 6, we have considered four binary classification datasets, available in the UCI repository (Dua & Graff, 2017), and previously used in multiple papers on covariate shift adaptation (Gretton et al., 2008; Huang et al., 2006; Kanamori et al., 2009; Mazaheri et al., 2020; Wen et al., 2014). In addition, we use the dataset News20groups that is intrinsically affected by covariate shift (Zhang et al., 2013). Table 2 details the characteristics of the datasets used in the experiments. The table also shows the parameter σ used in the computation of the kernel matrix K for the Ru LSIF, KMM and DW-KMM methods, which is determined using the common heuristic based on nearest neighbors with K = 50, as is done in (Wen et al., 2014). For the results obtained using the flattening method in (Shimodaira, 2000) and the Ru LSIF method in (Yamada et al., 2011) we considered the hyperparameter γ = 0.5, which is the default value used in those papers. Table 2. Datasets used in the experiments. Dataset Covariates Samples Ratio of σ majority class Blood 3 748 76.20% 0.7491 Breast Cancer 9 683 65.01% 1.6064 Haberman 3 306 75.53% 1.3024 Ringnorm 20 7400 50.49% 3.8299 comp vs sci 1000 5309 3534 55.31% 23.5628 comp vs talk 1000 4888 3256 60.06% 23.4890 rec vs sci 1000 4762 3169 50.17% 24.5642 rec vs talk 1000 4341 2891 55.02% 25.1129 sci vs talk 1000 4325 2880 54.85% 24.8320 In the additional experiments we study the effectiveness of the proposed selection method for hyperparameter D. Specifically, Tables 3 and 4 show the average classification error varying the value of D for the datasets and covariate shifts shown in Table 1. The first column of these tables shows the classification error obtained when selecting D with the proposed method that minimizes the minimax risk, while the other columns show the classification error obtained using specific values of D. The values of the hyperparameter D have been chosen based on the last inequality in the optimization problem (25). Specifically, we take the values for D such that 1 1/ D {0, 0.1, . . . , 0.9}. As can be seen from the tables, the proposed selection method results in performances near those obtained with the best values of D. Double-Weighting for Covariate Shift Adaptation Table 3. Classification error in 21 scenarios using DW-GCS methods with 0-1-loss varying the value of the hyperparameter D. Dataset proposed D = 1 D = 1.2 D = 1.6 D = 2 D = 2.8 D = 4 D = 6.3 D = 11 D = 25 D = 100 selection Blood Feature 1 0.30 0.32 0.31 0.31 0.31 0.30 0.30 0.30 0.29 0.29 0.30 Feature 2 0.38 0.40 0.40 0.40 0.40 0.39 0.39 0.38 0.38 0.37 0.38 Feature 3 0.34 0.39 0.37 0.36 0.36 0.35 0.34 0.34 0.34 0.33 0.34 PCA 0.28 0.32 0.30 0.29 0.28 0.27 0.27 0.28 0.28 0.28 0.28 Breast Cancer Feature 1 0.04 0.05 0.05 0.04 0.04 0.04 0.04 0.05 0.05 0.04 0.04 Feature 2 0.04 0.06 0.06 0.05 0.05 0.05 0.06 0.06 0.06 0.04 0.04 Feature 3 0.04 0.06 0.05 0.05 0.05 0.06 0.05 0.06 0.05 0.04 0.04 PCA 0.02 0.03 0.03 0.02 0.02 0.03 0.03 0.03 0.02 0.02 0.02 Haberman Feature 1 0.28 0.39 0.36 0.33 0.30 0.29 0.28 0.28 0.28 0.28 0.28 Feature 2 0.29 0.39 0.37 0.35 0.33 0.32 0.30 0.30 0.30 0.30 0.29 Feature 3 0.35 0.46 0.45 0.43 0.40 0.37 0.35 0.35 0.34 0.34 0.34 PCA 0.30 0.40 0.38 0.35 0.32 0.30 0.29 0.29 0.29 0.30 0.29 Ringnorm Feature 1 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 Feature 2 0.25 0.25 0.26 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 Feature 3 0.25 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.25 0.25 PCA 0.27 0.30 0.29 0.29 0.28 0.28 0.27 0.27 0.26 0.27 0.27 20 Newsgroups comp vs sci 0.22 0.25 0.24 0.23 0.22 0.21 0.21 0.21 0.21 0.21 0.21 comp vs talk 0.11 0.17 0.16 0.14 0.12 0.11 0.10 0.10 0.10 0.11 0.11 rec vs sci 0.17 0.19 0.18 0.18 0.17 0.17 0.16 0.16 0.16 0.16 0.16 rec vs talk 0.15 0.18 0.17 0.16 0.15 0.15 0.14 0.14 0.14 0.14 0.14 sci vs talk 0.20 0.22 0.21 0.20 0.19 0.19 0.18 0.18 0.18 0.19 0.19 Table 4. Classification error in 21 scenarios using DW-GCS methods with log-loss varying the value of the hyperparameter D. Dataset proposed D = 1 D = 1.2 D = 1.6 D = 2 D = 2.8 D = 4 D = 6.3 D = 11 D = 25 D = 100 selection Blood Feature 1 0.31 0.32 0.32 0.32 0.31 0.30 0.30 0.30 0.29 0.29 0.30 Feature 2 0.38 0.41 0.40 0.40 0.40 0.39 0.38 0.38 0.38 0.38 0.38 Feature 3 0.35 0.38 0.38 0.37 0.36 0.36 0.35 0.34 0.34 0.34 0.34 PCA 0.28 0.32 0.30 0.29 0.29 0.28 0.28 0.28 0.28 0.28 0.28 Breast Cancer Feature 1 0.04 0.05 0.05 0.05 0.05 0.04 0.04 0.05 0.05 0.04 0.04 Feature 2 0.04 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.04 0.04 Feature 3 0.04 0.06 0.06 0.06 0.06 0.06 0.05 0.06 0.05 0.04 0.04 PCA 0.02 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.02 0.02 0.02 Haberman Feature 1 0.29 0.38 0.36 0.34 0.31 0.30 0.29 0.29 0.29 0.29 0.28 Feature 2 0.30 0.39 0.37 0.35 0.33 0.32 0.31 0.31 0.31 0.31 0.30 Feature 3 0.36 0.46 0.44 0.43 0.40 0.37 0.36 0.35 0.35 0.35 0.34 PCA 0.31 0.39 0.38 0.36 0.33 0.31 0.30 0.30 0.30 0.31 0.30 Ringnorm Feature 1 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 Feature 2 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.24 0.25 Feature 3 0.25 0.26 0.26 0.26 0.25 0.26 0.25 0.25 0.25 0.25 0.25 PCA 0.26 0.30 0.29 0.29 0.28 0.27 0.27 0.26 0.26 0.26 0.27 20 Newsgroups comp vs sci 0.22 0.25 0.24 0.23 0.22 0.21 0.21 0.21 0.21 0.21 0.21 comp vs talk 0.11 0.17 0.16 0.14 0.12 0.11 0.10 0.10 0.10 0.11 0.11 rec vs sci 0.17 0.19 0.18 0.18 0.17 0.17 0.16 0.16 0.16 0.16 0.16 rec vs talk 0.15 0.18 0.17 0.16 0.15 0.15 0.14 0.14 0.14 0.14 0.14 sci vs talk 0.20 0.22 0.21 0.20 0.19 0.19 0.18 0.18 0.18 0.19 0.19