# doubleweighting_for_covariate_shift_adaptation__32718a15.pdf

Double-Weighting for Covariate Shift Adaptation

Jos e I. Segovia-Mart ın 1 Santiago Mazuelas 1 2 Anqi Liu 3

Supervised learning is often affected by a covariate shift in which the marginal distributions of instances (covariates x) of training and testing samples ptr(x) and pte(x) are different but the label conditionals coincide. Existing approaches address such covariate shift by either using the ratio pte(x)/ptr(x) to weight training samples (reweighted methods) or using the ratio ptr(x)/pte(x) to weight testing samples (robust methods). However, the performance of such approaches can be poor under support mismatch or when the above ratios take large values. We propose a minimax risk classiﬁcation (MRC) approach for covariate shift adaptation that avoids such limitations by weighting both training and testing samples. In addition, we develop effective techniques that obtain both sets of weights and generalize the conventional kernel mean matching method. We provide novel generalization bounds for our method that show a signiﬁcant increase in the effective sample size compared with reweighted methods. The proposed method also achieves enhanced classiﬁcation performance in both synthetic and empirical experiments.

1. Introduction

Most supervised learning methods assume that training and testing samples are drawn i.i.d. from the same underlying distribution. However, practical scenarios are often affected by a covariate shift in which the marginal distributions of instances (covariates x) of training and testing samples ptr(x) and pte(x) are different (see e.g., (Sugiyama & Kawanabe, 2012; Qui nonero-Candela et al.,

1Basque Center for Applied Mathematics (BCAM), Bilbao, Spain 2IKERBASQUE-Basque Foundation for Science 3CS department, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland, USA. Correspondence to: Jos e I. Segovia-Mart ın <jsegovia@bcamath.org>, Santiago Mazuelas <smazuelas@bcamath.org>, Anqi Liu <aliu@cs.jhu.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

2008)), while the conditional label distribution stays the same. In such scenarios, conventional supervised classiﬁcation methods, like empirical risk minimization, can perform poorly because the empirical risk is approximating the training expected risk, rather than the test expected risk.

Most of the existing methods for covariate shift adaptation are based on the reweighted approach (Sugiyama & Kawanabe, 2012; Qui nonero-Candela et al., 2008; Cortes et al., 2008; Huang et al., 2006). These methods weight loss functions at training using the ratio pte(x)/ptr(x) so that training samples more likely in the test distribution are assigned higher weights (see Fig. 1), increasing their relevance at training. Such ratios can be estimated by using training and testing instances (Tsuboi et al., 2009; Yamada et al., 2011; Liu et al., 2013). Reweighted methods are designed for situations where the support of ptr contains that of pte. However, even if such condition is satisﬁed, reweighted methods may achieve poor performances if the ratio pte(x)/ptr(x) take large values at certain training samples, leading to inaccurate estimations of expected losses (see e.g., (Cortes & Mohri, 2014; Reddi et al., 2015)). Such problems can be alleviated by ﬂattening the above ratio (Shimodaira, 2000; Yamada et al., 2011), by utilizing a regularization term based on the unweighted solution (Reddi et al., 2015), and by directly estimating weights for training samples through kernel mean matching (KMM) methods (Gretton et al., 2008; Huang et al., 2006).

Robust methods for covariate shift adaptation (Liu & Ziebart, 2014; 2017; Chen et al., 2016) are derived from a distributionally robust learning framework, where the feature expectation matching constraints are obtained from training samples but the adversarial risk is deﬁned on the test distribution. Such methods weight feature functions at testing using the ratio ptr(x)/pte(x) (see Fig. 1). The resulting parametric form produces less conﬁdent predictions when testing samples are less likely in the training distribution. Robust methods are designed for situations where the support of pte contains that of ptr. However, even if such condition is satisﬁed, robust methods may achieve poor performances if the ratio ptr(x)/pte(x) take large values at certain testing samples, leading to overconﬁdent classiﬁcation rules.

Double-Weighting for Covariate Shift Adaptation

Weights for training samples Weights for testing samples

Reweighted approach Robust approach Proposed approach

Figure 1. Different approaches for covariate shift adaptation (training and testing instances follow Gaussian distributions with probability mass concentrated in the red and blue circles, resp.). Reweighted methods weight training instances x using the ratio β(x) = pte(x)/ptr(x) while robust methods weight testing instances x using the ratio α(x) = ptr(x)/pte(x). The proposed approach utilizes weights both for training and testing instances and can avoid large weights β(x) by reducing the corresponding α(x) and avoid large weights α(x) by reducing the corresponding β(x).

In practice, the distributions of training and testing instances can differ in an arbitrary manner (e.g., their supports may not be contained in each other). This paper proposes a learning methodology that can tackle such general covariate shift and addresses the limitations of existing approaches by weighting both training and testing samples (see Fig. 1). In particular, the methods proposed are based on minimax risk classiﬁers (MRCs) (Mazuelas et al., 2022; 2023) and utilize weighted averages of training samples to estimate expectations of weighted feature functions under the test distribution. Speciﬁcally, the main contributions in the paper are as follows.

We present a learning framework for general covariate shift adaptation based on a double-weighting of both training and testing samples. Our framework encompasses existing approaches for speciﬁc choices of weights.

We propose effective techniques that obtain weights for training and testing samples, and generalize the conventional KMM that only obtains weights for training samples.

We develop generalization bounds for the proposed methods that show a signiﬁcant increase in effective sample size compared with reweighted approaches.

We experimentally assess the performance improvement obtained by the proposed techniques in multiple covariate shift scenarios.

Notations. Calligraphic upper case letters represent sets; bold lower and upper case letters represent vectors and matrices, respectively; for a vector v, v(i) denotes its i-th component, |v| denotes its component-wise absolute value, and (v)+ denotes its positive part; 1 denotes a vector with

all components equal to 1; || ||1, || || , and || ||H denote the 1-norm, the inﬁnity, and the Hilbert space norm of its argument, respectively; and represent vector componentwise inequalities; N(m, Σ) denotes the pdf of a Gaussian r.v. x with mean m and covariance matrix Σ; and Ep{ } denotes the expectation of its argument w.r.t distribution p.

2. Preliminaries

This section describes the learning setup, the two main existing approaches for covariate shift adaptation, and the framework of MRCs.

Setup. Let X be the set of instances and Y the set of labels represented by the set {1, . . . , |Y|}. We denote by (X Y) the set of probability distributions over X and Y, and by T(X, Y) the set of classiﬁcation rules. For h T(X, Y), we denote by h(y|x) the probability with which instance x X is classiﬁed by label y Y. We use the notation pte for the underlying distribution at test, and (x1, y1), (x2, y2) . . . , (xn, yn) for the set of training samples. The ℓ-risk of a classiﬁcation rule h is its expected classiﬁcation loss with respect to the true underlying distribution at test pte, i.e., R(h) = Epte {ℓ(h, (x, y))}.

The learning objective is to use the training samples to ﬁnd a classiﬁcation rule h that has small ℓ-risk R(h). In this paper, we consider 0-1-loss and log-loss:

ℓ01 (h, (x, y)) = 1 h(y|x) (1)

ℓlog (h, (x, y)) = log h(y|x). (2)

Covariate shift. Under covariate shift, the training samples (x1, y1), (x2, y2), . . . , (xn, yn) follow a distribution ptr(x, y) such that the marginal distributions of in-

Double-Weighting for Covariate Shift Adaptation

stances differ, pte(x) = ptr(x), but label conditionals coincide, ptr(y|x) = pte(y|x). In addition, covariate shift methods assume that the ratio between ptr(x) and pte(x) is known or that t unsupervised samples xn+1, xn+2, . . . , xn+t from pte(x) are known at training. In previous literature, it is also usually assumed that the training support contains that at testing or vice versa. In this paper, we consider general scenarios of covariate shift in which such supports are not required to contain each other.

2.1. Main existing approaches

Reweighted methods. Most of the techniques for covariate shift adaptation are based on the reweighted approach (Sugiyama & Kawanabe, 2012; Shimodaira, 2000; Zadrozny, 2004; Cortes et al., 2008; Dud ık et al., 2005; Lin et al., 2002). These methods exploit that, for any function f, we have that

Eptef(x, y) = Eptrβ(x)f(x, y), for β(x) = pte(x)

if pte(x) > 0 ptr(x) > 0. Reweighted methods weight loss functions at training by means of the weight function β(x) in (3), as detailed in Appendix A. Using these weights, such methods can account for the fact that some training instances are unlikely at testing, and assign low relevance to such instances at training (see Fig.1).

Reweighted methods assume the support of ptr contains that of pte (i.e., pte(x) > 0 ptr(x) > 0) so that (3) is valid. Even if this condition is satisﬁed, such methods may achieve poor performances if the ratio β(x) in (3) takes large values at certain training samples. In these cases, the learning process is dominated by few training samples (see e.g., (Cortes & Mohri, 2014; Gretton et al., 2008)). The ﬂattening approach alleviates such problems using weights for training samples smoothed utilizing a hyperparameter γ as (pte(x)/ptr(x))γ in (Shimodaira, 2000) and as pte(x)/ (γpte(x) + (1 γ)ptr(x)) in (Yamada et al., 2011).

Robust methods. Robust methods under covariate shift (Liu & Ziebart, 2014; 2017) exploit that, for any f:

Epteα(x)f(x, y) = Eptrf(x, y), for α(x) = ptr(x)

if ptr(x) > 0 pte(x) > 0. Robust methods weight feature functions at testing1 by means of the weight function α(x) in (4), as detailed in Appendix A. Using these weights, such methods can account for the fact that some testing

1The robust bias-aware prediction weight the feature functions in both training and testing as the weight appears in the predictive parametric form. Here we are emphasizing the weights at testing to show a symmetric view between these two methods.

instances are unlikely at training, and consider rules that assign low-conﬁdence predictions to such instances (see Fig. 1).

Robust methods assume the support of pte contains that of ptr (i.e., ptr(x) > 0 pte(x) > 0) so that (4) is valid. Even if this condition is satisﬁed, such methods may achieve poor performances if the ratio α(x) in (3) takes large values at certain testing samples. In these cases, the classiﬁcation rule would only provide conﬁdent predictions at few testing samples.

The connection and symmetric relation between reweighted and robust methods are also shown in Theorem 3 in (Liu & Ziebart, 2014). They can both be regarded as special cases of the adversarial risk minimization framework (Fathony et al., 2016) when the feature expectation matching constraints are set to match different empirical estimates of the features. To enable covariate shift adaptation in general cases (e.g., training and testing supports not contained in each other), this paper proposes a learning framework that avoids the limitations of existing methods by utilizing a double-weighting approach.

2.2. Minimax Risk Classiﬁers

Similarly to other approaches based on robust risk minimization (RRM) (Farnia & Tse, 2016; Fathony et al., 2016), MRC methods (Mazuelas et al., 2022; 2023) do not require that the training samples follow the same distribution as the testing samples. MRCs minimize the worst-case expected loss with respect to distributions in uncertainty sets that can contain the true underlying distribution with high probability. The uncertainty sets are given by constraints on the expectation of a function Φ : X Y Rm

referred to as feature mapping. Such a function can be deﬁned using one-hot encodings of the elements of Y as Φ(x, y) = ey x, where ey is the y-th element of the canonical basis of R|Y| and denotes the Kronecker product.

Given the uncertainty set U, we say that a classiﬁcation rule h U is a ℓ-MRC for U if

h U arg min h T(X,Y) max p U ℓ(h, p) (5)

and, we denote by R(U) the minimax risk against U, i.e.,

R(U) = min h T(X,Y) max p U ℓ(h, p) (6)

where ℓ(h, p) denotes the expected loss of classiﬁcation rule h w.r.t. distribution p.

3. Framework for Adaptation to General Covariate Shift

This section ﬁrst describes the proposed double-weighting approach and the corresponding MRC learning methodo-

Double-Weighting for Covariate Shift Adaptation

logy. We then describe its relationship with existing techniques, present ﬁnite-sample generalization bounds for the proposed methods, and discuss the trade-off involved in the choice of weight functions.

3.1. Double-weighting

The proposed framework considers both training and testing weights, β(x) and α(x) (see Fig.1). We exploit the fact that, for any function f, we have that

Epteα(x)f(x, y) = Eptrβ(x)f(x, y) (7)

can be attained by multiple choices of weights α(x) and β(x). For instance, it is satisﬁed taking

α(x) = min C ptr(x)

pte(x), 1 , β(x) = min pte(x)

for any C > 0, since α(x)pte(x) = β(x)ptr(x), x X. Notice that the equality in (7) is satisﬁed taking weights as in (8) even if the supports of ptr and pte do not contain each other.

Such a double-weighting approach can avoid the limitations of reweighed and robust methods. For x X with large ratio pte(x)/ptr(x), using a small α(x) can enable to have α(x)pte(x) = β(x)ptr(x) with moderate values of β(x). Reciprocally, for x X with large ratio ptr(x)/pte(x), using a small β(x) can enable to have α(x)pte(x) = ptr(x)β(x) with moderate values of α(x). For instance, using weights as in (8) we have that β(x) C and α(x) 1 for any x X. Considering both weights β(x) and α(x), we can both assign low relevance to training instances that are unlikely at testing, and also assign low-conﬁdence predictions to testing instances that are unlikely at training.

3.2. MRC learning framework using double-weighting

The proposed framework adapts to general covariate shift by constructing the uncertainty set U in (5) using both weights α(x) and β(x). In particular, we use feature mappings weighted by α(x) as Φα(x, y) = α(x)Φ(x, y) and constrain the difference between the expectation and empirical mean estimates of feature mappings as follows

U = {p (X Y) : |EpΦα(x, y) τ| λ and p(x) = pte(x), x X} (9)

where τ denotes the mean vector of expectation estimates, and λ is a vector that determines the conﬁdence with which pte(x, y) U. The expectation of the feature mapping Φα(x, y) is estimated using averages of training samples weighted by β(x) as

i=1 Φβ(xi, yi), for Φβ(x, y) = β(x)Φ(x, y). (10)

Notice that the mean vector τ is an unbiased estimator of EpteΦα(x, y) for any choice of weights satisfying α(x)pte(x) = β(x)ptr(x), in particular for those given by (8). In addition, the accuracy of τ can be improved using weights α(x) that avoid large weights β(x), as discussed in Section 3.4 below.

Convex optimization. We next show how MRCs corresponding with uncertainty sets (9) can be learned by solving the convex optimization problem

min µ τ Tµ + Epte(x)ϕℓ(µ, x, α(x)) + λT|µ| (11)

where ϕℓis a function deﬁned under different loss functions. For 0-1-loss, we have

ϕ01(µ, x, α(x)) = 1 + max C Y

P y C Φα(x, y)Tµ 1

and, for log-loss, we have

ϕlog(µ, x, α(x)) = log X

y Y exp Φα(x, y)Tµ . (13)

Theorem 3.1. Let τ, λ Rm be such that the uncertainty set U in (9) is not the empty set. If µ is a solution of (11) for 0-1-loss, the classiﬁcation rule

h U(y|x) = α(x)Φ(x, y)Tµ ϕ01(µ , x, α(x)) + 1

+ (14) is a 0-1-MRC for U. If µ is a solution of (11) for log-loss, the classiﬁcation rule

h U(y|x) = exp α(x)Φ(x, y)Tµ ϕlog(µ , x, α(x))

(15) is a log-MRC for U. In addition, the minimax risk R(U) is given by

R(U) = τ Tµ +Epte(x)ϕℓ(µ , x, α(x))+λT|µ |. (16)

Proof. See Appendix B.

Remarks. The optimization in (11) can be addressed in practice using conventional optimization methods such as stochastic gradient descent. If unlabeled instances from the test distribution are available at training, they can directly be used to obtain samples corresponding to the (sub)gradient of Epte(x)ϕℓ(µ, x, α(x)) since the function ϕℓdoes not depend on labels. If the marginals ptr(x), pte(x) are known, training samples can be used to obtain samples of the above gradient using (3). This theorem is novel as we apply weights α and β for covariate shift adaptation, even though the general form is analogous to the results in (Mazuelas et al., 2022; 2023), which studies MRC with train and test data sampled i.i.d. from the same distribution.

Double-Weighting for Covariate Shift Adaptation

Regularization. The convex optimization problem (11) carries out an L1-type regularization, where the regularization parameter is given by vector λ. The regularization term in (11) allows to penalize each component of parameter µ differently, such that feature components with poorly estimated expectations (i.e., components i with large λ(i)) are strongly penalized.

Classiﬁcation rule. The deterministic classiﬁer associated with h U classiﬁes each instance with the label maximizing h U(y|x). For both losses, this deterministic classiﬁer is given by

arg max y Y h U(y|x) = arg max y Y α(x)Φ(x, y)Tµ

= arg max y Y Φ(x, y)Tµ . (17)

Such deterministic classiﬁers, denoted by h U d , allow us to classify testing samples even if we do not know the weights α(x) associated with them.

Predictive conﬁdence. The values of α(x) adjust the conﬁdence with which each sample is classiﬁed. For instance, for very small values of α(x), the classiﬁer h U

uniformly assigns labels in the set Y for both losses, i.e., h U(y|x) = 1/|Y| for all y Y.

Relation with existing approaches. The general framework proposed above encompasses existing approaches, as detailed in Appendix A for binary classiﬁcation with log-loss. The usage of weights α(x) = 1, β(x) = pte(x)/ptr(x) leads to reweighted methods (Sugiyama & Kawanabe, 2012), approximating Epte(x)ϕℓ(µ, x, α(x)) in (11) using training instances. The usage of weights α(x) = ptr(x)/pte(x), β(x) = 1 leads to robust methods (Liu & Ziebart, 2014), approximating the gradient of Epte(x)ϕℓ(µ, x, α(x)) in (11) using training instances.

3.3. Generalization bounds The following shows the generalization bounds of the proposed methods in Section 3.2. Such bounds are given in terms of smallest minimax risk, R , that corresponds with the uncertainty set given by the exact expectations, and is deﬁned by

R = min µ EpteΦα(x, y)Tµ + Epte(x)ϕℓ(µ, x, α(x)).

(18) The MRC corresponding to that smallest minimax risk R

could only be obtained by an exact estimation of the expectation of the feature mapping Φα that in turn would require an inﬁnite amount of training samples. The theorem below shows risk bounds for the proposed MRCs in terms of minimax risks R(U) and smallest minimax risks R .

Theorem 3.2. Let U be a non-empty uncertainty set given by (9) and h U be an ℓ-MRC for U. If µ and µ are solu-

tions to (11) and (18), respectively, then, we have that

R (h U) R(U) + (|τ EpteΦα(x, y)| λ)T |µ | (19)

R (h U) R + λT (|µ | |µ |)

+ |τ EpteΦα(x, y)|T |µ µ | . (20)

Proof. See Appendix B.

Note that the minimax risk R(U) obtained at learning offers an upper bound for the ℓ-risk if λ |τ EpteΦα(x, y)| and an approximate upper bound for general λ. In addition, the difference between the risk R (h U) and the smallest minimax risk R decreases with the estimation error |τ EpteΦα(x, y)|.

We next show how the proposed methods can lead to a signiﬁcant increase in effective size compared with reweighted methods.

Corollary 3.3. Let U be a non-empty uncertainty set given by (9) with λ = 0, and h U be an ℓ-MRC for U. If weights α(x) and β(x) are given by (8) with C = B/

D for D 1 and

B = sup x X pte(x)/ptr(x). (21)

Then, with probability at least 1 δ we have that

R(h U) R + M µ µ

where M is a constant satisfying Φ(x, y) M for all x X, y Y.

Proof. A direct consequence of Theorem 3.2 and Hoeffding s inequality.

As described in (Cortes et al., 2010; Yu & Szepesv ari, 2012), reweighted methods have an estimation error of

the order q

δ so that the methods proposed can achieve an effective sample size increased by a factor of D using the double-weighting given by (8) with C = B/

D. The next section more broadly discusses such an increase in effective sample size and the corresponding trade-off for predictions conﬁdence.

3.4. Choice of weight functions

Existing reweighted and robust methods, as well as the proposed general framework, utilize weights α(x) and β(x) in the estimation of expectations:

i=1 β(xi)f(xi, yi) Epteα(x)f(x, y). (23)

The error of such estimates is determined by the weights β(x). If α(x) and β(x) satisfy α(x)pte(x) = β(x)ptr(x), using Hoeffding s inequality we have that

Double-Weighting for Covariate Shift Adaptation

i=1 β(xi)f(xi, yi) Epteα(x)f(x, y)

with probability at least 1 δ.

In particular, for reweighted methods the bound (24) be-

comes ||f|| q

δ with B given by (21) as shown in (Cortes et al., 2010; Yu & Szepesv ari, 2012).

The error in the expectations estimates in (23) decreases when we choose the weights α(x) adequately. In particular, using small values of α(x) we can achieve α(x)pte(x) = β(x)ptr(x) with moderate values of β(x). Such improvement comes at the expense of using classiﬁcation rules with signiﬁcant conﬁdence only in the subregion of X in which α(x) is signiﬁcantly larger than 0.

The above trade-off between error in expectations estimates and conﬁdence of classiﬁcation rules can be addressed using pairs of weights of the form (8) and varying the value of C. For values C B, α(x) = 1 and β(x) = pte(x)/ptr(x) that corresponds to the reweighted approach. For values C < B, the expectations estimates improve as we decrease C since β = C. However, the corresponding classiﬁcation rules would only predict with signiﬁcant conﬁdence in the subregion of X where α(x) is signiﬁcantly larger than 0. Such subregion shrinks when C decreases because it is composed by the x X where pte(x) is not signiﬁcantly larger than Cptr(x). In the following, we present methods that obtain weights α and β addressing the above trade-off, and generalize conventional KMM methods.

4. Double-weighting Kernel Mean Matching

The KMM method obtains weights β Rn for n training instances x1, x2, . . . , xn using t testing instances xn+1, xn+2, . . . , xn+t (Huang et al., 2006; Gretton et al., 2008). We propose the double-weighting KMM (DW-KMM) method that obtains weights β Rn for the n training instances together with weights α Rt for the t testing instances by solving the optimization problem

i=1 α(i)K(xn+i) 1

i=1 β(i)K(xi)

H s.t. 0 β(i) B/

D, for i = 1, . . . , n

0 α(i) 1, for i = 1, . . . , t 1 n

||α 1|| 1 1

where K : X H is a feature map corresponding with a reproducing kernel Hilbert space (RKHS) H with kernel k(x, x) = K(x), K( x) H.

As described above, the hyperparameter D 1 in (25) balances the trade-off between error in expectation estimates and conﬁdence of the classiﬁcation. For D = 1, the optimization problem becomes that of KMM for reweighted methods (Huang et al., 2006; Gretton et al., 2008).

Performance guarantees. The proposed approach is an empirical version of the following (population) problem given by exact expectations

min α(x),β(x)

Epte(x)α(x)K(x) Eptr(x)β(x)K(x)

H s.t 0 β(x) B/

D, 0 α(x) 1, x X Epte(x)α(x) = Eptr(x)β(x)

Epte(x) (α(x) 1)2 1 1/

The minimum value of (26) is zero since (8) with C = B/

D is a feasible solution. Then, solutions of (26), ˆβ(x), ˆα(x), provide consistent estimators of expectations because

Epte(x,y)ˆα(x)Φ(x, y) = Eptr(x,y) ˆβ(x)Φ(x, y) (27)

is satisﬁed if the kernel k is characteristic or if Epte(y|x)Φ(x, y) belongs to H, analogously as shown in (Yu & Szepesv ari, 2012).

With ﬁnite samples, the following theorem shows bounds for the difference between the empirical means in feature space for solutions of (26). Theorem 4.1. If ˆβ(x) and ˆα(x) are solutions of (26), with probability at least 1 δ we have that

i=1 ˆβ(xi)K(xi) 1

i=1 ˆα(xn+i)K(xn+i)

where the constant κ satisﬁes |k(x, x)| κ2 for all x X.

Proof. See Appendix C.

Relation with conventional KMM. The solutions ˆβ(x) for conventional KMM in reweighted methods satisfy with probability at least 1 δ

i=1 ˆβ(xi)K(xi) 1

i=1 K(xn+i)

Double-Weighting for Covariate Shift Adaptation

as shown in Lemma 4 of (Huang et al., 2006) and equation (10) in (Yu & Szepesv ari, 2012). Therefore, the proposed DW-KMM allows to signiﬁcantly improve the effective sample by exploiting the usage of weights α. Analogously to the results shown in Section 3.4, the effective sample size of the methods proposed is D times larger than that of existing KMM for reweighted methods.

5. Practical Algorithm

In this section, we present a practical algorithm for the proposed Double-Weighting for General Covariate Shift (DW-GCS), detailed in Algorithm 1. We ﬁrst compute weights α and β by solving (25), then, we learn the classiﬁer s parameters by solving (11) using mean vector τ deﬁned in (10) and conﬁdence vector λ.

Algorithm 1 The proposed algorithm: DW-GCS Input: Training samples (x1, y1), (x2, y2), . . . , (xn, yn) Testing instances xn+1, xn+2, . . . , xn+t, D Output: Weights ˆβ and ˆα Classiﬁer parameters µ , Minimax risk R(U) 1: ˆβ, ˆα solution of (25) 2: τ 1

n Pn i=1 ˆβ(i)Φ(xi, yi) 3: λ solution of (31) 4: µ solution of (30) using (12) for 0-1-loss, and (13) for log-loss

5: R(U) τ Tµ + 1

i=1 ϕℓ(µ , xn+i, ˆα(i)) + λT|µ |

Computing weights and learning MRCs. Weights α and β are computed solving the convex optimization (25), which is a quadratic problem as detailed in Appendix D.

The optimization in (11) can be addressed by approximating the expectation by means of the t instances in testing xn+1, xn+2, . . . , xn+t as

min µ τ T µ + 1

i=1 ϕℓ(µ, xn+i, α(i)) + λT |µ| (30)

that is an unconstrained convex optimization problem and can be efﬁciently solved by conventional methods.

Hyperparameters. In principle, both hyperparameters λ and D can be obtained by cross-validation. However, standard cross-validation is not valid under covariate shift (Sugiyama et al., 2007). We hence avoid cross-validation and determine both parameters as follows.

As detailed in Section 3.4, the hyperparameter D serves to address the trade-off between error in expectation estimates and conﬁdence of classiﬁcation rules. For instance, values of D close to 1 can be effective in situations with a large number of samples while higher values of D can be effective with a reduced number of samples. This is shown by

the theoretical results in the paper, since the estimation error in the proposed methods is of the order O(1/

Dn), as described by the performance bounds in Corollary 3.3 and Theorem 4.1.

In practice, we propose to select the hyperparameter D taking advantage of the minimax risk provided at the learning stage by the methods presented. Speciﬁcally, we select the value of D to achieve the lowest minimax risk over a certain range D 1. Note that, as described in Theorem 3.2, the minimax risk R(U) obtained at learning offers an upper bound for the risk if λ |τ EpteΦα(x, y)|, and an approximate upper bound for general λ. Therefore, the proposed selection method in the paper uses the value of D that results in the lowest upper bound over a range of values for D. Appendix E further illustrates the adequacy of such approach in practice.

The second hyperparameter λ is determined solving

min p,λ 1Tλ

y Y p(y|xn+i)Φα(xn+i, y) τ + λ

y Y p(y|xn+i) = 1/t for i = 1, . . . , t (31)

that ensures the uncertainty set used is non-empty.

Complexity and implementation without testing instances. The computational complexity of the methods proposed is similar to existing methods for covariate shift adaptation. Speciﬁcally, the step for DW-KMM that obtains weights has a similar complexity as that for conventional KMM. The main difference is that (25) has t additional variables and t + 1 additional constraints corresponding to the weights α. The step that obtains the classiﬁer parameters solving convex optimization problem (11) has the same complexity as that for conventional methods. Finally, the step that determines hyperparameters not only avoids the usage of cross-validation but can also reduce complexity. In particular, cross-validation with P partitions would require solving (11) P times for each candidate value for hyperparameters, while the methods proposed only require solving (31) and (11) once, for each candidate value of D.

Algorithm 1 details the implementation of DW-GCS in cases where testing instances are available at training. The methods proposed can be implemented with small modiﬁcations in cases where only training instances are available and the marginals (or their ratios) are known. In these cases, weights α(x) and β(x) can be determined using (8) with C = B/

D instead of solving (25), and optimization (11) can be addressed using the training instances instead of testing instances making use of equality (3).

Double-Weighting for Covariate Shift Adaptation

6. Experiments

This section shows experimental results for the proposed approach in comparison with existing methods on synthetic and real datasets. Reweighted and robust approaches are implemented as in (Sugiyama et al., 2007; Liu & Ziebart, 2014) and described in Appendix A, the ﬂattening method is implemented as in (Shimodaira, 2000), the Ru LSIF is implemented as in (Yamada et al., 2011), the KMM method is implemented as in (Huang et al., 2006), and the methods proposed are implemented as described in Alg. 1. The source code for the methods presented is publicly available in the library MRCpy (Bondugula et al., 2023) and the experimental setup in https://github.com/Machine Learning BCAM/ MRCs-for-Covariate-Shift-Adaptation.

For existing methods, the regularization parameter has been ﬁne-tuned as shown in Appendix E. For the proposed methods, hyperparameters are obtained as described in Section 5. The results in this section are complemented by those in Appendix E that provide further implementation details and experimental results. In particular, the appendix shows that selecting the hyperparameter D with lowest minimax risk results in performances near those obtained with the best value for D by grid search.

Parameter δ

Classiﬁcation error

No adapt. Reweighted Robust DW-GCS

0.05 0.1 0.2 0.35 0.4 0.45

Figure 2. Classiﬁcation error for different types of covariate shift. In the case δ = 0.05, the training support contains that at testing, while in the case δ = 0.45 we have the opposite.

Experiments with synthetic data. In the ﬁrst set of results we show how the proposed approach can achieve covariate shift adaptation in situations where existing methods are challenged. For such results, the training and testing samples are drawn from distributions

ptr(x) = (0.5 δ)N(m1, Σ1) + (0.5 + δ)N(m2, Σ2)

pte(x) = (1 δ)N(m1, Σ1) + δN(m2, Σ2) (32)

with m1 = [ 3/2, 0]T, m2 = [3/2, 0]T, Σ1 = Σ2 =

(1/4)I, and labels are y = 1 if x(1)x(2) 0 and y = 2 otherwise. We use values δ {0.05, 0.1, 0.2, 0.35, 0.4, 0.45} to simulate different relations between the marginals of training and testing instances. We utilize the non-linear feature mapping given by instances components and their products and implement existing and proposed methods using the exact marginals. In addition, for each type of covariate shift (value of δ) we carry out 1,000 random repetitions with 100 training and testing samples.

Results. Figure 2 shows box-plots corresponding to the classiﬁcation error of exiting and proposed approaches in comparison to that obtained without covariate shift adaptation (α(x) = β(x) = 1). The results in the ﬁgure show how reweighted (resp. robust) methods obtain poor performances in situations where the support of training (resp. testing) instances does not contain that of testing (resp. training) instances. On the other hand, the methods proposed can leverage the presented double-weighting approach and adapt to more general covariate shifts.

Experiments with real datasets. In the second set of results, we assess the performance of the proposed methods in comparison with existing techniques using real datasets. In particular, reweighted and robust methods are implemented with marginal distributions estimated using log-linear models as shown in (Bickel et al., 2007; 2009).

We generate covariate shift in the datasets following (Huang et al., 2006) and (Gretton et al., 2008). In particular, we select training and testing samples with different probabilities based on the medians of the ﬁrst 3 features, and based on the median of the ﬁrst principal component of features. In (Huang et al., 2006) and (Gretton et al., 2008), covariate shift is generated with a biased sampling for testing instances that are drawn with probability δte if the ﬁrst principal component or feature is larger than a certain value. In those works, the training samples are uniformly sampled, so that the generated covariate shifts correspond to situations where the support of training samples contains that of testing samples. In the numerical results of the table below, we generate more general covariate shifts by using a biased sampling both for training and testing instances (using probabilities δtr = 0.7 and δte = 0.3). These covariate shifts correspond to situations where the support of training and testing samples have certain overlap but they do not need to be contained in each other. Additionally, we include experimental results using the News20groups dataset that is intrinsically affected by a covariate shift since the training and testing partitions correspond to different times (Zhang et al., 2013). We consider the same 5 binary problems used in (Zhang et al., 2013), utilize the 1,000 features with highest Pearson s correlation, and randomly sample 1,000 training and testing samples in each repetition.

Results. Table 1 shows the averaged classiﬁcation error

Double-Weighting for Covariate Shift Adaptation

Table 1. Classiﬁcation errors in 21 scenarios show that the proposed methods can more adequately adapt to general covariate shift. Values in bold show best classiﬁcation error in each scenario.

Datasets Reweighted Flattening Ru LSIF Robust KMM DW-GCS 0-1 DW-GCS log

Blood Feature 1 .55 .08 .48 .11 .29 .04 .34 .06 .32 .03 .30 .03 .31 .03 Feature 2 .39 .03 .38 .03 .39 .03 .40 .03 .39 .04 .38 .05 .38 .05 Feature 3 .43 .05 .41 .05 .36 .04 .39 .04 .36 .04 .34 .03 .35 .03 PCA .48 .05 .48 .05 .29 .05 .44 .05 .30 .05 .28 .04 .28 .04 Breast Cancer Feature 1 .05 .02 .05 .03 .05 .02 .06 .03 .06 .02 .04 .02 .04 .02 Feature 2 .06 .02 .05 .02 .06 .03 .07 .03 .06 .03 .04 .02 .04 .02 Feature 3 .05 .02 .05 .02 .05 .02 .06 .03 .05 .02 .04 .02 .04 .02 PCA .03 .01 .03 .01 .03 .01 .03 .01 .03 .01 .02 .01 .02 .01 Haberman Feature 1 .48 .07 .47 .08 .31 .06 .41 .09 .34 .10 .28 .07 .29 .06 Feature 2 .46 .08 .44 .08 .31 .06 .39 .08 .36 .10 .29 .08 .30 .07 Feature 3 .33 .05 .33 .05 .33 .05 .36 .06 .42 .08 .35 .07 .36 .06 PCA .43 .12 .42 .12 .29 .05 .42 .11 .35 .08 .30 .08 .31 .07 Ringnorm Feature 1 .27 .02 .26 .02 .25 .02 .26 .02 .25 .02 .25 .02 .25 .02 Feature 2 .28 .02 .27 .02 .25 .02 .27 .02 .26 .02 .25 .02 .25 .02 Feature 3 .28 .02 .27 .02 .25 .02 .27 .02 .26 .03 .25 .02 .25 .02 PCA .32 .03 .29 .03 .25 .02 .26 .02 .28 .02 .27 .02 .26 .02 20 Newsgroups comp vs sci .41 .02 .41 .02 .41 .02 .42 .03 .40 .02 .22 .02 .22 .02 comp vs talk .37 .03 .37 .03 .37 .03 .40 .05 .34 .03 .11 .02 .11 .02 rec vs sci .43 .02 .42 .02 .42 .02 .42 .03 .41 .02 .17 .02 .17 .02 rec vs talk .40 .03 .40 .03 .40 .03 .41 .03 .38 .03 .15 .02 .15 .02 sci vs talk .41 .03 .41 .02 .41 .02 .41 .04 .39 .02 .20 .02 .20 .02

corresponding to different datasets and covariate shift situations, together with their standard deviations over 100 random partitions as detailed in Appendix E. The ﬁrst column of the table describes the different covariate shift datasets generated as described above.

Overall, the experimental results show that the proposed method provides improved adaptation to general covariate shifts, even in situations where the supports of training and testing samples are not contained in each other. These results agree with the discussion in Sections 2.1 and 3.1 as well as the theoretical results in Corollary 3.3 and Theorem 4.1 that show how the proposed methodology can be effective in situations where existing methods based on single weights are challenged. The improvement obtained by the methods presented can be clearly observed by comparing the results obtained by the KMM method, since that technique is the most closely related to the proposed method. In particular, the results show that signiﬁcant performance improvements can be obtained using a double weighting of both training and testing samples solving (25) instead of using the existing KMM method (that solves (25) ﬁxing the weights α to be one).

7. Conclusion

Existing approaches for covariate shift adaptation use the ratios between marginal distributions to either weight train-

ing or testing samples. However, the performance of such approaches can be poor when the marginals supports are not contained in each other or when marginals ratios take large values. This paper proposes a minimax risk classiﬁcation (MRC) approach for covariate shift adaptation that avoids such limitations by weighting both training and testing samples. We present effective techniques that obtain both sets of weights generalizing the conventional kernel mean matching method that only obtains weights for training samples. In addition, we present generalization bounds for the proposed methods that show a signiﬁcant increase in effective sample size. The unifying approach and the learning methods proposed can enable techniques capable to adapt to more general scenarios affected by covariate shift.

Acknowledgments

Funding in direct support of this work has been provided by projects PID2019-105058GA-I00, CNS2022-135203, and CEX2021-001142-S funded by MCIN/AEI/10.13039/ 501100011033 and the European Union Next Generation EU /PRTR, programmes ELKARTEK and BERC2022-2025 funded by the Basque Government, the project Early Prognosis of COVID-19 Infections via Machine Learning funded by the AXA Research Fund, and by the JHU-Amazon AI2AI Faculty Award.

Double-Weighting for Covariate Shift Adaptation

Altun, Y. and Smola, A. Unifying divergence minimization and statistical inference via convex duality. In Proceedings of the 19th Annual Conference on Computational Learning Theory, pp. 139 153, 2006.

Bickel, S., Br uckner, M., and Scheffer, T. Discriminative learning for differing training and test distributions. In Proceedings of the 24th International Conference on Machine Learning, pp. 81 88, 2007.

Bickel, S., Br uckner, M., and Scheffer, T. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10:2137 2155, 2009.

Bondugula, K., Alvarez, V., Segovia-Mart ın, J. I., P erez, A., and Mazuelas, S. MRCpy: A library for minimax risk classiﬁers. ar Xiv preprint ar Xiv:2108.01952, 2023.

Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.

Chen, X., Monfort, M., Liu, A., and Ziebart, B. D. Robust covariate shift regression. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, pp. 1270 1279, 2016.

Cortes, C. and Mohri, M. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103 126, 2014.

Cortes, C., Mohri, M., Riley, M., and Rostamizadeh, A. Sample selection bias correction theory. In Proceedings of the 19th International Conference on Algorithmic Learning Theory, pp. 38 53, 2008.

Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems, pp. 442 450, 2010.

Dua, D. and Graff, C. UCI Machine Learning Repository, 2017. URL http://archive.ics.uci.edu/ml.

Dud ık, M., Schapire, R. E., and Phillips, S. J. Correcting sample selection bias in maximum entropy density estimation. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems, pp. 323 330, 2005.

Farnia, F. and Tse, D. A minimax approach to supervised learning. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pp. 4240 4248, 2016.

Fathony, R., Liu, A., Asif, K., and Ziebart, B. D. Adversarial multiclass classiﬁcation: A risk minimization perspective. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, pp. 559 567, 2016.

Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Sch olkopf, B. Covariate shift by kernel mean matching. In Dataset shift in machine learning, pp. 131 160. MIT Press, 2008.

Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., and Sch olkopf, B. Correcting sample selection bias by unlabeled data. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, pp. 601 608, 2006.

Kanamori, T., Hido, S., and Sugiyama, M. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391 1445, 2009.

Lin, Y., Lee, Y., and Wahba, G. Support vector machines for classiﬁcation in nonstandard situations. Machine Learning, 46(1):191 202, 2002.

Liu, A. and Ziebart, B. D. Robust classiﬁcation under sample selection bias. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems, pp. 37 45, 2014.

Liu, A. and Ziebart, B. D. Robust covariate shift prediction with general losses and feature views. ar Xiv preprint ar Xiv:1712.10043, 2017.

Liu, S., Yamada, M., Collier, N., and Sugiyama, M. Change-point detection in time-series data by relative density-ratio estimation. Neural Networks, 43:72 83, 2013.

Mazaheri, B., Jain, S., and Bruck, J. Robust correction of sampling bias using cumulative distribution functions. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, pp. 3546 3556, 2020.

Mazuelas, S., Shen, Y., and Perez, A. Generalized maximum entropy for supervised classiﬁcation. IEEE Transactions on Information Theory, 68(4):2530 2550, 2022.

Mazuelas, S., Romero, M., and Gr unwald, P. Minimax risk classiﬁers with 0-1 loss. ar Xiv preprint ar Xiv:2201.06487, 2023.

Qui nonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset Shift in Machine Learning. MIT Press, 2008.

Reddi, S. J., P oczos, B., and Smola, A. Doubly robust covariate shift correction. In Proceedings of the 29th

Double-Weighting for Covariate Shift Adaptation

AAAI Conference on Artiﬁcial Intelligence, pp. 2949 2955, 2015.

Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227 244, 2000.

Sugiyama, M. and Kawanabe, M. Machine learning in nonstationary environments: Introduction to covariate shift adaptation. MIT press, 2012.

Sugiyama, M., Krauledat, M., and M uller, K.-R. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985 1005, 2007.

Tsuboi, Y., Kashima, H., Hido, S., Bickel, S., and Sugiyama, M. Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing, 17:138 155, 2009.

Wen, J., Yu, C.-N., and Greiner, R. Robust learning under uncertain test distributions: Relating covariate shift to model misspeciﬁcation. In Proceedings of the 31st International Conference on Machine Learning, pp. 631 639, 2014.

Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., and Sugiyama, M. Relative density-ratio estimation for robust distribution comparison. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems, pp. 594 602, 2011.

Yu, Y.-L. and Szepesv ari, C. Analysis of kernel mean matching under covariate shift. In Proceedings of the 29th International Conference on Machine Learning, pp. 1147 1154, 2012.

Zadrozny, B. Learning and evaluating classiﬁers under sample selection bias. In Proceedings of the 21st International Conference on Machine Learning, pp. 114, 2004.

Zhang, K., Zheng, V. W., Wang, Q., Kwok, J. T., Yang, Q., and Marsic, I. Covariate shift in hilbert space: A solution via surrogate kernels. In Proceedings of the 30th International Conference on Machine Learning, pp. 388 395, 2013.

Double-Weighting for Covariate Shift Adaptation

A. Detailed derivations describing existing methods and relation with the proposed framework

The following describes reweighted and robust methods for binary classiﬁcation with Y { 1, 1} and log-loss. In particular, we show how, using the speciﬁc weights in (3) and (4), such methods can be obtained from Theorem 3.1 in Section 3.2 corresponding to the proposed framework.

Reweighted methods consider classiﬁcation rules of the form

h(y|x) = 1 1 + exp { yx T µ} (33)

and learn the parameter µ using the fact that equality (3) in Section 2.1 allows to estimate expected losses with respect to the test distribution using training samples since

Epte(x,y) log 1 + exp yx T µ = Eptr(x,y)β(x) log 1 + exp yx T µ .

for β(x) = pte(x)/ptr(x).

Robust methods consider classiﬁcation rules of the form

h(y|x) = 1 1 + exp { α(x)yx T µ} (34)

with α(x) = ptr(x)/pte(x). Such methods learn the parameter µ using the fact that equality (4) in Section 2.1 allows to estimate the expected gradient of losses with respect to the test distribution using training samples since

Epte(x,y) µ log 1 + exp ptr(x)

pte(x)yx T µ

= Epte(x,y)α(x)

1 + exp n ptr(x) pte(x)yx T µ o

= Eptr(x,y) yx T

1 + exp n ptr(x) pte(x)yx T µ o.

for α(x) = ptr(x)/pte(x).

For their derivation from the Theorem 3.1 corresponding to the proposed framework; taking Φ(x, y) = yx/2, we have that optimization problem in (11) of Theorem 3.1 becomes

i=1 β(xi)yix T i 2 µ + Epte(x) n log exp α(x)x T

2 µ + exp α(x)x T

in binary classiﬁcation with log-loss.

If α(x) = 1, β(x) = pte(x)/ptr(x), the classiﬁer in (15) of Theorem 3.1 coincides with that of reweighted methods in (33). In addition, using (3) in Section 2.1 and approximating the expectation with training samples, the optimization in (35) becomes

pte(xi) ptr(xi) log 1 + exp yix T i µ (36)

that coincides with that of reweighted logistic regression (Sugiyama & Kawanabe, 2012).

If α(x) = ptr(x)/pte(x), β(x) = 1, the classiﬁer in (15) of Theorem 3.1 coincides with that of robust methods in (34). In addition, using (4) in Section 2.1, the gradient of objective function in (35) becomes

2 + Eptr(x) x T

1 exp n ptr(x)

pte(x)x T µ o

1 + exp n ptr(x)

pte(x)x T µ o (37)

that coincides with that shown in equation (7) in (Liu & Ziebart, 2014) for robust methods.

Double-Weighting for Covariate Shift Adaptation

B. Proofs for Section 3

The proofs of Theorem 3.1 and 3.2 below are done for the case of ﬁnite X. The proofs for inﬁnite X can be carried out analogously using Fenchel duality instead of Lagrange duality, similarly to as is done in (Altun & Smola, 2006; Mazuelas et al., 2023).

Proof of Theorem 3.1. Firstly, for each h T(X, Y), we have that

maxp U ℓ(h, p) = max p l Tp I+(p)

s.t. P y Y p(x, y) = pte(x), x X τ λ ΦT αp τ + λ (38)

where l, p, and Φα denote the vectors and matrix with rows ℓ(h, (x, y)), p(x, y), and Φα(x, y)T, respectively, for x X, y Y, and

I+(p) = 0 if p 0 otherwise.

Optimization problem (38) has Lagrange dual

min µ1,µ2,ν(x) τ λ Tµ1 + τ + λ Tµ2 + Epte(x)ν(x) + f (Φα(µ1 µ2) ν)

s.t. µ1, µ2 0

where ν is the vector in R|X||Y| with component corresponding with (x, y) for x X, y Y given by ν(x), and f is the conjugate function of f(p) = l Tp + I+(p) given by

f (w) = sup p 0 w Tp + l Tp = 0 if w l otherwise .

Therefore, the Lagrange dual above becomes

min µ1,µ2,ν(x) τ λ Tµ1 + τ + λ Tµ2 + Epte(x)ν(x)

s.t. µ1, µ2 0 Φα(x, y)T(µ1 µ2) ν(x) ℓ(h, (x, y)), x X, y Y.

It is easy to see that the solution of such optimization problem µ1, µ2 satisﬁes that µ(i) 1 µ(i) 2 = 0 for any i such that λ(i) > 0. Then λT( µ1 + µ2) = λT| µ1 µ2| and taking µ = µ1 µ2 the Lagrange dual above is equivalent to

min µ,ν(x) τ Tµ + λT|µ| + Epte(x)ν(x)

Φα(x, y)Tµ ν(x) ℓ(h, (x, y)), x X, y Y

that has the same value as maxp U ℓ(h, p) since the constraints in (38) are afﬁne and U is non-empty.

min h T(X,Y) max p U ℓ(h, p) = min h,µ,ν(x) τ Tµ + λT|µ| + Epte(x)ν(x)

Φα(x, y)Tµ ν(x) ℓ(h, (x, y)), x X, y Y.

For 0-1-loss we have that

Φα(x, y)Tµ ν(x) 1 + h(y|x), x X, y Y

Φα(x, y)Tµ ν(x) + 1 1, C Y, x X

y C Φα(x, y)Tµ 1

|C| , C Y, x X

ν(x) ϕ01(µ, x, α(x)), x X.

Double-Weighting for Covariate Shift Adaptation

Therefore, for each µ, we have that any classiﬁcation rule satisfying

h(y|x) Φα(x, y)Tµ ϕ01(µ, x, α(x)) + 1, x X, y Y

is solution of

min h,ν(x) Epte(x)ν(x) = Epte(x)ϕ01(µ, x, α(x))

Φα(x, y)Tµ ν(x) + 1 h(y|x), x X, y Y

and the result is obtained because for any x X, we have that X

Φα(x, y)Tµ ϕ01(µ, x, α(x)) + 1

because otherwise there would exist νx < ϕ01(µ, x, α(x)) such that

Φα(x, y)Tµ νx + 1

+ = max C Y

Φα(x, y)Tµ νx + 1

which contradicts the deﬁnition of ϕ01(µ, x, α(x)).

The case of log-loss is analogous to the case for 0-1-loss above taking into account that

Φα(x, y)Tµ ν(x) log(h(y|x)), x X, y Y

y Y exp{Φα(x, y)Tµ ν(x)} 1, x X

y Y exp{Φα(x, y)Tµ} , x X

ν(x) ϕlog(µ, x, α(x)), x X.

The lemma below is used in the proof of Theorem 3.2.

Lemma B.1. Let U be the uncertainty set given by (9) for τ, λ Rm, and h be a classiﬁcation rule. If

R01 (U, h) = min µ τ T µ + Epte(x) max y Y 1 + α(x)Φ(x, y)T µ h(y|x) + λT |µ| (39)

Rlog (U, h) = min µ τ T µ + Epte(x) max y Y α(x)Φ(x, y)T µ log h(y|x) + λT |µ| (40)

then, for any p U

ℓ01(h, p) R01 (U, h) (41)

ℓlog(h, p) Rlog (U, h) . (42)

Proof of Lemma B.1. The proof is analogous to the proof of Theorem 5 of (Mazuelas et al., 2023). The case U = is trivial. For the case where U = , we will ﬁrst calculate the Lagrange dual of the optimization problem minˆp U Eˆpq for a general function q : X Y R. Then we will consider the fact that for any p U and h T(X, Y),

min ˆp U ℓ(h, ˆp) ℓ(h, p) max ˆp U ℓ(h, ˆp)

max ˆp U ℓ01(h, ˆp) = min ˆp U Eˆp {h(y|x) 1}

max ˆp U ℓlog(h, ˆp) = min ˆp U Eˆp log h(y|x)

Double-Weighting for Covariate Shift Adaptation

for 0-1-loss and log-loss respectively.

First, we have that minˆp U Eˆpq is equal to

min ˆp q T ˆp + I+(ˆp)

y Y ˆp(x, y) = pte(x) for all x X

τ λ ΦT α ˆp τ + λ (43)

where ˆp, q, Φα denote the vectors and matrix with rows ˆp(x, y), q(x, y) and α(x)Φ(x, y)T , respectively, for x X, y Y, and

I+(ˆp) = 0 if ˆp 0 otherwise.

Optimization problem (43) has Lagrange dual

max µ1,µ2,ν(x) (τ λ)T µ1 (τ + λ)T µ2 + Epte(x)ν(x) f (Φα(µ1 µ2) + ν)

s.t. µ1, µ2 0

where ν denotes the vector in R|X||Y| with component corresponding with (x, y) for x X, y Y given by ν(x), and f

is the conjugate function of f(ˆp) = q T ˆp + I+(ˆp) that becomes

f (w) = 0 if w q otherwise.

Therefore, the previous Lagrange dual becomes

max µ1,µ2,ν(x) (τ λ)T µ1 (τ + λ)T µ2 + Epte(x)ν(x)

s.t. µ1, µ2 0

Φα(µ1 µ2) + ν q

which is equivalent to

max µ1,µ2 (τ λ)T µ1 (τ + λ)T µ2 + Epte(x) min y Y q(x, y) α(x)Φ(x, y)T (µ1 µ2)

s.t. µ1, µ2 0.

Taking µ = µ1 µ2 the Lagrange dual problem is equivalent to

max µ τ T µ + Epte(x) min y Y q(x, y) α(x)Φ(x, y)T µ λT |µ|

that has the same value as its primal minˆp U Eˆpq since the constraints deﬁning U are afﬁne and U = . Then, we have that

max ˆp U ℓ01(h, ˆp) = min ˆp U Eˆp {h(y|x) 1} = min µ τ T µ + Epte(x) max y Y 1 + α(x)Φ(x, y)T µ h(y|x) + λT |µ|

max ˆp U ℓlog(h, ˆp) = min ˆp U Eˆp log h(y|x) = min µ τ T µ + Epte(x) max y Y α(x)Φ(x, y)T µ log h(y|x) + λT |µ|

for 0-1-loss and log-loss respectively.

Proof of Theorem 3.2. For inequality (19), let U be the uncertainty set given by the exact mean vector τ = EpteΦα(x, y), i.e.,

U = {p (X Y) : |EpΦα(x, y) τ | λ

and p(x) = pte(x), x X} . (44)

Double-Weighting for Covariate Shift Adaptation

It is clear that we have pte(x, y) U , then for 0-1-loss, using Lemma B.1 and the deﬁnition of h(y|x) in (14), we have that

R(h U) R01(U , h U) = min µ τ T µ + Epte(x) max y Y 1 + α(x)Φ(x, y)T µ h(y|x)

τ T µ + Epte(x) max y Y 1 + α(x)Φ(x, y)T µ h(y|x) (45)

τ T µ + Epte(x) max y Y ϕ01(µ , x, α(x)) (46)

= τ T µ + Epte(x)ϕ01(µ , x, α(x)) (47)

= R(U) + (τ τ )T µ λT |µ |. (48)

where, for inequality (45)-(46), we have used the fact that h(y|x) α(x)Φ(x, y)T µ + 1 ϕ01(µ , x, α(x)) and for inequality (47)-(48) we have added and subtracted τ T µ and λT µ , and used the deﬁnition of R(U) in (16).

For log-loss, using Lemma B.1 and the deﬁnition of h(y|x) in (15), we have that

R(h U) Rlog(U , h U) = min µ τ T µ + Epte(x) max y Y α(x)Φ(x, y)T µ log h(y|x)

τ T µ + Epte(x) max y Y α(x)Φ(x, y)T µ log h(y|x) (49)

= τ T µ + Epte(x) max y Y ϕlog(µ , x, α(x)) (50)

= τ T µ + Epte(x)ϕlog(µ , x, α(x)) (51)

= R(U) + (τ τ )T µ λT |µ | (52)

where, for inequality (49)-(50) we have used the fact that log h(y|x) = α(x)Φ(x, y)T µ ϕlog(µ , x, α(x)) and for inequality (51)-(52) we have added and subtracted τ T µ and λT µ , and used the deﬁnition of R(U) in (16).

For inequality (20), note that using the deﬁnition of µ and (47) (resp. (51)) for 0-1-loss (resp. log-loss), we have that

R(h U) τ T µ + Epte(x)ϕℓ(µ , x, α(x))

τ T µ + Epte(x)ϕℓ(µ , x, α(x)) + λT |µ | + (τ τ )T µ λT |µ |

= R + λT (|µ | |µ |) + (τ τ)T µ + (τ τ )T µ

R + λT (|µ | |µ |) + |τ τ |T |µ µ | .

C. Proofs for Section 4

Proof of Theorem 4.1. The proof is analogous to Example 6.3 in (Boucheron et al., 2013) that shows a Hoeffding-type inequality in Hilbert space.

We consider n + t independent random variables taking values in the Hilbert space H as follows

1 n ˆβ(xi)K(xi) if i = 1, 2, . . . , n

t ˆα(xi)K(xi) if i = n + 1, n + 2, . . . , n + t. (53)

and we want to bound || Pn+t i=1 fi||H. We have that,

Dκ if i = 1, 2, . . . , n

1 t κ if i = n + 1, n + 2, . . . , n + t. (54)

Double-Weighting for Covariate Shift Adaptation

Taking v = κ2 B2 Dn + 1

t and using the bounded differences inequality (Theorem 6.2 in (Boucheron et al., 2013)), we

have that, for all l v

l E Pn+t i=1 fi H

Finally, using H older s inequality and by independence, we have that

i=1 E ||fi||2 H v.

i=1 ˆβ(xi)K(xi) 1

i=n+1 ˆα(xn+i)K(xn+i)

with probability at least 1 δ.

D. Quadratic version of DW-KMM

The convex optimization in (25) is a quadratic problem since the squared norm in H can be written as 1

i=1 α(i)K(xn+i) 1

i=1 β(i)K(xi)

i,j=1 α(i)α(j)k(xn+i, xn+j) + 1

i,j=1 β(i)β(j)k(xi, xj) 2

j=1 α(i)β(j)k(xn+i, xj)

k(xn+1, xn+1) k(xn+1, xn+t) ... ... ... k(xn+t, xn+1) k(xn+t, xn+t)

k(x1, x1) k(x1, xn) ... ... ... k(xn, x1) k(xn, xn)

k(x1, xn+1) k(x1, xn+t) ... ... ... k(xn, xn+1) k(xn, xn+t)

= h βT /n, αT /t i K β/n α/t

where K is the kernel matrix given by K(i,j) = k(xi, xj).

Double-Weighting for Covariate Shift Adaptation

Therefore, the optimization problem (25) is equivalent to the quadratic optimization problem

h βT /n, αT /t i K β/n α/t

s.t. 0 β (B/

D)1, 0 α 1 βT 1/n αT 1/t ǫ (56)

||α 1|| 1 1

E. Implementation details and additional experimental results

This appendix details the datasets and settings used for the experiments in Section 6 and shows additional experiments.

For the experiments in Section 6, we have considered four binary classiﬁcation datasets, available in the UCI repository (Dua & Graff, 2017), and previously used in multiple papers on covariate shift adaptation (Gretton et al., 2008; Huang et al., 2006; Kanamori et al., 2009; Mazaheri et al., 2020; Wen et al., 2014). In addition, we use the dataset News20groups that is intrinsically affected by covariate shift (Zhang et al., 2013).

Table 2 details the characteristics of the datasets used in the experiments. The table also shows the parameter σ used in the computation of the kernel matrix K for the Ru LSIF, KMM and DW-KMM methods, which is determined using the common heuristic based on nearest neighbors with K = 50, as is done in (Wen et al., 2014). For the results obtained using the ﬂattening method in (Shimodaira, 2000) and the Ru LSIF method in (Yamada et al., 2011) we considered the hyperparameter γ = 0.5, which is the default value used in those papers.

Table 2. Datasets used in the experiments.

Dataset Covariates Samples Ratio of σ majority class

Blood 3 748 76.20% 0.7491

Breast Cancer 9 683 65.01% 1.6064

Haberman 3 306 75.53% 1.3024

Ringnorm 20 7400 50.49% 3.8299

comp vs sci 1000 5309 3534 55.31% 23.5628

comp vs talk 1000 4888 3256 60.06% 23.4890

rec vs sci 1000 4762 3169 50.17% 24.5642

rec vs talk 1000 4341 2891 55.02% 25.1129

sci vs talk 1000 4325 2880 54.85% 24.8320

In the additional experiments we study the effectiveness of the proposed selection method for hyperparameter D. Speciﬁcally, Tables 3 and 4 show the average classiﬁcation error varying the value of D for the datasets and covariate shifts shown in Table 1. The ﬁrst column of these tables shows the classiﬁcation error obtained when selecting D with the proposed method that minimizes the minimax risk, while the other columns show the classiﬁcation error obtained using speciﬁc values of D. The values of the hyperparameter D have been chosen based on the last inequality in the optimization problem (25). Speciﬁcally, we take the values for D such that 1 1/

D {0, 0.1, . . . , 0.9}. As can be seen from the tables, the proposed selection method results in performances near those obtained with the best values of D.

Double-Weighting for Covariate Shift Adaptation

Table 3. Classiﬁcation error in 21 scenarios using DW-GCS methods with 0-1-loss varying the value of the hyperparameter D.

Dataset proposed D = 1 D = 1.2 D = 1.6 D = 2 D = 2.8 D = 4 D = 6.3 D = 11 D = 25 D = 100 selection

Blood Feature 1 0.30 0.32 0.31 0.31 0.31 0.30 0.30 0.30 0.29 0.29 0.30 Feature 2 0.38 0.40 0.40 0.40 0.40 0.39 0.39 0.38 0.38 0.37 0.38 Feature 3 0.34 0.39 0.37 0.36 0.36 0.35 0.34 0.34 0.34 0.33 0.34 PCA 0.28 0.32 0.30 0.29 0.28 0.27 0.27 0.28 0.28 0.28 0.28 Breast Cancer Feature 1 0.04 0.05 0.05 0.04 0.04 0.04 0.04 0.05 0.05 0.04 0.04 Feature 2 0.04 0.06 0.06 0.05 0.05 0.05 0.06 0.06 0.06 0.04 0.04 Feature 3 0.04 0.06 0.05 0.05 0.05 0.06 0.05 0.06 0.05 0.04 0.04 PCA 0.02 0.03 0.03 0.02 0.02 0.03 0.03 0.03 0.02 0.02 0.02 Haberman Feature 1 0.28 0.39 0.36 0.33 0.30 0.29 0.28 0.28 0.28 0.28 0.28 Feature 2 0.29 0.39 0.37 0.35 0.33 0.32 0.30 0.30 0.30 0.30 0.29 Feature 3 0.35 0.46 0.45 0.43 0.40 0.37 0.35 0.35 0.34 0.34 0.34 PCA 0.30 0.40 0.38 0.35 0.32 0.30 0.29 0.29 0.29 0.30 0.29 Ringnorm Feature 1 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 Feature 2 0.25 0.25 0.26 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 Feature 3 0.25 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.26 0.25 0.25 PCA 0.27 0.30 0.29 0.29 0.28 0.28 0.27 0.27 0.26 0.27 0.27 20 Newsgroups comp vs sci 0.22 0.25 0.24 0.23 0.22 0.21 0.21 0.21 0.21 0.21 0.21 comp vs talk 0.11 0.17 0.16 0.14 0.12 0.11 0.10 0.10 0.10 0.11 0.11 rec vs sci 0.17 0.19 0.18 0.18 0.17 0.17 0.16 0.16 0.16 0.16 0.16 rec vs talk 0.15 0.18 0.17 0.16 0.15 0.15 0.14 0.14 0.14 0.14 0.14 sci vs talk 0.20 0.22 0.21 0.20 0.19 0.19 0.18 0.18 0.18 0.19 0.19

Table 4. Classiﬁcation error in 21 scenarios using DW-GCS methods with log-loss varying the value of the hyperparameter D.

Dataset proposed D = 1 D = 1.2 D = 1.6 D = 2 D = 2.8 D = 4 D = 6.3 D = 11 D = 25 D = 100 selection

Blood Feature 1 0.31 0.32 0.32 0.32 0.31 0.30 0.30 0.30 0.29 0.29 0.30 Feature 2 0.38 0.41 0.40 0.40 0.40 0.39 0.38 0.38 0.38 0.38 0.38 Feature 3 0.35 0.38 0.38 0.37 0.36 0.36 0.35 0.34 0.34 0.34 0.34 PCA 0.28 0.32 0.30 0.29 0.29 0.28 0.28 0.28 0.28 0.28 0.28 Breast Cancer Feature 1 0.04 0.05 0.05 0.05 0.05 0.04 0.04 0.05 0.05 0.04 0.04 Feature 2 0.04 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.04 0.04 Feature 3 0.04 0.06 0.06 0.06 0.06 0.06 0.05 0.06 0.05 0.04 0.04 PCA 0.02 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.02 0.02 0.02 Haberman Feature 1 0.29 0.38 0.36 0.34 0.31 0.30 0.29 0.29 0.29 0.29 0.28 Feature 2 0.30 0.39 0.37 0.35 0.33 0.32 0.31 0.31 0.31 0.31 0.30 Feature 3 0.36 0.46 0.44 0.43 0.40 0.37 0.36 0.35 0.35 0.35 0.34 PCA 0.31 0.39 0.38 0.36 0.33 0.31 0.30 0.30 0.30 0.31 0.30 Ringnorm Feature 1 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 Feature 2 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.24 0.25 Feature 3 0.25 0.26 0.26 0.26 0.25 0.26 0.25 0.25 0.25 0.25 0.25 PCA 0.26 0.30 0.29 0.29 0.28 0.27 0.27 0.26 0.26 0.26 0.27 20 Newsgroups comp vs sci 0.22 0.25 0.24 0.23 0.22 0.21 0.21 0.21 0.21 0.21 0.21 comp vs talk 0.11 0.17 0.16 0.14 0.12 0.11 0.10 0.10 0.10 0.11 0.11 rec vs sci 0.17 0.19 0.18 0.18 0.17 0.17 0.16 0.16 0.16 0.16 0.16 rec vs talk 0.15 0.18 0.17 0.16 0.15 0.15 0.14 0.14 0.14 0.14 0.14 sci vs talk 0.20 0.22 0.21 0.20 0.19 0.19 0.18 0.18 0.18 0.19 0.19