# nearoptimal_linear_regression_under_distribution_shift__5f08a697.pdf Near-Optimal Linear Regression under Distribution Shift Qi Lei 1 Wei Hu 1 Jason D. Lee 1 Transfer learning is essential when sufficient data comes from the source domain, with scarce labeled data from the target domain. We develop estimators that achieve minimax linear risk for linear regression problems under distribution shift. Our algorithms cover different transfer learning settings including covariate shift and model shift. We also consider when data are generated from either linear or general nonlinear models. We show that linear minimax estimators are within an absolute constant of the minimax risk even among nonlinear estimators for various source/target distributions. 1. Introduction The success of machine learning crucially relies on the availability of labeled data. The data labeling process usually requires extensive human labor and can be very expensive and time-consuming, especially for large datasets like Image Net (Deng et al., 2009). On the other hand, models trained on one dataset, despite performing well on test data from the same distribution they are trained on, are often sensitive to distribution shifts, i.e., they do not adapt well to related but different distributions. Even small distributional shift can result in substantial performance degradation (Recht et al., 2018; Lu et al., 2020). Transfer learning has been an essential paradigm to tackle the challenges associated with insufficient labeled data (Pan & Yang, 2009; Weiss et al., 2016; Long et al., 2017). The main idea is to make use of a source domain with plentiful labeled data (e.g., Image Net) and to learn a model that performs well on the target domain (e.g., medical images) where few or no labels are available. Despite the lack of labeled data, we may still use unlabeled data from the target domain, which are usually much easier to obtain and can 1Princeton University. Correspondence to: Qi Lei , Wei Hu , Jason D. Lee . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). provide helpful marginal distribution information about the target domain. Although this approach is integral to many applications, many fundamental questions are left open even in very basic settings. In this work, we focus on the setting of linear regression under distribution shift and ask the fundamental question of how to optimally learn a linear model for the target domain, using labeled data from a source domain and unlabeled data (and possibly limited unlabeled data) from the target domain. We design a two-stage meta-algorithm that addresses this problem in various settings, including covariate shift (i.e., when p(x) changes) and model shift (i.e., when p(y|x) changes). Following the meta-algorithm, we develop estimators that achieve near minimax risk (up to universal constant factors) among all linear estimation rules under some standard data concentration properties. Here linear estimators refer to all estimators that depend linearly on the label vector; these include almost all popular estimators known in linear regression, such as ridge regression and its variants. When the second moment matrix of input variables in source and target domains commute, we prove that our estimators achieve near minimax risk among all possible estimators. We also provide a separation result demonstrating our algorithm can be better than ridge regression by a multiplicative factor of O(d 1/4). A crucial insight from our results is that when covariate shift is present, we need to apply data-dependent regularization that adapts to changes in the input distribution. For linear regression, this is characterized by the input covariances of source and target tasks, estimated using unlabeled data. Our experiments verify that our estimator has significant improvement over ridge regression and similar heuristics. 1.1. Related work Different types of distribution shift are introduced in Storkey (2009); Quionero-Candela et al. (2009). Specifically, covariate shift occurs when the marginal distribution on P(x) changes from source to target domain (Heckman, 1979; Shimodaira, 2000; Huang et al., 2007). Wang et al. (2014); Wang & Schneider (2015) tackle model shift (P(y|x)) provided the change is smooth as a function of x. Sun et al. (2011) design a two-stage reweighting method based on both covariate shift and model shift. Other meth- Near-Optimal Linear Regression under Distribution Shift ods like the change of representation, adaptation through instance pruning are proposed in Jiang & Zhai (2007). In this work, we focus on the above two kinds of distribution shifts. Other distribution shift settings involving label/target shift (P(y)) and conditional shift (P(x|y)) are beyond the scope of this paper. Some prior work also focuses on these settings (See reference therein (Saerens et al., 2002; Zhang et al., 2013; Lipton et al., 2018)). For instance, Zhang et al. (2013) exploits the benefit of multi-layer adaptation by a location-scale transformation on x. Transfer learning/domain adaptation are sub-fields within machine learning to cope with distribution shift. A variety of prior work falls into the following categories. 1) Importance-reweighting is mostly used in the covariate shift (Shimodaira, 2000; Huang et al., 2007; Cortes et al., 2010); 2) One fruitful line of work focuses on exploring robust/causal features or domain-invariant representations (Wu et al., 2019) through invariant risk minimization (Arjovsky et al., 2019), distributional robust minimization (Sagawa et al., 2019), human annotation (Srivastava et al., 2020), adversarial training (Long et al., 2017; Ganin et al., 2016), or by minimizing domain discrepancy measured by some distance metric (Pan et al., 2010; Long et al., 2013; Baktashmotlagh et al., 2013; Gong et al., 2013; Zhang et al., 2013; Wang & Schneider, 2014) ; 3) Several approaches seek gradual domain adaptation (Gopalan et al., 2011; Gong et al., 2012; Glorot et al., 2011; Kumar et al., 2020) through self-training or a gradual change in the training distribution. Near minimax estimations are introduced in Donoho (1994) for linear regression problems with Gaussian noise. For a more general setting, Juditsky et al. (2009) estimate the linear functional using convex programming. Blaker (2000) compares ridge regression with a minimax linear estimator using weighted squared error. Kalan et al. (2020) considers a setting similar to this work of minimax estimator under distribution shift but focuses on computing the lower bound for the linear and one-hidden-layer neural network under distribution shift. A few more interesting results are derived by minimizing the generalization error bounds for distribution shift under various settings (David et al., 2010; Hanneke & Kpotufe, 2019; Ben-David et al., 2010; Zhao et al., 2019). 2. Preliminary We formalize the setting considered in this paper for transfer learning under the distribution shift. Notation and setup. Let p S(x) and p T (x) be the marginal distribution for x in source and target domain. The associated second-moment matrices are ΣS := Ep S[xx ], and ΣT := Ep S[xx ]. Labeled data (x, y) satisfies Ep S[y|x] = Ep T [y|x] = f (x) and y = f (x) + z with Gaussian noise z N(0, σ2). We consider both linear (f (x) := E[y|x] = x β ) and general nonlinear data generation model. When the optimal linear model changes from source to domain we add a subscript for distinction, i.e., β S and β T . We use bold (x) symbols for vectors, lower case letter (x) for scalars and capital letter (A) for matrices. We observe n S, n T labeled samples from source and target domain, and n U unlabeled target samples. Labeled data is scarce in target domain: n S n T and n T can be 0. Specifically, data is collected as XS = [x 1 |x 2 | |x n S] Rn S d, with xi, i [n S] drawn from p S, noise z = [z1, z2, zn S] , zi N(0, σ2). y S = [y1, y2, , yn S] Rn S, y T Rn T and XU Rn U d are similarly defined). Denote by ˆΣS = X S XS/n S the empirical second-moment matrix. The positive part of a number is denoted by (x)+. Minimax (linear) risk. In this work, we focus on designing linear estimators ˆβ : Rn Rd, y S Ay S1 for β T B. Here β T is the optimal linear model in target domain (:= arg minβ Ex p T ,z N(0,σ2)[(f (x) + z x β)2]). 2 Our estimator is evaluated by the excess risk on target domain, with the worst case β T in some set B: LB( ˆβ) = maxβ B Ey S Ex p T x ( ˆβ(y S) β T ) 2 . Minimax linear risk and minimax risk among all estimators are respectively defined as: RL(B) min ˆβ linear in y S LB( ˆβ); RN(B) min ˆβ LB( ˆβ). The subscript N" or L" is a mnemonic for non-linear" or linear" estimators. RN is the optimal risk with no restriction placed on the class of estimators. RL only considers the linear function class for ˆβ. Minimax linear estimator and minimax estimator are the estimators that respectively attain RL and RN within universal multiplicative constants. Normally we only consider B = {β| β 2 r}. When there is no ambiguity, we simplify ˆβ(y S) by ˆβ. Our meta-algorithm. Our paper considers different settings with distribution shift. Our methods are unified under the following meta-algorithm: Step 1: Construct an unbiased sufficient statistic ˆβSS3 for the unknown parameter. Step 2: Construct ˆβMM, a linear function of the sufficient 1A Rd n may depend in an arbitrary way on XS, n S, or ΣT . The estimator is linear in the observation y S. 2We do not distinguish linear and affine regression since one could simply add another constant coordinate to x to take into consideration the intercept part. 3With samples y S, a statistic t = T(y S) is sufficient for the underlying parameter β if the conditional probability distribution of the data y S, given the statistic t = T(y S), does not depend on the parameter β . Near-Optimal Linear Regression under Distribution Shift statistic ˆβSS that minimizes LB( ˆβMM). For each setting, we will show that ˆβMM achieves linear minimax risk RL. Furthermore, under some conditions, the minimax risk RN is uniformly lower bounded by a universal constant times LB( ˆβMM). Outline. In the sections below, we tackle the problem in several different settings. In Section 3, we design algorithms with only covariate shift and linear data-generation models (f is linear) for unsupervised domain adaptation (n T = 0) in Section 3.1, and supervised domain adaptation (n T > 0) in Section 3.4. Section 4 is about linear regression with approximation error (n T = 0 and f (x) is a general nonlinear function). Finally we consider model shift for linear models (β S = β T ) in Section 5. 3. Covariate shift with linear models In this section, we consider the setting with only covariate shift and f is linear. That is, only ΣS (marginal distribution p S(x)) changes to ΣT (marginal distribution p T (x)), but f = E[y|x] = x β (conditional distribution p(y|x)) is shared. 3.1. Unsupervised domain adaptation with linear models We observe n S samples from source domain: y S = XSβ + z, z N(0, σ2I) and only some unlabeled samples XU from the target domain. Our goal is to find the minimax linear estimator ˆβMM(y S) = Ay S with some linear mapping A that attains RL(B) 4. Following our meta-algorithm, let ˆβSS = 1 n S ˆΣ 1 S X S y S5 be an unbiased sufficient statistic for β : n S ˆΣ 1 S X S y S = 1 n S ˆΣ 1 S X S XSβ + 1 n S ˆΣ 1 S X S z. n S ˆΣ 1 S X S z N β , σ2 The fact that ˆβSS(y S) is a sufficient statistic for β is proven in Claim 3.9 for a more general case, using the Fisher Neyman factorization theorem. We prove that the minimax linear estimator is of the form ˆβMM = C ˆβSS and then design algorithms that calculate the optimal C. Claim 3.1. The minimax linear estimator is of the form ˆβMM = C ˆβSS for some C Rd d. Warm-up: commutative second-moment matrices. In order to derive the minimax linear estimator, we first con- 4For linear estimator ˆβ = Ay S, y S is the only source of randomness and A depends on XS, n S, which are considered fixed. 5Throughout the paper ˆΣ 1 S could be replaced by pseudo-inverse and our algorithm also applies when n < d. sider the simple case when ΣT and ˆΣS are simultaneously diagonalizable. We note that under this setting, minimax estimation under covariate shift reduces to the wellstudied problem of finding a minimax linear estimator under weighted square loss (see e.g., (Blaker, 2000)). One could apply Pinsker s Theorem (Johnstone, 2011) and get an estimator function and the minimax risk with a closed form: Theorem 3.2 (Linear Minimax Risk with Covariate Shift). Suppose the observations follow sequence model y S = XSβ + z, z N(0, σ2In). If ΣT = Udiag(t)U and ˆΣS X S XS/n S = Udiag(s)U , then the minimax linear risk RL(B) min ˆβ=Ay S max β B E Σ1/2 T ( ˆβ β ) 2 where B = {β| β r}, and λ = λ(r) is determined by σ2 n S Pd i=1 1 si ( ti/λ 1)+ = r2. The linear minimax estimator is given by: ˆβMM =Σ 1/2 T U(I diag(λ/ t))+U Σ1/2 T ˆβSS, (2) where ˆβSS = 1 n S ˆΣ 1 S X S y S. Since r is unknown in practice, we could simply view either r or directly λ as the tuning parameter. We compare the functionality of λ with that of ridge regression: ˆβλ RR = arg min ˆβ E 1 2n XS ˆβ y S 2 + λ 2 ˆβ 2 = (ˆΣS + λI) 1X S y S/n S. For both algorithms, λ balances the bias and variance: λ = 0 gives an unbiased estimator, and a big λ gives a (near) zero estimator with no variance. The difference is, the minimax linear estimator shrinks some signal directions based on the value of ti, since the risk in those directions is downweighted in the target loss. The estimator tends to sacrifice the directions of signal where ti is smaller. Ridge regression, however, respects the value of si. A natural counterpart is for ridge to also regularize based on t: let ˆβλ RR,T = arg min 1 n Σ1/2 T (β ˆΣ 1 S X S y S) 2 + λ β 2 = (ΣT + λI) 1ΣT ˆβSS. We will compare their performances in the experimental section. Non-commutative second-moment matrices. For noncommutative second-moment shift, we follow the same procedure. Our estimator is achieved by optimizing over C: ˆβMM = C ˆβSS: RL(B) min ˆβ=Ay S max β B E Σ1/2 T ( ˆβ β ) 2 2 = min ˆβ=C ˆβSS max β r n Σ1/2 T (C I)β 2 2 n S Tr(Σ1/2 T C ˆΣ 1 S C Σ1/2 T ) (Claim 3.1) Near-Optimal Linear Regression under Distribution Shift n S Tr(Σ1/2 T C ˆΣ 1 S C Σ1/2 T ) , (3) s.t. (C I) ΣT (C I) τI. Unlike the commutative case, this problem does not have a closed form solution, but is still computable: Proposition 3.3. Problem (3) is a convex program and computable in polynomial-time. We achieve near-optimal minimax risk among all estimators under some conditions: Theorem 3.4 (Near minimaxity of linear estimators). When ΣS, ΣT commute, or ΣT is rank 1, the best linear estimator from (2) or (3) achieves near-optimal minimax risk: LB( ˆβMM) = RL(B) 1.25RN(B). Note that RN RL by definition. Therefore 1) our estimator ˆβMM is near-optimal, and 2) our lower bound for RN is tight. Lower bounds (without matching upper bounds) for general non-commutative problem is presented in Kalan et al. (2020) and we improve their result for the commutative case and provide a matching algorithm. Their lower bound scales with d n S mini ti si for large r, while ours becomes 1 n S P i ti si . Our lower bound is always larger and thus tighter, and potentially arbitrarily larger when maxi ti si and mini ti si are very different. We defer our proof to the appendix. 3.2. Connection to ridge regression From a probabilistic perspective, ridge regression is equivalent to maximum a posteriori (MAP) inference with a Gaussian prior: β N(0, r2I) (see e.g. Murphy (2012)). Similarly, instead of considering a worst-case risk that minimizes LB( ˆβ) := maxβ B Ey S Σ1/2 T ( ˆβ(y S) β ) 2, one could also study the average setting that minimizes LB := Eβ N(0,r2I) Ey S Σ1/2 T ( ˆβ(y S) β ) 2 instead. With distribution shift, the performance is evaluated on ΣT instead of ΣS. Interestingly with Gaussian prior, this does not give us a different algorithm other than the original ridge regression. Proposition 3.5. The optimal estimator under Gaussian prior β N(0, r2I) evaluated on p T is: ˆβ arg min β=Ay S Eβ N(0,r2I) Ey S Ex p T x (β β ) 2 r2n S I + ˆΣS) 1X S y S arg min ˆβ E 1 2n XS ˆβ y S 2 + λ =(ˆΣS + λI) 1 ˆΣS ˆβSS =: ˆβλ RR, when λ = σ2/(n Sr2). Namely, the average-case best linear estimator with Gaussian prior is equivalent to ridge regres- sion with regularization strength λ = σ2/n S r2 : the variance ratio between the noise distribution and prior distribution. Even though ridge regression achieves the optimal risk in the average sense, it could be much worse than the minimax linear estimator in the worst case. We prove a separation result on a specific example (that is deferred to the appendix). Remark 3.1 (Benefit of minimax linear estimator). There is an example that RL(B) O(d 1/4LB( ˆβλ RR)) even with the optimal hyperparameter λ. 6 Adaptation on the prior distribution. With specific problems, one should adjust the prior distribution instead of simply assume β N(0, r2). If one replaces the prior by β N( ˆβSS, r2), one could get another heuristic method: Proposition 3.6. Let ˆβSS be the estimator from ordinary least square: ˆβSS = ˆΣ 1 S X S y S/n S. The optimal estimator under Gaussian prior β N( ˆβSS, r2I) evaluated on p T is: ˆβ arg min β=Ay S Eβ N( ˆβSS,r2I) Ey S Ex p T x (β β ) 2 r2n S I + ΣT ) 1ΣT ˆΣ 1 S X S y S arg min β Σ1/2 T (β ˆβSS) 2 + λ β 2 =(ΣT + λI) 1ΣT ˆβSS =: ˆβλ RR,T , when λ = σ2/(n Sr2). Comparing the closed-form estimator ˆβλ RR,T := (ΣT + λI) 1ΣT ˆβSS to the original ridge regression ˆβλ RR := (ˆΣS + λI) 1 ˆΣS ˆβSS, we could see that this algorithm regularizes ˆβ based on the signal strength from the target distribution, and it is equivalent to ridge regression by adjusting the prior distribution to center at ˆβSS, the unbiased estimator for the ground truth β . We will compare both methods with our minimax estimator in the experimental section. 3.3. Minimax linear estimator with finite unlabeled samples from target domain In practice, we have finite unlabeled samples XU Rn U d, where we denote the empirical second-moment matrix as ˆΣU = X U XU/n U. Let ˆLB to denote the worst case excess risk measured on the observed target samples: ˆLB( ˆβ) = maxβ B Ey S 1 n U XU( ˆβ(y S) β ) 2. To find the best linear estimator that minimizes ˆLB, our proposed algorithm becomes: n S Tr(C ˆΣ 1 S C ˆΣU) , (4) 6Note this goes without saying that our method can also be orderwise better than ordinary least square, which is a special case of ridge regression by setting λ = 0. Near-Optimal Linear Regression under Distribution Shift s.t. (C I) ˆΣU(C I) τI. Let ˆβ = ˆC ˆΣ 1 S X S y S/n S. We want to show that in spite of the existence of estimation error due to the replacement of ΣT with ˆΣT , our generated ˆβ still achieves minimax linear risk (up to constant multiplicative error). For simplicity, in this section we assume input samples are centered: Ep S[x] = Ep T [x] = 0. This assumption results in no loss of generality. Since the sample mean is more sample-efficient to estimate than covariance matrix, one will be able to first estimate the mean and center the data. We assume some standard light-tail property on the target samples: Definition 3.7 (ρ2-subgaussian distribution). We call a distribution p, E[p] = 0 to be ρ2-subgaussianwhen there exists ρ > 0 such that the random vector x p is ρ2-subgaussian. p is the whitening of p such that x p is equivalent to x = Σ1/2 x p, where Σ = Ep[xx ]. 7 Note that ρ is defined on the whitening of the data. It doesn t scale with Σ op and should be viewed as universal constant. Theorem 3.8. Fix a failure probability δ (0, 1). Suppose target distribution p T is ρ2-subgaussian, and the sample size in target domain satisfies n U ρ4(d + log 1 δ ). Let ˆβ : y S ˆC ˆΣ 1 S X S y S where ˆC is defined from Eqn. (4). Then with probability at least 1 δ over the unlabeled samples from target domain, and for each fixed XS from source domain, our learned estimator ˆβ(y S) satisfies: LB( ˆβ) (1 + O( ρ4(d + log(1/δ)) n ))RL(B). (5) When ΣT commutes with ˆΣS or is rank 1, we have: LB( ˆβ) (1.25 + O( ρ4(d + log(1/δ)) n ))RN(B). (6) Similarly all other results in the paper could be extended to ˆβ arg min ˆLB( ), the estimator obtained with finite target samples XU. Remark 3.2 (Incorporating the randomness from source data). For linear estimators, it naturally considers XS as fixed and Theorem 3.4 is comparing our estimator with the optimal nonlinear estimator using the same data XS from the source domain. In Appendix D, we compare our estimator with an even stronger linear estimator with infinite access to p S and show that our estimator is still within multiplicative factor of it. 7A random vector x is called ρ2-subgaussian if for any fixed unit vector v of the same dimension, the random variable v x is ρ2-subgaussian, i.e., E[es v (x E[x])] es2ρ2/2 ( s R). 3.4. Utilize source and target labeled data jointly In some scenarios, we have moderate amount of labeled data from target domain as well. In such cases, it is important to utilize the source and target labeled data jointly. Let y S = XSβ + z S, y T = XT β + z T . We consider XS, XT as deterministic variables, ˆΣ 1 S X S y S/n S N(β , σ2 n S ˆΣ 1 S ) and ˆΣ 1 T X T y T /n T N(β , σ2 n T ˆΣ 1 T ). Therefore conditioned on the observations y S, y T , a sufficient statistic for β is ˆβSS := (n S ˆΣS + n T ˆΣT ) 1(X S y S + X T y T ). Claim 3.9. ˆβSS is an unbiased sufficient statistic of β with samples y S, y T . ˆβSS N(β , σ2(n S ˆΣS + n T ˆΣT ) 1). Algorithm: First consider the estimator ˆβSS = (n S ˆΣS + n T ˆΣT ) 1(X S y S + X T y T ). Next find the best linear function of ˆβSS: ˆβMM = arg min C,τ r2τ + σ2Tr((n S ˆΣS + n T ˆΣT ) 1C ΣT C), s.t. (C I) ΣT (C I) τ. Proposition 3.10. The minimax estimator ˆβMM is of the form C ˆβSS for some C. When choosing C with our proposed algorithm and when ˆΣS commutes with ˆΣT and ΣT , we achieve the minimax risk RL(B) 1.25RN(B). 4. Covariate shift with approximation error Now we consider observations coming from nonlinear models: y S = f (XS) + z. Let β S = arg minβ Ex p S,z N(0,σ2)[(f (x)+z β x)2], and similarly for β T . Notice now even with f unchanged across domains, the input distribution affects the best linear model. Approximation error on source domain is a S(x) := f (x) x β S and vice versa for a T . Define the reweighting vector w Rn as wi = p T (xi)/p S(xi). We form an unbiased estimator via ˆβLS = arg min β { X p T (xi) p S(xi)(β xi yi)2} =(X S diag(w)XS) 1(X S diag(w)y S). Claim 4.1. ˆβLS is asymptotically unbiased and normally distributed with covariance matrix M := Σ 1 T Ex p T [ p T (x) p S(x)(a T (x)2 + σ2)xx ]Σ 1 T : n S( ˆβLS β T ) d N(0, M). Note that large importance weights greatly inflates the variance of the estimator, especially when p T /p S blows up somewhere. Therefore here we design the an algorithm to cope with the inflated variance. Again we want to minimize Near-Optimal Linear Regression under Distribution Shift the worst case risk: min ˆβ=C ˆβLS max β T B E Σ1/2 T ( ˆβ β T ) 2 d min C max β T r Σ1/2 T (C I)β T 2 2 + 1 n S Tr(CMC ΣT ) (C I) ΣT (C I) 2r2 + 1 n S Tr(CMC ΣT ) With ˆβLS computed beforehand, one could first estimate M by let ˆ M := 1 n S P i Σ 1 T p2 T (x) p2 S(x)(yi x i ˆβLS)2xix i Σ 1 T . Therefore our estimator is ˆβMM ˆC ˆβLS, where ˆC finds ˆC arg min τ,C n S Tr(C ˆ MC ΣT ) (7) s.t. (C I) ΣT (C I) τI. Claim 4.2. Let B = {β| β r}, and f F is some compact symmetric function class: f F f F. Then linear minimax estimator is of the form C ˆβLS for some C. When ˆC solves Eqn. (7), LB( ˆβMM) asymptotically matches RL(B), the linear minimax risk. By reducing from y S to ˆβLS we eliminate n d dimensions, and this claim says that X S y S is sufficient to predict β T . We note that f is more general than a linear function and therefore the lower bound could only be larger than RN(B) defined in the previous section. 4.1. Estimating p T (x)/p S(x) Even though estimating p T (x)/p S(x) might be sample inefficient, it only involves unlabeled data and therefore instance weighting related algorithms still attract prior studies as demonstrated in the related work section. Practical ways to estimate the density ratio involve respectively estimating p T and p S (Lin et al., 2002; Zadrozny, 2004), kernel mean matching (KMM) (Huang et al., 2006)), and some common divergence minimization between weighted source distribution and target distribution (Sugiyama et al., 2008; 2012; Uehara et al., 2016; Menon & Ong, 2016; Kanamori et al., 2011).We propose another simple algorithm that is very convenient to use. We conduct regression on the data samples (x, y) q(x, y) where q Y (y) is Bernouli( 1 2)8 and q X|Y (x|y = 1) = p T , q X|Y (x|y = 0) = p S. Empirically, we will concatenate XS and XU to form input data and stack 0 Rn S and 1 Rn U as the target vector y. Proposition 4.3. The optimal function that solves α arg minf Ex,y q(f(x) y)2 satisfies: α(x) = p T (x) p S(x)+p T (x). 8The scalar 1/2 should be adjusted based on the number of unlabeled samples from source and target domain. Therefore with proper transformation9 on α one could get the importance weights. In practice, one might be flexible on choosing the function class F for estimating α and sample complexity will be bounded by some standard measure of F s complexity, e.g., Rademacher or Gaussian complexity (Bartlett & Mendelson, 2002). Unlike KMM, this parametrized estimation applies to unseen data x which makes cross-validation possible. 5. Near minimax estimator with model shift The general setting of transfer learning in linear regression involves both model shift and covariate shift. Namely, the generative model of the labels might be different: y S = XSβ S + z S, and y T = XT β T + z T . Denote by δ := β S β T as the model shift. We are interested in the minimax linear estimator when δ γ and β T r. Thus our problem becomes to find minimax estimator for β T B = {β| β r} from y S, y T . Algorithm: First consider a sufficient statistic ( βS, βT ) for (β T , δ). Here βS = ˆΣ 1 S X S y S/n S N(β T + δ, σ2 n S ˆΣ 1 S ), and βT = ˆΣ 1 T X T y T /n T N(β T , σ2 n T ˆΣ 1 T ). Then consider the best linear estimator on top of it: ˆβ = A1 βS + A2 βT . Write = {δ| δ γ} and LB, ( ˆβ) := maxβ T B,δ Σ1/2 T ( ˆβ β T ) 2. RL(B, ) := min ˆβ=A1 βS+A2 βT LB, ( ˆβ) min A1,A2 max β T r, δ γ n 2 Σ1/2 T ((A1 + A2 I)β T 2 +2 Σ1/2 T A1δ 2 + σ2 n S Tr(A1 ˆΣ 1 S A 1 ) (8) n T Tr(A2 ˆΣ 1 T A 2 ) (AM-GM) = min A1,A2 n 2 Σ1/2 T ((A1 + A2 I) 2 2r2 + 2 Σ1/2 T A1 2 2γ2 n S Tr(A1 ˆΣ 1 S A 1 ) + σ2 n T Tr(A2 ˆΣ 1 T A 2 ) =: r B, (A1, A2)} . (9) Therefore we optimize over this upper bound and reformulate the problem as a convex program: ( ˆA1, ˆA2) arg min A1,A2,a,b 2ar2 + 2bγ2 n S Tr(A1 ˆΣ 1 S A 1 ) + σ2 n T Tr(A2 ˆΣ 1 T A 2 ) s.t.(A1 + A2 I) ΣT (A1 + A2 I) a I, A 1 ΣT A1 b I. (10) 9Apply f(x) 1 1/f(x) 1 Near-Optimal Linear Regression under Distribution Shift Our estimator is given by: ˆβMM = ˆA1 βS + ˆA2 βT . Since ˆβMM is a relaxation of the linear minimax estimator, it is important to understand how well ˆβMM performs on the original objective: Claim 5.1. RL(B, ) LB, ( ˆβMM) 2RL(B, ). Finally we show with the relaxation we still achieve a nearoptimal estimator even among all nonlinear rules. Theorem 5.2. When ΣT commutes with ˆΣS, it satisfies: LB, ( ˆβMM) := max β T B,δ Σ1/2 T ( ˆβMM β T ) 2 Here RN(B, ) := min ˆβ(y S,y T ) maxβ T B,δ Σ1/2 T ( ˆβ β T ) is the minimax risk. We defer the complete proof to the appendix. The main proof technique is to decompose the problem to 2-d subproblems with closed-form solutions and are solvable with Le Cam s two point lemma. We include the proof sketch here: Proof sketch of Theorem 5.2. For the ease of understanding, we provide a simple proof sketch when ΣS = ΣT are diagonal. We first define the hardest hyperrectangular subproblem. Let B(τ) = {b : |βi| τi} be a subset of B and similarly for (ζ). We show that RL(B, ) = maxτ B,ζ RL(B(τ), (ζ)), and clearly RN(B, ) maxτ B,ζ RN(B(τ), (ζ)). Meanwhile we show when the sets are hyperrectangles the minimax (linear) risk could be decomposed to 2-d problems: RL(B(τ), (ζ)) = P i RL(τi, ζi). Each RL(τi, ζi) is the linear minimax risk to estimate βi from x N(βi + δi, 1) and y N(βi, 1) where |βi| τi and |δi| ζi. This 2-d problem for linear risk has a closed form solution, and the minimax risk can be lower bounded using Le Cam s two point lemma. We show RL(τi, ζi) 13.5RN(τi, ζi) and therefore: 1 2LB, ( ˆβMM) Claim 5.1 Lemma C.2 = max τ B,ζ RL(B(τ), (ζ)) Prop C.4.a = max τ B,ζ i RL(τi, ζi) max τ B,ζ 13.5 X i RN(τi, ζi) Prop C.4.b = 13.5 max τ B,ζ RN(B(τ), (ζ)) 13.5RN(B, ). 6. Experiments Our estimators are provably near optimal for the worst case β . However, it remains unknown whether on average they outperform other baselines. With synthetic data we explore the performances with random β . We are also interested to investigate the conditions when we win more. Setup. We set n S = 2000, d = 50, σ = 1, r = d. For each setting, we sample β T from standard normal distribution and rescale it to be norm r. We estimate ΣT by n U = 2000 unlabeled samples. We compare our estimator with ridge regression (S-ridge) and a variant of ridge regression transformed to target domain (T-ridge): ˆβλ RR,T = arg min 1 n Σ1/2 T (β ˆΣ 1 S X S y S) 2 + λ β 2 = (ΣT + λI) 1ΣT ˆβSS. Covariate shift. In order to understand the effect of covariate shift on our algorithm, we consider three types of settings, each with a unique varying factor that influences the performance: 1) covariate eigenvalue shift with shared eigenspace; 2) covariate eigenspace shift with fixed eigenvalues10; 3) signal strength change. We also have an additional 200 labeled data from target domain as validation set only for hyper-parameter tuning. Model shift. Next we consider the problem with model shift. We sample a random δ with norm γ varying from 0 to r = d and observe data generated by y S = XS(β T + δ) + z S R2000, z S N(0, I) and y T = XT β T + z T R500, z T N(0, I). We compare our estimator with two baselines: "ridge-source" denotes ridge regression using only source data, and "ridge-target" is from ridge regression with target data. Figure 1 demonstrates the better performance of our estimator in all circumstances. From (a) we see that with more discrepancy between ΣS and ΣT , our estimator tends to perform better. (b) shows our estimator is better when the signal is relatively stronger. From (c) we can see that with the increasing model shift measured by γ/r, S-ridge becomes worse and is outperformed by T-ridge that remains unchanged. Our estimator becomes slightly worse as well due to the less utility from source data, but remains the best among others. When γ/r 0.2, our method has the most improvement in percentage compared to the best result among ridge-source and ridge-target. 6.1. Experiments with approximation error Finally, we conduct empirical studies with nonlinear models. We maintain the same setting as before. We also generate a small validation dataset from target domain: XCV R500 50, sampled from N(0, ΣT ), y CV = f (XCV) + z CV, 10We leave this result in appendix since performance appears invariant to this factor. Near-Optimal Linear Regression under Distribution Shift (a) covariate eigen-spectrum (b) signal strength (c) model shift Figure 1: Performance comparisons. (a): The x-axis α defines the spread of eigen-spectrum of ΣS: si 1/iα, ti 1/i. (b) x-axis is the normalized value of signal strength: ΣT β /r. (c) X-axis is the model shift measured by γ/r. Performance with standard error bar is from 40 runs. Figure 2: The x-axis is noise level σ and y-axis is the excess risk (with approximation error). Performance with standard error bar is from 40 runs. with z CV N(0, σ2I). We choose λi(ΣS) i, λi(ΣT ) 1/i, and the eigenspace for both ΣS and ΣT are random orthonormal matrices. ( ΣS 2 F = ΣT 2 F = d.) The ground truth model is a one-hidden-layer Re LU network: f (x) = 1/da (Wx)+, where W and a are randomly generated from standard Gaussian distribution. We observe noisy labels: y S = f (x) + z, where zi N(0, σ2). Estimating weights p T (x)/p S(x). Since the generated data samples are Gaussian, the absolute weights for p T (x)/p S(x) = q |ΣS| |ΣT | exp( 1 2x (Σ 1 S Σ 1 T )x). However, this absolute value scales exponentially with the norm of x and can amplify the variance. Meanwhile, when one multiplies both XS, y S by 10, the ground truth β doesn t change but the absolute value for p T (x)/p S(x) will change drastically. This discrepancy highlights the importance of relative magnitudes (among samples) instead of the absolute value, as noted by Kanamori et al. (2009). To obtain a relative score, we first estimate the absolute density ratio α(x) p T (x)/(p S(x) + p T (x)) following our algorithm in Section 4.1 with linear regression. We then uniformly assign the weight wi for each sample by 10 discrete values 1, 2, 3 10 based on the absolute value of α(x) and then rescale the reweighting vector properly. We use the conventional way to adjust the reweighting strength by using wc i , c [0, 1] and choose c by cross validation. We implement our method (Eqn. 7) using the estimated weights as above, and plot the excess risk comparisons in Figure 2. The baselines we choose are ordinary least square ("OLS" in Figure (2)), ridge regression (Legend is "Ridge") and weighted least square (Kanamori et al., 2009) (Legend is "Reweighting"; ˆβLS in our main text). For ridge regression, reweighting and our methods, we tune hyperparameters through cross-validation. All results are presented from 40 runs where the randomness comes from f and the eigenspaces of ΣS, ΣT . From Figure 2 we could see that reweighting algorithm improves over ordinary least square but is outperformed by ridge regression due to large variance. Our algorithm achieves the best performance among others by appropriately reweighting then reducing the variance. Experiments on Berkeley Yearbook Dataset To verify the performance of our algorithm on real-world data, we conduct an experiment on the Berkeley Yearbook dataset (Ginosar et al., 2015). We randomly split the data to form source and target tasks, where the source has 63.2% male photos and 43.4% male images for the target task. Input X is gray-scale portraits, and Y is the year the photo is taken (ranging from 1905 to 2013). We implement our algorithms together with the baselines and estimate the density ratio from the data. We demonstrate the performance improvement in Figure 3. The x-axis is the scalar that adjusts reweighting strength c defined in the previous paragraph. Near-Optimal Linear Regression under Distribution Shift Figure 3: Comparisons on Yearbook Dataset (Ginosar et al., 2015). 7. Conclusion We study in depth the minimax linear estimator for linear regression under various distribution shift settings. We investigated the optimal linear estimators with covariate shift for linear models in unsupervised and supervised domain adaptation settings, with no or scarce labeled data from the target distribution. For nonlinear models with approximation error, we also introduce the minimax linear estimator together with an easy-to-use density ratio estimation method. We further explore some moderate model shift in the linear setting. Our estimators achieve near-optimal worst-case excess risk measured on the target domain and, in some circumstances, are within constant of the minimax risk among all nonlinear rules. The significant improvement of our estimators over ridge regression is demonstrated by a theoretical separation result and by empirical validations even for average case with random parameters. In future work, we will extend our algorithm to classification problems under distribution shift and apply the algorithms to fine-tuning the last-layer of a deep network. Acknowledgements QL was supported by NSF #2030859 and the Computing Research Association for the CIFellows Project. WH was supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Amazon Research, DARPA and SRC. JDL was supported by ARO under MURI Award W911NF-11-1-0303, the Sloan Research Fellowship, and NSF CCF 2002272. Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez Paz, D. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Baktashmotlagh, M., Harandi, M. T., Lovell, B. C., and Salz- mann, M. Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 769 776, 2013. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1-2):151 175, 2010. Blaker, H. Minimax estimation in linear regression under restrictions. Journal of statistical planning and inference, 90(1):35 55, 2000. Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442 450, 2010. David, S. B., Lu, T., Luu, T., and Pál, D. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 129 136, 2010. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Donoho, D. L. Statistical estimation and optimal recovery. The Annals of Statistics, pp. 238 270, 1994. Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096 2030, 2016. Ginosar, S., Rakelly, K., Sachs, S., Yin, B., and Efros, A. A. A century of portraits: A visual historical record of american high school yearbooks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1 7, 2015. Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, 2011. Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066 2073. IEEE, 2012. Near-Optimal Linear Regression under Distribution Shift Gong, B., Grauman, K., and Sha, F. Connecting the dots with landmarks: Discriminatively learning domaininvariant features for unsupervised domain adaptation. In International Conference on Machine Learning, pp. 222 230, 2013. Gopalan, R., Li, R., and Chellappa, R. Domain adaptation for object recognition: An unsupervised approach. In 2011 international conference on computer vision, pp. 999 1006. IEEE, 2011. Hanneke, S. and Kpotufe, S. On the value of target data in transfer learning. In Advances in Neural Information Processing Systems, pp. 9871 9881, 2019. Heckman, J. J. Sample selection bias as a specification error. Econometrica: Journal of the econometric society, pp. 153 161, 1979. Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., and Smola, A. Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601 608, 2006. Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., and Smola, A. J. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pp. 601 608, 2007. Jiang, J. and Zhai, C. Instance weighting for domain adaptation in nlp. In Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 264 271, 2007. Johnstone, I. M. Gaussian estimation: Sequence and wavelet models. Unpublished manuscript, 2011. Juditsky, A. B., Nemirovski, A. S., et al. Nonparametric estimation by convex programming. The Annals of Statistics, 37(5A):2278 2300, 2009. Kalan, S. M. M., Fabian, Z., Avestimehr, A. S., and Soltanolkotabi, M. Minimax lower bounds for transfer learning with linear and one-hidden layer neural networks. ar Xiv preprint ar Xiv:2006.10581, 2020. Kanamori, T., Hido, S., and Sugiyama, M. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391 1445, 2009. Kanamori, T., Suzuki, T., and Sugiyama, M. f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE transactions on information theory, 58(2):708 720, 2011. Kumar, A., Ma, T., and Liang, P. Understanding selftraining for gradual domain adaptation. ar Xiv preprint ar Xiv:2002.11361, 2020. Lin, Y., Lee, Y., and Wahba, G. Support vector machines for classification in nonstandard situations. Machine learning, 46(1):191 202, 2002. Lipton, Z., Wang, Y.-X., and Smola, A. Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pp. 3122 3130. PMLR, 2018. Long, M., Wang, J., Ding, G., Sun, J., and Yu, P. S. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision, pp. 2200 2207, 2013. Long, M., Zhu, H., Wang, J., and Jordan, M. I. Deep transfer learning with joint adaptation networks. In International conference on machine learning, pp. 2208 2217. PMLR, 2017. Lu, S., Nott, B., Olson, A., Todeschini, A., Vahabi, H., Carmon, Y., and Schmidt, L. Harder or different? a closer look at distribution shift in dataset reproduction. In ICML Workshop on Uncertainty and Robustness in Deep Learning, 2020. Menon, A. and Ong, C. S. Linking losses for density ratio and class-probability estimation. In International Conference on Machine Learning, pp. 304 313. PMLR, 2016. Murphy, K. P. Machine learning: a probabilistic perspective. MIT press, 2012. Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10): 1345 1359, 2009. Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199 210, 2010. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. 2009. Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do cifar-10 classifiers generalize to cifar-10? ar Xiv preprint ar Xiv:1806.00451, 2018. Saerens, M., Latinne, P., and Decaestecker, C. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21 41, 2002. Sagawa, S., Koh, P. W., Hashimoto, T. B., and Liang, P. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Near-Optimal Linear Regression under Distribution Shift Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000. Srivastava, M., Hashimoto, T., and Liang, P. Robustness to spurious correlations via human annotations. ar Xiv preprint ar Xiv:2007.06661, 2020. Storkey, A. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, pp. 3 28, 2009. Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P., and Kawanabe, M. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699 746, 2008. Sugiyama, M., Suzuki, T., and Kanamori, T. Density-ratio matching under the bregman divergence: a unified framework of density-ratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009 1044, 2012. Sun, Q., Chattopadhyay, R., Panchanathan, S., and Ye, J. A two-stage weighting framework for multi-source domain adaptation. In Advances in neural information processing systems, pp. 505 513, 2011. Uehara, M., Sato, I., Suzuki, M., Nakayama, K., and Matsuo, Y. Generative adversarial nets from a density ratio estimation perspective. ar Xiv preprint ar Xiv:1610.02920, 2016. Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge University Press, 2019. Wang, X. and Schneider, J. Flexible transfer learning under support and model shift. In Advances in Neural Information Processing Systems, pp. 1898 1906, 2014. Wang, X. and Schneider, J. G. Generalization bounds for transfer learning under model shift. In UAI, pp. 922 931, 2015. Wang, X., Huang, T.-K., and Schneider, J. Active transfer learning under model shift. In International Conference on Machine Learning, pp. 1305 1313, 2014. Weiss, K., Khoshgoftaar, T. M., and Wang, D. A survey of transfer learning. Journal of Big data, 3(1):9, 2016. Wu, Y., Winston, E., Kaushik, D., and Lipton, Z. Domain adaptation with asymmetrically-relaxed distribution alignment. In International Conference on Machine Learning, pp. 6872 6881. PMLR, 2019. Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, pp. 114, 2004. Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. Domain adaptation under target and conditional shift. In International Conference on Machine Learning, pp. 819 827, 2013. Zhao, H., Combes, R. T. d., Zhang, K., and Gordon, G. J. On learning invariant representation for domain adaptation. ar Xiv preprint ar Xiv:1901.09453, 2019.