# single_point_transductive_prediction__f1e67cf5.pdf Single Point Transductive Prediction Nilesh Tripuraneni 1 Lester Mackey 2 Standard methods in supervised learning separate training and prediction: the model is fit independently of any test points it may encounter. However, can knowledge of the next test point x? be exploited to improve prediction accuracy? We address this question in the context of linear prediction, showing how techniques from semiparametric inference can be used transductively to combat regularization bias. We first lower bound the x? prediction error of ridge regression and the Lasso, showing that they must incur significant bias in certain test directions. We then provide non-asymptotic upper bounds on the x? prediction error of two transductive prediction rules. We conclude by showing the efficacy of our methods on both synthetic and real data, highlighting the improvements single point transductive prediction can provide in settings with distribution shift. 1. Introduction We consider the task of prediction given independent datapoints ((yi, xi))n i=1 from a linear model, i β0 + i, E[ i] = 0, i ?? xi (1) in which our observed targets y = (y1, . . . , yn) 2 Rn and covariates X = [x1, . . . , xn]> 2 Rn p are related by an unobserved parameter vector β0 2 Rp and noise vector = ( 1, . . . , n) 2 Rn. Most approaches to linear model prediction are inductive, divorcing the steps of training and prediction; for example, regularized least squares methods like ridge regression (Hoerl & Kennard, 1970) and the Lasso (Tibshirani, 1996) are fit independently of any knowledge of the next target test point x?. This suggests a tantalizing transductive question: 1Department of EECS, University of California, Berkeley 2Microsoft Research, New England. Correspondence to: Nilesh Tripuraneni . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). can knowledge of a single test point x? be leveraged to improve prediction for x?? In the random design linear model setting (1), we answer this question in the affirmative. Specifically, in Section 2 we establish out-of-sample prediction lower bounds for the popular ridge and Lasso estimators, highlighting the significant dimension-dependent bias introduced by regularization. In Section 3 we demonstrate how this bias can be mitigated by presenting two classes of transductive estimators that exploit explicit knowledge of the test point x?. We provide non-asymptotic risk bounds for these estimators in the random design setting, proving that they achieve dimension-free O( 1 n) x?-prediction risk for n sufficiently large. In Section 4, we first validate our theory in simulation, demonstrating that transduction improves the prediction accuracy of the Lasso with fixed regularization even when x? is drawn from the training distribution. We then demonstrate that under distribution shift, our transductive methods outperform even the popular cross-validated Lasso, cross-validated ridge, and cross-validated elastic net estimators (which attempt to find an optimal data-dependent trade-off between bias and variance) on both synthetic data and a suite of five real datasets. 1.1. Related Work Our work is inspired by two approaches to semiparametric inference: the debiased Lasso approach introduced by (Zhang & Zhang, 2014; Van de Geer et al., 2014; Javanmard & Montanari, 2014) and the orthogonal machine learning approach of Chernozhukov et al. (2017). The works (Zhang & Zhang, 2014; Van de Geer et al., 2014; Javanmard & Montanari, 2014) obtain small-width and asympoticallyvalid confidence intervals (CIs) for individual model parameters (β0)j = hβ0, eji by debiasing an initial Lasso estimator (Tibshirani, 1996). The works (Chao et al., 2014; Cai & Guo, 2017; Athey et al., 2018) each consider a more closely related problem of obtaining prediction confidence intervals using a generalization of the debiased Lasso estimator of Javanmard & Montanari (2014). The work of Chernozhukov et al. (2017) describes a general-purpose procedure for extracting pn-consistent and asymptotically normal target parameter estimates in the presence of nuisance parameters. Specifically, Chernozhukov et al. (2017) construct a two-stage estimator where one initially fits firststage estimates of nuisance parameters using arbitrary ML Single Point Transductive Prediction estimators on a first-stage data sample. In the second-stage, these first-stage estimators are used to provide estimates of the relevant model parameters using an orthogonalized method-of-moments. Wager et al. (2016) also uses generic ML procedures as regression adjustments to form efficient confidence intervals (CIs) for treatment effects. These pioneering works all focus on improved CI construction. Here we show that the semiparametric techniques developed for hypothesis testing can be adapted to provide practical improvements in mean-squared prediction error. Our resulting mean-squared error bounds complement the in-probability bounds of the aforementioned literature by controlling prediction performance across all events. While past work on transductive regression has demonstrated both empirical and theoretical benefits over induction when many unlabeled test points are simultaneously available (Belkin et al., 2006; Alquier & Hebiri, 2012; Bellec et al., 2018; Chapelle et al., 2000; Cortes & Mohri, 2007; Cortes et al., 2008), none of these works have demonstrated a significant benefit, either empirical or theoretical, from transduction given access to only a single test point. For example, the works (Belkin et al., 2006; Chapelle et al., 2000), while theoretically motivated, provide no formal guarantees on transductive predictive performance and only show empirical benefits for large unlabeled test sets. The transductive Lasso analyses of Alquier & Hebiri (2012); Bellec et al. (2018) provide prediction error bounds identical to those of the inductive Lasso, where only the restrictedeigenvalue constant is potentially improved by transduction. Neither analysis improves the dimension dependence of Lasso prediction in the SP setting to provide O(1/n) rates. The formal analysis of Cortes & Mohri (2007); Cortes et al. (2008) only guarantees small error when the number of unlabeled test points is large. Our aim is to develop single point transductive prediction procedures that improve upon the standard inductive approaches both in theory and in practice. Our approach also bears some resemblance to semisupervised learning (SSL) improving the predictive power of an inductive learner by observing additional unlabelled examples (see, e.g., Zhu, 2005; Bellec et al., 2018). Conventionally, SSL benefits from access to a large pool of unlabeled points drawn from the same distribution as the training data. In contrast, our procedures receive access to only a single arbitrary test point x? (we make no assumption about its distribution), and our aim is accurate prediction for that point. We are unaware of SSL results that benefit significantly from access to single unlabeled point x?. 1.2. Problem Setup Our principal aim in this work is to understand the x? prediction risk, R(x?, ˆy) = E[(y? ˆy)2] σ2 = E[(ˆy hx?, β0i)2] (2) of an estimator ˆy of the unobserved test response y? = x> ? β0 + ?. Here, ? is independent of x? with variance σ2 . We exclude the additive noise σ2 from our risk definition, as it is irreducible for any estimator. Importantly, to accommodate non-stationary learning settings, we consider x? to be fixed and arbitrary; in particular, x? need not be drawn from the training distribution. Hereafter, we will make use of several assumptions which are standard in the random design linear regression literature. Assumption 1 (Well-specified Model). The data (X, y) is generated from the model (1). Assumption 2 (Bounded Covariance). The covariate vectors have common covariance = E[xix> i ] with ii 1/2, σmax( ) Cmax and σmin( ) Cmin. We further define the precision matrix = 1 and condition number Ccond = Cmax/Cmin. Assumption 3 (Sub-Gaussian Design). Each covariate vector 1/2xi is sub-Gaussian with parameter 1, in the sense that, E[exp 2k 1/2vk2/2 Assumption 4 (Sub-Gaussian Noise). The noise i is sub Gaussian with variance parameter σ2 Throughout, we use bold lower-case letters (e.g., x) to refer to vectors and bold upper-case letters to refer to matrices (e.g., X). We define [p] = {1, . . . , p} and p _ n = max(p, n). Vectors or matrices subscripted with an index set S indicate the subvector or submatrix supported on S. The expression sβ0 indicates the number of non-zero elements in β0, supp(β0) = {j : (β0)j 6= 0} and B0(s) refers to the set of s-sparse vectors in Rp. We use &, ., and to denote greater than, less than, and equal to up to a constant that is independent of p and n. 2. Lower Bounds for Regularized Prediction We begin by providing lower bounds on the x? prediction risk of Lasso and ridge regression; the corresponding predictions take the form ˆy = hx?, ˆβi for a regularized estimate ˆβ of the unknown vector β0. 2.1. Lower Bounds for Ridge Regression Prediction We first consider the x? prediction risk of the ridge estimator ˆβR(λ) , argminβ ky Xβk2 2 with regularization parameter λ > 0. In the asymptotic high-dimensional limit (with n, p ! 1) and assuming the training distribution equals the test distribution, Dobriban et al. (2018) Single Point Transductive Prediction compute the predictive risk of the ridge estimator in a dense random effects model. By contrast, we provide a non-asymptotic lower bound which does not impose any distributional assumptions on x? or on the underlying parameter vector β0. Theorem 1, proved in Appendix B.1, isolates the error in the ridge estimator due to bias for any choice of regularizer λ. Theorem 1. Under Assumption 1, suppose xi i.i.d. N(0, Ip) with independent noise N(0, Inσ2 ). If n p 20, E[hx?, ˆβR(λ) β0i2] n cos(x?, β0)2. Notably, the dimension-free term kx?k2 n in this bound coincides with the x? risk of the ordinary least squares (OLS) estimator in this setting. The remaining multiplicative factor indicates that the ridge risk can be substantially larger if the regularization strength λ is too large. In fact, our next result shows that, surprisingly, over-regularization can result even when λ is tuned to minimize held-out prediction error over the training population. The same undesirable outcome results when λ is selected to minimize 2 estimation error; the proof can be found in Appendix B.2. Corollary 1. Under the conditions of Theorem 1, if x d= x1 and x is independent of (X, y), then for SNR , kβ0k2 λ , argminλ E[h x, ˆβR(λ) β0i2] = argminλ E[k ˆβR(λ) β0k2 2] = p SNR, and, for n 1 E[hx?, ˆβR(λ ) β0i2] p2 n SNR kx?k2 n cos(x?,β0)2 Several insights can be gathered from the previous results. First, the expression E[h x, ˆβR(λ) β0i2] minimized in Corollary 1 is the expected prediction risk E[( y x> ˆβR(λ))2] σ2 for a new datapoint ( x, y) drawn from the training distribution. This is the population analog of held-out validation error or cross-validation error that is often minimized to select λ in practice. Second, in the setting of Corollary 1, taking SNR = 1 E[hx?, ˆβR(λ ) β0i2] p kx?k2 n 3 cos(x?,β0)2 More generally, if we take cos(x?, β0)2 = (1), SNR = o( p2 n ) and SNR 1 E[hx?, ˆβR(λ ) β0i2] !(kx?k2 If λ is optimized for estimation error or for prediction error with respect to the training distribution, the ridge estimator must incur much larger test error then the OLS estimator in some test directions. Such behavior can be viewed as a symptom of over-regularization the choice λ is optimized for the training distribution and cannot be targeted to provide uniformly good performance over all x?. In Section 3 we show how transductive techniques can improve prediction in this regime. The chief difficulty in lower-bounding the x? prediction risk in Theorem 1 lies in controlling the expectation over the design X, which enters nonlinearly into the prediction risk. Our proof circumvents this difficulty in two steps. First, the isotropy and independence properties of Wishart matrices are used to reduce the computation to that of a 1-dimensional expectation with respect to the unordered eigenvalues of X. Second, in the regime n p, the sharp concentration of Gaussian random matrices in spectral norm is exploited to essentially approximate 1 2.2. Lower Bounds for Lasso Prediction We next provide a strong lower bound on the out-ofsample prediction error of the Lasso estimator ˆβL(λ) , argminβ 1 2nky Xβk2 2 + λkβk1 with regularization parameter λ > 0. There has been extensive work (see, e.g., Raskutti et al., 2011) establishing minimax lower bounds for the in-sample prediction error and parameter estimation error of any procedure given data from a sparse linear model. However, our focus is on out-of-sample prediction risk for a specific procedure, the Lasso. The point x? need not be one of the training points (in-sample) nor even be drawn from the same distribution as the covariates. Theorem 2, proved in Appendix C.1, establishes that a well-regularized Lasso program suffers significant biases even in a simple problem setting with i.i.d. Gaussian covariates and noise.1 Theorem 2. Under Assumption 1, fix s 0, and let xi i.i.d. N(0, Ip) with independent noise N(0, Inσ2 ). If λ (8 + 2 log(2ep)/n and p 20,2 then there exist universal constants c1:3 such that for all n c1s2 log(2ep), (s) sup β02B0(s) E[hx?, ˆβL(λ) β0i2] sup β02B0(s),kβ0k1 λ E[hx?, ˆβL(λ) β0i2] c2λ2kx?k2 where the trimmed norm kx?k(s) is the sum of the magnitudes of the s largest magnitude entries of x?. In practice we will always be interested in a known x? direction, but the next result clarifies the dependence of our Lasso lower bound on sparsity for worst-case test directions x? (see Appendix C.2 for the proof): Corollary 2. In the setting of Theorem 2, for q 2 [1, 1], sup kx?kq=1 sup β02B0(s) E[hx?, ˆβL(λ) β0i2] c2λ2s2 2/q. 1A yet tighter lower bound is available if, instead of being fixed, x? follows an arbitrary distribution, and the expectation is taken over x? as well. See the proof for details. 2The cutoff at 20 is arbitrary and can be decreased. Single Point Transductive Prediction We make several comments regarding these results. First, Theorem 2 yields an x?-specific lower bound showing that given any potential direction x? there will exist an underlying s-sparse parameter β0 for which the Lasso performs poorly. Morever, the magnitude of error suffered by the Lasso scales both with the regularization strength λ and the norm of x? along its top s coordinates. Second, the constraint on the regularization parameter in Theorem 2, λ & σ log p/n, is a sufficient and standard choice to obtain consistent estimates with the Lasso (see Wainwright (2019, Ch. 7) for example). Third, simplifying to the case of q = 2, we see that Corollary 2 implies the Lasso must incur worst-case x? prediction error & σ2 n , matching upper bounds for Lasso prediction error (Wainwright, 2019, Example 7.14). In particular such a bound is not dimension-free, possessing a dependence on s log p, even though the Lasso is only required to predict well along a single direction. The proof of Theorem 2 uses two key ideas. First, in this benign setting, we can show that ˆβL(λ) has support strictly contained in the support of β0 with at least constant probability. We then adapt ideas from the study of debiased lasso estimation in (Javanmard & Montanari, 2014) to sharply characterize the coordinate-wise bias of the Lasso estimator along the support of β0; in particular we show that a worst-case β0 can match the signs of the s largest elements of x? and have magnitude λ on each non-zero coordinate. Thus the bias induced by regularization can coherently sum across the s coordinates in the support of β0. A similar lower bound follows by choosing β0 to match the signs of x? on any subset of size s. This sign alignment between x? and β0 is also explored in the independent and concurrent work of (Bellec & Zhang, 2019, Thm. 2.2). 3. Upper Bounds for Transductive Prediction Having established that regularization can lead to excessive prediction bias, we now introduce two classes of estimators which can mitigate this bias using knowledge of the single test direction x?. While our presentation focuses on the prediction risk (2), which features an expectation over ˆy, our proofs in the appendix also provide identical high probability upper bounds on (ˆy hx?, β0i)2. Throughout this section, the O( ) masks constants depending only on , Cmin, Cmax, Ccond. 3.1. Javanmard-Montanari (JM)-style Estimator Our first approach to single point transductive prediction is inspired by the debiased Lasso estimator of Javanmard & Montanari (2014) which was to designed to construct confidence intervals for individual model parameters (β0)j. For prediction in the x? direction, we will consider the following generalization of the Javanmard-Montanari (JM) debiasing construction3: ˆy JM = hx?, ˆβi + 1 nw>X>(y X ˆβ) for (3) w = argmin w w> n w s.t. k n w x?k1 λw. (4) Here, ˆβ is any (ideally 1-consistent) initial pilot estimate of β0, like the estimate ˆβL(λ) returned by the Lasso. When x? = ej the estimator (3) reduces exactly to the program in (Javanmard & Montanari, 2014), and equivalent generalizations have been used in (Chao et al., 2014; Athey et al., 2018; Cai & Guo, 2017) to construct prediction intervals and to estimate treatment effects. Intuitively, w approximately inverts the population covariance matrix along the direction defined by x? (i.e., w x?). The second term in (3) can be thought of as a high-dimensional one-step correction designed to remove bias from the initial prediction hx?, ˆβi; see (Javanmard & Montanari, 2014) for more intuition on this construction. We can now state our primary guarantee for the JM-style estimator (3); the proof is given in Appendix D.1. Theorem 3. Suppose Assumptions 1, 2, 3 and 4 hold and that the transductive estimator ˆy JM of (3) is fit with regular- ization parameter λw = 8ap Ccond 2kx?k2 n for some a > 0. Then there is a universal constant c1 such that if n c1a2 log(2e(p _ n)), E[(ˆy JM hβ0, x?i)2] (5) 1 (n_p)c3 ) for c3 = a2 2 and rβ,1 = (E[k ˆβ β0k4 1])1/4, the 1 error of the initial estimate. Moreover, if λw kx?k1, then E[(ˆy JM hβ0, x?i)2] = E[hx?, ˆβ ˆβ0i2]. Intuitively, the first term in our bound (5) can be viewed as the variance of the estimator s prediction along the direction of x? while the second term can be thought of as the (reduced) bias of the estimator. We consider the third term to be of higher order since a (and in turn c3) can be chosen as a large constant. Finally, when λw kx?k1 the error of the transductive procedure reduces to that of the pilot regression procedure. When the Lasso is used as the pilot regression procedure we can derive the following corollary to Theorem 3, also proved in Appendix D.3. Corollary 3. Under the conditions of Theorem 3, consider the JM-style estimator (3) with pilot estimate ˆβ = ˆβL(λ) log(2ep/sβ0) n . If p 20, then there exist universal constants c1, c2 such that if kβ0k1/σ = o(ec1n) and n c2 max{ Cmin , a2} log(2e(p _ n)), E[(ˆy JM hβ0, x?i)2] O( σ2 1 (n_p)c3 )). 3In the event the constraints are not feasible we define w = 0. Single Point Transductive Prediction We make several remarks to further interpret this result. First, to simplify the presentation of the results (and match the lower bound setting of Theorem 2) consider the setting in Corollary 3 with a 1, λ σ log p/n, and n & s2 β0 log p log(p _ n). Then the upper bound in Theo- rem 3 can be succinctly stated as O( σ2 2 n ). In short, the transductive estimator attains a dimension-free rate for sufficiently large n. Under the same conditions the Lasso estimator suffers a prediction error of (kx?k2 n ) as Theorem 2 and Corollary 2 establish. Thus transduction guarantees improvement over the Lasso lower bound whenever x? satisfies the soft sparsity condition kx?k2 kx?k(s) . plog p. Since x? is observable, one can selectively deploy transduction based on the soft sparsity level kx?k2 kx?k(s) or on bounds thereof. Second, the estimator described in (3) and (4) is transductive in that it is tailored to an individual test-point x?. The corresponding guarantees in Theorem 3 and Corollary 3 embody a computational-statistical tradeoff. In our setting, the detrimental effects of regularization can be mitigated at the cost of extra computation: the convex program in (4) must be solved for each new x?. Third, the condition kβ0k1/σ = o(ec1n) is not used for our high-probability error bound and is only used to control prediction risk (2) on the low-probability event that the (random) design matrix X does not satisfy a restricted eigenvalue-like condition. For comparison, note that our Theorem 2 lower bound establishes substantial excess Lasso bias even when kβ0k1 = λ = o(1). Finally, we highlight that Cai & Guo (2017) have shown that the JM-style estimator with a scaled lasso base procedure n produce CIs for x> ? β0 with minimax rate optimal length when x? is sparsely loaded. Although our primary focus is in improving the mean-square prediction risk (2), we conclude this section by showing that a different setting of λw yields minimax rate optimal CIs for dense x? and simultaneously minimax rate optimal CIs for sparse and dense x? when β0 is sufficiently sparse: Proposition 4. Under the conditions of Theorem 3 with σ = 1, consider the JM-style estimator (3) with pilot estimate ˆβ = ˆβL(λ) and λ = 80 n . Fix any C1, C2, C3 > 0, and instate the assumptions of Cai & Guo (2017), namely that the vector x? satisfies maxj |(x?)j| minj |(x?)j| C1 and sβ0 pγ for 0 γ < 1 2. Then for n & sβ0 log p the estimator ˆy JM (3) with λw = 8p Ccond 2 1 sβ0 plog pkx?k2 yields (minimax rate optimal) 1 confidence intervals for x> ? β0 of expected length O(kx?k1 sβ0 n ) in the dense x? regime where kx?k0 = C3pγq with 2γ < γq < 1 (matching the result of (Cai & Guo, 2017, Thm. 4)). O(kx?k2 1 pn) in the sparse x? regime of (Cai & Guo, 2017, Thm. 1) where kx?k0 C2sβ0 if n & s2 β0(log p)2. Here the O( ) masks constants depending only on , C1, C2, C3, Cmin, Cmax, Ccond. The proof can be found in Appendix D.2. 3.2. Orthogonal Moment (OM) Estimators Our second approach to single point transductive prediction is inspired by orthogonal moment (OM) estimation (Chernozhukov et al., 2017). OM estimators are commonly used to estimate single parameters of interest (like a treatment effect) in the presence of high-dimensional or nonparametric nuisance. To connect our problem to this semiparametric world, we first frame the task of prediction in the x? direction as one of estimating a single parameter, 0 = x> ? β0. Consider the linear model equation (1) i β0 + i = ((U 1)>xi)>Uβ0 + i with a data reparametrization defined by the matrix U = for x? kx?k2 = u1 so that e> Here, the matrix R 2 R(p 1) p has orthonormal rows which span the subspace orthogonal to u1 these are obtained as the non-u1 eigenvectors of the projector matrix Ip u1u> 1 . This induces the data reparametrization x0 = [t, z] = (U 1)>x. In the reparametrized basis, the linear model becomes, yi = 0ti + z> i f0 + i, ti = g0(zi) + i, q0(zi) , 0g0(zi) + z> where we have introduced convenient auxiliary equations in terms of g0(zi) , E[ti | zi]. To estimate 0 = x> ? β0 in the presence of the unknown nuisance parameters f0, g0, q0, we introduce a thresholdedvariant of the two-stage method of moments estimator proposed in (Chernozhukov et al., 2017). The method of moments takes as input a moment function m of both data and parameters that uniquely identifies the target parameter of interest. Our reparameterized model form (6) gives us access to two different Neyman orthogonal moment functions described (Chernozhukov et al., 2017): f moments: m(ti, yi, , z> i f, g(zi)) = i f)(ti g(zi)) (7) q moments: m(ti, yi, , q(zi), g(zi)) = (yi q(zi) (ti g(zi)))(ti g(zi)). These orthogonal moment equations enable the accurate estimation of a target parameter 0 in the presence of highdimensional or nonparametric nuisance parameters (in this Single Point Transductive Prediction case f0 and g0). We focus our theoretical analysis and present description on the set of f moments since the analysis is similar for the q, although we investigate the practical utility of both in Section 4. Our OM proposal to estimate 0 now proceeds as follows. We first split our original dataset of n points into two4 disjoint, equal-sized folds (X(1), y(1)) = {(xi, yi) : i 2 {1, . . . , n 2 }} and (X(2), y(2)) = {(xi, yi) : i 2 { n 2 + 1, . . . , n}}. Then, The first fold (X(1), y(1)) is used to run two first-stage regressions. We estimate β0 by linearly regressing y(1) onto X(1) to produce ˆβ; this provides an estimator of f0 as e> 1U ˆβ = ˆf. Second we estimate g0 by regressing t(1) onto z(1) to produce a regression model ˆg( ) : Rp 1 ! R. Any arbitrary linear or non-linear regression procedure can be used to fit ˆg( ). Then, we estimate E[ 2 1] as µ2 = 1 n/2 2 +1 ti(ti ˆg(zi)) where the sum is taken over the second fold of data in (X(2), y(2)); crucially (ti, zi) are independent of ˆg( ) in this expression. If µ2 for a threshold we simply output ˆy OM = ? ˆβ. If µ2 we estimate 0 by solving the empirical moment equation: 2 +1 m(ti, yi, ˆy OM, z> i ˆf, ˆg(zi)) = 0 =) i ˆf)(ti ˆg(zi)) where the sum is taken over the second fold of data in (X(2), y(2)) and m is defined in (7). If we had oracle access to the underlying f0 and g0, solving the population moment condition Et1,y1,z1[m(t1, y1, , z> 1 f0, g0(z1))] = 0 for would exactly yield 0 = x> ? β0. In practice, we first construct estimates ˆf and ˆg of the unknown nuisance parameters to serve as surrogates for f0 and g0 and then solve an empirical version of the aforementioned moment condition to extract ˆy OM. A key property of the moments in (7) is their Neyman orthogonality: they satisfy E[rz> 1 fm(t1, y1, 0, z> 1 f0, g0(z1))] = 0 and E[rg(z1)[m(t1, y1, 0, z> 1 f0, g0(z1))] = 0. Thus the solution of the empirical moment equations is first-order insensitive to errors arising from using ˆf, ˆg in place of f0 and g0. Data splitting is further used to create independence across the two stages of the procedure. In the context of testing linearly-constrained hypotheses of the parameter β0, Zhu & Bradic (2018) propose a two-stage OM test 4In practice, we use K-fold cross-fitting to increase the sampleefficiency of the scheme as in (Chernozhukov et al., 2017); for simplicity of presentation, we defer the description of this slight modification to Appendix G.4. statistic based on the transformed f moments introduced above; they do not use cross-fitting and specifically employ adaptive Dantzig-like selectors to estimate f0 and g0. Finally, the thresholding step allows us to control the variance increase that might arise from µ2 being too small and thereby enables our non-asymptotic prediction risk bounds. Before presenting the analysis of the OM estimator (8) we introduce another condition5: Assumption 5. The noise i is independent of zi. Recall ˆg is evaluated on the (independent) second fold data z. We now obtain our central guarantee for the OM estimator (proved in Appendix E.1). Theorem 5. Let Assumptions 1, 2, 3, 4 and 5 hold, and assume that g0(zi) = g> 0 zi in (6) for g0 = argming E[(t1 z> 1 g)2]. Then the thresholded orthogonal ML estimator ˆy OM of (8) with = 1 E[(ˆy OM x> g,2 (σ2 )2 ) + O( where rβ,2 = (E[k ˆβ β0k4 2])1/4 and rg,2 = (E[(ˆg(zn) g0(zn))4])1/4 denote the expected prediction errors of the first-stage estimators. Since we are interested in the case where ˆβ and ˆg( ) have small error (i.e., rβ,2 = rg,2 = o(1)), the first term in (9) can be interpreted as the variance of the estimator s prediction along the direction of x?, while the remaining terms represent the reduced bias of the estimator. We first instantiate this result in the setting where both β0 and g0 are estimated using ridge regression (see Appendix E.2 for the corresponding proof). Corollary 4 (OM Ridge). Assume kβ0k1/σ = O(1). In the setting of Theorem 5, suppose ˆβ and ˆg(zi) = ˆg>zi are fit with the ridge estimator with regularization parameters λβ and λg respectively. Then there exist universal constants c1:5 such that if p 20, c1 n2Cmin p Ccond e nc2/ 4C2 c3 (Ccond Cmaxn)1/3, and c4 p Ccond e nc2/ 4C2 for n c5 4C2 E[(ˆy OM x> σ2 n) + O( p2 (σ2 )2n2 ) + O( ) (σ2 )2n2 ) Similarly, when β0 and g0 are estimated using the Lasso we conclude the following (proved in Appendix E.2). 5This assumption is not essential to our result and could be replaced by assuming i satisfies E[ i|zi] = 0 and is almost surely (w.r.t. to zi) sub-Gaussian with a uniformly (w.r.t. to zi) bounded variance parameter. Single Point Transductive Prediction Corollary 5 (OM Lasso). In the setting of Theorem 5, suppose ˆβ and ˆg(zi) = ˆg>zi are fit with the Lasso with regularization parameters λβ 80σ log(2ep/sβ0)/n and λg 80σ log(2ep/sg)/n respectively. If p 20, sβ0 = kβ0k0, and sg0 = kg0k0, then there exist universal constants c1, c2 such that if kβ0k1/σ = o(ec1n), then for n c1 4 Cmin max{sβ0, sg} log(2ep), E[(ˆy OM x> gsβ0sg0 (σ2 )2 ) + O( We make several comments regarding the aforementioned results. First, Theorem 5 possesses a double-robustness property. In order for the dominant bias term O(r2 g,2) to be small, it is sufficient for either β0 or g0 to be estimated at a fast rate or both to be estimated at a slow rate. As before, the estimator is transductive and adapted to predicting along the direction x?. Second, in the case of ridge regression, to match the lower bound of Corollary 1, consider the setting where n = (p2), SNR = o( p2 n ), cos(x?, β0)2 = (1) and SNR & p n. Then, the upper bound6 can be simplified to n ). By contrast, Corollary 1 shows the error of the optimally-tuned ridge estimator is lower bounded by !(kx?k2 n ); for example, the error is (pkx?k2 n ) when SNR = 1 p n. Hence, the performance of the ridge estimator can be significantly worse then its transductive counterpart. Third, if we consider the setting of Corollary 5 where n & sβ0sg0(log p)2 while we take λβ σ log p/n and λg σ log p/n, the error of the OML estimator attains the fast, dimension-free O(kx?k2 n ) rate. On the other hand, Corollary 2 shows the Lasso suffers prediction error (kx?k2 n ), and hence again strict improvement is possible over the baseline when kx?k2 kx?k(s) . plog p. Finally, although Theorem 5 makes stronger assumptions on the design of X than the JM-style estimator introduced in (4) and (3), one of the primary benefits of the OM framework is its flexibility. All that is required for the algorithm are black-box estimates of g0 and β0 which can be obtained from more general ML procedures than the Lasso. 4. Experiments We complement our theoretical analysis with a series of numerical experiments highlighting the failure modes of standard inductive prediction. In Sections 4.1 and 4.2, error bars represent 1 standard error of the mean computed over 20 independent problem instances. We provide complete experimental set-up details in Appendix G and code replicating all experiments at https://github.com/ 6Note that in this regime, p SNR = kβ0k2/σ = o(1) and hence the condition kβ0k1/σ = O(1) in Corollary 4 is satisfied. nileshtrip/SPTransduc Pred Code. 4.1. Excess Lasso Bias without Distribution Shift We construct problem instances for Lasso estimation by independently generating xi N(0, Ip), i N(0, 1), and (β0)j N(0, 1) for j less then the desired sparsity level sβ0 while (β0)j = 0 otherwise. We fit the Lasso estimator, JM-style estimator with Lasso pilot, and the OM f-moment estimator with Lasso first-stage estimators. We set all hyperparameters to their theoretically-motivated values. As 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Lasso (λ=Theory) OM (λ=Theory) JM (λ=Theory) 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Lasso (λ=Theory) OM (λ=Theory) JM (λ=Theory) Figure 1. Lasso vs. OM and JM Lasso prediction without distribu- tion shift. Hyperparameters are set according to theory (see Section 4.1). Left: p = 200, sβ0 = 20. Right: p = 200, sβ0 = 100. Figure 1 demonstrates, both transductive methods significantly reduce the prediction risk of the Lasso estimator when the hyperparameters are calibrated to their theoretical values, even for a dense β0 (where p sβ0 = 2). 4.2. Benefits of Transduction under Distribution Shift The no distribution shift simulations of Section 4.1 corroborate the theoretical results of Corollaries 3 and 5. However, since our transductive estimators are tailored to each individual test point x?, we expect these methods to provide an even greater gain when the test distribution deviates from the training distribution. In Figure 2, we consider two cases where the test distribution is either mean-shifted or covariance-shifted from the training distribution and evaluate the ridge estimator with the optimal regularization parameter for the training distribution, λ = pσ2 2 . We independently generated xi N(0, Ip), i N(0, 1), and β0 N(0, 1 pp Ip). In the case with a mean-shifted test distribution, we generated x? N(10β0, Ip) for each problem instance while the covariance-shifted test distribution was generated by taking x? N(0, 100β0β> 0 ). The plots in Figure 2 show the OM estimator with λ -ridge pilot provides significant gains over the baseline λ -ridge estimator. In Figure 4 we also consider two cases where the test distribution is shifted for Lasso estimation but otherwise identical to the previous set-up in Section 4.1. For covariance shifting, we generated (x?)i indep N(0, 100) for i 2 supp(β0) and (x?)i = 0 otherwise for each problem instance. For Single Point Transductive Prediction 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Distribution Shift in Cov. Matrix Ridge (λ=Theory) OM (λ=Theory) 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Distribution Shift in Mean Vector Ridge (λ=Theory) OM (λ=Theory) Figure 2. Ridge vs. OM ridge prediction (p = 200) under train-test distribution shift. Hyperparameters are set according to theory. 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Distribution Shift in Cov. Matrix Ridge (λ=CV) 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Distribution Shift in Mean Vector Ridge (λ=CV) Figure 3. Ridge vs. OM ridge prediction (p = 200) under train-test distribution shift. Hyperparameters are set according to CV. 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Distribution Shift in Cov. Matrix Lasso (λ=Theory) OM (λ=Theory) JM (λ=Theory) 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Distribution Shift in Mean Vector Lasso (λ=Theory) OM (λ=Theory) JM (λ=Theory) Figure 4. Lasso vs. OM and JM Lasso prediction (p = 200) under mean (sβ0 = 100) or covariance (sβ0 = 20) train-test distribution shifts. Hyperparameters are set according to theory. mean shifting, we generated x? N(10β0, Ip) for each problem instance. The first and second plots in Figure 4 show the transductive effect of the OM and JM estimators improves prediction risk with respect to the Lasso when the regularization hyperparameters are selected via theory. We also note that Figure 3 and Figure 5 compares CVtuned ridge or Lasso to OM and JM with CV-tuned base procedures showing the benefit of transduction in this practical setting where regularization hyperparameters are chosen by CV. As the first and second plots in Figure 3 show, selecting λ via CV leads to over-regularization of the ridge estimator, and the transductive methods provide substantial 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Distribution Shift in Cov. Matrix Lasso (λ=CV) 0 500 1000 1500 2000 2500 3000 Number of Sample Points (n) Log Average Prediction RMSRE Distribution Shift in Mean Vector Lasso (λ=CV) Figure 5. Lasso vs. OM and JM Lasso prediction (p = 200) under mean (sβ0 = 100) or covariance (sβ0 = 20) train-test distribution shifts. Hyperparameters are set according to CV. gains over the base ridge estimator. In the case of the Lasso, the first and second plots in Figure 5 show the residual bias of the CV Lasso also causes it to incur significant error in its test predictions, while the transductive methods provide substantial gains by adapting to each x?. 4.3. Improving Cross-validated Prediction Motivated by our findings on synthetic data, we next report the performance of our methods on 5 real datasets with and without distribution shift. We also include the popular elastic net estimator as a base regression procedure alongside ridge and the Lasso. All hyperparameters are selected by CV. For the OM estimators we exploited the flexibility of the framework by including a suite of methods for the auxiliary g regressions: Lasso estimation, random forest regression, and a g = 0 baseline. Amongst these, we select the method with the least estimated asymptotic variance, which can be done in a data-dependent way without introducing any extra hyperparameters into the implementation. The f and q regressions were always fit with Lasso, ridge, or elastic net estimation. See Appendix G for further details on the methodology and datasets from the UCI dataset repository (Dua & Graff, 2017). In Table 1 we see that the OM estimators generically provide gains over the CV Lasso, CV ridge, and CV elastic net on datasets with intrinsic distribution shift and perform comparably on a dataset without explicit distribution shift. On Wine, we see a substantial performance gain from 0.96-0.99 RMSE without transduction to 0.77 with OM q transduction. The gains on other datasets are smaller but notable as they represent consistent improvements over the de facto standard of CV prediction. We also report the performance of ordinary least squares (OLS) which produces an unbiased estimate of the entire parameter vector β0. OLS fares worse than most methods on each dataset due to an increase in variance. In contrast, our proposed transductive procedures limit the variance intro- Single Point Transductive Prediction Table 1. Test set RMSE of OLS; CV-tuned ridge, Lasso, and elastic net; OM and JM transductive CV-tuned ridge, Lasso, and elastic net; and prior transductive approaches (TD Lasso, Ridge, and KNN) on real-world datasets. All hyperparameters are set via CV. Error bars represent a delta method interval based on 1 standard error of the mean squared error over the test set. Method Wine Parkinson Fire Fertility Triazines (no shift) OLS 1.0118 0.0156 12.7916 0.1486 82.7147 35.5141 0.3988 0.0657 0.1716 0.037 Ridge 0.9936 0.0155 12.5267 0.1448 82.3462 35.5955 0.399 0.0665 0.1469 0.0285 OM f (Ridge) 0.9883 0.0154 12.4686 0.1439 82.3522 35.5519 0.3987 0.0655 0.1446 0.029 OM q (Ridge) 0.7696 0.0145 12.0891 0.1366 81.9794 35.7872 0.3977 0.0653 0.1507 0.0242 Lasso 0.9812 0.0155 12.2535 0.1356 82.0656 36.0321 0.4092 0.0716 0.1482 0.0237 JM (Lasso) 1.0118 0.0156 12.7916 0.1486 82.7147 35.5141 0.3988 0.0657 0.173 0.0367 OM f (Lasso) 0.9473 0.0152 11.869 0.1339 81.794 35.5699 0.398 0.0665 0.1444 0.0239 OM q (Lasso) 0.7691 0.0144 11.8692 0.1339 81.811 35.5637 0.3976 0.0656 0.1479 0.0226 Elastic 0.9652 0.0154 12.2535 0.1356 81.8428 35.8333 0.4092 0.0716 0.1495 0.0238 OM f (Elastic) 0.9507 0.0152 11.8369 0.1338 81.7719 35.6166 0.398 0.0655 0.1445 0.024 OM q (Elastic) 0.7693 0.0145 11.8658 0.1341 81.803 35.6485 0.3976 0.0657 0.147 0.0228 TD Lasso (Alquier & Hebiri, 2012) 0.9813 0.0154 12.2535 0.1358 82.0657 36.0320 0.4092 0.0716 0.1483 0.0237 TD Ridge (Chapelle et al., 2000) 0.8411 0.0004 12.2534 0.0021 82.0664 2.567 0.4089 0.0128 0.1735 0.0004 TD KNN (Cortes & Mohri, 2007) 0.8345 0.0153 12.3326 0.1447 81.9467 35.8340 0.3845 0.0760 0.1510 0.0240 duced by targeting a single parameter of interest, hx?, β0i. Finally, we evaluated three existing transductive prediction methods the transductive Lasso (TD Lasso) of (Alquier & Hebiri, 2012; Bellec et al., 2018), transductive ridge regression (TD Ridge) (Chapelle et al., 2000), and transductive ridge regression with local (kernel) neighbor labelling (TD KNN) (Cortes & Mohri, 2007) on each dataset, tuning all hyperparameters via CV. TD Lasso does not significantly improve upon the Lasso baseline on any dataset. TD Ridge only improves upon the baselines on Wine but is outperformed by OM q. TD KNN also underperforms OM q on every dataset except Fertility. 5. Discussion and Future Work We presented two single point transductive prediction procedures that, given advanced knowledge of a test point, can significantly improve the prediction error of an inductive learner. We provided theoretical guarantees for these procedures and demonstrated their practical utility, especially under distribution shift, on synthetic and real data. Promising directions for future work include improving our OM debiasing techniques using higher-order orthogonal moments (Mackey et al., 2017) and exploring the utility of these debiasing techniques for other regularizers (e.g., group Lasso (Yuan & Lin, 2006) penalties) and models such as generalized linear models and kernel machines. Alquier, P. and Hebiri, M. Transductive versions of the lasso and the dantzig selector. Journal of Statistical Planning and Inference, 142(9):2485 2500, 2012. Athey, S., Imbens, G. W., and Wager, S. Approximate resid- ual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4): 597 623, 2018. Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regular- ization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399 2434, 2006. Bellec, P. C. and Zhang, C.-H. De-biasing the lasso with degrees-of-freedom adjustment. ar Xiv preprint ar Xiv:1902.08885, 2019. Bellec, P. C., Lecué, G., and Tsybakov, A. B. Slope meets lasso: improved oracle bounds and optimality. ar Xiv preprint ar Xiv:1605.08651, 2016. Bellec, P. C., Dalalyan, A. S., Grappin, E., Paris, Q., et al. On the prediction loss of the lasso in the partially labeled setting. Electronic Journal of Statistics, 12(2):3443 3472, 2018. Bickel, P. J., Ritov, Y., Tsybakov, A. B., et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009. Bishop, A. N., Moral, P. D., and Niclas, A. An introduction to wishart matrix moments. Foundations and Trends R in Machine Learning, 11(2):97 218, 2018. ISSN 19358237. doi: 10.1561/2200000072. URL http://dx. doi.org/10.1561/2200000072. Cai, T. T. and Guo, Z. Confidence intervals for highdimensional linear regression: Minimax rates and adaptivity. The Annals of statistics, 45(2):615 646, 2017. Single Point Transductive Prediction Chao, S.-K., Ning, Y., and Liu, H. On high dimensional post-regularization prediction intervals, 2014. Chapelle, O., Vapnik, V., and Weston, J. Transductive in- ference for estimating values of functions. In Advances in Neural Information Processing Systems, pp. 421 427, 2000. Chen, X., Monfort, M., Liu, A., and Ziebart, B. D. Robust covariate shift regression. In Artificial Intelligence and Statistics, pp. 1270 1279, 2016. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J., et al. Double/debiased machine learning for treatment and causal parameters. Technical report, 2017. Cortes, C. and Mohri, M. On transductive regression. In Advances in Neural Information Processing Systems, pp. 305 312, 2007. Cortes, C., Mohri, M., Pechyony, D., and Rastogi, A. Stabil- ity of transductive regression algorithms. In Proceedings of the 25th international conference on Machine learning, pp. 176 183, 2008. Diamond, S. and Boyd, S. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016. Dobriban, E., Wager, S., et al. High-dimensional asymp- totics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247 279, 2018. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Hoerl, A. E. and Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970. Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. In Conference on learning theory, pp. 9 1, 2012. Javanmard, A. and Montanari, A. Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869 2909, 2014. Mackey, L., Syrgkanis, V., and Zadik, I. Orthogonal ma- chine learning: Power and limitations. ar Xiv preprint ar Xiv:1711.00342, 2017. Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., et al. Ray: A distributed framework for emerging {AI} applications. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), pp. 561 577, 2018. Raskutti, G., Wainwright, M. J., and Yu, B. Minimax rates of estimation for high-dimensional linear regression over q-balls. IEEE transactions on information theory, 57 (10):6976 6994, 2011. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. van de Geer, S. Statistical theory for high-dimensional models, 2014a. van de Geer, S. Statistical theory for high-dimensional models. ar Xiv preprint ar Xiv:1409.8557, 2014b. Van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R., et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166 1202, 2014. Wager, S., Du, W., Taylor, J., and Tibshirani, R. J. High- dimensional regression adjustments in randomized experiments. Proceedings of the National Academy of Sciences, 113(45):12673 12678, 2016. Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint. 2019. Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (1):49 67, 2006. Zhang, C.-H. and Zhang, S. S. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217 242, 2014. Zhu, X. J. Semi-supervised learning literature survey. Tech- nical report, University of Wisconsin-Madison Department of Computer Sciences, 2005. Zhu, Y. and Bradic, J. Linear hypothesis testing in dense high-dimensional linear models. Journal of the American Statistical Association, 113(524):1583 1600, 2018.