# accurate_inference_for_adaptive_linear_models__3bf7a1e4.pdf Accurate Inference for Adaptive Linear Models Yash Deshpande 1 Lester Mackey 2 Vasilis Syrgkanis 2 Matt Taddy 2 3 Estimators computed from adaptively collected data do not behave like their non-adaptive brethren. Rather, the sequential dependence of the collection policy can lead to severe distributional biases that persist even in the infinite data limit. We develop a general method W - decorrelation for transforming the bias of adaptive linear regression estimators into variance. The method uses only coarse-grained information about the data collection policy and does not need access to propensity scores or exact knowledge of the policy. We bound the finite-sample bias and variance of the W -estimator and develop asymptotically correct confidence intervals based on a novel martingale central limit theorem. We then demonstrate the empirical benefits of the generic W -decorrelation procedure in two different adaptive data settings: the multi-armed bandit and the autoregressive time series. 1. Introduction Consider a dataset of n sample points (yi, xi)i n where yi represents an observed outcome and xi Rp an associated vector of covariates. In the standard linear model, the outcomes and covariates are related through a parameter β: yi = xi, β + εi. (1) In this model, the noise term εi represents inherent variation in the sample, or the variation that is not captured in the model. Parametric models of the type (1) are a fundamental building block in many machine learning problems. A common additional assumption is that the covariate vector xi for a given datapoint i is independent of the other sample point outcomes (yj)j =i and the inherent variation (εj)j [n]. This paper is motivated by experiments where 1Department of Mathematics, MIT 2Microsoft Research New England 3Booth School of Business, University of Chicago. Correspondence to: Yash Deshpande . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). the sample (yi, xi)i n is not completely randomized but rather adaptively chosen. By adaptive, we mean that the choice of the data point (yi, xi) is guided from inferences on past data (yj, xj)j 0. Suppose that the data collection process satisfies the assumptions of Theorems 4 and 6. Set λ = λ(n) as in Theorem 4, and let bσ be a consistent estimate of σ as in Theorem 1. Define Q = bσ2W n W T n and the interval I(a, α) = [bβd a QaaΦ 1(1 α/2), bβd a+ QaaΦ 1(1 α/2). Then lim supn P{βa I(a, α)} α. 3. Related work There is extensive work in statistics and econometrics on stochastic regression models (Wei, 1985; Lai, 1994; Chen et al., 1999; Heyde, 2008) and non-stationary time series (Shumway & Stoffer, 2006; Enders, 2008; Phillips & Perron, 1988). This line of work is analogous to Theorem 1 or restricted to specific time series models. We instead focus on literature from sequential decision-making, policy learning and causal inference that closely resembles our work in terms of goals, techniques and applicability. The seminal work of Lai and Robbins (Robbins, 1985; Lai & Robbins, 1985) has spurred a vast literature on multi-armed bandit problems and sequential experiments that propose allocation algorithms based on confidence bounds (see (Bubeck et al., 2012) and references therein). A variety of confidence bounds and corresponding rules have been proposed (Auer, 2002; Dani et al., 2008; Rusmevichientong & Tsitsiklis, 2010; Abbasi-Yadkori et al., 2011; Jamieson et al., 2014) based on martingale concentration and the law of iterated logarithm. While these results can certainly be used to compute valid confidence intervals, they are conservative for a few reasons. First, they do not explicitly account for bias in OLS estimates and, correspondingly, must be wider to account for it. Second, obtaining optimal constants in the concentration inequalities can require sophisticated tools even for non-adaptive data (Ledoux, 1996; 2005). This is evidenced in all of our experiments which show that concentration inequalities yield valid, but conservative intervals. A closely-related line of work is that of learning from logged data (Li et al., 2011; Dud ık et al., 2011; Swaminathan & Joachims, 2015) and policy learning (Athey & Wager, 2017; Kallus, 2017). The focus here is efficiently estimating the reward (or value) of a certain test policy using data collected from a different policy. For linear models, this reduces to accurate prediction which is directly related to the estimation error on the parameters β. While our work shares some features, we focus on unbiased estimation of the parameters and obtaining accurate confidence intervals for linear functions of the parameters. Some of the work on learning from logged data also builds on propensity scores and their estimation (Imbens, 2000; Lunceford & Davidian, 2004). Villar et al. (2015) empirically demonstrate the presence of bias for a number of multi-armed bandit algorithms. Recent work by Dimakopoulou et al. (2017) also shows a similar effect in contextual bandits. Along with a result on the sign of the bias, (Nie et al., 2017) also propose conditional likelihood optimization methods to estimate parameters of the linear model. Through the lens of selective inference, they also propose methods to randomize the data collection process that simultaneously lower bias and reduce the MSE. Their techniques rely on considerable information about (and control over) the data generating process, in particular the probabilities of choosing a specific action at each Accurate Inference for Adaptive Linear Models Figure 2. Left: One-sided confidence region coverage for OLS and decorrelated W -decorrelated estimator trials. Right: Quantile-quantile (QQ) plots and empirical excess kurtosis (inset) for the OLS and W -decorrelated estimator errors point in the data selection. This can be viewed as lying on the opposite end of the spectrum from our work, which attempts to use only the data at hand, along with coarse aggregate information on the exploration inherent in the data generating process. It is an interesting, and open, direction to consider approaches that can combine the strengths of our approach and that of (Nie et al., 2017). 4. Experiments In this section we empirically validate the decorrelated estimators in two scenarios that involve sequential dependence in covariates. Our first scenario is a simple experiment of multi-armed bandits while the second scenario is autoregressive time series data. In these cases, we compare the empirical coverage and typical widths of confidence intervals for parameters obtained via three methods: (i) classical OLS theory, (ii) concentration inequalities and (iii) decorrelated estimates. Jupyter notebooks reproducing our experiments are available on the first author s Github (Deshpande et al., 2018). 4.1. Multi-armed bandits In this section, we demonstrate the utility of the W - estimator for a stochastic multi-armed bandit setting. Vil- lar et al. (2015) studied this problem in the context of patient allocation in clinical trials. Here the trial proceeds in a sequential fashion with the ith patient given one of p treatments, encoded as xi = ea with a [p], and yi denoting the outcome observed. We model the outcome as yi = xi, β + εi where εi Unif([ 1, 1]) with β being the mean outcome of the p treatments. We sequentially assign one of p = 2 treatments to each of n = 444 patients using one of three policies (i) an εgreedy policy (called ECB or Epsilon Current Belief), (ii) a practical UCB strategy based on the law of iterated logarithm (UCB) (Jamieson et al., 2014) and (iii) Thompson sampling (Thompson, 1933). The ECB and TS sampling strategies are Bayesian. They place an independent Gaussian prior (with mean µ0 = 0.3 and variance σ2 0 = 0.33) on each unknown mean outcome parameter β = (0.3, 0.31) and form an updated posterior belief concerning β following each treatment administration xi and observation yi. For ECB, the treatment administered to patient i is, with probability 1 ε = .9, the treatment with the largest posterior mean; with probability 1 ε, a uniformly random treatment is administered instead, to ensure sufficient exploration of all treatments. Note that this strategy satisfies condition (6) with µn(i) = ε/p. For TS, at each patient i, a sample bβ of the mean treatment effect is drawn from the posterior belief. The treatment assigned to patient is the one maximizing the sampled mean treatment, i.e. a (i) = arg maxa [p] bβa. In UCB, the algorithm maintains a score for each arm a [p] that is a combination of the mean reward that the arm achieves and the empirical uncertainty of the reward. For each patient i, the UCB algorithm chooses the arm maximizing this score, and updates the score according to a fixed rule. For details on the specific implementation, see Jamieson et al. (2014). Our goal is to produce confidence intervals for the mean effect βa of each treatment based on the data adaptively collected from these standard bandit algorithms. We repeat this simulation 4000 times. From each trial simulation, we estimate the parameters β using both OLS and the W -estimator with λ = ˆλ10%,π which is the 10th percentile of λmin(n) achieved by the policy π {ECB, UCB, TS}. This choice is guided by Corollary 4. We compare the quality of W -decorrelated estimator confidence regions, OLS Gaussian confidence regions (OLSgsn), and the OLS-based concentration inequality regions (OLSconc) (Abbasi-Yadkori et al., 2011, Sec. 4). Figure 2 (left column) shows that the OLS Gaussian have lower tail regions that typically overestimate coverage and upper tail regions that typically underestimate coverage. This is consistent with the observation that the sample means are biased negatively (Nie et al., 2017). The concentration OLS tail bounds are all conservative, producing nearly 100% coverage, irrespective of the nominal level. Meanwhile, the Accurate Inference for Adaptive Linear Models Figure 3. Mean 2-sided confidence interval widths (error bars show 1 standard deviation) for the 2 arms in the MAB experiment. decorrelated intervals improves coverage uniformly over OLS intervals, often achieving the nominal coverage. Figure 2 (right column) shows the QQ plots of OLS and W -estimator errors for each parameter βa. The distribution of OLS errors is distinctly non-Gaussian with considerable excess kurtosis for every policy. Conversely, for the W estimator the excess kurtosis is reduced for every policy by an order of magnitude (0 for ECB and UCB). Figure 3 summarizes the distribution of 2-sided interval widths produced by each method for each arm. As expected, the W -decorrelation intervals are wider than those of OLSgsn but compare favorably with those provided by OLSconc. For UCB and for arm 0 for all policies, the mean OLSconc widths are always largest. For arm 1 in the ECB and TS policies, W -decorrelation yields smaller intervals than OLSconc for moderate confidence levels and larger for high confidence levels. 4.2. Autoregressive time series In this section, we consider the classical AR(p) model where yi = P ℓ p βℓyi ℓ+ εi.. We generate data for the model with parameters p = 2, n = 50, β = (0.95, 0.2), y0 = 0 and εi Unif([ 1, 1]); all estimates are computed over 4000 monte carlo iterations. We plot the coverage confidences for various values of the nominal on the right panel of Figure 4. The QQ plot of the error distributions on the bottom right panel of Figure 4 shows that the OLS errors are skewed downwards, while the W -estimate errors are nearly Gaussian. We obtain the following improvements over the comparison methods of OLS standard errors OLSgsn and concentration inequality widths OLSconc (Abbasi-Yadkori et al., 2011) The Gaussian OLS confidence regions systematically give incorrect empirical coverage. Meanwhile, the concentration inequalities provide very conservative intervals, with nearly 100% coverage, irrespective of the nominal level. In contrast, our decorrelated intervals achieve empirical coverage that closely approximates the nominal levels. These coverage improvements are enabled by an increase in width over that of OLSgsn, but the W -estimate widths are systematically smaller than those of the concentration inequalities. Figure 4. Lower (Top left) and upper (Top right) coverage probabilities for OLS with Gaussian intervals, OLS with concentration inequality intervals and W -decorrelated estimate intervals. QQ plot with kurtosis inset (bottom right) errors in OLS estimate and W-decorrelated estimate. Mean confidence widths (bottom left) for OLS, concentration and W -decorrelated estimates. Error bars show one standard deviation. Acknowledgements Y.D would like to thank Adel Javanmard for feedback on an earlier draft of this paper. Accurate Inference for Adaptive Linear Models Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Online least squares estimation with self-normalized processes: An application to bandit problems. ar Xiv preprint ar Xiv:1102.2670, 2011. Athey, S. and Wager, S. Efficient policy learning. ar Xiv preprint ar Xiv:1702.02896, 2017. Audibert, J.-Y. and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In COLT, pp. 217 226, 2009. Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. Bertsekas, D. P. Incremental proximal methods for large scale convex optimization. Mathematical programming, 129(2):163, 2011. Billingsley, P. Probability and measure. John Wiley & Sons, 2008. Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012. Castro, R. M. and Nowak, R. D. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54 (5):2339 2353, 2008. Chan, N. H. and Wei, C.-Z. Asymptotic inference for nearly nonstationary ar (1) processes. The Annals of Statistics, pp. 1050 1063, 1987. Chen, K., Hu, I., Ying, Z., et al. Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. The Annals of Statistics, 27(4):1155 1163, 1999. Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In COLT, pp. 355 366, 2008. Deshpande, Y. and Montanari, A. Linear bandits in high dimension and recommendation systems. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pp. 1750 1754. IEEE, 2012. Deshpande, Y., Mackey, L., Syrgkanis, V., and Taddy, M. Accurate inference for adaptive linear models, 2018. URL https://bit.ly/2m Jx ULg. Dimakopoulou, M., Athey, S., and Imbens, G. Estimation considerations in contextual bandits. ar Xiv preprint ar Xiv:1711.07077, 2017. Dud ık, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601, 2011. Dvoretzky, A. Asymptotic normality for sums of dependent random variables. In Proc. 6th Berkeley Symp. Math. Statist. Probab, volume 2, pp. 513 535, 1972. Enders, W. Applied econometric time series. John Wiley & Sons, 2008. Garivier, A. and Capp e, O. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual Conference On Learning Theory, pp. 359 376, 2011. Heyde, C. C. Quasi-likelihood and its application: a general approach to optimal parameter estimation. Springer Science & Business Media, 2008. Imbens, G. W. The role of the propensity score in estimating dose-response functions. Biometrika, 87(3):706 710, 2000. Jamieson, K., Malloy, M., Nowak, R., and Bubeck, S. lilucb: An optimal exploration algorithm for multiarmed bandits. In Conference on Learning Theory, pp. 423 439, 2014. Javanmard, A. and Montanari, A. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(1):2869 2909, 2014a. Javanmard, A. and Montanari, A. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. IEEE Transactions on Information Theory, 60(10):6522 6554, 2014b. Kallus, N. Balanced policy evaluation and learning. ar Xiv preprint ar Xiv:1705.07384, 2017. Kulis, B. and Bartlett, P. L. Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 575 582, 2010. Lai, T. and Siegmund, D. Fixed accuracy estimation of an autoregressive parameter. The Annals of Statistics, pp. 478 485, 1983. Lai, T. L. Asymptotic properties of nonlinear least squares estimates in stochastic regression models. The Annals of Statistics, pp. 1917 1930, 1994. Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985. Accurate Inference for Adaptive Linear Models Lai, T. L. and Wei, C. Z. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, pp. 154 166, 1982. Ledoux, M. Isoperimetry and Gaussian analysis, volume 1648. Springer, Providence, 1996. Ledoux, M. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2005. Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670. ACM, 2010. Li, L., Chu, W., Langford, J., and Wang, X. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 297 306. ACM, 2011. Lunceford, J. K. and Davidian, M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine, 23(19):2937 2960, 2004. Nagumo, J.-I. and Noda, A. A learning method for system identification. IEEE Transactions on Automatic Control, 12(3):282 287, 1967. Nie, X., Xiaoying, T., Taylor, J., and Zou, J. Why adaptively collected data have negative bias and how to correct for it. 2017. Phillips, P. C. and Perron, P. Testing for a unit root in time series regression. Biometrika, 75(2):335 346, 1988. Robbins, H. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pp. 169 177. Springer, 1985. Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010. Russo, D. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pp. 1417 1418, 2016. Shumway, R. H. and Stoffer, D. S. Time series analysis and its applications: with R examples. Springer Science & Business Media, 2006. Swaminathan, A. and Joachims, T. Batch learning from logged bandit feedback through counterfactual risk minimization. Journal of Machine Learning Research, 16: 1731 1755, 2015. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Tropp, J. A. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12 (4):389 434, 2012. Van de Geer, S., B uhlmann, P., Ritov, Y., Dezeure, R., et al. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166 1202, 2014. Villar, S., Bowden, J., and Wason, J. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. Wei, C.-Z. Asymptotic properties of least-squares estimates in stochastic regression models. The Annals of Statistics, pp. 1498 1508, 1985. Zhang, C.-H. and Zhang, S. S. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217 242, 2014.