# online_aoptimal_design_and_active_linear_regression__de3ba7db.pdf Online A-Optimal Design and Active Linear Regression Xavier Fontaine 1 Pierre Perrault 2 Michal Valko 3 Vianney Perchet 4 5 We consider in this paper the problem of optimal experiment design where a decision maker can choose which points to sample to obtain an estimate ˆβ of the hidden parameter β of an underlying linear model. The key challenge of this work lies in the heteroscedasticity assumption that we make, meaning that each covariate has a different and unknown variance. The goal of the decision maker is then to figure out on the fly the optimal way to allocate the total budget of T samples between covariates, as sampling several times a specific one will reduce the variance of the estimated model around it (but at the cost of a possible higher variance elsewhere). By trying to minimize the ℓ2-loss E[ ˆβ β 2] the decision maker is actually minimizing the trace of the covariance matrix of the problem, which corresponds then to online A-optimal design. Combining techniques from bandit and convex optimization we propose a new active sampling algorithm and we compare it with existing ones. We provide theoretical guarantees of this algorithm in different settings, including a O(T 2) regret bound in the case where the covariates form a basis of the feature space, generalizing and improving existing results. Numerical experiments validate our theoretical findings. 1. Introduction and related work A classical problem in statistics consists in estimating an unknown quantity, for example the mean of a random variable, parameters of a model, poll results or the efficiency of a medical treatment. In order to do that, statisticians usually build estimators which are random variables based on the data, supposed to approximate the quantity to estimate. A 1Centre Borelli, ENS Paris-Saclay, Palaiseau, France 2Idemia, Courbevoie, France 3Google Deep Mind, Paris, France 4CREST, ENSAE, Palaiseau, France 5Criteo AI Lab, Paris, France. Correspondence to: Xavier Fontaine . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). way to construct an estimator is to make experiments and to gather data on the estimand. In the polling context an experiment consists for example in interviewing people in order to know their voting intentions. However if one wants to obtain a good estimator, typically an unbiased estimator with low variance, the choice of which experiment to run has to be done carefully. Interviewing similar people might indeed lead to a poor prediction. In this work we are interested in the problem of optimal design of experiments, which consists in choosing adequately the experiments to run in order to obtain an estimator with small variance. We focus here on the case of heteroscedastic linear models with the goal of actively constructing the design matrix. Linear models, though possibly sometimes too simple, have been indeed widely studied and used in practice due to their interpretability and can be a first good approximation model for a complex problem. The original motivation of this problem comes from use cases where obtaining the label of a sample is costly, hence choosing carefully which points to sample in a regression task is crucial. Consider for example the problem of controlling the wear of manufacturing machines in a factory (Antos et al., 2010), which requires a long and manual process. The wear can be modeled as a linear function of some features of the machine (age, number of times it has been used, average temperature, ...) so that two machines with the same parameters will have similar wears. Since the inspection process is manual and complicated, results are noisy and this noise depends on the machine: a new machine, slightly worn, will often be in a good state, while the state of heavily worn machines can vary a lot. Thus evaluating the linear model for the wear requires additional examinations of some machines and less inspection of others. Another motivating example comes from econometrics, typically in income forecasting. It is usually assumed that the annual income is influenced by the individual s education level, age, gender, occupation, etc. through a linear model. Polling is also an issue in this context: what kind of individual to poll to gain as much information as possible about an explanatory variable? The field of optimal experiment design (Pukelsheim, 2006) aims precisely at choosing which experiment to perform in order to minimize an objective function within a budget constraint. In experiment design, the distance of the produced hypothesis to the true one is measured by the covariance Online A-Optimal Design and Active Linear Regression matrix of the error (Boyd & Vandenberghe, 2004). There are several criteria that can be used to minimize a covariance matrix, the most popular being A, D and E-optimality. In this paper we focus on A-optimal design whose goal is to minimize the trace of the covariance matrix. Contrary to several existing works which solve the A-optimal design problem in an offline manner in the homoscedastic setting (Sagnol, 2010; Yang et al., 2013; Gao et al., 2014) we are interested here in proposing an algorithm which solves this problem sequentially, with the additional challenge that each experiment has an unknown and different variance. Our problem is therefore close to active learning which is more and more popular nowadays because of the exponential growth of datasets and the cost of labeling data. Indeed, the latter may be tedious and require expert knowledge, as in the domain of medical imaging. It is therefore essential to choose wisely which data to collect and to label, based on the information gathered so far. Usually, machine learning agents are assumed to be passive in the sense that the data is seen as a fixed and given input that cannot be modified or optimized. However, in many cases, the agent can be able to appropriately select the data (Goos & Jones, 2011). Active learning specifically studies the optimal ways to perform data selection (Cohn et al., 1996) and this is crucial as one of the current limiting factors of machine learning algorithms are computing costs, that can be reduced since all examples in a dataset do not have equal importance (Freund et al., 1997). This approach has many practical applications: in online marketing where one wants to estimate the potential impact of new products on customers, or in online polling where the different options do not have the same variance (Atkeson & Alvarez, 2018). In this paper we consider therefore a decision maker who has a limited experimental budget and aims at learning some latent linear model. The goal is to build a predictor ˆβ that estimates the unknown parameter of the linear model β , and that minimizes E[ ˆβ β 2]. The key point here is that the design matrix is constructed sequentially and actively by the agent: at each time step, the decision maker chooses a covariate Xk Rd and receives a noisy output X k β + ε. The quality of the predictor is measured through its variance. The agent will repeatedly query the different available covariates in order to obtain more precise estimates of their values. Instinctively a covariate with small variance should not be sampled too often since its value is already quite precise. On the other hand, a noisy covariate will be sampled more often. The major issue lies in the heteroscedastic assumption: the unknown variances must be learned to wisely sample the points. Antos et al. (2010) introduced a specific variant of our setting where the environment providing the data is assumed to be stochastic and i.i.d. across rounds. More precisely, they studied this problem using the framework of stochastic multi-armed bandits (MAB) by considering a set of K probability distributions (or arms), associated with K variances. Their objective is to define an allocation strategy over the arms to estimate their expected values uniformly well. Later, the analysis and results have been improved by Carpentier et al. (2011). However, this line of work is actually focusing on the case where the covariates are only vectors of the canonical basis of Rd, which gives a simpler closed form linear regression problem. There have been some recent works on MAB with heteroscedastic noise (Cowan et al., 2015; Kirschner & Krause, 2018) with natural connections to this paper. Indeed, covariates could somehow be interpreted as contexts in contextual bandits. The most related setting might be the one of Soare (2015). However, they are mostly concerned about best-arm identification while recovering the latent parameter β of the linear model is a more challenging task (as each decision has an impact on the loss). In that sense we improve the results of Soare (2015) by proving a bound on the regret of our algorithm. Other works as (Chen & Price, 2019) propose active learning algorithms aiming at finding a constant factor approximation of the classification loss while we are focusing on the statistical problem of recovering β . Yet another similar setting has been introduced in (Riquelme et al., 2017a). In this setting the agent has to estimate several linear models in parallel and for each covariate (that appears randomly), the agent has to decide which model to estimate. Other works studied the problem of active linear regression, and for example Sugiyama & Rubens (2008) proposed an algorithm conducting active learning and model selection simultaneously but without any theoretical guarantees. More recently Riquelme et al. (2017b) have studied the setting of active linear regression with thresholding techniques in the homoscedastic case. An active line of research has also been conducted in the domain of random design linear regression (Hsu et al., 2011; Sabato & Munos, 2014; Derezi nski et al., 2019). In these works the authors aim at controlling the mean-squared regression error E[(X β Y )2] with a minimum number of random samples Xk. Except from the loss function that they considered, these works differ from ours in several points: they generally do not consider the heteroscedastic case and their goal is to minimize the number of samples to use to reach an ε-estimator while in our setting the total number of covariates K is fixed. Allen-Zhu et al. (2020) provide a similar analysis but under the scope of optimal experiment design. Another setting similar to ours is introduced in (Hazan & Karnin, 2014), where active linear regression with a hard-margin criterion is studied. However, the minimization of the classical ℓ2-norm of the difference between the true parameter of the linear model and its estimator seems to be a more natural criterion, which justifies our investigations. Online A-Optimal Design and Active Linear Regression In this work we adopt a different point of view from the aforementioned existing works. We consider A-optimal design under the heteroscedasticity assumption and we generalize MAB results to the non-coordinate basis setting with two different algorithms taking inspiration from the convex optimization and bandit literature. We prove optimal e O(T 2) regret bounds for d covariates and provide a weaker guarantee for more than d covariates. Our work emphasizes the connection between MAB and optimal design, closing open questions in A-optimal design. Finally we corroborate our theoretical findings with numerical experiments. 2. Setting and description of the problem 2.1. Motivations and description of the setting Let X1, . . . , XK Rd be K covariates available to some agent who can successively sample each of them (several times if needed). Observations Y are generated by a standard linear model, i.e., Y = X β + ε with β Rd . Each of these covariates correspond to an experiment that can be run by the decision maker to gain information about the unknown vector β . The goal of optimal experiment design is to choose the experiments to perform from a pool of possible design points {X1, . . . , XK} in order to obtain the best estimate ˆβ of β within a fixed budget of T N samples. In classical experiment design problems the variances of the different experiments are supposed to be equal. Here we consider the more challenging setting where each covariate has a specific and unknown variance σ2 k, i.e., we suppose that when Xk is queried for the i-th time the decision maker observes Y (i) k = X k β + ε(i) k , where E[ε(i) k ] = 0, Var[ε(i) k ] = σ2 k > 0 and ε(i) k is κ2subgaussian. We assume also that the ε(i) k are independent from each other. This setting corresponds actually to online optimal experiment design since the decision maker has to design sequentially the sampling policy, in an adaptive manner. A naive sampling strategy is to equally sample each covariate Xk. In our heteroscedastic setting, this will not produce the most precise estimate of β because of the different variances σ2 k. Intuitively a point Xk with a low variance will provide very precise information on the value X k β while a point with a high variance will not give much information (up to the converse effect of the norm Xk ). This indicates that a point with high variance should be sampled more often than a point with low variance. Since the variances σ2 k are unknown, we need at the same time to estimate σ2 k (which might require lots of samples of Xk to be precise) and to minimize the estimation error (which might require only a few examples of some covariate Xk). There is then a tradeoff between gathering information on the values of σ2 k and using it to optimize the loss; the fact that this loss is global, and not cumulative, makes this tradeoff exploration vs. exploitation much more intricate than in standard multi-armed bandits. Usual algorithms handling global losses are rather slow (Agrawal & Devanur, 2014; Mannor et al., 2014) or dedicated to specific well-posed problems with closed form losses (Antos et al., 2010; Carpentier et al., 2011). Our setting can be seen as an extension of the two aforementioned works who aim at estimating the means of a set of K distributions. Noting µ = (µ1, . . . , µK) the vector of the means of those distributions and Xi = ei the ith vector of the canonical basis of RK, we see (since X i µ = µi) that their objective is actually to estimate the parameter µ of a linear model. This setting is a particular case of ours since the vectors Xi form the canonical basis of RK. 2.2. Definition of the loss function As we mentioned it before, the decision maker can be led to sample several times the same design point Xk in order to obtain a more precise estimate of its response X k β . We denote therefore by Tk 0 the number of samples of Xk, hence T = PK k=1 Tk. For each k [K]1, the linear model yields the following i=1 Y (i) k = XT k β + T 1 k i=1 ε(i) k . We define Yk = PTk i=1 Y (i) k /σk Tk , Xk = Tk Xk/σk and εk = PTk i=1 ε(i) k /σk Tk so that for all k [K], Yk = XT k β + εk, where E[ ε] = 0 and Var[ εk] = 1. We denote by X = ( X 1 , , X K) RK d the induced design matrix of the policy. Under the assumption that X has full rank, the above Ordinary Least Squares (OLS) problem has an optimal unbiased estimator ˆβ = (X X) 1X Y . The overarching objective is to upper-bound E ˆβ β 2, which can be easily rewritten as follows: E h ˆβ β 2i = Tr((X X) 1) = Tr k=1 pk Xk X k /σ2 k where we have denoted for every k [K], pk = Tk/T the proportion of times the covariate Xk has been sampled. By definition, p = (p1, . . . , p K) K, the simplex of 1[K] = {1, . . . , K}. Online A-Optimal Design and Active Linear Regression dimension K 1. We emphasize here that minimizing E ˆβ β 2 is equivalent to minimizing the trace of the inverse of the covariance matrix X X, which corresponds actually to A-optimal design (Pukelsheim, 2006). Denote now by Ω(p) the following weighted covariance matrix pk σ2 k Xk X k = X X . The objective is to minimize over p K the loss function L(p) = Tr Ω(p) 1 with L(p) = + if (p 7 Ω(p)) is not invertible, such that E h ˆβ β 2i = 1 T Tr Ω(p) 1 = 1 For the problem to be non-trivial, we require that the covariates span Rd. If it is not the case then there exists a vector along which one cannot get information about the parameter β . The best algorithm we can compare against can only estimate the projection of β on the subspace spanned by the covariates, and we can work in this subspace. The rest of this work is devoted to design an algorithm minimizing Tr Ω(p) 1 with the difficulty that the variances σ2 k are unknown. In order to do that we will sequentially and adaptively choose which point to sample to minimize Tr Ω(p) 1 . This corresponds consequently to online Aoptimal design. As developed above, the norms of the covariates have a scaling role and those can be renormalized to lie on the sphere at no cost, which is thus an assumption from now on: k [K], Xk 2 = 1. The following proposition shows that the problem we are considering is convex. Proposition 1. L is strictly convex on d and continuous in its relative interior d. The proof is deferred to Appendix C. Proposition 1 implies that L has a unique minimum p in d and we note p = arg min p d L(p) . Finally, we evaluate the performance of a sampling policy in term of regret i.e., the difference in loss between the optimal sampling policy and the policy in question. Definition 1. Let p T denote the sampling proportions after T samples of a policy. Its regret is then T (L(p T ) L(p )) . We will construct active sampling algorithms to minimize R(T). A key step is the following computations of the gradient of L. Since kΩ(p) = Xk XT k /σ2 k, it follows pk L(p) = 1 σ2 k Tr Ω(p) 2Xk XT k = 1 Ω(p) 1Xk 2 2 . As in several works (Hsu et al., 2011; Allen-Zhu et al., 2020) we will have to study different cases depending on the values of K and d. The first one corresponds to the case K d. As we explained it above, if K < d, the matrix Ω(p) is not invertible and it is impossible to obtain a sublinear regret, which makes us work in the subspace spanned by the covariates Xk. This corresponds to K = d. We will treat this case in Sections 3 and 4. The case K > d is considered in Section 5. 3. A naive randomized algorithm We begin by proposing an obvious baseline for the problem at hand. One naive algorithm would be to estimate the variances of each of the covariates by sampling them a fixed amount of time. Sampling each arm c T times (with c < 1/K) would give an approximation ˆσk of σk of order 1/ T. Then we can use these values to construct ˆΩ(p) an approximation of Ω(p) and then derive the optimal proportions ˆpk to minimize Tr(ˆΩ(p) 1). Finally the algorithm would consist in using the remainder of the budget to sample the arms according to those proportions. However, such a trivial algorithm would not provide good regret guarantees. Indeed the constant fraction c of the samples used to estimate the variances has to be chosen carefully; it will lead to a 1/T regret if c is too big (if c > p k for some k). That is why we need to design an algorithm that will first roughly estimate the p k. In order to improve the algorithm it will also be useful to refine at each iteration the estimates ˆpk. Following these ideas we propose Algorithm 1 which uses a pre-sampling phase (see Lemma 3 for further details) and which constructs at each iteration lower confidence estimates of the variances, providing an optimistic estimate L of the objective function L. Then the algorithm minimizes this estimate (with an offline A-optimal design algorithm, see e.g., (Gao et al., 2014)). Finally the covariate Xk is sampled with probability ˆpt,k. Then feedback is collected and estimates are updated. Proposition 2. For T 1 samples, running Algorithm 1 with Ni = po i T/2 (with po defined by (2)) for all i [K], gives final sampling proportions p T such that R(T) = OΓ,σk where Γ is the Gram matrix of X1, . . . , XK. The proof is postponed to Appendix D. Notice that we avoid the problem discussed by Erraqabi et al. (2017) (that is due to infinite gradient on the simplex boundary) thanks to presampling, allowing us to have positive empirical variance estimates with high probability. Online A-Optimal Design and Active Linear Regression Algorithm 1 Naive randomized algorithm Require: d, T, δ confidence parameter Require: N1, . . . , Nd of sum N 1: Sample Nk times each covariate Xk 2: p N (N1/N, . . . , Nd/N) 3: Compute empirical variances ˆσ2 1, . . . , ˆσ2 d 4: for N + 1 t T do 5: Compute ˆpt arg min L, where L is the same function as L, but with variances replaced by lower confidence estimates of the variances (from Theorem 1). 6: Draw π(t) randomly according to probabilities ˆpt and sample covariate Xπ(t) 7: Update pt+1 = pt + 1 t+1(eπ(t+1) pt) and ˆσ2 π(t) where (e1, . . . , ed) is the canonical basis of Rd. 8: end for 4. A faster first-order algorithm We now improve the relatively slow dependency in T in the rates of Algorithm 1 due to its naive reduction to a MAB problem, and because it does not use any estimates of the gradient of L with a different approach based on convex optimization techniques, that we can leverage to gain an order in the rates of convergence. 4.1. Description of the algorithm The main algorithm is described in Algorithm 2 and is built following the work of Berthet & Perchet (2017). The idea is to sample the arm sampled which minimizes the norm of a proxy of the gradient of L, corrected by a positive error term, as in the UCB algorithm (Auer et al., 2002). Algorithm 2 Bandit algorithm Require: d, T Require: N1, . . . , Nd of sum N 1: Sample Nk times each covariate Xk 2: p N (N1/N, . . . , Nd/N) 3: Compute empirical variances ˆσ2 1, . . . , ˆσ2 d 4: for N + 1 t T do 5: Compute ˆL(pt), where ˆL is the same function as L, but with variances replaced by empirical variances. 6: for k [d] do 7: ˆgk k ˆL(pt) 2 Tk 8: end for 9: π(t) arg mink [d] ˆgk and sample covariate Xπ(t) 10: Update pt+1 = pt + 1 t+1(eπ(t+1) pt) and update ˆσ2 π(t) 11: end for N1, . . . , Nd are the number of times each covariate is sampled at the beginning of the algorithm. This stage is needed to ensure that L is smooth. More details about that will be given with Lemma 3. 4.2. Concentration of the gradient of the loss The cornerstone of the algorithm is to guarantee that the estimates of the gradients concentrate around their true value. To simplify notations, we denote by Gk = pk L(p) the true kth derivative of L and by ˆGk its estimate. More precisely, if we note ˆΩ(p) = PK k=1(pk/ˆσk )Xk X k , we have Gk = σ 2 k Ω(p) 1Xk 2 2, ˆGk .= ˆσ 2 k ˆΩ(p) 1Xk 2 2 . Since ˆGk depends on the ˆσ2 k, we need a concentration bound on the empirical variances ˆσ2 k. As traditional results on the concentration of the variances (Maurer & Pontil, 2009; Carpentier et al., 2011) are generally obtained in the bounded setting, we prove in Appendix A the following bound in the case of subgaussian random variables. Theorem 1. Let X be a centered and κ2-sub-gaussian random variable sampled n 2 times. Let δ (0, 1). Let c = (e 1)(2e(2e 1)) 1 0.07. With probability at least 1 δ, the following concentration bound on its empirical variance holds ˆσ2 n σ2 3κ2 max Using Theorem 1 we claim the following concentration argument, which is the main ingredient of the analysis of Algorithm 2. Proposition 3. For every k [K], after having gathered Tk T samples of covariates Xk, there exists a constant C > 0 (explicit and given in the proof) such that, with probability at least 1 δ |Gk ˆGk| C σ 1 k max i [K] σ2 i pi For clarity reasons we postpone the proof to Appendix B. Proving this proposition was one of the main technical challenges of our analysis. Now that we have it proven we can turn to the analysis of Algorithm 2. 4.3. Analysis of the convergence of the algorithm In convex optimization several classical assumptions can be leveraged to derive fast convergence rates. Those assumptions are typically strong convexity, positive distance from Online A-Optimal Design and Active Linear Regression the boundary of the constraint set, and smoothness of the objective function, i.e., that it has Lipschitz gradient. We prove in the following that the loss L satisfies them, up to the smoothness because its gradient explodes on the boundary of d. However, L is smooth on the relative interior of the simplex. Consequently we will circumvent this smoothness issue by using a technique from (Fontaine et al., 2019) consisting in pre-sampling every arm a linear number of times in order to force p to be far from the boundaries of d. Using the following notations X0 .= (X 1 , , X d ) and Γ .= X0X 0 = Gram(X1, . . . , Xd) we prove the following lemma in Appendix E.1. Lemma 1. The loss function L verifies for all p d, L(p) = 1 det(X 0 X0) σ2 k pk Cof(X0X 0 )kk . With this expression, the optimal proportion p can be easily computed using the KKT theorem, with the following closed form: Cof(Γ)ii . (1) This yields that L is µ-strongly convex on d, with µ = 2 det(Γ) 1 mini Cof(Γ)iiσ2 i . Moreover, this also implies that p is far away from the boundary of d. Lemma 2. Let η .= dist(p , d) be the distance from p to the boundary of the simplex. We have K K 1 mini σi p Cof(Γ)ii Pd k=1 σk p Proof. This is immediate with (1) since η = r K K 1 mini p i . It remains to recover the smoothness of L. This is done using a pre-sampling phase. Lemma 3 (see (Fontaine et al., 2019)). If there exists α (0, 1/2) and po d such that p αpo (component-wise) then sampling arm i at most αpo i T times (for all i [d]) at the beginning of the algorithm and running Algorithm 2 is equivalent to running Algorithm 2 with budget (1 α)T on the smooth function (p 7 L(αpo + (1 α)p). We have proved that p k is bounded away from 0 and thus a pre-sampling would be possible. However, this requires to have some estimate of each σ2 k. The upside is that those estimates must be accurate up to some multiplicative factor (and not additive factor) so that a logarithmic number of samples of each arm is enough to get valid lower/upper bounds (see Corollary 1). Indeed, the estimate σ2 k obtained satisfies, for each k [d], that σ2 k [σ2 k/2, 3σ2 k/2]. Consequently we know that k [d], p k 1 Cof(Γ)kk Pd i=1 σi p where po = σk p Cof(Γ)kk Pd i=1 σi p This will let us use Lemma 3 and with a presampling stage as prescribed, p is forced to remain far away from the boundaries of the simplex in the sense that pt,i po i /2 at each stage t subsequent to the pre-sampling, and for all i [d]. Consequently, this logarithmic phase of estimation plus the linear phase of pre-sampling ensures that in the remaining of the process, L is actually smooth. Lemma 4. With the pre-sampling of Lemma 3, L is smooth with constant CS where CS 432 σ2 max Pd k=1 σk p det(Γ)σ3 min p mink Cof(Γ)kk . The proof is deferred to Appendix E.2. We can now state our main theorem that is proved in Appendix E.3. Theorem 2. Applying Algorithm 2 with T 1 samples after having pre-sampled each arm k [d] at most po k T/2 times gives the following bound2 R(T) = OΓ,σk This theorem provides a fast convergence rate for the regret R and emphasizes the importance of using the gradient information in Algorithm 2 compared to Algorithm 1. 5. Discussion and generalization to K > d We discuss in this section the case where the number K of covariate vectors is greater than d. 5.1. Discussion of the case K > d In the case where K > d it may be possible that the optimal p lies on the boundary of the simplex K, meaning that some arms should not be sampled. This happens for instance as soon as there exist two covariate points that are exactly equal but with different variances. The point with the lowest variance should be sampled while the point with the highest 2The notation OΓ,σk means that there is a hidden constant depending on Γ and on the σk. The explicit dependency on these parameters is given in the proof. Online A-Optimal Design and Active Linear Regression one should not. All the difficulty of an algorithm for the case where K > d is to be able to detect which covariate should be sampled and which one should not. In order to adopt another point of view on this problem it might be interesting to go back to the field of optimal design of experiments. Indeed by choosing vk = Xk/σk, our problem consists exactly in the following constraint minimization problem given v1 . . . , v K Rd: j=1 pjvjv j under contraints p K . (P) It is known (Pukelsheim (2006)) that the dual problem of A-optimal design consists in finding the smallest ellipsoid, in some sense, containing all the points vj: u.c. W 03 and v j Wvj 1 for all 1 j K . In our case the role of the ellipsoid can be easily seen with the KKT conditions. We obtain the following proposition, proved in Appendix G.1. Proposition 4. The points Xk/σk lie within the ellipsoid defined by the matrix Ω(p ) 2. This geometric interpretation shows that a point Xk with high variance is likely to be in the interior of the ellipsoid (because Xk/σk is close to the origin), meaning that µk > 0 and therefore that p k = 0 i.e., that Xk should not be sampled. Nevertheless since the variances are unknown, one is not easily able to find which point has to be sampled. Figures illustrating the geometric interpretation can be found in Appendix G.2. 5.2. A theoretical upper-bound and a lower bound We derive now a bound for the convergence rate of Algorithm 2 in the case where K > d. Theorem 3. Applying Algorithm 2 with K > d covariate points gives the following bound on the regret: R(T) = O log(T)T 5/4 . The proof is postponed to Appendix F.1. One can ask whether this result is optimal, and if it is possible to reach the bound of Theorem 2. The following theorem provides a lower bound showing that it is impossible in the case where there are d covariates. However the upper and lower bounds of Theorems 3 and 4 do not match. It is still an open question whether we can obtain better rates than T 5/4. Theorem 4. In the case where K > d, for any algorithm on our problem, there exists a set of parameters such that R(T) T 3/2. We prove Theorem 4 in Appendix F.2. 6. Numerical simulations We now present numerical experiments to validate our results and claims. We compare several algorithms for active matrix design: a very naive algorithm that samples equally each covariate, Algorithm 1, Algorithm 2 and a Thompson Sampling (TS) algorithm (Thompson, 1933). We run our experiments on synthetic data with horizon time T between 104 and 106, averaging the results over 25 rounds. We consider covariate vectors in RK of unit norm for values of K ranging from 3 to 100. All the experiments ran in less than 15 minutes on a standard laptop. 4 4.5 5 5.5 6 naive slope= 1.0 Alg. 2 slope= 2.0 TS slope= 2.0 Alg. 1 slope= 1.9 Figure 1. Regret as a function of T in log log scale in the case of K = 3 covariates in R3. 4 4.5 5 5.5 naive slope= 1.0 Alg. 2 slope= 1.9 TS slope= 1.9 Figure 2. Regret as a function of T in log log scale in the case of K = 4 covariates in R3. Let us quickly describe the Thompson Sampling algorithm. We choose Normal Inverse Gamma distributions for priors Online A-Optimal Design and Active Linear Regression for the mean and variance of each of the arms, as they are the conjugate priors for gaussian likelihood with unknown mean and variance. At each time step t, for each arm k [K], a value of ˆσk is sampled from the prior distribution. An approximate value of k L(p) is computed with the ˆσk values. The arm with the lowest gradient value is chosen and sampled. The value of this arm updates the hyperparameters of the prior distribution. In our first experiment we consider only 3 covariate vectors. We plot the results in log log scale in order to see the convergence speed which is given by the slope of the plot. Results on Figure 1 show that both Algorithms 1 and 2, as well as Thompson sampling have regret O(1/T 2) as expected. 4 4.5 5 5.5 8 Alg. 1 slope= 1.0 Alg. 2 slope= 1.36 Figure 3. Regret as a function of T in log log scale in the case of K = 4 covariates in R3 in a challenging setting. 3 3.2 3.4 3.6 3.8 4 K = 5 slope= 1.98 K = 10 slope= 2.11 K = 20 slope= 2.23 K = 50 slope= 2.15 K = 100 slope= 2.06 Figure 4. Regret as a function of T for different values of K in log log scale. We see that Thompson Sampling performs well on lowdimensional data. However it is approximately 200 times slower than Algorithm 2 due to the sampling of complex Normal Inverse Gamma distributions and therefore inefficient in practice. On the contrary, Algorithm 2 is very practical. Indeed its computational complexity is linear in time T and its main computational cost is due to the computation of the gradient ˆL. This relies on inverting ˆΩ Rd d, whose complexity is O(d3) (or even O(d2.807) with Strassen algorithm). Thus the overall complexity of Algorithm 2 is O(T(d2.8 + K)) hence polynomial. This computational complexity advocates that Algorithm 2 is practical for moderate values of d, as in linear regression problems. Figure 1 shows that Algorithm 1 performs nearly as well as Algorithm 2. However, the minimization step of ˆL is timeconsuming when K > d, since there is no close form for p , which leads to approximate results. Therefore Algorithm 1 is not adapted to K > d. We also have conducted similar experiments in this case, with K = d + 1. The offline solution of the problem indicates that one covariate should not be sampled, i.e., p K. Results presented on Figure 2 prove the performances of Algorithm 2. One might argue that the positive results of Figure 2 are due to the fact that it is easy for the algorithm to detect that one covariate should not be sampled, in the sense that this covariate clearly lies in the interior of the ellipsoids mentioned in Section 5.1. In the very challenging case where two covariates are equal but with variances separated by only 1/ T, we obtain the results described on Figure 3. The observed experimental convergence rate is of the order of T 1.36 which is much slower than the rates of Figure 2, and between the rates proved in Theorems 3 and Theorem 4. Finally we run a last experiment with larger values of K = d. We plot the convergence rate of Algorithm 2 for values of K ranging from 5 to 100 in log log scale on Figure 4. The slope is again approximately of 2, which is coherent with Theorem 2. We note furthermore that larger values of d do not make Algorithm 2 impracticable, as inferred by its cubic complexity. 7. Conclusion We have proposed an algorithm mixing bandit and convex optimization techniques to solve the problem of online Aoptimal design, which is related to active linear regression with repeated queries. This algorithm has proven fast and optimal rates e O(T 2) in the case of d covariates that can be sampled in Rd. One cannot obtain such fast rates in the more general case of K > d covariates. We have therefore provided weaker results in this very challenging setting and conducted more experiments showing that the problem is indeed more difficult. Online A-Optimal Design and Active Linear Regression Acknowledgements Xavier Fontaine was supported by grants from Région Ilede-France. Agrawal, S. and Devanur, N. R. Bandits with concave rewards and convex knapsacks. In Proceedings of the fifteenth ACM conference on Economics and computation, pp. 989 1006, 2014. Allen-Zhu, Z., Li, Y., Singh, A., and Wang, Y. Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming, pp. 1 40, 2020. Antos, A., Grover, V., and Szepesvári, C. Active learning in heteroscedastic noise. Theoretical Computer Science, 411(29-30):2712 2728, 2010. Atkeson, L. R. and Alvarez, R. M. The Oxford handbook of polling and survey methods. Oxford University Press, 2018. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235 256, May 2002. ISSN 0885-6125. Berthet, Q. and Perchet, V. Fast rates for bandit optimization with upper-confidence frank-wolfe. In Advances in Neural Information Processing Systems, pp. 2225 2234, 2017. Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge university press, 2004. Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., and Auer, P. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In International Conference on Algorithmic Learning Theory, pp. 189 203. Springer, 2011. Chen, X. and Price, E. Active regression via linear-sample sparsification. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 663 695, Phoenix, USA, 25 28 Jun 2019. PMLR. URL http://proceedings.mlr.press/ v99/chen19a.html. Cohn, D. A., Ghahramani, Z., and Jordan, M. I. Active learning with statistical models. Journal of artificial intelligence research, 4:129 145, 1996. Cowan, W., Honda, J., and Katehakis, M. N. Normal bandits of unknown means and variances: Asymptotic optimality, finite horizon regret bounds, and a solution to an open problem. ar Xiv preprint ar Xiv:1504.05823, 2015. Derezi nski, M., Warmuth, M. K., and Hsu, D. Unbiased estimators for random design regression. ar Xiv preprint ar Xiv:1907.03411, 2019. Erraqabi, A., Lazaric, A., Valko, M., Brunskill, E., and Liu, Y.-E. Trading off Rewards and Errors in Multi Armed Bandits. In Singh, A. and Zhu, J. (eds.), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 709 717, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR. URL http://proceedings.mlr.press/ v54/erraqabi17a.html. Fontaine, X., Berthet, Q., and Perchet, V. Regularized contextual bandits. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pp. 2144 2153. PMLR, 16 18 Apr 2019. URL http://proceedings.mlr.press/ v89/fontaine19a.html. Freund, Y., Seung, H. S., Shamir, E., and Tishby, N. Selective sampling using the query by committee algorithm. Machine learning, 28(2-3):133 168, 1997. Gao, W., Chan, P. S., Ng, H. K. T., and Lu, X. Efficient computational algorithm for optimal allocation in regression models. Journal of Computational and Applied Mathematics, 261:118 126, 2014. Goos, P. and Jones, B. Optimal design of experiments: a case study approach. John Wiley & Sons, 2011. Hazan, E. and Karnin, Z. Hard-margin active linear regression. In International Conference on Machine Learning, pp. 883 891, 2014. Hsu, D., Kakade, S. M., and Zhang, T. An analysis of random design linear regression. ar Xiv preprint ar Xiv:1106.2363, 2011. Kirschner, J. and Krause, A. Information directed sampling and bandits with heteroscedastic noise. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 358 384. PMLR, 06 09 Jul 2018. URL http://proceedings.mlr.press/ v75/kirschner18a.html. Mannor, S., Perchet, V., and Stoltz, G. Approachability in unknown games: Online learning meets multi-objective optimization. In Conference on Learning Theory, pp. 339 355, 2014. Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009. Online A-Optimal Design and Active Linear Regression Pukelsheim, F. Optimal design of experiments. SIAM, 2006. Riquelme, C., Ghavamzadeh, M., and Lazaric, A. Active learning for accurate estimation of linear models. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2931 2939, International Convention Centre, Sydney, Australia, 06 11 Aug 2017a. PMLR. URL http://proceedings.mlr.press/ v70/riquelme17a.html. Riquelme, C., Johari, R., and Zhang, B. Online active linear regression via thresholding. In Thirty-First AAAI Conference on Artificial Intelligence, 2017b. Sabato, S. and Munos, R. Active regression by stratification. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 469 477. Curran Associates, Inc., 2014. URL http: //papers.nips.cc/paper/5468-activeregression-by-stratification.pdf. Sagnol, G. Optimal design of experiments with application to the inference of traffic matrices in large networks: second order cone programming and submodularity. Ph D thesis, École Nationale Supérieure des Mines de Paris, 2010. Soare, M. Sequential Resource Allocation in Linear Stochastic Bandits . Theses, Université Lille 1 - Sciences et Technologies, December 2015. URL https:// hal.archives-ouvertes.fr/tel-01249224. Sugiyama, M. and Rubens, N. Active learning with model selection in linear regression. In Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 518 529. SIAM, 2008. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Yang, M., Biedermann, S., and Tang, E. On optimal designs for nonlinear models: a general and efficient algorithm. Journal of the American Statistical Association, 108(504): 1411 1420, 2013.