# adaptive_linear_estimating_equations__7dd87add.pdf

Adaptive Linear Estimating Equations

Mufang Ying Department of Statistics Rutgers University - New Brunswick my426@scarletmail.rutgers.edu

Koulik Khamaru Department of Statistics Rutgers University - New Brunswick k1241@stat.rutgers.edu

Cun-Hui Zhang Department of Statistics Rutgers University - New Brunswick czhang@stat.rutgers.edu

Sequential data collection has emerged as a widely adopted technique for enhancing the efficiency of data gathering processes. Despite its advantages, such data collection mechanism often introduces complexities to the statistical inference procedure. For instance, the ordinary least squares (OLS) estimator in an adaptive linear regression model can exhibit non-normal asymptotic behavior, posing challenges for accurate inference and interpretation. In this paper, we propose a general method for constructing debiased estimator which remedies this issue. It makes use of the idea of adaptive linear estimating equations, and we establish theoretical guarantees of asymptotic normality, supplemented by discussions on achieving near-optimal asymptotic variance. A salient feature of our estimator is that in the context of multi-armed bandits, our estimator retains the non-asymptotic performance of the least squares estimator while obtaining asymptotic normality property. Consequently, this work helps connect two fruitful paradigms of adaptive inference: a) non-asymptotic inference using concentration inequalities and b) asymptotic inference via asymptotic normality.

1 Introduction

Adaptive data collection arises as a common practice in various scenarios, with a notable example being the use of (contextual) bandit algorithms. Algorithms like these aid in striking a balance between exploration and exploitation trade-offs within decision-making processes, encompassing domains such as personalized healthcare and web-based services [35, 24, 3, 22]. For instance, in personalized healthcare, the primary objective is to choose the most effective treatment for each patient based on their individual characteristics, such as medical history, genetic profile, and living environment. Bandit algorithms can be used to allocate treatments based on observed response, and the algorithm updates its probability distribution to incorporate new information as patients receive treatment and their response is observed. Over time, the algorithm can learn which treatments are the most effective for different types of patients.

Although the adaptivity in data collection improves the quality of data, the sequential nature (non-iid) of the data makes the inference procedure quite challenging [34, 26, 5, 28, 27, 10, 30, 29]. There is a lengthy literature on the problem of parameter estimation in the adaptive design setting. In a series of work [15, 19, 17], the authors studied the consistency of the least squares estimator for an adaptive

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

linear model. In a later work, Lai [14] studied the consistency of the least squares estimator in a nonlinear regression model. The collective wisdom of these papers is that, for adaptive data collection methods, standard estimators are consistent under a mild condition on the maximum and minimum eigenvalues of the covariance matrix [19, 14]. In a more recent line of work [1, 2], the authors provide a high probability upper bound on the ℓ2-error of the least squares estimator for a linear model. We point out that, while the high probability bounds provide a quantitative understanding of OLS, these results assume a stronger sub-Gaussian assumption on the noise variables.

The problem of inference, i.e. constructing valid confidence intervals, with adaptively collected data is much more delicate. Lai and Wei [19] demonstrated that for a unit root autoregressive model, which is an example of adaptive linear regression models, the least squares estimator doesn t achieve asymptotic normality. Furthermore, the authors showed that for a linear regression model, the least squares estimator is asymptotically normal when the data collection procedure satisfies a stability condition. Concretely, letting xi denote the covariate associated with i-th sample, the authors require

B 1 n Sn p I (1)

where Sn = Pn i=1 xix i and {Bn}n 1 is a sequence of non-random positive definite matrices. Unfortunately, in many scenarios, the stability condition (1) is violated [38, 19]. Moreover, in practice, it might be difficult to verify whether the stability condition (1) holds or not. In another line of research [10, 36, 37, 4, 9, 28, 31, 25, 38], the authors assume knowledge of the underlying data collection algorithm and provide asymptotically valid confidence intervals. While this approach offers intervals under a much weaker assumption on the underlying model, full knowledge of the data collection algorithm is often unavailable in practice.

Online debiasing based methods: In order to produce valid statistical inference when the stability condition (1) does not hold, some authors [8, 7, 13] utilize the idea of online debiasing. At a high level, the online debiased estimator reduces bias from an initial estimate (usually the least squares estimate) by adding some correction terms, and the online debiasing procedure does not require the knowledge of the data generating process. Although this procedure guarantees asymptotic reduction of bias to zero, the bias term s convergence rate can be quite slow.

In this work, we consider estimating the unknown parameter in an adaptive linear model by using a set of adaptive linear estimating equations (ALEE). We show that our proposed ALEE estimator achieves asymptotic normality without knowing the exact data collection algorithm while addressing the slowly decaying bias problem in online debiasing procedure.

2 Background and problem set-up

In this section, we provide the background for our problem and set up a few notations. We begin by defining the adaptive data collection mechanism for linear models.

2.1 Adaptive linear model

Suppose a scalar response variable yt is linked to a covariate vector xt Rd at time t via the linear model: yt = x t θ + ϵt for t [n], (2)

where θ Rd is the unknown parameter of interest.

In an adaptive linear model, the regressor xt at time t is assumed to be a (unknown) function of the prior data point {x1, y1, . . . , xt 1, yt 1} as well as additional source of randomness that may be present in the data collection process. Formally, we assume there is an increasing sequence of σ-fields {Ft}t 0 such that

σ(x1, y1, . . . , xt 1, yt 1, xt) Ft 1 for t [n].

For the noise variables {ϵt}t 1 appearing in equation (2), we impose the following conditions

E[ϵt|Ft 1] = 0, E[ϵ2 t|Ft 1] = σ2, and sup t 1 E[|ϵt/σ|2+δ|Ft 1] < , (3)

for some δ > 0. The above condition is relatively mild compared to a sub-Gaussian condition.

Examples of adaptive linear model arise in various problems, including multi-armed and contextual bandit problems, dynamical input-output systems, adaptive approximation schemes and time series models. For instance, in the context of the multi-armed bandit problem, the design vector xt is one of the basis vectors {ek}k [d], representing an arm being pulled, while θ , yt represent the true mean reward vector and reward at time t, respectively.

2.2 Adaptive linear estimating equations

As we mentioned earlier, the OLS estimator can fail to achieve asymptotic normality due to the instability of the covariance matrix with adaptively collected data. To get around this issue, we consider a different approach ALEE (adaptive linear estimating equations). Namely, we obtain an estimate by solving a system of linear estimating equations with adaptive weights,

ALEE: Pn t=1 wt(yt x t bθALEE) = 0. (4)

Here the weight wt Rd is chosen in a way that wt Ft 1 for t [n]. Let us now try to gain some intuition behind the construction of ALEE. Rewriting equation (4), we have

{Pn t=1 wtxt} (bθALEE θ ) = Pn t=1 wtϵt. (5)

Notably, the choice of wt Ft 1 makes Pn t=1 wtϵt the sum of a martingale difference sequence. Our first theorem postulates conditions on the weight vectors {wt}t 1 such that the right-hand side of (5) converges to normal distribution asymptotically. Throughout the paper, we use the shorthand Wt = (w1, . . . , wt) Rt d, Xt = (x1, . . . , xt) Rt d.

Proposition 2.1. Suppose condition (3) holds and the predictable sequence {wt}1 t n satisfies

max 1 t n wt 2 = op(1) and Id W n Wn op = op(1). (6)

Let Aw = Vw U w Xn with Wn = UwΛw V w being the SVD of Wn. Then,

Aw(bθALEE θ )/bσ d N 0, Id , (7)

where bσ is any consistent estimator for σ.

Proof. Invoking the second part of the condition (6), we have that Λw is invertible for large n, and VwΛ 1 w V w Id op = op(1). Utilizing the expression (5), we have

Aw(bθALEE θ )/σ = VwΛ 1 w V w W n Xn(bθALEE θ )/σ = VwΛ 1 w V w

t=1 wtϵt/σ.

Invoking the stability condition on the weights {wt} and using the fact that Pn t=1 wtϵt is a martingale difference sequence, we conclude from martingale central limit theorem [11, Theorem 2.1] that Pn t=1 wtϵt/σ d N 0, Id .

Combining the last equation with VwΛ 1 w V w Id op = op(1) and using Slutsky s theorem yield

Aw(bθALEE θ )/σ d N 0, Id .

The claim of Proposition 2.1 now follows from Slutsky s theorem.

A few comments regarding the Proposition 2.1 are in order. Straightforward calculation shows

A w Aw = X n Pw Xn Sn, where Pw = Wn(W n Wn) 1W n . (8)

In words, the volume of the confidence region based on (7) is always larger than the confidence region generated by the least squares estimate. Therefore, the ALEE-based inference, which is consistently valid, exhibits a reduced efficiency in cases where both types of confidence regions are valid. Compared with the confidence regions based on OLS, the advantage of the ALEE approach is to provide flexibility in the choice of weights to guarantee the validity of the CLT conditions (6).

Next, note that the matrix Aw is asymptotically equivalent to the matrix W n Xn (see equation (5)) under the stability condition (6). The benefit of this reformulation is that it helps us better understand

efficiency of ALEE compared with the OLS. This has led us to define a notion of affinity between the weights {wt}t 1 and covariates {xt}t 1 for better understanding of the efficiency of ALEE and ways to design nearly optimal weights, as it will be clear in the next section.

Finally, it is straightforward to obtain a consistent estimate for σ. For instance, assuming log(λmax(X n Xn))/n a.s. 0 and the noise condition (3), we have

t=1 (yt x t bθLS)2 a.s. σ2. (9)

Here, bθLS refers to the least squares estimate. See [19, Lemma 3] for a detailed proof of equation (9).

3 Main results

In this section, we propose methods to construct weights {wt}t 1 which satisfy the stability property (6), and study the resulting ALEE. Section 3.1 is devoted to the multi-armed bandit case, Section 3.2 to an autoregressive model, and Section 3.3 to the contextual bandit case. Before delving into details, let us try to understand intuitively how to construct weights that have desirable properties.

The expression (8) reveals that the efficiency of ALEE depends on the projection of the data matrix Xn on Wn. Thus, the efficiency of the ALEE approach can be measured by the principal angles between the random projections Pw in (8) and Px = Xn S 1 n X n . Accordingly, we define the affinity A(Wn, Xn) of the weights {wt}t 1 as the cosine of the largest principle angle, or equivalently

A(Wn, Xn) = σd(Pw Px) = σd U w Xn S 1/2 n (10)

as the d-th largest singular value of Pw Px. Formally, the above definition captures the cosine of the angle between the two subspaces spanned by the columns of Xn and Wn, respectively [12]. Good weights {wt}t 1 are those with relatively large affinity or

Uw Xn S 1/2 n (approximately). (11)

3.1 Multi-armed bandits

In the context of the K-arm bandit problem, the Gram matrix has a diagonal structure, which means that we can focus on constructing weights {wt}t 1 for each coordinate independently. For an arm k [K] and round t 1, define

st,k = s0 + Pt i=1 x2 i,k for some positive s0 F0. (12)

Define the k-th coordinate of the weight wt as

wt,k = f st,k

xt,k s0 with f(x) =

log 2 x log(e2x) (log log(e2x))2 . (13)

The intuition behind the above construction is as follows. The discussion at near equation (11) indicates that the k-th coordinate of wt should be proportional to xt,k/(P i n x2 i,k)1/2. However, the weight wt is required to be predictable, which can only depend on the data points 1 up to time t. Consequently, we approximate the sum P

i n x2 i,k by the partial sum st,k in (12). Finally, note that

wt,k = f st,k

xt,k s0 xt,k st,k . (14)

The logarithmic factors in (13) ensure that the stability conditions (6) hold. In the following theorem, we generalize the above method as a general strategy for constructing weights {wt}t 1 satisfying the stability condition (6).

1Note that xt,k Ft 1 can be used to construct wt

4 3 2 1 0 1 2 3 4

Distribution of errors in AR(1)

(0, 1) OLS ALEE

4 3 2 1 0 1 2 3 4

Distribution of errors with -greedy

(0, 1) OLS ALEE

Figure 1: Empirical distribution of the standardized estimation errors from OLS and ALEE approach. Results are obtained with a dataset of size n = 1000 and 3000 independent replications. Left: AR(1) model yt = yt 1 + ϵt with independent errors ϵt N(0, 1). Right: Two-armed bandit problem with equal arm mean θ 1 = θ 2 = 0.3 and independent noise ϵt N(0, 1). Figure 4 in Section C.3 considers the same setting with centered Poisson noise, which is not sub-Gaussian.

3.1.1 Stable weight construction strategy

Consider a positive decreasing function f(x) on the interval [1, ) with an increasing derivative f (x). Let f satisfy the conditions: f /f is increasing, Z

1 f 2(x)dx = 1, and Z

1 f(x)dx = . (15)

With s0 F0, we define weight wt,k as

wt,k = f st,k

xt,k s0 with st,k = s0 +

i=1 x2 i,k. (16)

A key condition that ensures the weights {wt,k}t 1 satisfy the desirable stability property (6) is

max 1 t n f 2 st,k

x2 t,k s0 + max 1 t n

1 f(st,k/s0) f(st 1,k/s0)

sn,k/s0 f 2(x)dx = op(1). (17)

For multi-armed bandits, this condition is automatically satisfied when both quantities 1/s0 and s0/sn,k converge to zero in probability. Putting together the pieces, we have the following result for multi-armed bandits. Theorem 3.1. Suppose condition (3) holds and 1/s0 + s0/sn,k = op(1) for some k [K]. Then, the k-th coordinate bθALEE,k, obtained using weights from equation (16), satisfies

(bθALEE,k θ k) Z sn,k/s0

bσ f(x)dx d N(0, 1), (18)

where bσ is a consistent estimate of σ. Equivalently,

(bθALEE,k θ k)

1 t n w2 t,k n X

t=1 wt,kxt,k

d N(0, 1). (19)

The proof of Theorem 3.1 can be found in Section A.1 of the Appendix. A few comments regarding Theorem 3.1 are in order.

First, the above theorem enables us to construct valid CI in the estimation of the mean θ k for a sub-optimal arm k when employing an asymptotically optimal allocation rule to achieve the optimal regret in [18] with sample size P

t n xt,k log n, or when using a sub-optimal rule to achieve polylog(n). On the other hand, the classical martingale CLT is applicable to the optimal arm (if unique) under such asymptotically optimal or sub-optimal allocation rules. Consequently, one may obtain a valid CI for the optimal arm from the standard OLS estimate [19]. However, it is important to note that such CIs are not guaranteed for sub-optimal arms.

Next, while Theorem 3.1 holds for any s0 diverging to infinity but of smaller order than sn,k ( which may depend on k), the convergence rate of P

1 t n wt,kϵt to normality is enhanced by choosing a large value for s0. In practical terms, it is advisable to choose an s0 that is slightly smaller than the best-known lower bound for sn,k.

Finally, the choice of function f determines the efficiency of ALEE estimator. For instance, taking function f(x) = 1/x, we obtain an estimator with asymptotic variance of order 1/{s0 log2(sn,k/s0)}, which is only better than what one would get using stopping time results by a logarithmic factor. In the next Corollary, an improved choice of f yields near optimal variance up to logarithmic terms.

Corollary 3.2. Consider the same set of assumptions as stated in Theorem 3.1. The ALEE estimator bθALEE,k, obtained by using f(x) = (β logβ 2)1/2{x(log e2x)(log log e2x)1+β} 1/2 for any β > 0 satisfies s

log(sn,k/s0){log log(sn,k/s0)}1+β sn,k(bθALEE,k θ k) bσ

The proof of this corollary follows directly from Theorem 3.1. For s0 = log n/ log log n in multiarmed bandits with asymptotically optimal allocations, log(sn,k/s0) = (1 + o(1)) log log sn,k.

3.1.2 Finite sample bounds for ALEE estimators

One may also construct finite sample confidence intervals for each arm via applying concentration bounds. Indeed, for any arm k K, we have

t=1 wt,kxt,k} bθALEE,k θ k =

t=1 wt,kϵt. (20)

Following the construction of wt,k Ft 1, the term Pn t=1 wt,kϵt is amenable to concentration inequalities if we assume that the noise ϵt is sub-Gaussian conditioned on Ft 1, i.e.

λ R E[eλϵt | Ft 1] eσ2 gλ2/2. (21)

Corollary 3.3 (Theorem 1 in [1]). Suppose the sub-Gaussian noise condition (21) is in force. Then for any δ > 0 and λ0 > 0, the following bound holds with probability at least 1 δ

t=1 wt,kxt,k

|bθALEE,k θ k| σg

v u u t(λ0 +

t=1 w2 t,k) log λ0 + Pn t=1 w2 t,k δ2λ0

Remark 3.4. In the context of multi-armed bandit, by considering the function f in Corollary 3.2 with β = 1 and Corollary 3.3 with λ0 = 1, we derive that with probability at least 1 δ

|bθALEE,k θ k| σg p

2 + log(sn,k/s0) log{2 + log(sn,k/s0)}

sn,k s0 (23)

provided s0 > 1. See Section A.2 of the Appendix for a proof of this argument. Recall that sn,k = (s0 + P

i n x2 i,k)1/2, the bound is in the same spirit as existing finite sample bounds for the OLS estimator for arm means [1, 21]. In simple terms, the ALEE estimator behaves similarly to the OLS estimator in a non-asymptotic setting while still maintaining asymptotic normality.

3.2 Autoregressive time series

Next, we focus on an autoregressive time series model

yt = θ yt 1 + ϵt for t [n], (24)

where y0 = 0. Note that the above model is a special case of the adaptive linear model (2). It is well-known that when θ ( 1, 1), the time series model (24) satisfies a stability assumption (1). Consequently, one might use the OLS estimate based confidence intervals [19] for θ . However, when θ = 1 also known as the unit root case stability condition (1) does not hold, and the

least squares estimator is not asymptotically normal [19]. In other words, when θ = 1, the least squares based intervals do not provide correct coverage.

In this section, we apply ALEE-based approach to construct confidence intervals that are valid for θ [ 1, 1]. Similar to previous sections, let s0 F0 and denote st = s0 + P

1 i t y2 i 1. Following a construction similar to the last section, we have the following corollary. Corollary 3.5. Assume the noise variables {ϵt}t are i.i.d with mean zero, variance σ2 and sub Gaussian parameter σ2 g. Then, for any θ [ 1, 1], the ALEE estimator, obtained using wt = f(st/s0)yt 1/ s0 with function f from Corollary 3.2 and s0 = n/ log log(n), satisfies s

log(sn/s0){log log(sn/s0)}1+β sn(bθALEE θ )

d N(0, 1). (25)

The proof of Corollary 3.5 can be found in Section A.3 of the Appendix.

3.3 Contextual bandits

In contextual bandit problems, the task of defining adaptive weights that satisfy the stability condition (6) while maintaining a large affinity is challenging. Without loss of generality, we assume that xt 2 1. Following the discussion around (11) and using St as an approximation of Sn, we

see that a good choice for the weight is wt S 1

2 t xt. However, it is not all clear at the moment why the above choice produces d dimensional weights wt satisfying the stability condition (6). It turns out that the success of our construction is based on the variability of certain matrix Vt. For a F0-measurable d d symmetric matrix Σ0 Id and t [n], we define

i=1 xix i and zt = Σ 1

2 t 1xt. (26)

For t [n], we define the variability matrix Vt as

1 (Variability). (27)

The variability matrix Vt comes up frequently in finite sample analysis of the least squares estimator [16, 19, 2], the generalized linear models with adaptive data [23], and in online optimization [6]; see comments after Theorem 3.6 for a more detailed discussion on the matrix Vt. Now, we define weights {wt}t 1 as

1 + z t Vt 1zt Vtzt. (28)

Theorem 3.6. Suppose condition (3) holds and Σ 1 0 op + Vn op = op(1). Then, the ALEE estimator bθALEE, obtained using the weights {wt}1 t n from (28), satisfies n X

(bθALEE θ ) d N 0, σ2Id .

The proof of Theorem 3.6 can be found in Section A.4 of the Appendix. In Theorem B.4 in the appendix, we establish the asymptotic normality of a modified version of the ALEE estimator, which has the same asymptotic variance as the one in Theorem 3.6 under the assumption Σ 1 0 op = op(1). In other words, the modified theorem B.4 does not assume any condition on the Vn op.

To better convey the idea of our construction, we provide a lemma that may be of independent interest. This lemma applies to weights wt generated by (27) and (28) with general zt. Lemma 3.7. Let wt be as in (28) with the variability matrix Vt in (27). Then,

t=1 wtw t = Id Vn, max 1 t n wt 2 = max 1 t n Vt 1zt 2/(1 + z t Vt 1zt)1/2. (29)

For zt Ft 1, the stability condition (6) holds when max1 t n z t Vtzt + Vn op = op(1).

Proof. For any t 1, Vt = Vt 1 Vt 1ztz t Vt 1/(1 + z t Vt 1zt). It follows that Vtzt = Vt 1zt/(1 + z t Vt 1zt) and Pn t=1 wtw t = Pn t=1 Vt 1(V 1 t V 1 t 1)Vt = Id Vn.

Comments on Theorem 3.6 conditions: It is instructive to compare the conditions of Theorem 3.1 and Theorem 3.6. The condition Σ 1 0 op = op(1) is an analogue of the condition 1/s0 = op(1). The condition Vn op = op(1) is a bit more subtle. This condition is an analogue of the condition s0/sn,k = op(1). Indeed, applying elliptical potential lemma [2, Lemma 4] yields

log(det(Σ0 + Sn))

log(det(Σ0)) trace(V 1 n ) d =

t=1 x t Σ 1 t 1xt 2 log(det(Σ0 + Sn))

log(det(Σ0)) (30)

where Sn = Pn i=1 xix i is the Gram matrix. We see that for Vn op = op(1), it is necessary that the eigenvalues of Sn grow to infinity at a faster rate than the eigenvalues of Σ0. Moreover, in the case of dimension d = 1, the condition Vn op = op(1) is equivalent to s0/sn,k = op(1).

4 Numerical experiments

In this section, we consider three settings: two-armed bandit setting, first order auto-regressive model setting and contextual bandit setting. In two-armed bandit setting, the rewards are generated with same arm mean (θ 1, θ 2) = (0.3, 0.3), and noise is generated from a normal distribution with mean 0 and variance 1. To collect two-armed bandit data, we use ϵ-Greedy algorithm with decaying exploration rate p

log(t)/t. The rate is designed to make sure the number of times each armed is pulled has order greater than log(n) up to time n. In the second setting, we consider the time series model, yt = θ yt 1 + ϵt, (31) where θ = 1 and noise ϵt is drawn from a normal distribution with mean 0 and variance 1. In the contextual bandit setting, we consider the true parameter θ to be 0.3 times the all-one vector. In the initial iterations, a random context xt is generated from a uniform distribution in Sd 1. Then, we apply ϵ-Greedy algorithm to these pre-selected contexts with decaying exploration rate log2(t)/t. For all of the above three settings, we run 1000 independent replications.

0.75 0.80 0.85 0.90 0.95 Target coverage probability

Empirical coverage probability

Bandit: lower tail coverage for * 1

Target Coverage OLS ALEE Concentration W-decorrelation

0.75 0.80 0.85 0.90 0.95 Target coverage probability

Empirical coverage probability

Bandit: upper tail coverage for * 1

Target Coverage OLS ALEE Concentration W-decorrelation

0.75 0.80 0.85 0.90 0.95 Target coverage probability

Mean width of CI

Average two-sided CI width for * 1

OLS ALEE Concentration W-decorrelation

Figure 2: Two-armed bandit problem with equal arm mean θ 1 = θ 2 = 0.3. Error bars plotted are standard errors.

To analyze the data we collect for these settings, we apply ALEE approach with weights specified in Corollary 3.2, 3.5 and Theorem B.4, respectively. More specifically, in the first two settings, we consider β = 1 in Corollary 3.2. For two-armed bandit example, we set s0 = e2 log(n), which is known to be a lower bound for sn,1. For AR(1) model, we consider s0 = e2n/ log log(n). For the contextual bandit example, we consider Σ0 = log(n) Id. In the simulations, we also compare ALEE approach to the normality based confidence interval for OLS estimator [19] (which may be incorrect), the concentration bounds for the OLS estimator based on self-normalized martingale sequence [1], and W-decorrelation [8]. Detailed implementations about these methods can be found in Appendix C.1.

In Figure 2, we display results for two-armed bandit example, providing the empirical coverage plots for the first arm mean θ 1 as well as average width for two-sided CIs. We observe that CIs based on

0.75 0.80 0.85 0.90 0.95 Target coverage probability

Empirical coverage probability

AR: lower tail coverage for *

Target Coverage OLS ALEE Concentration W-decorrelation

0.75 0.80 0.85 0.90 0.95 Target coverage probability

Empirical coverage probability

AR: upper tail coverage for *

Target Coverage OLS ALEE Concentration W-decorrelation

0.75 0.80 0.85 0.90 0.95 Target coverage probability

Mean width of CI

Average two-sided CI width for *

OLS ALEE Concentration W-decorrelation

Figure 3: AR(1) with model coefficient θ = 1 and s0 = e2n/ log log(n). Error bars plotted are standard errors.

OLS undercover θ 1 while other methods provide satisfactory coverage. Notably, from the average CI width plot, we can see that W-decorrelation and concentration methods have relatively large CI widths. On the contrary, ALEE-based CIs achieve target coverage while keeping the width of CIs relatively small.

For AR(1) model, we display the results in Figure 3. For the context bandit example, we consider d = 20 and summarize the empirical coverage probability and the logarithm of the volume of the confidence regions in Table 1, along with corresponding standard deviations. See Appendix C.2 for experiments with dimension d = 10 and d = 50.

Table 1: Contextual bandit: d = 20

Method Level of confidence

0.8 0.85 0.9

Avg. Coverage Avg. log(Volumn) Avg. Coverage Avg. log(Volumn) Avg. Coverage Avg. log(Volumn) ALEE 0.805 ( 0.396) 6.541 ( 0.528) 0.861 ( 0.346) 7.108 ( 0.528) 0.910 ( 0.286) 7.806 ( 0.528) OLS 0.776 ( 0.417) -2.079 ( 0.525) 0.830 ( 0.376) -1.513 ( 0.525) 0.881 ( 0.324) -0.815 ( 0.525) W-Decorrelation 0.777 ( 0.416) 25.727 ( 0.518) 0.829 ( 0.377) 26.294 ( 0.518) 0.870 ( 0.336) 26.992 ( 0.518) Concentration 1.000 ( 0.000) 17.374 ( 0.506) 1.000 ( 0.000) 17.408( 0.506) 1.000 ( 0.000) 17.455 ( 0.506)

5 Discussion

In this paper, we study the parameter estimation problem in an adaptive linear model. We propose to use ALEE (adaptive linear estimation equations) to obtain point and interval estimates. Our main contribution is to propose an estimator which is asymptotically normal without requiring any stability condition on the sample covariance matrix. Unlike the concentration based confidence regions, our proposed confidence regions allow for heavy tailed noise variables. We demonstrate the utilitity of our method by comparing our method with existing methods.

Our work leaves several questions open for future research. For example, it would be interesting to characterize the variance of the ALEE estimator compared to the best possible variance[13, 20] for d > 1. It would also be interesting to know if such results can be extended to non-linear adaptive models, e.g., to an adaptive generalized linear model [23]. Furthermore, our paper assumes a fixed dimension d for the problem while letting n . It would be interesting to explore whether we can allow the dimension to grow with the number of samples at a specific rate.

Acknowledgments

This work was partially supported by the National Science Foundation Grants DMS-2311304, CCF1934924, DMS-2052949 and DMS-2210850.

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

[2] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Online least squares estimation with selfnormalized processes: An application to bandit problems. ar Xiv preprint ar Xiv:1102.2670, 2011.

[3] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu Ramakrishnan, Scott Roy, and Joe Zachariah. Online models for content optimization. Advances in Neural Information Processing Systems, 21, 2008.

[4] Aurélien Bibaut, Maria Dimakopoulou, Nathan Kallus, Antoine Chambaz, and Mark van Der Laan. Post-contextual-bandit inference. Advances in neural information processing systems, 34:28548 28559, 2021.

[5] Jack Bowden and Lorenzo Trippa. Unbiased estimation for response adaptive clinical trials. Statistical methods in medical research, 26(5):2376 2388, 2017.

[6] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008.

[7] Yash Deshpande, Adel Javanmard, and Mohammad Mehrabi. Online debiasing for adaptively collected high-dimensional data with applications to time series analysis. Journal of the American Statistical Association, pages 1 14, 2021.

[8] Yash Deshpande, Lester Mackey, Vasilis Syrgkanis, and Matt Taddy. Accurate inference for adaptive linear models. In International Conference on Machine Learning, pages 1194 1203. PMLR, 2018.

[9] Maria Dimakopoulou, Zhimei Ren, and Zhengyuan Zhou. Online multi-armed bandits with adaptive inference. Advances in Neural Information Processing Systems, 34:1939 1951, 2021.

[10] Vitor Hadad, David A Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the National Academy of Sciences, 118(15):e2014602118, 2021.

[11] Inge S Helland. Central limit theorems for martingales with discrete or continuous time. Scandinavian Journal of Statistics, pages 79 94, 1982.

[12] Ilse CF Ipsen and Carl D Meyer. The angle between complementary subspaces. The American mathematical monthly, 102(10):904 911, 1995.

[13] Koulik Khamaru, Yash Deshpande, Lester Mackey, and Martin J Wainwright. Near-optimal inference in adaptive linear regression. ar Xiv preprint ar Xiv:2107.02266, 2021.

[14] Tze Leung Lai. Asymptotic properties of nonlinear least squares estimates in stochastic regression models. The Annals of Statistics, pages 1917 1930, 1994.

[15] Tze Leung Lai and Herbert Robbins. Adaptive design and stochastic approximation. The annals of Statistics, pages 1196 1221, 1979.

[16] Tze Leung Lai and Herbert Robbins. Adaptive design and stochastic approximation. The annals of Statistics, pages 1196 1221, 1979.

[17] Tze Leung Lai and Herbert Robbins. Consistency and asymptotic efficiency of slope estimates in stochastic approximation schemes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 56(3):329 360, 1981.

[18] Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

[19] Tze Leung Lai and Ching Zong Wei. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154 166, 1982.

[20] Tor Lattimore. A lower bound for linear and kernel regression with adaptive covariates. In The Thirty Sixth Annual Conference on Learning Theory, pages 2095 2113. PMLR, 2023.

[21] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728 737. PMLR, 2017.

[22] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010.

[23] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pages 2071 2080. PMLR, 2017.

[24] Peng Liao, Kristjan Greenewald, Predrag Klasnja, and Susan Murphy. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1 22, 2020.

[25] Licong Lin, Koulik Khamaru, and Martin J Wainwright. Semi-parametric inference based on adaptively collected data. ar Xiv preprint ar Xiv:2303.02534, 2023.

[26] Alexander R Luedtke and Mark J Van Der Laan. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 44(2):713, 2016.

[27] Seth Neel and Aaron Roth. Mitigating bias in adaptive data gathering via differential privacy. In International Conference on Machine Learning, pages 3720 3729. PMLR, 2018.

[28] Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. Why adaptively collected data have negative bias and how to correct for it. In International Conference on Artificial Intelligence and Statistics, pages 1261 1269. PMLR, 2018.

[29] Jaehyeok Shin, Aaditya Ramdas, and Alessandro Rinaldo. Are sample means in multi-armed bandits positively or negatively biased? Advances in Neural Information Processing Systems, 32, 2019.

[30] Jaehyeok Shin, Aaditya Ramdas, and Alessandro Rinaldo. On the bias, risk and consistency of sample means in multi-armed bandits. ar Xiv preprint ar Xiv:1902.00746, 2019.

[31] Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don t judge an object by its context: Learning to overcome contextual bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11070 11078, 2020.

[32] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

[33] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

[34] Min Xu, Tao Qin, and Tie-Yan Liu. Estimation bias in multi-armed bandit algorithms for search advertising. Advances in Neural Information Processing Systems, 26, 2013.

[35] Elad Yom-Tov, Guy Feraru, Mark Kozdoba, Shie Mannor, Moshe Tennenholtz, and Irit Hochberg. Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research, 19(10):e338, 2017.

[36] Ruohan Zhan, Vitor Hadad, David A Hirshberg, and Susan Athey. Off-policy evaluation via adaptive weighting with data from contextual bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2125 2135, 2021.

[37] Ruohan Zhan, Zhimei Ren, Susan Athey, and Zhengyuan Zhou. Policy learning with adaptively collected data. Management Science, 2023.

[38] Kelly Zhang, Lucas Janson, and Susan Murphy. Inference for batched bandits. Advances in neural information processing systems, 33:9818 9829, 2020.

Table of Contents

A Proof 12 A.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 A.2 Proof of Remark 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A.3 Proof of Corollary 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.4 Proof of Theorem 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

B Generalized Theorem 3.6 18

C Simulation 21 C.1 Simulation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Tables for contextual bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.3 Asymptotic normality with centered Poisson noise variables . . . . . . . . . . . 22

In Theorem 3.1, Corollary 3.3 and Remark 3.4, we deal with an arm with index k [K]. To simplify notations, we drop the subscript k in st,k, wt,k, xt,k, bθALEE,k and θ k throughout the proof, and use st, wt, xt, bθALEE and θ , respectively.

A.1 Proof of Theorem 3.1

Condition (17) serves as an important role in proving (18). Therefore, we start our proof by verifying the condition (17). Since function f is a positive decreasing function, we first have

max 1 t n f 2( st

s0 )x2 t s0 f 2(1) 1

Furthermore, since function f /f is increasing, we have

1 f(st/s0) f(st 1/s0)

= max 1 t n f(st 1/s0) f(st/s0)

s0 max 1 t n f (st 1/s0)

f(st 1/s0) = 1

f(1) , (33)

where inequality (i) follows from mean value theorem and the monotonicity of the function f /f. Thus, by assuming 1/s0 = op(1) and s0/sn = op(1), condition (17) follows directly from equation (32) and equation (33).

By the construction of ALEE estimator, we have n X

(bθALEE θ ) =

t=1 wtϵt. (34)

t=1 wtxt = s0

t=1 f(st/s0)x2 t s0 = s0 Z sn/s0

t n f(st/s0)x2 t/s0 R sn/s0 1 f(x)dx . (35)

By the mean value theorem, we have that for t [n], ξt [st 1, st] Z st/s0

st 1/s0 f(x)dx = f(ξt/s0)x2 t s0 .

Therefore, we have

t n f(st/s0)x2 t/s0 R sn/s0 1 f(x)dx = 1 +

t n( f(st/s0)

f(ξt/s0) 1)f(ξt/s0)x2 t/s0 P

t n f(ξt/s0)x2 t/s0 | {z }

Observe that

t n | f(st/s0)

f(ξt/s0) 1|f(ξt/s0)x2 t/s0 P

t n f(ξt/s0)x2 t/s0

t n | f(st/s0) f(st 1/s0) 1|f(ξt/s0)x2 t/s0 P

t n f(ξt/s0)x2 t/s0

1 f(st/s0) f(st 1/s0)

(ii) = op(1).

Equality (ii) follows from condition (17). Consequently, applying Slutsky s theorem yields

Pn i=1 wtxt s0 R sn/s0 1 f(x)dx

Similarly, we can derive

t=1 f 2(st/s0)x2 t s0 = (1 + op(1)) Z sn/s0

1 f 2(x)dx = 1 + op(1). (36)

Knowing max1 t n w2 t = max1 t n f 2(st/s0)x2 t/s0 = op(1), which is a consequence of equation (17), martingale central limit theorem together with an application of Slutsky s theorem yields

(bθALEE θ ) Z sn/s0

bσ f(x)dx d N(0, 1).

Lastly, we recall that

t n w2 t n X

Therefore, equation (19) follows from martingale central limit theorem and Slutsky s theorem.

Remark A.1. Equation (18) sheds light on the asymptotic variance of the ALEE estimator, thereby aiding in the selection of a suitable function f to improve the efficiency of ALEE estimator. On the other hand, equation (19) offers a practical approach to obtaining an asymptotically precise confidence interval.

Remark A.2. Condition (17) is a general requirement that governs equation (18), and is not specific to bandit problems. However, the difficulty in verifying (17) can vary depending on the problem at hand.

A.2 Proof of Remark 3.4

Corollary 3.3 follows directly from Theorem 1 in [1]. In this section, we provide a proof of Remark 3.4. By considering λ0 = 1 in Corollary 3.3, we have with probability at least 1 δ

|bθALEE θ | σg

v u u t(1 +

t=1 w2 t ) log 1 + Pn t=1 w2 t δ2

By the construction of the weights in Corollary 3.2, we have

t=1 f 2( st

s0 )x2 t s0 Z

1 f 2(x)dx = 1. (38)

Therefore, to complete the proof, it suffices to characterize a lower bound for P

1 t n wtxt. By definition, we have

t=1 f(st/s0) x2 t s0

x2 t (st log(e2st/s0))1/2 log log(e2st/s0)

1 (2 + log(sn/s0))1/2 log(2 + log(sn/s0))

(ii) 1 (2 + log(sn/s0))1/2 log(2 + log(sn/s0)) 2( sn s0) r s0 1 + s0

(iii) 1 (2 + log(sn/s0))1/2 log(2 + log(sn/s0))

In equation (i), we plug in the expression of function f and hence s0 cancels out. Since xt is either 0 or 1, inequality (ii) follows from the integration of the function h(x) = 1/ x. Inequality (iii) follows from s0 > 1. Putting things together, we have

|bθALEE θ | σg

2 log(2/δ2) P

2 + log(sn/s0) log{2 + log(sn/s0)} sn s0 .

This completes our proof of Remark 3.4.

A.3 Proof of Corollary 3.5

Note that it suffices to verify the following condition (41)

max 1 t n f 2 st

y2 t 1 s0 + max 1 t n

1 f(st/s0) f(st 1/s0)

sn/s0 f 2(x)dx = op(1) (41)

for θ [ 1, 1] in order to complete the proof of Corollary 3.5. The other part of the proof can be adapted from the proof of Theorem 3.1. To simplify notations, we let

T1 = max 1 t n f 2 st

y2 t 1 s0 , T2 = max 1 t n

1 f(st/s0) f(st 1/s0)

, and T3 = Z

sn/s0 f 2(x)dx.

Therefore, proving equation (41) is equivalent to showing that T1, T2, and T3 converge to zero in probability. We will now demonstrate the convergence of each of these three terms to zero in probability.

T1 with θ = 1: To prove T1 = op(1), we make use of a result in [19, Equation 3.23], which is

lim inf n n 2(log log n)

t=1 y2 t 1 = σ2/4

Observe that

T1 = max 1 t n f 2( st

s0 )y2 t 1 s0 = max 1 t n y2 t 1 st log(e2st/s0){log log(e2st/s0)}1+β

max 1 t n y2 t 1 st log(e2){log log(e2)}1+β

= 1 2(log 2)1+β max 1 t n y2 t 1 st

1 2(log 2)1+β max{ max 1 t n2/3 y2 t 1 s0 , max n2/3 +1 t n y2 t 1 s n2/3 s0 }

In equation (43), we split the sequence into two parts and set different lower bounds for st. The major benefit of this step is to help us derive a better choice of s0. Now we bound max1 t n2/3 y2 t 1 and

max n2/3 +1 t n y2 t 1. Note that

P max 1 t n2/3 y2 t 1 ϵ

=P max{ max 1 t n2/3 yt 1, max 1 t n2/3 yt 1} ϵ

E h max{max1 t n2/3 yt 1, max1 t n2/3 yt 1} i

2n2/3σ2g log(2n2/3)

where inequality is derived from [33, Exercise 2.12] and the fact that yi is sub-Gaussian with sub-Gaussian parameter σ2 gn2/3 for i n2/3 . Therefore, we conclude that

max 1 t n2/3 y2 t 1 = Op(n2/3 log n).

Consequently, we have

max 1 t n2/3 y2 t 1 s0 = n2/3 log n

n/ log log n Op(1) = op(1). (45)

By applying the same trick to max n2/3 +1 t n y2 t 1, we can derive

max n2/3 +1 t n y2 t 1 = Op(n log n).

Hence we have

max n2/3 +1 t n y2 t 1 s n2/3 s0 = Op(n log n) n4/3/ log log n2/3 n4/3/ log log n2/3

(ii) = op(1) Op(1) = op(1). (46)

Equality (ii) makes use of equation (42). Combining equation (44) with equations (45) and (46), we conclude that T1 = op(1).

T1 with θ = 1: When θ = 1, the proof is essentially the same as the case when θ = 1. The only difference lies in the order of P

1 i n y2 i 1. However, by pairing ϵ2t 1 with ϵ2t for t N+, we can arrive at the same result. Specifically, for t N+, we let ϵ t = ϵ2t ϵ2t 1 and define

where y 0 = 0 and {ϵ t}t 1 are random variables with mean zero, variance 2σ2 and sub-Gaussian parameter 2σ2 g. Therefore, applying equation (44) yields

lim inf n n 2(log log n)

t=1 (y t 1)2 = σ2. (47)

Setting n0 = ( n2/3 1)/2 , we have

s n2/3 s0 =

t=1 (y t)2 =

t=1 (y t 1)2. (48)

According to equation (47) and equation (48), we have

max n2/3 +1 t n y2 t 1 s n2/3 s0 max n2/3 +1 t n y2 t 1 P

1 t n0+1(y t 1)2

= max n2/3 +1 t n y2 t 1 (n0 + 1)2/ log log(n0 + 1) (n0 + 1)2/ log log(n0 + 1) P

1 t n0+1(y t 1)2

= op(1) Op(1) = op(1),

which completes the proof of T1 = op(1) for the case when θ = 1.

T1 with θ ( 1, 1): Given θ ( 1, 1), we observe that yt is a sub-Gaussian random variable with

sub-Gaussian parameter σ2 g 1 (θ )2 for any t N+. Therefore, following equation (43), we have

T1 1 2(log 2)1+β max 1 t n y2 t 1 s0 (50)

where in the above inequality we use s0 as a lower bound for st. By applying [33, Exercise 2.12], we have

max 1 t n y2 t 1 = Op(log n), (51)

leading to the conclusion that T1 = op(1).

T2 with θ [ 1, 1]: Similar to equation (33), we have

T2 = max 1 t n

1 f(st/s0) f(st 1/s0)

= max 1 t n f(st 1/s0) f(st/s0)

max 1 t n f (st 1/s0)

f(st 1/s0) y2 t 1 s0 .

Define g(x) = f (x)/f(x) and we can compute that Z g(x)dx = Z f (x)

f(x) dx = Z 1 f df = log f + C,

where C is some constant. Doing some calculation yields

dx log f = d

2(log(x) + log log(e2x)) + (1 + β) log log log(e2x))

1 + 1 log(e2x) + 1 + β log(e2x) 1 log log(e2x)

Therefore, we have

T2 max 1 t n f (st 1/s0)

f(st 1/s0) y2 t 1 s0 = max 1 t n g(st 1/s0)y2 t 1 s0

max 1 t n y2 t 1 st 1 .

We note that demonstrating max1 t n y2 t 1/st 1 = op(1) follows the same approach as the proof of max1 t n y2 t 1/st = op(1). Hence, we omit it. To conclude, we show that T2 = op(1) for θ [ 1, 1].

T3 with θ [ 1, 1]: To prove T3 = op(1), it suffices to verify that

1 t n y2 t = op(1). (52)

For convenience, in equation (52) we use yt instead of yt 1. Note that when θ = 1 or θ = 1, we have provided almost sure lower bounds for P

1 t n y2 t in the proof of T1 = op(1). Therefore, equation (52) follows from these lower bounds. To prove equation (52) when θ ( 1, 1), we begin by rewriting P

1 t n y2 t in quadratic form. Without confusion and loss of generality, we replace θ by θ, consider Var(ϵt) = 1, and set εn = (ϵ1, ϵ2, . . . , ϵn) . For t [n], we have

k=1 θt kϵk = a t εn,

where at Rn and at,j = θt j for j t and at,j = 0 for j > t. Therefore, P

1 t n y2 t can be written as X

1 t n y2 t = X

1 t n ε n ata t εn = ε n Aεn, (53)

where A = P

1 t n ata t . Applying Hanson-Wright inequality (e.g. see [32]), we have

P |ε n Aεn Eε n Aεn| > t 2 exp c min t2

K4|||A|||2 F , t K2|||A|||F

where c and K are some universal constants. Observe that

Eε n Aεn = trace(A) = trace( X

1 t n ata t ) = trace( X

1 t n a t at)

1 t n (1 + θ2 + + θ2(t 1))

= n 1 θ2 θ2(1 θ2n)

Furthermore, we have

|||A|||2 F = trace(A A) = trace( X

1 i n aia i X

1 j n aja j )

1 j n (a i aj)2

1 i n ai 4 2 + 2 X

1 i<j n ai 4 2 θ2(j i).

Subsequently, we have X

1 i n ai 4 2 |||A|||2 F (1 + 2 1 θ2 ) X

1 i n ai 4 2, (56)

1 i n ai 4 2 = n (1 θ2)2 2θ2(1 θ2n)

(1 θ2)3 + θ4(1 θ4n) (1 θ2)2(1 θ4).

Assuming δ 2e c and t = 1

c K2|||A|||F log( 2

δ ), we have with probability at least 1 δ,

ε n Aεn Eε n Aεn 1

c K2|||A|||F log(2

We note that the term on the right hand side of equation (57) has order n. For any ϵ > 0, consider the following probability

lim sup n P s0 ε n Aεn > ϵ lim sup n P s0 ε n Aεn > ϵ, ε n Aεn Eε n Aεn 1

c K2|||A|||F log(2

+ lim sup n P ε n Aεn < Eε n Aεn 1

c K2|||A|||F log(2

lim sup n P s0 Eε n Aεn 1

c K2|||A|||F log( 2

δ ) > ϵ + δ.

By fixing δ and comparing the order of s0 with the order of ε n Aεn 1

c K2|||A|||F log( 2

δ ), we have

lim sup n P s0 Eε n Aεn 1

c K2|||A|||F log( 2

δ ) > ϵ = 0.

Since δ can be arbitrarily small, we conclude that s0 ε n Aεn = op(1), (59)

which completes the proof of T3 = op(1).

A.4 Proof of Theorem 3.6

Note that for any t 1, we have

Vt op 1 and Vt = Vt 1 Vt 1ztz t Vt 1/(1 + z t Vt 1zt). (60)

The second part of equation (60) follows from the Sherman Morrison formula. Let ut = Vtzt and we adopt the notation V0 = Id. By multiplying zt on the right hand side of Vt, we have

Vtzt = Vt 1zt Vt 1ztz t Vt 1zt/(1 + z t Vt 1zt)

1 z t Vt 1zt 1 + z t Vt 1zt

= Vt 1zt 1 + z t Vt 1zt . (61)

Therefore, following the definition of ut, we have (1 + z t Vt 1zt)ut = Vt 1zt. Consequently,

t=1 (1 + z t Vt 1zt)utu t =

t=1 Vt 1(V 1 t V 1 t 1)Vt = Id Vn. (62)

By recognizing wt = p

1 + z t Vt 1zt ut, we come to

t=1 wtw t =

t=1 Vt 1(V 1 t V 1 t 1)Vt = Id Vn.

What remains now is to verify conditions in (6). Notably, assumption Vn op = op(1) implies

t=1 wtw t p Id. (63)

Since Σ 1 0 op = op(1), Vt op 1 and xt 2 1, we can show

max 1 t n z t Vtzt = max 1 t n x t Σ 1

2 t 1xt = op(1). (64)

Besides, equation (61) together with equation (64) implies

max 1 t n z t Vt 1zt = max 1 t n z t Vtzt 1 z t Vtzt = op(1). (65)

Thus, it follows that

max 1 t n wt 2 = max 1 t n

1 + z t Vt 1zt Vtzt

1 + z t Vt 1zt V

s 1 + max 1 t n z t Vt 1zt

max 1 t n z t Vtzt = op(1).

Combining equations (66) and (63) yields (6). Hence we complete the proof by applying Proposition 2.1.

Remark A.3. The detailed proof of Lemma 3.7 can be found in the proof of Theorem 3.6.

B Generalized Theorem 3.6

In Theorem 3.6, we impose the following condition (67) so that the ALEE estimator with weights specified in equation (28) achieves asymptotic normality:

Vn op = op(1). (67)

However, it is typically difficult to directly verify the above condition in practice. To tackle this problem, in this section, we provide a modified version of ALEE estimator which achieves asymptotic normality without requiring condition (67). In this section, we use the same notations Σt, zt, Vt, and wt as defined in equations (26), (27) and (28), respectively. Furthermore, we let λ1 . . . λn be the eigenvalues of the matrix V 1 n and a1, . . . , an be the corresponding eigenvectors.

At a high level, we construct additional mn vectors {zt}n+1 t n+mn so that the minimum eigenvalue of the resulting matrix V 1 n+mn is greater than a pre-specified constant κn, which satisfies limn κn = . It is easy to see that by construction (see Algorithm 1), the matrix Vn+mn satisfies

p 0 where mn =

k=1 nk. (69)

Remark B.1. Parameter κn is set to ensure condition (69) holds. In practice, we set κn = d log(n).

Remark B.2. It s worth mentioning that the number of extra {zt}t>n is a random variable. Therefore, in order to prove a similar asymptotic normality theorem to Theorem 3.6, we have to apply martingale central limit theorem with stopping times [11, Theorem 2.1].

Theorem B.3 (Theorem 2.1 in [11]). Let {ξn,k}k 1,n 1 be an array of random variables defined on a probability space (Ω, F, P) and let {Fn,k}n 1,k 0 be an array of σ-fields such that ξn,k is Fn,k-measurable and Fn,k 1 Fn,k F for each n and k 1. For each n, let kn be a stopping time with respect to {Fn,k}k 0. Suppose that

Algorithm 1: Modified ALEE estimate

1: Input:{(xt, yt)}n t=1 and tuning parameter κn 2: Compute {(zt, Vt, wt)}n t=1, {(λk, ak)}d k=1, and obtain a consistent estimate bσ2 of σ2

3: Initiate t = n and set τn = 1/ Σ 1/2 0 op 4: for k = 1, . . . , d do 5: Compute nk = max{κn λk, 0} τn 6: if nk > 0 then 7: for i = 1, . . . , nk do 8: Set t = t + 1 9: Simulate ϵt N(0, bσ2) 10: Define zt = ak/τn 11: Compute

Vt = Vt 1 Vt 1ztz t Vt 1 1 + z t Vt 1zt and wt = q

1 + z t Vt 1zt Vtzt

12: end for 13: end if 14: end for 15: Obtain bθALEE from equation

i=1 wi(yi x i bθALEE) +

i=n+1 wiϵi = 0 (68)

16: Output: bθALEE

k=1 E [ξn,k | Fn,k 1] p 0, (70a)

k=1 Var [ξn,k | Fn,k 1] p 1, (70b)

k=1 E h |ξn,k|2+δ | Fn,k 1 i p 0 for some δ > 0, (70c)

then Pkn k=1 ξn,k d N(0, 1).

With this setup, we are now ready to prove the asymptotic normality of bθALEE from (68).

Theorem B.4 (Generalized Theorem 3.6). Suppose condition (3) holds. Then, for any tuning parameters Σ0 and κn that satisfy Σ 1 0 op = op(1) and limn κn = , the ALEE estimator bθALEE obtained from equation (68) satisfies n X

d N 0, Id ,

where bσ is a consistent estimator of σ.

Remark B.5. We would like to reiterate that the asymptotic variance of of the modified ALEE estimator obtained from (68) is the same as the one mentioned in Theorem 3.6. Additionally, this modified version does not require the condition Vn op = op(1) hold and hence is more applicable in practice with theoretical guarantee.

Proof. Rewriting equation (68), we have

t=1 wtx t (bθALEE θ ) =

t=1 wtϵt. (71)

Therefore, by Cramér Wold theorem, it suffices to show that for any unit vector v,

t=1 v wtϵt d N(0, σ2). (72)

The proof now follows by verifying the conditions (70a)-(70c) of Theorem B.3 with ξn,k = v wkϵk. We begin by verifying conditions (70a)-(70c). By Lemma 3.7, we have

t=1 wtw t = Id Vn+mn. (73)

Note that n+mn X

t=1 Var[wtϵt | Ft 1] =

t=1 σ2wtw t + σ2 bσ2

σ2 1 n+mn X

t=n+1 wtw t . (74)

By equation (73) and the fact that bσ2 is consistent, we have

t=n+1 wtw t Id and bσ2

σ2 1 p 0. (75)

Combining equations (69), (73), (74) and (75), we conclude

t=1 Var[wtϵt | Ft 1] p σ2Id. (76)

On the other hand, we have

max 1 t n+mn wt 2

(i) max 1 t n+mn

1 + z t Vt 1zt Vt op zt 2

(ii) max 1 t n+mn

2 Σ 1/2 0 op.

Inequality (i) follows from the definition of wt. In inequality (ii), we use the assumption that Σ0 Id and the fact that zt 2 1 and Vt op 1. The last inequality (iii) follows from the definition of zt and the condition that Σ 1 0 op = op(1). Hence, we can see that

max 1 t n+mn wt 2 p 0. (77)

Therefore, we have

max 1 t n+mn |v wt| p 0 and

t=1 Var[v wtϵt | Ft 1] p σ2. (78)

Note that condition (70a) holds because {v wkϵk}k 1 is a martingale difference sequence by construction. Condition (70b) follows from statement (78). It remains to verify condition (70c). Observe that

t=1 E[|v wtϵt|2+δ | Ft 1] =

t=1 |v wt|2+δE[|ϵt|2+δ | Ft 1]

max 1 t n+mn |v wt|δ sup t 1 E[|ϵt|2+δ | Ft 1] max{ 1

t=1 Var[v wtϵt | Ft 1]

(iv) = op(1) Op(1) Op(1) = op(1).

Equation (iv) follows from condition (3), equation (78) and the fact that bσ2 is a consistent estimator. Lastly, by applying Slutsky s theorem, we prove that

t=1 wtx t (bθALEE θ ) d N(0, Id). (79)

C Simulation

In this section, we provide additional comparisons among the ALEE method, the OLS, the W-decorrelation [8], and the concentration inequality based bounds [1]. The code can be found at https://github.com/ mufangying/ALEE.

C.1 Simulation details

Throughout our experiments, we utilize bσ2 from equation (9) as an (consistent) estimate of of σ2 [19].

OLS: When data are i.i.d, the least squares estimator satisfies the following condition

1 σ2 (bθLS θ ) Sn(bθLS θ ) d χ2 d.

Therefore, we consider 1 α confidence region to be

CLS = θ Rd : 1

bσ2 (bθLS θ) Sn(bθLS θ) χ2 d,1 α

We point out that the above confidence region is not guaranteed to be accurate when the data is collected in an adaptive manner, as will also be highlighted in our experiments.

W-decorrelation: The W-decorrelation method is borrowed from Algorithm 1 in [8]. Specifically, the estimator takes the form

bθW = bθLS +

t=1 wt(yt x t bθLS). (81)

Given a parameter λ, weights {wt}1 t n are set as follows

xt/(λ + xt 2 2). (82)

Following the recommendations from the paper [8], in order to set λ appropriately, we first run the bandit algorithm or time series with N replications and record the corresponding minimum eigenvalues {λmin(S(1) n ), . . . , λmin(S(N) n )}. We choose λ to be the 0.1-quantile of {λmin(S(1) n ), . . . , λmin(S(N) n )}. Finally, we obtain a 1 α confidence region for θ as

CW = θ Rd : 1

bσ2 (bθW θ) W W(bθW θ) χ2 d,1 α

where W = (w1, . . . , wn) .

Concentration based on self-normalized martingales: We consider [1, Theorem 1] for a single coordinate in two-armed bandit problem and AR(1) model. For contextual bandits, we apply [1, Theorem 2]. Applying concentration bounds requires a sub-Gaussian parameter, for which we use bσ from equation (9) as an estimate. We point out that this estimate of the sub-Gaussian parameter is conservative, as the sub-Gaussian parameter of a sub-Gaussian random variable is always lower bounded by its variance [33, Chapter 2]. This variance estimate is accurate for Gaussian noise random variables.

For one dimensional examples, we have that for any λ > 0, with probability at least 1 α:

|bθLS θ | bσ p

λ + Pn t=1 x2 t Pn t=1 x2 t

log λ + Pn t=1 x2 t λα2

In two-armed bandit problem, xt is simply xt,1 for θ 1 or xt,2 for θ 2. Here we consider λ = 1.

For the contextual bandit examples, we apply Theorem 2 from [1], and set S =

d; we set a small value of λ = 0.01 to mimic the performance of an OLS estimators. Specifically, we utilize the following 1 α confidence region

Ccon = θ Rd : (bθr θ) (λId + Sn)(bθr θ) bσ

log det(λId + Sn)

+ λ 1 2 S 2 ,

(85) where bθr = (X n Xn + λId) 1X n Yn and Yn = (y1, . . . , yn) .

C.2 Tables for contextual bandits

In all the contextual bandit simulations, we consider noises that are generated from a centered Poisson distribution (i.e. Poisson(1) 1). We would like to highlight that the centered Poisson random variable is not sub-Gaussian. Therefore, it is important to note that concentration inequality-based bounds [1] may not be guaranteed to work. In the simulations of this section, we set the number of samples as n = 1000, and the tables below show results over 1000 replications. The tables below clearly show that the average log-volume of the confidence regions are smallest for ALEE among methods which yield valid confidence regions (empirical coverage is more than the target coverage). The volume of the confidence region obtained from the OLS estimate is the smallest, but they under-cover the true parameter. The confidence regions for ALEE are obtained from Theorem B.4 with Σ0 = log(n) Id and κn = d log(n).

Table 2: Contextual bandit: d = 10

Method Level of confidence

0.8 0.85 0.9

Avg. Coverage Avg. log(Volumn) Avg. Coverage Avg. log(Volumn) Avg. Coverage Avg. log(Volumn) ALEE 0.819 ( 0.385) -2.761 ( 0.263) 0.872 ( 0.334) -2.370 ( 0.263) 0.920 ( 0.271) -1.894 ( 0.263) OLS 0.807 ( 0.395) -7.306 ( 0.262) 0.863 ( 0.344) -6.915 ( 0.262) 0.905 ( 0.293) -6.439 ( 0.262) W-Decorrelation 0.785 ( 0.411) 8.382 ( 0.252) 0.827 ( 0.378) 8.773 ( 0.252) 0.868 ( 0.338) 9.249 ( 0.252) Concentration 1.000 ( 0.000) 2.517 ( 0.252) 1.000 ( 0.000) 2.548 ( 0.252) 1.000 ( 0.000) 2.591 ( 0.252)

Table 3: Contextual bandit: d = 50

Method Level of confidence

0.8 0.85 0.9

Avg. Coverage Avg. log(Volumn) Avg. Coverage Avg. log(Volumn) Avg. Coverage Avg. log(Volumn) ALEE 0.744 ( 0.436) 72.759 ( 1.403) 0.809 ( 0.393) 73.680 ( 1.403) 0.875 ( 0.331) 74.822 ( 1.403) OLS 0.730 ( 0.444) 44.640 ( 1.370) 0.791 ( 0.407) 45.560 ( 1.370) 0.847 ( 0.360) 46.703 ( 1.370) W-Decorrelation 0.192 ( 0.394) 97.559 ( 1.337) 0.225 ( 0.418) 98.479 ( 1.337) 0.276 ( 0.447) 99.622 ( 1.337) Concentration 1.000 ( 0.000) 90.964 ( 1.312) 1.000 ( 0.000) 91.004 ( 1.312) 1.000 ( 0.000) 91.060 ( 1.312)

C.3 Asymptotic normality with centered Poisson noise variables

4 3 2 1 0 1 2 3 4

Distribution of errors in AR(1)

(0, 1) OLS ALEE

4 3 2 1 0 1 2 3 4

Distribution of errors with -greedy

(0, 1) OLS ALEE

Figure 4: Same setting as Figure 1 but with noise variables {ϵt} distributed as centered Poisson(1). We set n = 3000 and the number of replications is set to 1000. The simulations show that the asymptotic distribution of ALEE is in good accordance with the asymptotic normality proved in Corollary 3.5 and Theorem 3.1.