# experimental_designs_for_heteroskedastic_variance__8f818547.pdf

Experimental Designs for Heteroskedastic Variance

Justin Weltz Tanner Fiez Eric Laber Alexander Volfovsky

Blake Mason Houssam Nassif Lalit Jain

Most linear experimental design problems assume homogeneous variance, even though heteroskedastic noise is present in many realistic settings. Let a learner have access to a finite set of measurement vectors X Rd that can be probed to receive noisy linear responses of the form y = x θ + η. Here θ Rd is an unknown parameter vector, and η is independent mean-zero σ2 x-strictly-sub-Gaussian noise defined by a flexible heteroskedastic variance model, σ2 x = x Σ x. Assuming that Σ Rd d is an unknown matrix, we propose, analyze and empirically evaluate a novel design for uniformly bounding estimation error of the variance parameters, σ2 x. We demonstrate the benefits of this method with two adaptive experimental design problems under heteroskedastic noise, fixed confidence transductive bestarm identification, and level-set identification; proving the first instance-dependent lower bounds in these settings. Lastly, we construct near-optimal algorithms and empirically demonstrate the large improvements in sample complexity gained from accounting for heteroskedastic variance in these designs.

1 Introduction

Accurate data analysis pipelines require decomposing observed data into signal and noise components. To facilitate this, the noise is commonly assumed to be additive and homogeneous. However, in applied fields this assumption is often untenable, necessitating the development of novel design and analysis tools. Specific examples of deviations from homogeneity include the variability in sales figures of different products in e-commerce [24, 31], the volatility of stocks across sectors [5], spatial data [58, 43], and genomics [48, 10]. This heteroskedasticity is frequently structured as a function of observable features. In this work, we develop two near-optimal adaptive experimental designs that take advantage of such structure.

There are many approaches, primarily developed in the statistics and econometrics literature, for efficient mean estimation in the presence of heteroskedastic noise (see [56, 16] for a survey). These methods concentrate on settings where data have already been collected and do not consider sampling to better understand the underlying heteroskedasticity. Some early work exists on experimental design in the presence of heteroskedastic noise, however these methods are non-adaptive (see Section 2 for details). Importantly, using naive data collection procedures designed for homoskedastic environments may lead to sub-optimal inference downstream (see Section 1.2 for examples). A scientist may fail to take samples that allow them to observe the effect of interest with a fixed sampling budget, thus wasting time and money. This work demonstrates how to efficiently and adaptively collect data

Department of Statistical Science, Duke University, justin.weltz@duke.edu, work conducted at Amazon Amazon.com, USA, fieztann@amazon.com Department of Statistical Science, Duke University, eric.laber@duke.edu Department of Statistical Science, Duke University, alexander.volfovsky@duke.edu Amazon.com, USA, bjmason@amazon.com Meta, USA, houssamn@meta.com Michael G. Foster School of Business, University of Washington, lalitj@uw.edu, work conducted at Amazon

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

in the presence of heteroskedastic noise. It bridges the gap between heteroskedasticity-robust (but potentially inefficient) passive data collection techniques, and powerful (but at times brittle) active algorithms that largely ignore the presence of differing variances.

1.1 Problem Setting

To ground our discussion of heteroskedasticity in active data collection, we focus on adaptive experimentation with covariates; also known as pure-exploration linear bandits [7, 45]. Let a learner be given a finite set of measurement vectors X Rd, span(X) = Rd, which can be probed to receive linear responses with additive Gaussian noise of the form:

y = x θ + η with η N(0, σ2 x), σ2 x := x Σ x. (1)

Here, θ Rd is an unknown parameter vector, η is independent mean-zero Gaussian noise with variance σ2 x = x Σ x, and Σ Rd d is an unknown positive semi-definite matrix with the assumption that σ2 min σ2 x σ2 max for all x X, where σ2 max, σ2 min > 0 are known. We focus on Gaussian error for ease of exposition but extend to strictly sub-Gaussian noise in the Appendix. The linear model in Eq. (1) with unknown noise parameter Σ directly generalizes the standard multi-armed bandit to the case where the arms have different variances [4]. Additionally, this heteroskedastic variance structure arises naturally in linear mixed effects models [20], which are used in a variety of experimental settings to account for multi-level heterogeneity (patient, spatial and block design variation [41]). Throughout, Z Rd will be a set (that need not be equal to X) over which our adaptive designs will be evaluated. While the theory holds true for σ2 min σ2 x σ2 max in general, we consider the case where σ2 max = maxx X σ2 x and σ2 min = minx X σ2 x so that κ := σ2 max/σ2 min summarizes the level of heteroskedasticity.

1.2 Heteroskedasticity Changes Optimal Adaptive Experimentation

To underscore the importance of adapting to heteroskedasticity, we consider the performance of RAGE [12], a best arm identification algorithm designed for homoskedastic settings. This algorithm does not account for differing variances, so it is necessary to upper bound σ2 x σ2 max for each x to identify the best arm with probability 1 δ for δ (0, 1). In Fig. 1, we compare this approach to the optimal sampling allocation accounting for heteroskedasticity in a setting that is a twist on a standard benchmark [46] for best-arm identification algorithms. Let X = Z R2 and |X| = 3. We take x1 = e1 and x2 = e2 to be the first and second standard bases and set x3 = {cos(0.5), sin(0.5)}. Furthermore, we take θ = e1 such that x1 has the highest reward and x3 has the second highest. In the homoskedastic case, an optimal algorithm will focus on sampling x2 because it is highly informative for distinguishing between x1 and x3 as shown in Fig 1b. However, if the errors are heteroskedastic, such a sampling strategy may not be optimal. If σ2 σ1, σ3, the optimal allocation for heteroskedastic noise focuses nearly all samples on x1 and x3 (as shown in Fig. 1a), which are more informative than x2 because they have less noise. Hence, ignoring heteroskedasticity and upperbounding the variances leads to a suboptimal sampling allocation and inefficient performance. This effect worsens as the amount of heteroskedasticity increases, leading to an additional multiplicative factor of κ (cf., [12], Thm. 1) as demonstrated in Fig. 1c, In this important example, we see that the sample complexity of an algorithm that ignores heteroskedasticity suffers a linear dependence on κ, while a method that adapts to the heteroskedasticity in the data does not.

1.3 Paper Contributions

1. Finite Sample Noise Estimation Guarantees. We use an experimental design technique to construct estimates of the heteroskedastic noise variances with bounded maximum absolute error. This adaptive experimental design is a two-phase sample splitting procedure based on a pair of distinct G-optimal experimental designs [28], with bounds that scale favorably with the underlying problem dimension.

2. Near-Optimal Experimental Design Algorithms. We propose adaptive experimental design algorithms for transductive linear best-arm and level-set identification with heteroskedastic noise, that combine our variance estimation method with optimal experimental design techniques. We prove sample complexity upper bounds for unknown Σ that nearly match instance dependent lower bounds when Σ is known. Importantly, we show that in contrast to naive methods which

x1 x2 x3 Arms

(a) Heteroskedastic design

x1 x2 x3 Arms

(b) Homoskedastic design

0 20 40 60 80 100

Oracle Sample Complexity

Heteroskedastic Oracle Homoskedastic Oracle

(c) Sample complexity as κ increases

Figure 1: We compare the sampling strategies of algorithms that do and do not account for heteroskedastic variance in Figs. 1a and 1b respectively, when κ = 20. Figure 1c shows that the gap in the optimal sample complexity between the two algorithms becomes linear as κ increases.

have a multiplicative dependence on κ, we only suffer an additive dependence. Lastly, we show that the theoretical improvements translate to stronger empirical performance.

2 Related Work

Experimental Design and Active Learning for Heteroskedastic Regression. Many experimental design methods for heteroskedastic regression assume known weights (or efficiency functions) [55, 54]. Procedures that estimate the heteroskedastic variance focus on optimizing the design for the weighted least squares estimator given a burn-in period used to estimate the noise parameters [37, 53, 52]. These contributions assume that the variance is unstructured, which leads to sample complexities that scale with the number of arms. Similarly, past active learning work on linear bandits with heteroskedastic noise either assume the variances are known [30, 62], or that the variances are unstructured [14, 8, 4]. The works that assume structured heteroskedastic noise in a linear bandit framework focus on active maximum likelihood estimation and G-optimal design [7, 45], but do not exploit the structure of the heteroskedastic noise to improve performance, leading to complexities scaling poorly in the problem dimension. Lastly, Mukherjee et al. [38] study policy value estimation under the reward model described in Eq. 1. However, their variance estimator is different from ours.

Best-Arm and Level Set Identification in Linear and Logistic Bandits. Best-arm identification is a fundamental problem in the bandit literature [13, 46, 18]. Soare [44] constructed the first passive and adaptive algorithms for pure-exploration in linear bandits using G-optimal and XY-allocation experimental designs. Fiez et al. [12] built on these results, developing the first pure exploration algorithm (RAGE) for the transductive linear bandit setting with nearly matching upper and lower bounds. Recent work specializes these methods to different settings [47, 57, 25, 26], improves time complexity [60, 50, 33], and targets different (minimax, asymptotically optimal, etc.) lower bound guarantees [59, 22, 9]. In the level set literature, Mason et al. [35] provides the first instance optimal algorithm with matching upper and lower bounds. Critically, none of these contributions account for heteroskedastic variance. Finally, we note the connection between this work and the logistic bandits literature [23, 11, 1, 36]. In that setting, the observations are Bernoulli random variables with a probability given by P(y = 1|x) = 1 + exp( x θ ) 1 for x Rd, θ Rd and y {0, 1}. Because of the mean-variance relationship for Bernoulli random variables, points x for which P(y = 1|x) is near 0 or 1 have lower variance than other x s. While the tools are different in these two cases, we highlight two common ideas: First, Mason et al. [36] present a method that explicitly handles the heteroskedasticity in logistic bandits. Second, a core focus of optimal logistic bandit papers is reducing the dependence on the worst case variance to only an additive penalty, similar to the guarantee of κ herein, which has been shown to be unavoidable in general [23].

3 Heteroskedastic Variance Estimation

In this section, we describe an algorithm for heteroskedastic noise variance estimation with error bounds that scale favorably in the problem dimension. We then evaluate this method empirically.

Given a sampling budget Γ, our goal is to choose {xt}Γ t=1 X to control the tail of the maximum absolute error (MAE): maxx X |bσ2 x σ2 x| by exploiting the heteroskedastic variance model. Alg. 1, HEAD (Heteroskedasticity Estimation by Adaptive Designs), provides an experimental design for finite-time bounds on structured heteroskedastic variance estimation.

Notation. Let M := d(d + 1)/2. Define W = {ϕx, x X} RM where ϕx = vec(xx )G and G Rd2 M is the duplication matrix such that G vech(xx ) = vec(xx ), with vech( ) denoting the half-vectorization. For a set V, let PV := {λ R|V| : P

v V λv = 1, λv 0 v} denote distributions over V and Y(V) := {v v : v, v V, v = v } represent the differences between members of set V.

General Approach. Given observations from yt = x t θ + ηt for t {1, 2, . . . , Γ} and the true value of θ , we can estimate σ2 x for each x X from the squared residuals η2 t = (yt x t θ )2 (e.g., using method of moments or another estimating equation approach [15]). However, since θ is not known, we cannot directly observe η2 t , making estimation of σ2 x more difficult. Consider a general procedure that first uses Γ/2 samples from the budget to construct a least-squares estimator for θ , bθΓ/2, and then uses the final Γ/2 samples to collect error estimates of the form bη2 t = (yt x t bθΓ/2)2. Define ΦΓ RΓ/2 M as a matrix with rows {ϕxt}Γ t=Γ/2+1 and bη RΓ/2 as a vector with values {bηt}Γ t=Γ/2+1. With bη, we can estimate Σ by least squares as

vech(bΣΓ) = arg min s RM ΦΓ s bη2 2,

where we minimize over the space of symmetric matrices. Splitting samples ensures that the errors in estimating θ and Σ are independent. Note that (bΣΓ, bθΓ/2) is an M-estimator of (Σ , θ ) derived from unbiased estimating functions, a common approach in heteroskedastic regression [21, 51, 34]. Finally, by plugging in the estimate of Σ , one can obtain an estimate of the variance, bσ2 x = min{max{x c ΣΓx, σ2 min}, σ2 max}. We show that this approach leads to the following general error decomposition for any x X with z := ϕ x (Φ Γ ΦΓ) 1Φ Γ :

|x (Σ bΣΓ)x| n x (bθΓ/2 θ ) o2

t=Γ/2 zt η2 t x t Σ xt | {z } B

t=Γ/2 2ztx t (bθΓ/2 θ )ηt

Experimental Design for Precise Estimation. The analysis above gives an error decomposition for any phased non-adaptive data collection method that splits samples between estimation of θ and Σ . Intuitively, we wish to sample {x1, . . . , xΓ/2} and {xΓ/2+1, . . . , xΓ} to control the maximum absolute error by minimizing the expression above. We achieve this via two phases of G-optimal experimental design [28] first over the X space to estimate θ , and then over the W space to estimate Σ . Precisely, we control quantity A via stage 1 of Alg. 1, which draws from a G-optimal design λ to minimize the maximum predictive variance

max x X E n x (bθΓ/2 θ ) o2 .

Quantity B is another sub-exponential quantity (Definition B.2 in Appendix B), which we control by drawing from a G-optimal design λ over W to minimize maxϕ W ϕ (Φ Γ ΦΓ) 1ϕ. Finally, C is controlled by a combination of the guarantees associated with λ and λ . This leads to the following error bound on bσ2 x = min n max{x c ΣΓx, σ2 min}, σ2 max o .

Theorem 3.1. Assume Γ = Ω max σ2 max log(|X|/δ)d2, d2 . For any x X and δ (0, 1), Alg. 1 guarantees the following:

P |bσ2 x σ2 x| CΓ,δ 1 δ/2 where CΓ,δ = O np

log(|X|/δ)σ2maxd2/Γ o .

The proof of Theorem 3.1 is in Appendix D and follows from the decomposition in Eq. 3; treating A, B and C as sub-exponential random variables with bounds that leverage the experimental designs of Alg. 1. The details on the constants involved in Theorem 3.1 are also included in Appendix D.

Algorithm 1: HEAD (Heteroskedasticity Estimation by Adaptive Designs)

Result: Find bΣΓ

1 Input: Arms X Rd, Γ N

2 //Stage 1: Take half the samples to estimate θ

3 Determine λ according to λ = arg minλ PX maxx X x P

x X λx x x 1 x

4 Pull arm x X λ xΓ/2 times and collect observations {xt, yt}Γ/2 t=1

5 Define A = PΓ/2 t=1 xtx t and b = PΓ/2 t=1 xtyt and estimate bθΓ/2 = A 1b

6 //Stage 2: Take half the samples to estimate Σ given bθΓ/2

7 Determine λ λ x = arg minλ PX maxx X ϕ x P

x X λx ϕx ϕ x 1 ϕx

8 Pull arm x X λ xΓ/2 times8and collect observations {xt, yt}Γ t=Γ/2+1

9 Let A = PΓ t=Γ/2+1 ϕxtϕ xt, b = PΓ t=Γ/2+1 ϕxt yt x t bθΓ/2 2

10 Output: vech(bΣΓ) = A 1b .

Note that we design to directly control the MAE of bσ2 x. This is in contrast to an inefficient alternative procedure that allocates samples to minimize the error of bΣΓ in general, and then extends this bound to x bΣΓx for x X.

Theoretical Comparison of Variance Estimation Methods. In contrast to previous approaches that naively scale in the size of X [14], the above result has a dimension dependence of d2, which intuitively scales with the degrees of freedom of Σ . To highlight the tightness of this result, consider estimating the noise parameters for each arm, x X, only using the samples of x (like the method in [14]). We improve this approach by adapting it to our heteroskedastic variance structure and strategically pick d2 points to avoid the dependence on |X|. We refer to this method as the Separate Arm Estimator, and in Appendix D, we show that it suffers a dependence of d4. This comparison point, along with our empirical results below, suggest that the error bounds established in Theorem 3.1 scale favorably with the problem dimension.

Empirical Comparison of Variance Estimation Methods. Define two sets of arms X1, X2 such that x X1 is drawn uniformly from a unit sphere and x X2 is drawn uniformly from a sphere with radius 0.1; with |X1| = 200 and |X2| = 1800. Let Σ = diag(1, 0.1, 1, 0.1, . . .) Rd d, and θ = (1, 1, . . . , 1) Rd. This setting provides the heteroskedastic variation needed to illustrate the advantages of using HEAD (Alg. 1). An optimal algorithm will tend to target orthogonal vectors in X1 because these arms have higher magnitudes in informative directions. Defining X = X1 X2, we perform 32 simulations in which we randomly sample the arm set and construct estimators for each σ2 x, x X. We compare HEAD, the Separate Arm Estimator, and the Uniform Estimator (see Alg. 3 in Appendix D). The Uniform Estimator is based on Alg. 1 but samples arms uniformly from X and does not split samples between estimation of θ and Σ . HEAD should outperform its competitors in two ways: 1) optimally allocating samples, and 2) efficiently sharing information across arms in estimation. The Uniform Estimator exploits the problem structure for estimation but not when sampling. In contrast, the Seperate Arm Estimator optimally samples, but does not use the relationships between arms for estimation. Fig. 2 depicts the average MAE and standard errors for each estimator. In Fig. 2a, we see that HEAD outperforms its competitors over a series of sample sizes for d = 15. Fig. 2b compares the estimators over a range of dimensions for a sample size of 95k. While HEAD continues to accurately estimate the heteroskedastic noise parameters at high dimensions, the Separate Arm Estimator error scales poorly with dimension as the analysis suggests.

4 Adaptive Experimentation with Covariates in Heteroskedastic Settings

As an application of the novel sampling procedure and resulting variance estimation bounds developed in section 3, we now study adaptive experimentation with covariates in the presence of heteroskedastic

8As λxΓ/2 / N, it is not possible in reality to take λxΓ/2 samples. This problem can easily be handled by rounding procedures detailed in Appendix A.

Figure 2: The proposed experimental design estimator outperforms competitors in terms of maximum absolute error maxx X |σ2 x bσ2 x| over a range of sample sizes (a) and dimensions (b).

noise through the lens of transductive linear identification problems [12]. Recalling the problem setting from Section 1.1, we consider the following fixed confidence identification objectives:

1. Best-Arm Identification (BAI). Identify the singleton set {z } where z = arg maxz Z z θ .

2. Level-Set Identification (LS). Identify the set Gα = {z : z θ > α} given a threshold α R.

This set of objectives, together with a learning protocol, allow us to capture a number of practical problems of interest where heteroskedastic noise naturally arises in the assumed form. Consider multivariate testing problems in e-commerce advertisement applications [17, 39]. The expected feedback is modeled through a linear combination of primary effects from content variation dimensions (e.g., features or locations that partition the advertisement), and interaction effects between content variation in pairs of dimensions. The noise structure is naturally heteroskedastic and dependent on the combination of variation choices. Moreover, the transductive objectives allow for flexible experimentation such as identifying the optimal combination of variations or the primary effects that exceed some threshold. We return to this setting in Experiment 3 of Section 5. In the remainder of this section, we characterize the optimal sample complexities for the heteroskedastic transductive linear identification problems and design algorithms that nearly match these lower bounds.

4.1 Linear Estimation Methods with Heteroskedastic Noise

To draw inferences in the experimentation problems of interest, it is necessary to consider estimation methods for the unknown parameter vector θ Rd. Given a matrix of covariates XT RT d, a vector of observations YT RT , and a user-defined weighting matrix WT = diag(w1, . . . , w T ) RT T , the weighted least squares (WLS) estimator is defined as:

bθWLS = X T WT XT 1 X T WT YT = minθ Rd YT XT θ 2 WT .

Let ΣT = diag(σ2 x1, σ2 x2, . . . , σ2 x T ) RT T denote the variance parameters of the noise observations. The WLS estimator is unbiased regardless of the weight selections, with variance

V bθWLS = X T WT XT 1 X T WT ΣT W T XT X T WT XT 1 WT Σ 1 T = X T Σ 1 T XT 1 . (2) The WLS estimator with WT = Σ 1 T is the minimum variance linear unbiased estimator [2]. In contrast, the ordinary least squares (OLS) estimator, obtained by taking WT = I in the WLS estimator, is unbiased but with a higher variance,

V bθOLS = X T XT 1 X T ΣT XT X T XT 1 σ2 max(X T XT ) 1.

Consequently, WLS is a natural choice with heteroskedastic noise for sample efficient estimation.

4.2 Optimal Sample Complexities

We now motivate the optimal sample complexity for any algorithm that returns the correct quantity with a fixed probability greater than 1 δ, δ (0, 1), in transductive identification problems with

heteroskedastic noise. We seek to identify a set of vectors HOBJ, where

HBAI = {z } = {arg max z Z z θ } and HLS = {z Z : z θ α > 0}.

Algorithms consist of a sampling strategy and a rule to return a set b H at time τ.

Definition 4.1. An algorithm is δ-PAC for OBJ {BAI, LS} if θ Θ, Pθ( b H = HOBJ) 1 δ.

Throughout, we will only consider algorithms that are δ-PAC. For drawing parallels between best-arm and level set identification, it will be useful to define a set of values QOBJ, where

QBAI = {z : z Z, z = z } and QLS = {0},

and a constant b OBJ such that b BAI = 0 and b LS = α. Both problems amount to verifying that for all h HOBJ and q QOBJ, (h q) bθWLS > b OBJ, or equivalently (h q) θ b OBJ (h q) (θ bθWLS) > 0. Using a sub-Gaussian tail bound and a union bound [32], this verification requires the following to hold for all h HOBJ and q QOBJ for the data collected:

|(h q) (θ bθWLS)| r

2 h q 2 V(bθWLS) log(2|Z|/δ) (h q) θ b OBJ. (3)

Let λx denote the proportion of T samples given to a measurement vector x X so that when WT = Σ 1 T in Eq. (4.1),

V bθWLS = X T Σ 1 T XT 1 =

x X λxσ 2 x xx ! 1

as in Eq. (2). Now, we reformulate the condition in Eq. (3) and minimize over designs λ PX to see that δ-PAC verification requires

T 2ψ OBJ log(2|Z|/δ), where ψ OBJ = min λ PX max h HOBJ,q QOBJ

||h q||2 ( P

x X λxσ 2 x xx ) 1

{(h q) θ b OBJ}2 .

This motivating analysis gives rise to the following sample complexity lower bound. Theorem 4.1. Consider an objective, OBJ, of best-arm identification (BAI) or level-set identification (LS). For any δ (0, 1), a δ-PAC algorithm must satisfy

Eθ[τ] 2 log(1/2.4δ)ψ OBJ.

Remark 4.1. This lower bound assumes that the noise variance parameters are known. We compare to the existing lower bounds for the identification objectives with homoskedastic noise [12, 44, 35] in Appendix G. These apply in this setting by taking the variance parameter σ2 = σ2 max and are characterized by an instance-dependent parameter, ρ OBJ, such that κ 1 ψ OBJ/ρ OBJ 1 (recall that κ := σ2 max/σ2 min). In general, κ can be quite large in many problems with heteroskedastic noise. Consequently, this implies that near-optimal algorithms for linear identification with homoskedastic noise can be highly suboptimal (by up to a multiplicative factor of κ) in this setting.

4.3 Adaptive Designs with Unknown Heteroskedastic Variance

We now present Alg. 2, H-RAGE (Heteroskedastic Randomized Adaptive Gap Elimination), as a solution for adaptive experimentation with covariates in heteroskedastic settings. The approach is composed of two components: 1) Obtain accurate estimates of the unknown variance parameters {σ2 x}x X using HEAD; 2) Use adaptive experimental design methods for minimizing the uncertainty in directions of interest given the estimated variance parameters. Denote by = minh HOBJ minq QOBJ(h q) θ b OBJ the minimum gap for the objective and assume maxz Zℓ| z z , θ | 2. We prove the following guarantees for Alg. 2. Theorem 4.2. Consider an objective, OBJ, of best-arm identification (BAI) or level-set identification (LS). The set returned from Alg. 2 at a time τ is δ-PAC and

τ = O ψ OBJ log( 1) log(|Z|/δ) + log log( 1) + log( 1)d2 + log(|X|/δ)κ2d2 .

Algorithm 2: (H-RAGE) Heteroskedastic Randomized Adaptive Gap Elimination

Result: Find z := arg maxz Z z θ for BAI or Gα := {z Z : z θ > α} for LS

1 Input: X Rd, Z Rd, confidence δ (0, 1), OBJ {BAI,LS}, threshold α R

2 Initialize: ℓ 1, Z1 Z, G1 , B1

3 //Variance estimation

4 Call Alg. 1 such that bσ2 x = min n max{x bΣΓx, σ2 min}, σ2 max o σ2 x σ2 x/2

5 while (|Zℓ| > 1 and OBJ=BAI) or (|Zℓ| > 0 and OBJ=LS) do

6 //Determine the design

7 Let bλℓ PX be a minimizer of q {λ, Y(Zℓ)} if OBJ=BAI and q(λ, Zℓ) if OBJ=LS where

q(V) = infλ PX q(λ; V) = infλ PX maxz V ||z||2 (P

x X bσ 2 x λxxx ) 1

8 Set ϵℓ= 2 ℓ, τℓ= 3ϵ 2 ℓq(Zℓ) log(8ℓ2|Z|/δ) //Determine stepsize

9 Pull arm x X exactly τℓbλℓ,x times for nℓsamples and collect {xℓ,i, yℓ,i}nℓ i=1

10 Define Aℓ:= Pnℓ i=1 bσ 2 i xℓ,ix ℓ,i, bℓ= Pnℓ i=1 bσ 2 i xℓ,iyℓ,i and construct bθℓ= A 1 ℓbℓ

11 //Eliminate arms

12 if OBJ is BAI then

13 Zℓ+1 Zℓ\{z Zℓ: maxz Zℓ z z, bθℓ > ϵℓ}

15 Gℓ+1 Gℓ {z Zℓ: z, bθℓ ϵℓ> α}

16 Bℓ+1 Bℓ {z Zℓ: z, bθℓ + ϵℓ< α}

17 Zℓ+1 Zℓ\{{Gℓ\ Gℓ+1} {Bℓ\ Bℓ+1}}

21 Output: Zℓfor BAI or Gℓfor LS

Remark 4.2. This result shows that Alg. 2 is nearly instance-optimal. Critically, the impact of not knowing the noise variances ahead of time only has an additive impact on the sample complexity relative to the lower bound. In contrast, existing near-optimal methods for identification problems with homoskedastic noise would be suboptimal by a multiplicative factor depending on κ relative to the lower bound. Given that the lower bound assumes the variances are known, an interesting question for future work is a tighter lower bound assuming the variance parameters are unknown.

Algorithm Overview. Observe that for ψ OBJ, the optimal allocation depends on both the unknown gaps ((h q) θ b OBJ) for h HOBJ and q QOBJ, and the unknown noise variance parameters {σ2 x}x X . To handle this, the algorithm begins with an initial burn-in phase using the procedure from Alg. 1 to produce estimates {bσ2 x}x X of the unknown parameters {σ2 x}x X , achieving a multiplicative error bound of the form |σ2 x bσ2 x| σ2 x/2 for all x X. From this point, the estimates {bσ2 x}x X are used as plug-ins to the experimental designs and the weighted least squares estimates. In each round ℓ N of the procedure, we maintain a set of uncertain items Zℓ, and obtain the sampling distribution bλℓ, by minimizing the maximum predictive variance of the WLS estimator over all directions y = h q for h HOBJ and q QOBJ. This is known as an XY-allocation and G-optimal-allocation in the best-arm and level set identification problems respectively. Critically, the number of samples taken in round ℓguarantees that the error in all estimates is bounded by ϵℓ, such that for any h HOBJ and q QOBJ, (h q) θ > 2ϵℓis eliminated from the active set at the end of the round. By progressively honing in on the optimal item set HOBJ, the sampling allocation gets closer to approximating the oracle allocation.

5 Experiments

We now present experiments focusing on the transductive linear bandit setting. We compare H-RAGE and its corresponding oracle to their homoskedastic counterparts, RAGE [12] and the oracle allocation

with noise equal to σ2 max. RAGE is provably near-optimal in the homoskedastic setting and known to perform well empirically. The algorithm is similar in concept to Alg. 2, but lacks the initial exploratory phase to estimate the variance parameters. We include the pseudo-code of RAGE in Appendix H. The homoskedastic and heteroskedastic oracles sample with respect to the optimal design with a sub-Gaussian interval stopping condition. All algorithms are run at a confidence level of δ = 0.05, and we use the Franke-Wolfe method to compute the designs. We demonstrate that H-RAGE has superior empirical performance over a range of settings.

Linear bandits: Experiment 1 (Signal-to-Noise). We begin by presenting an experiment that is a variation of the standard benchmark problem for BAI in linear bandits [46] introduced in Sec. 1.2. Define ei Rd as the standard basis vectors in d dimensions. In the standard example, X = Z = {e1, e2, x } where x = {e1 cos(ω) + e2 sin(ω)} for ω 0, and θ = e1. Thus x1 = e1 is optimal, x3 = x is near-optimal, and x2 = e2 is informative to discriminate between x1 and x3. We consider an extension of this benchmark in multiple dimensions that highlights the performance gains from accounting for heteroskedastic variance. First, define an arm set H that is comprised of the standard example but with varied magnitudes. For q (0, 1),

H = {e1} {e2} {q ei}d i=3 {e1 cos(ω) + ei sin(ω)}d i=2.

Then define G to be a series of vectors such that the rank of span {ϕx : x G H} = RM. G allows for variance estimation. Finally, let the arm and item set be X = Z = G H and θ = e1.

We define Σ as the d-dimensional identity matrix. Observe that x1 = e1 is optimal, the d 1 arms given by {e1 cos(ω) + ei sin(ω)}d i=2 are near-optimal with equal gaps = 1 cos(ω), and the d 1 arms given by {e2} q {ei}d i=3 are informative to sample for discriminating between x1 = e1 and xi = e1 cos(ω) + ei sin(ω) for each i {2, . . . , d}, respectively. To identify x1, we must estimate the gap adequately in every dimension with arms {e2} q {ei}d i=3. Therefore, performing BAI without accounting for heteroskedastic variances will tend toward a design that prioritizes sampling informative arms {q ei}d i=3, since these have smaller magnitudes and seem less informative about the near-optimal arms in dimensions i {3, 4 . . . d}. However, the signal to noise ratio is actually constant across arms (because the variance of q {ei}d i=3 is equal to q2) leading to an oracle distribution that equally samples across {e2} q {ei}d i=3 when we account for heteroskedastic variances. This change in allocation is illustrated in Appendix H for d = 4, ω = 0.01, q = 0.4 and G = 0.5(e1 + e2) + 0.1e3, 0.5(e1 + e2) + 0.1(e3 + e4) . In the same setting, Fig. 3a shows that the heteroskedastic variance experimental design and the tighter predictive confidence intervals of the weighted least squares estimator result in a large decrease in the sample size needed to identify the best arm.

Experiment 2: Benchmark. This experiment is also similar to the benchmark example introduced in Sec. 1.2. In this variation, the informative arms have the same magnitude but are bent at a 45 angle,

X = {e1} {e1 cos(ω) + e2 sin(ω)} {ei}d i=3 {ei/

2}d i=1}d j>i.

Let the arm and item set be X = Z and θ = e1. In Experiment 1, we have many near-optimal arms and H-RAGE reveals that each of these are equally difficult to distinguish from the best. In contrast, Experiment 2 contains one near-optimal arm along with many potentially informative arms, and we are interested in identifying the one that is most helpful. Intuitively, Experiment 1 uses heteroskedastic variance estimates to assess the difficulty of many problems, while Experiment 2 uses the same information to assess the benefits of many solutions. Let the unknown noise matrix be given by Σ = diag(σ2 1, σ2 2, . . . , σ2 d) where σ2 1 = α2, (σ2 2, σ2 3) = β2 and (σ2 4, σ2 5 . . . , σ2 d) = α2. x1 = e1 is again optimal and x2 = {e1 cos(ω) + e2 sin(ω)} is a near optimal arm. In this case, {e2/

2}d i=3 are informative for distinguishing between x1 and x2, and the oracle allocation assuming homoskedastic noise (Homoskedastic Oracle) picks three of these vectors to sample. However, if α2 β2, then it is optimal to sample {e2/

2, e3} over other potential informative arm combinations. Appendix H shows this contrast in allocations for d = 3, ω = 0.01, α2 = 1 and β2 = 0.2. In the same setting, Fig. 3b shows that estimating and accounting for heteroskedastic noise contributes to a reduction in sample complexity.

Experiment 3: Multivariate Testing. In this experiment, we return to the motivating example of Section 4, multivariate testing in e-commerce advertising [17, 39]. We divide an advertisement into natural locations or features called dimensions, each of which has different content options or variations. We induce heteroskedastic variance by allowing both the expected value and the variance

(a) Experiment 1

(b) Experiment 2

(c) Experiment 3

Figure 3: Experimental results show the mean sample complexity of each algorithm over 100 simulations with 95% Monte Carlo confidence intervals.

of an advertisement s reward to depend on the n variations in each of the m dimensions. Let X = Z denote the set of layouts corresponding to the combinations of variant choices in the dimensions so that |X| = nm. The multivariate testing problem is often modeled in the linear bandit framework using the expected feedback for any layout x X {0, 1}d, where d = 1 + mn + n2m(m + 1)/2, given by

x θ := θ 0 + Pm i=1 Pn j=1 θ i,jxi,j + Pm i=1 Pm i =i+1 Pn j=1 Pn j =1 θ i,i ,j,j xi,jxi ,j .

In this model, xi,j {0, 1} is an indicator for variation j {1, . . . , n} being placed in dimension i {1, . . . , m}, and Pn j=1 xi,j = 1 for all i {1, . . . , n} for any layout x X. Observe that the expected feedback for a layout is modeled as a linear combination of a common bias term, primary effects from the variation choices within each dimension, and secondary interaction effects from the combinations of variation choices between dimensions.

We consider an environment where the effect of variation changes in one dimension, call it dimension j, has much higher variance than others; resulting in the best variation for dimension j being harder to identify. An algorithm that accounts for heteroskedastic variance will devote a greater number of samples to compare variations in dimension j, whereas an algorithm that upper bounds the variance will split samples equally between dimensions. For a simulation setting with 2 variations in 3 dimensions, we define Σ = diag(0.3, 0.7, 10 3, 10 3, . . .) R7 7 and θ = (0, 0.005, 0.0075, 0.01, 0.1, 0.1, . . .) R7. This simulation setting implies that the second variation in each of the dimensions is better than the first, but the positive effect in the first dimension is hardest to identify. In Appendix H we can see that this causes the heteroskedastic oracle to devote more weight than the homoskedastic oracle to vectors that include the second variation in the first dimension. Fig. 3c depicts the improvement in sample complexity resulting from accounting for the heteroskedastic variances.

6 Conclusion

This paper presents an investigation of online linear experimental design problems with heteroskedastic noise. We propose a two-phase sample splitting procedure for estimating the unknown noise variance parameters based on G-optimal experimental designs and show error bounds that scale efficiently with the dimension of the problem. The proposed approach is then applied to fixed confidence transductive best-arm and level set identification with heteroskedastic noise, where we present instance-dependent lower bounds with provably near-optimal algorithms.

This paper leads to a number of interesting future research directions. For example, there may be situations where the noise structure is not fully characterized by Σ . Future work can explore the development of methods that handle more general noise structures (including correlated or heavytailed noise) while remaining near-optimal. Furthermore, it is possible that an integrated approach to variance estimation and adaptive experimentation with covariates would be more sample efficient than our two-step method. Lastly, it would be interesting to extend our algorithm to nonlinear response models.

[1] Marc Abeille, Louis Faury, and Clément Calauzènes. Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics, pages 3691 3699. PMLR, 2021. [2] Alexander C Aitken. IV. On least squares and linear combination of observations. Proceedings of the Royal Society of Edinburgh, 55:42 48, 1936. [3] Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming, 186:439 478, 2021. [4] András Antos, Varun Grover, and Csaba Szepesvári. Active learning in heteroscedastic noise. Theoretical Computer Science, 411(29-30):2712 2728, 2010. [5] David Blitz and Pim Van Vliet. The volatility effect: Lower risk without lower return. Journal of portfolio management, pages 102 113, 2007. [6] Romain Camilleri, Kevin Jamieson, and Julian Katz-Samuels. High-dimensional experimental design and kernel bandits. In International Conference on Machine Learning, pages 1227 1237. PMLR, 2021. [7] Kamalika Chaudhuri, Prateek Jain, and Nagarajan Natarajan. Active heteroscedastic regression. In International Conference on Machine Learning, pages 694 702. PMLR, 2017. [8] Wesley Cowan, Junya Honda, and Michael N Katehakis. Normal bandits of unknown means and variances. Journal of Machine Learning Research, 18(154):1 28, 2018. [9] Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pages 2432 2442. PMLR, 2020. [10] Bianca Dumitrascu, Gregory Darnell, Julien Ayroles, and Barbara E Engelhardt. Statistical tests for detecting variance effects in quantitative trait studies. Bioinformatics, 35(2):200 210, 2019. [11] Louis Faury, Marc Abeille, Clément Calauzènes, and Olivier Fercoq. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pages 3052 3060. PMLR, 2020. [12] Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32:10667 -10677, 2019. [13] Tanner Fiez, Sergio Gamez, Arick Chen, Houssam Nassif, and Lalit Jain. Adaptive experimental design and counterfactual inference. In Workshops of Conference on Recommender Systems, 2022. [14] Xavier Fontaine, Pierre Perrault, Michal Valko, and Vianney Perchet. Online a-optimal design and active linear regression. In International Conference on Machine Learning, pages 3374 3383. PMLR, 2021. [15] VP Godambe and ME Thompson. Some aspects of the theory of estimating equations. Journal of Statistical Planning and Inference, 2(1):95 104, 1978. [16] William H Greene. Econometric analysis. Pearson Education India, 2003. [17] Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, and SVN Vishwanathan. An efficient bandit algorithm for realtime multivariate optimization. In International Conference on Knowledge Discovery and Data Mining, pages 1813 1821, 2017. [18] Matthew Hoffman, Bobak Shahriari, and Nando Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Artificial Intelligence and Statistics, pages 365 374. PMLR, 2014. [19] Jean Honorio and Tommi Jaakkola. Tight bounds for the expected risk of linear classifiers and PAC-Bayes finite-sample guarantees. In Artificial Intelligence and Statistics, pages 384 392. PMLR, 2014. [20] Francis L Huang, Wolfgang Wiedermann, and Bixi Zhang. Accounting for heteroskedasticity resulting from between-group differences in multilevel models. Multivariate Behavioral Research, pages 1 21, 2022.

[21] Peter J Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Berkeley Symposium on Mathematical Statistics and Probability, pages 221 233. University of California Press, 1967. [22] Yassir Jedra and Alexandre Proutiere. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33:10007 10017, 2020. [23] Kwang-Sung Jun, Lalit Jain, Blake Mason, and Houssam Nassif. Improved confidence bounds for the linear logistic model and applications to bandits. In International Conference on Machine Learning, pages 5148 5157. PMLR, 2021. [24] Maurits Kaptein and Petri Parvinen. Advancing e-commerce personalization: Process framework and case study. International Journal of Electronic Commerce, 19(3):7 33, 2015. [25] Zohar S Karnin. Verification based solution for structured mab problems. Advances in Neural Information Processing Systems, 29, 2016. [26] Julian Katz-Samuels, Lalit Jain, Kevin G Jamieson, et al. An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits. Advances in Neural Information Processing Systems, 33:10371 10382, 2020. [27] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1): 1 42, 2016. [28] Jack Kiefer. Optimum experimental designs. Journal of the Royal Statistical Society: Series B (Methodological), 21(2):272 304, 1959. [29] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366, 1960. [30] Johannes Kirschner and Andreas Krause. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pages 358 384. PMLR, 2018. [31] Fanjie Kong, Yuan Li, Houssam Nassif, Tanner Fiez, Shreya Chakrabarti, and Ricardo Henao. Neural insights for digital marketing content design. In International Conference on Knowledge Discovery and Data Mining, pages 4320 4332, 2023. [32] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [33] Zhaoqi Li, Lillian Ratliff, Houssam Nassif, Kevin Jamieson, and Lalit Jain. Instance-optimal pac algorithms for contextual bandits. In Conference on Neural Information Processing Systems, pages 37590 37603, 2022. [34] James G Mac Kinnon and Halbert White. Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of econometrics, 29(3):305 325, 1985. [35] Blake Mason, Lalit Jain, Subhojyoti Mukherjee, Romain Camilleri, Kevin Jamieson, and Robert Nowak. Nearly optimal algorithms for level set estimation. In International Conference on Artificial Intelligence and Statistics, pages 7625 7658. PMLR, 2022. [36] Blake Mason, Kwang-Sung Jun, and Lalit Jain. An experimental design approach for regret minimization in logistic bandits. In AAAI Conference on Artificial Intelligence, volume 36, pages 7736 7743, 2022. [37] Darcy P Mays and Raymond H Myers. Design and analysis for a two-level factorial experiment in the presence of variance heterogeneity. Computational statistics & data analysis, 26(2): 219 233, 1997. [38] Subhojyoti Mukherjee, Qiaomin Xie, Josiah Hanna, and Robert Nowak. SPEED: Experimental design for policy evaluation in linear heteroscedastic bandits. ar Xiv preprint ar Xiv:2301.12357, 2023. [39] Sareh Nabi, Houssam Nassif, Joseph Hong, Hamed Mamani, and Guido Imbens. Bayesian meta-prior learning using Empirical Bayes. Management Science, 68(3):1737 1755, 2022. [40] H Neudecker. Symmetry, 0-1 matrices and jacobians. Econometric Theory, 2:157 190, 1986. [41] Ann L Oberg and Douglas W Mahoney. Linear mixed effects models. Topics in biostatistics, pages 213 234, 2007. [42] Friedrich Pukelsheim. Optimal design of experiments. SIAM, 2006.

[43] Takaki Sato and Yasumasa Matsuda. Spatial extension of generalized autoregressive conditional heteroskedasticity models. Spatial Economic Analysis, 16(2):148 160, 2021. [44] Marta Soare. Sequential resource allocation in linear stochastic bandits. Ph D thesis, Université Lille 1-Sciences et Technologies, 2015. [45] Marta Soare, Alessandro Lazaric, and Rémi Munos. Active learning in linear stochastic bandits. Bayesian Optimization in Theory and Practice, 2013. [46] Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, 2014. [47] Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identification in linear bandits with linear dimension dependency. In International Conference on Machine Learning, pages 4877 4886. PMLR, 2018. [48] Bryan W Ting, Fred A Wright, and Yi-Hui Zhou. Simultaneous modeling of multivariate heterogeneous responses and heteroskedasticity via a two-stage composite likelihood. Biometrical Journal, page 2200029, 2023. [49] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. [50] Po-An Wang, Ruo-Chun Tzeng, and Alexandre Proutiere. Fast pure exploration via Frank-Wolfe. Advances in Neural Information Processing Systems, 34:5810 5821, 2021. [51] Halbert White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: journal of the Econometric Society, pages 817 838, 1980. [52] Douglas P Wiens. Designs for weighted least squares regression, with estimated weights. Statistics and Computing, 23:391 401, 2013. [53] Douglas P Wiens and Pengfei Li. V-optimal designs for heteroscedastic regression. Journal of Statistical Planning and Inference, 145:125 138, 2014. [54] Weng Kee Wong and R Dennis Cook. Heteroscedastic g-optimal designs. Journal of the Royal Statistical Society: Series B (Methodological), 55(4):871 880, 1993. [55] Weng Kee Wong and Wei Zhu. Optimum treatment allocation rules under a variance heterogeneity model. Statistics in Medicine, 27(22):4581 4595, 2008. [56] Jeffrey M Wooldridge. Econometric analysis of cross section and panel data. MIT press, 2010. [57] Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 843 851. PMLR, 2018. [58] Dan Yan, Xiaohang Ren, Wanli Zhang, Yiying Li, and Yang Miao. Exploring the real contribution of socioeconomic variation to urban PM2.5 pollution: New evidence from spatial heteroscedasticity. Science of the Total Environment, 806:150929, 2022. [59] Junwen Yang and Vincent Tan. Minimax optimal fixed-budget best arm identification in linear bandits. Advances in Neural Information Processing Systems, 35:12253 12266, 2022. [60] Mohammadi Zaki, Avinash Mohan, and Aditya Gopalan. Towards optimal and efficient best arm identification in linear bandits. ar Xiv preprint ar Xiv:1911.01695, 2019. [61] Huiming Zhang and Song Xi Chen. Concentration inequalities for statistical inference. Communications in Mathematical Research, 37(1):1 85, 2021. [62] Heyang Zhao, Dongruo Zhou, Jiafan He, and Quanquan Gu. Bandit learning with general function classes: Heteroscedastic noise and variance-dependent regret bounds. ar Xiv preprint ar Xiv:2202.13603, 2022.

Appendices 14

A Background on Experimental Designs and Rounding 14

B Technical Preliminaries 15

C Baseline Variance Estimation Procedures 16

D Proofs of Variance Estimation 17

D.1 Analysis of the Separate Arm Estimator . . . . . . . . . . . . . . . . . . . . . . . 17

D.2 Analysis of the HEAD Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

E Proofs of Best-Arm Identification 25

E.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

E.2 Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

E.2.1 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

E.2.2 Proof of Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 28

F Proofs of Level Set Identification 33

F.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

F.2 Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

F.2.1 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

F.2.2 Proof of Upper Bound on Sample Complexity . . . . . . . . . . . . . . . . 35

G Comparing Identification Lower Bounds 37

H Experiment Details 38

H.1 Oracle Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

H.2 Assuming Homosekedastic Noise . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A Background on Experimental Designs and Rounding

We use optimal experiment design [29, 42] to minimize uncertainty in the estimator of interest. Consider the WLS estimator with the weight matrix WT = Σ 1 T so that V(bθWLS) = (X T Σ 1 T XT ) 1 = (P t=1 σ 2 xt xtx t ) 1. Let λx denote the proportion of the total samples T given to the measurement vector x X so that V(bθWLS) = (P

x X λxσ 2 x xx ) 1/T. We wish to take T samples so as to construct the design P

x X Tλxσ 2 x xx , however Tλx is often not an integer. As has been discussed at depth [46, 12, 35, 42, 3], efficient rounding schemes can solve this problem. Given some constant tolerance threshold ϵ, a design λ PX , and a minimum sample size r(ϵ), efficient rounding procedures return a fixed allocation {xt}T t=1 that yields a (1 + ϵ) approximation to the intended design. A simple, well known procedure with r(ϵ) {d(d + 1) + 2} /ϵ comes from Pukelsheim [42], while the scheme of Allen-Zhu et al. [3] yields r(ϵ) O(d/ϵ2). In our experimental design algorithms, we leverage the aforementioned rounding schemes.

We adapt the guarantees of [3] for Alg. 1 as an example in the following proposition. For sample size N, define X as a set composed of the elements of X Rd duplicated N times and SN = n s {0, 1, . . . , N}N|X| : PN|X| t=1 si N o as the set of discrete sample allocations.

Proposition A.1. Suppose ϵ (0, 1/3] and N 5d/ϵ2. Let λ PX , then in polynomial-time (in n and d), we can round λN to an integral solution bλ SN satisfying

max x X x X

x X bλx x x ! 1

x (1 + 6ϵ) min λ PX max x X x X

x X Nλx x x ! 1

Proof. This follows directly from Theorem 2.1 in [3] letting CN = n s [0, N]N|X| : PN|X| i=1 si N o be a continuous relaxation of SN and setting k = N

and n = N|X|.

Alternatively, one could use the robust inverse propensity score (RIPS) estimator [6], which avoids the need for rounding schemes via robust mean estimation.

B Technical Preliminaries

For our analysis, we do not assume that the error, η, in Eq. 1 is Gaussian. We generalize by extending to strictly sub-Gaussian noise, in which the variance σ2 x is equal to the optimal variance proxy for ηt, t N. This paradigm allows for noise distributions such as the symmetrized Beta, symmetrized Gamma, and Uniform.

The following definitions and propositions will be used in the proofs of Appendices D-F. Definition B.1. A real-valued random variable X is sub-Gaussian if there exists a positive constant σ2 such that E eλX eλ2σ2/2

for all λ 0. Proposition B.1 (Equation 2.9, [49]). Let X be a sub-Gaussian random variable with sub-Gaussian parameter σ2 and mean E(X) = µ. Then, for all t > 0, we have

P (|X µ| t) 2e t2

Definition B.2. A real-valued random variable X with mean E(X) = µ is sub-exponential if there are non-negative parameters (ν, α) such that

E n eλ(X µ)o e ν2λ2

2 , |λ| < 1

Proposition B.2 (Proposition 2.9, [49]). Suppose that X is sub-exponential with parameters (ν, α) and mean E(X) = µ. Then, for t > 0,

ν2 , if 0 t ν2

α , if t > ν2

This implies that for δ (0, 1),

P h |X µ| < max np

2 log(2/δ)ν2, 2 log(2/δ)α oi = 1 δ.

Proposition B.3 (Equation 37 in Appendix B, [19]). Let X be a sub-Gaussian random variable with sub-Gaussian parameter σ2. Then X2 is sub-exponential with parameters (ν = 4σ2

2, α = 4σ2). Proposition B.4 (Equation 2.18, [49]). Suppose that Xi for i {1, 2, . . . , n} are sub-exponential with parameters (νi, αi) such that E(Xi) = µi. Then, Pn i=1(Xi µi) is sub-exponential with parameters (ν , α ) such that

α = max i=1,2,...,n αi and ν :=

Definition B.3. A G-optimal design [28], λ PX , for a set of arms X Rd is such that

λ = arg min λ PX max x X x X

x X λxxx ! 1

Proposition B.5 (Lemma 4.6, [28]). For any finite X Rd with span(X) = Rd,

d = min λ PX max x X x X

x X λxxx ! 1

This implies that if we sample each arm x X τλ x times, then

max x X x X

x X τλ x xx ! 1

τ max x X x X

x X λ xxx ! 1

Definition B.4. The Forbenius norm of A Rn m is

Definition B.5. The Spectral norm of A Rn m is

A 2 = sup x =0

Proposition B.6 (Matrix Norm Properties). For A Rn m, we have sub-multiplicativity for the Frobenius norm: A A F A 2 F , and a bound on the Spectral norm: A 2 A F . Proposition B.7 (Corollary 4.7, [61]). Let ξ1, . . . , ξn be zero-mean σ2-sub-Gaussian and Rn n. For any t > 0, P h ξ ξ σ2 n tr( ) + 2tr( 2t)1/2 + 2 2t oi e t,

which implies that with probability 1 δ for δ (0, 1),

ξ Σξ σ2 h tr( ) + 2tr 2 log(1/δ) 1/2 + 2 2 log(1/δ) i .

Proposition B.8 (Equation 20.3, [32]). Assume the setup of Eq. 1 and that a data set of size Γ, D = {xt, yt}Γ t=1, has been collected through a fixed design. Define VΓ = PΓ t=1 xtx t and construct

the least squares estimator for D, bθΓ = V 1 Γ PΓ t=1 xtyt . Then, with probability 1 δ for

δ (0, 1), bθΓ θ VΓ 2 p

2 {d log(6) + log(1/δ)}.

C Baseline Variance Estimation Procedures

These are the algorithms we reference in the theoretical and empirical comparisons of Section 3.

The Uniform Estimator. The Uniform Estimator draws arms uniformly at random and uses all samples to construct estimators bθΓ of θ and bΣΓ of Σ .

The Separate Arm Estimator. Define U as the set of subsets of W of size M = d(d + 1)/2, and for U U construct ΦU such that its rows are composed of ϕx U. Let ζmin(A) be the minimum singular value of matrix A Rd d. The Seperate Arm Estimator picks a set of M arms such that

U = arg max U U ζmin(Φ 1 U ).

It then splits Γ samples evenly between these arms, estimates the sample variance for each and finds the least squares estimator bΣSA Γ .

Algorithm 3: Uniform Estimator of Heteroskedastic Variance

Result: Find bΣUE Γ 1 Input: Arms X Rd and Γ N

2 Sample arms uniformly at random from X

3 Define A = PΓ t=1 xtx t and b = PΓ t=1 xtyt and estimate bθΓ = A 1b .

4 Define A = PΓ t=1 ϕxtϕ xt and b = PΓ t=1 ϕxt yt x t bθΓ 2 .

5 Output: bΣUE Γ = A 1b .

Algorithm 4: Seperate Arm Estimator of Heteroskedastic Variance

Result: Find bΣSA Γ 1 Input: Arms X Rd and Γ N

2 Define U = arg max U U ζmin(Φ 1 U ).

3 Equally distribute Γ samples among arms XU = {m : ϕm U } and observe rewards {ym,t}Γ/M t=1 for m XU

4 Define the sample average for each arm ym = (M/Γ) PΓ/M t=1 ym,t

5 Output: bΣSA Γ arg mins RM P

m XU PΓ/M t=1 n ϕm s ( ym ym,t)2o2

D Proofs of Variance Estimation

In this appendix, we analyze the maximum absolute error of estimates associated with the Seperate Arm Estimator and the HEAD Estimator. Recall that M := d(d+1)/2. Define W = {ϕx, x X} RM where ϕx = vec(xx )G and G Rd2 M is the duplication matrix such that Gvech(xx ) = vec(xx ) with vech( ) denoting the half-vectorization.

D.1 Analysis of the Separate Arm Estimator

We begin by proving a general result bounding the absolute error of the sample variance for any sub-Gaussian random variable with parameter γ2.

Proposition D.1. Let X be a γ2-sub-Gaussian random variable with mean µ. Assume that we have collected n independent copies of X, {X1, X2, . . . , Xn}, and label the sample mean as X = Pn i=1 Xi/n. Defining the sample variance as bσ2 = Pn i=1(Xi X)2/n, the true variance as σ2 = E(X2 i ) µ2 and δ (0, 1), we find that with probability greater than 1 δ,

2γ2 log(4/δ)

32γ4 log(4/δ)

Proof. We begin by upper bounding the absolute error using the triangle inequality,

σ2 bσ2 = σ2 Pn i=1(Xi X)2

= E(X2 i ) µ2 Pn i=1 X2 i 2Xi X + X2

= E(X2 i ) µ2 Pn i=1 X2 i n + 2 X2 X2

= E(X2 i ) Pn i=1 X2 i n + X2 σ2

E(X2 i ) Pn i=1 X2 i n

n µ2 | {z } B

We first bound quantity A. Leveraging Proposition B.3, we establish that X2 i E(X2 i ) is subexponential with parameters (4γ2

2, 4γ2). We can then use Proposition B.4 to analyze the sum of independent sub-exponential random variables, finding that Pn i=1 X2 i E(X2 i ) /n is subexponential with parameters (4γ2

2/ n, 4γ2/n). Finally, we invoke Proposition B.2 to bound quantity A with probability 1 δ/2,

E(X2 i ) Pn i=1 X2 i n

( 4γ2 log(4/δ)

32γ4 log(4/δ)

Now we bound quantity B. Appealing to Proposition B.3, we find that X2 σ2

n µ2 is subexponential with parameters (4

2γ2/n, 4γ2/n). We then use Proposition B.2 to bound B with probability 1 δ/2,

( 4γ2 log(4/δ)

32γ4 log(4/δ)

( 4γ2 log(4/δ)

n , 4γ2 log(4/δ)

2γ2 log(4/δ)

where (a) follows because

2. Finally, we combine quantities A, B and C, apply a union bound, and conclude that with probability 1 δ for δ (0, 1),

σ2 bσ2 E(X2 i ) Pn i=1 X2 i n

( 4γ2 log(4/δ)

32γ4 log(4/δ)

2γ2 log(4/δ)

2γ2 log(4/δ)

32γ4 log(4/δ)

We can now use Proposition D.1 to control the sample variance estimates of the arms chosen by the Separate Arm Estimator. Theorem D.1 derives a concentration bound of the maximum absolute error of eσ2 x = min n max n σ2 min, x bΣSA Γ x o , σ2 max o . This bound, which is O(d4), scales unfavorably with dimension as highlighted in the theoretical and empirical comparisons conducted in Section 3.

Theorem D.1. Set bΣSA Γ as the output of Alg. 4 and define B such that for all x X, x 2 B.

Defining eσ2 x = min n max n σ2 min, x bΣSA Γ x o , σ2 max o , σ2 x = x Σ x and δ [0, 1], for any x X,

|eσ2 x σ2 x| 5B2

2 ζmin(ΦU ) max

2σ2 maxd3 log(8M/δ)

32σ4maxd4 log(8M/δ)

Note that we set the probability bound to 1 δ/2 instead of 1 δ because we want to be able to leverage these variance estimation bounds with the identification algorithms of Section 4 (and so are anticipating another union bound).

Proof. Label M = d(d + 1)/2 and XU = {m : ϕm U }, where U is defined in Alg. 4. Additionally, in Alg. 4 we observe {ym,t}Γ/M t=1 for each arm m XU . Define the sample average for m XU as

ym = PΓ/M t=1 ym,t

We obtain an estimate of Σ via

vech(bΣSA Γ ) = arg min s RM X

ϕm, s ( ym ym,t)2 2 . (5)

Label zm,t = ( ym ym,t)2 R, Ψ = vech(Σ ) RM, eΨΓ = vech(bΣSA Γ ) RM and

zm = PΓ/M t=1 zm,t

We reformulate Eq. 5 as follows

eΨΓ = arg min s RM X

t=1 ( ϕm, s zm,t)2

= arg min s RM X

t=1 ϕm, s 2 2zm,t ϕm, s + z2 m,t

= arg min s RM X

m XU (Γ/M) ϕm, s 2 2 ϕm, s

= arg min s RM X

m XU ϕm, s 2 2 ϕm, s zm +

PΓ/M t=1 z2 m,t Γ/M

(a) = arg min s RM X

m XU ϕm, s 2 2 ϕm, s zm + z2 m

= arg min s RM X

m XU ( ϕm, s zm)2 ,

where (a) follows because PΓ/M t=1 z2 m,t/(Γ/M) and z2 m do not depend on s. Defining z = [ zm]m XU RM, Eq. 5 becomes

eΨΓ = arg min s RM X

m XU ( ϕm, s zm)2 = arg min s RM ΦU s z 2 2.

We find that ΦU (eΨΓ Ψ ) 2 = ΦU eΨΓ ΦU Ψ z + z 2

= ΦU eΨΓ z + z ΦU Ψ 2

ΦU eΨΓ z 2 + z ΦU Ψ 2 ΦU Ψ z 2 + z ΦU Ψ 2 = 2 ΦU Ψ z 2.

Consequently, we can focus on bounding ΦU Ψ z 2 in order to control ΦU (eΨΓ Ψ ) 2. Note that the mth entry of (ΦU Ψ z) is (m Σ m zm) and that zm is the sample variance of observations {ym,t}Γ/M t=1 . Using Proposition D.1 and the fact that the ym,t are σ2 max-sub-Gaussian, we show that with probability 1 δ/(2M),

|ϕ mΨ zm| 5

2σ2 max M log(8M/δ)

32σ4max M log(8M/δ)

We apply Eq. 6 to quantity ΦU Ψ z 2 and a union bound to find that with probability 1 δ/2,

2σ2 max M log(8M/δ)

32σ4max M log(8M/δ)

M log(8M/δ) Γ ,

32σ4max M 2 log(8M/δ)

Consequently,

ΦU (eΨΓ Ψ ) 2 5 max

M log(8M/δ) Γ ,

32σ4max M 2 log(8M/δ)

We now analyze the absolute error of the estimate eσx for an arbitrary action x X. Since ΦU is non-singular by construction, we know that for q := (ΦU 1) ϕx RM,

|eΨ Γ ϕx σ2 x| = (eΨΓ Ψ ) ϕx 2

= (eΨΓ Ψ ) ΦU q 2

(eΨΓ Ψ ) ΦU 2 q 2

= ΦU (eΨΓ Ψ ) 2 (ΦU 1) ϕx 2

ΦU (eΨΓ Ψ ) 2 (ΦU 1) 2 ϕx 2

(a) = 2 ζmin(ΦU ) ΦU (eΨΓ Ψ ) 2 ϕx 2,

where (a) follows from the definition of the matrix norm. We now leverage the nature of the duplication matrix, G, and the fact that x 2 B to bound ϕx 2 2.

ϕx 2 2 = vec(xx ) GG vec(xx )

= vec(xx ) G(G G)(G G) 1G vec(xx )

= vec(xx ) G(G G) 1/2(G G)(G G) 1/2G vec(xx )

(a) 2vec(xx ) G(G G) 1G vec(xx )

(b) = 2vec(xx ) vec(xx )

where (a) follows from the fact that G G 2I as shown in Equation (61) of [40], and (b) uses the fact that G(G G) 1G vec(xx ) = vec(xx ) as shown in Equations (54) and (52) of [40]. Consequently, with probability greater than 1 δ/2, we have that for any x X,

|eσ2 x σ2 x| 5B2

2 ζmin(ΦU ) max

M log(8M/δ) Γ ,

32σ4max M 2 log(8M/δ)

2 ζmin(ΦU ) max

2σ2 maxd3 log(8M/δ)

32σ4maxd4 log(8M/δ)

D.2 Analysis of the HEAD Estimator

We now present the maximum absolute error guarantees of the HEAD procedure, Alg. 1.

Proof of Theorem 3.1. In this proof, we bound the estimation error |bσ2 x σ2 x|, x X. Define σ2 max = maxx X σ2 x, Ψ = vech(Σ ), N1 = {1, . . . , Γ/2}, N2 = {Γ/2 + 1, . . . , Γ} and

bΨΓ = arg min s RM X

n s ϕxt (yt x t bθΓ/2)2o2 ,

where bΨΓ = vech(bΣΓ). Furthermore, define vector e with entries et = (yt x t bθΓ/2)2 and Φ such that the rows of Φ are composed of {ϕxt}t N2. We find that bΨΓ = (Φ Φ) 1Φ e.

For any ϕx W, we are interested in bounding ϕ x (bΨΓ Ψ ) ,

|ϕ x (bΨΓ Ψ )| = ϕ x n Φ Φ 1 Φ e Ψ o

= | ϕ x (Φ Φ) 1Φ | {z } z (e ΦΨ )|.

Define z = ϕ x (Φ Φ) 1Φ and X such that the rows of X are composed of {xt}t N2. We can bound this expression with the triangle inequality as

|ϕ x (bΨΓ Ψ )| =

t N2 zt n x t (bθΓ/2 θ ) o2 + X

t N2 zt η2 t ϕ t Ψ + X

t N2 2ztx t (bθΓ/2 θ )ηt

t N2 zt n x t (bθΓ/2 θ ) o2 +

t N2 zt η2 t ϕ t Ψ +

t N2 2ztx t (bθΓ/2 θ )ηt

We can simplify the first term as,

t N2 zt n x t (bθΓ/2 θ ) o2 =

t N2 ztϕ xtvech n (bθΓ/2 θ )(bθΓ/2 θ ) o

= ϕ x (Φ Φ) 1Φ Φ vech n (bθΓ/2 θ )(bθΓ/2 θ ) o

= x (bθΓ/2 θ )(bθΓ/2 θ ) x

= n x (bθΓ/2 θ ) o2 .

Additionally, the third term is equivalent to 2|(η z) X(bθΓ/2 θ )|, where " is element-wise multiplication. Consequently, the absolute error is equal to

|ϕ x (bΨΓ Ψ )| = n x (bθΓ/2 θ ) o2

t N2 zt η2 t ϕ t Ψ | {z } B

+ 2 (η z) X(bθΓ/2 θ ) | {z } C

Note that to bound |ϕ x (bΨΓ Ψ )| with probability 1 δ/2, it is necessary to bound quantities A, B and C with probability 1 δ/6 in anticipation of a union bound.

Quantity A. We sample {xt, t N1} according to the rounded G-optimal design over X constructed in Stage 1 of Alg. 1. We use the concentration bounds of Proposition B.1, to state that with probability 1 δ/6,

σ2 max log(12|X|/δ).

We use Proposition B.5 in combination with the rounding procedure guarantees of Proposition A.1 to bound quantity G. For ϵ (0, 1/3], with probability 1 δ/6,

A 4d(1 + 6ϵ)σ2 max log(12|X|/δ)

Quantity B. According to Proposition B.3, η2 t is sub-exponential with parameters (4

2σ2 max, 4σ2 max). Consequently, by Proposition B.4, we know that B is sub-exponential with parameters

t N2 z2 t 32σ4max, max t N2 |zt|4σ2 max

We now invoke the G-optimal design over W space conducted in Stage 2 of Alg. 1 and Propositions B.5 and A.1 to find that for ϵ (0, 1/3], X

t N2 z2 t = X

t N2 ϕ x (Φ Φ) 1ϕtϕ t (Φ Φ) 1ϕx

= ϕ x (Φ Φ) 1Φ Φ(Φ Φ) 1ϕx

ϕ x (Φ Φ) 1ϕx 2d2(1 + 6ϵ)

Consequently,

t N2 z2 t 2d2(1 + 6ϵ)

|zt| = |ϕ x (Φ Φ) 1ϕt|

ϕ x (Φ Φ) 1ϕx q

ϕ t (Φ Φ) 1ϕt

2d2(1 + 6ϵ)

Therefore, B has sub-exponential parameters ( p

64σ4maxd2(1 + 6ϵ)/Γ, 8σ2 maxd2(1 + 6ϵ)/Γ). Using Proposition B.2, with probability 1 δ/6,

128 log(12/δ)σ4maxd2(1 + 6ϵ)

Γ , 16 log(12/δ)σ2 maxd2(1 + 6ϵ) Γ

16 log(12/δ)σ2maxd2(1 + 6ϵ)

Γ , 16 log(12/δ)σ2 maxd2(1 + 6ϵ) Γ

Quantity C. Construct V RΓ/2 d such that the rows of V correspond to {xt : t N1}. We can upper bound quantity C using the Cauchy-Schwartz inequality in the following manner,

|(η z) X(bθΓ/2 θ )| = |(η z) X(V V ) 1/2(V V )1/2(bθΓ/2 θ )|

(η z) X(V V ) 1/2 2 (V V )1/2(bθΓ/2 θ ) 2.

By Proposition B.8, with probability 1 δ/12,

(V V )1/2(bθΓ/2 θ ) 2 2 p

2σ2max {d log(6) + log(12/δ)}.

Now construct Dz = diag({zt : t N2}) and express,

(η z) X(V V ) 1/2 2 = q

η Dz X(V V ) 1X Dzη.

We first establish that by the G-optimal designs in both X and W space along with Propositions B.5 and A.1,

tr Dz X(V V ) 1X Dz = tr D2 z X(V V ) 1X

t N2 z2 t x t (V V ) 1xt

(a) 4d3(1 + 6ϵ)2

where ϵ (0, 1/3] and (a) follows from Eq. 7. Consequently, we find that

tr Dz X(V V ) 1X Dz 4d3(1 + 6ϵ)2

This inequality will be useful in the following derivation. Labeling = Dz X(V V ) 1X Dz, we use Proposition B.7 to bound η η with probability 1 δ/12 and explain below.

η η < σ2 max n tr( ) + 2 F log(12/δ)1/2 + 2 2 log(12/δ) o (9)

4d3(1 + 6ϵ)2

Γ2 + 2 1/2 2 F log(12/δ)1/2 + 2 2 log(12/δ) (10)

4d3(1 + 6ϵ)2

Γ2 + 2 1/2 2 F log(12/δ)1/2 + 2 F log(12/δ) (11)

4d3(1 + 6ϵ)2

Γ2 + 2 1/2 2 F log(12/δ)1/2 + 2 1/2 2 F log(12/δ) (12)

4d3(1 + 6ϵ)2

Γ2 + 2tr( ) log(12/δ)1/2 + 2tr( ) log(12/δ) (13)

4d3(1 + 6ϵ)2

Γ2 + 24d3(1 + 6ϵ)2

Γ2 log(12/δ)1/2 + 24d3(1 + 6ϵ)2

Γ2 log(12/δ)

5σ2 max 4d3(1 + 6ϵ)2

Γ2 log(12/δ), (15)

where line (9) follows from Eq. 8, Proposition B.6 and Proposition B.7. In lines (10) and (11), we use Proposition B.6 again and then the definition of the Frobenius norm in line (12). Consequently, with probability 1 δ/6,

C = |(η z) X(bθΓ/2 θ )|

= |(η z) X(V V ) 1/2(V V )1/2(bθΓ/2 θ )|

(η z) X(V V ) 1/2 2 (V V )1/2(bθΓ/2 θ ) 2

10σ4max 4d3(1 + 6ϵ)2

Γ2 log(12/δ) {d log(6) + log(12/δ)}.

Combining quantities A, B and C together and applying a union bound, we have with probability 1 δ/2,

|ϕ x (bΨΓ Ψ )| max

16 log(12/δ)σ2maxd2(1 + 6ϵ)

Γ , 16 log(12/δ)σ2 maxd2(1 + 6ϵ) Γ

+ 4dσ2 max log(12|X|/δ)(1 + 6ϵ)

10σ4max 4d3(1 + 6ϵ)2

Γ2 log(12/δ) {d log(6) + log(12/δ)}

16 log(12/δ)σ2maxd2(1 + 6ϵ)

Γ , 16 log(12/δ)σ2 maxd2(1 + 6ϵ) Γ

+ 4dσ2 max log(12|X|/δ)(1 + 6ϵ)

We can exploit the dependencies of V and W to simplify this bound in two ways under different conditions. Note that we only have dependency on |X| through term V. If d 1 > log(|X|)/ log(1/δ) and Γ > 16 log(12/δ)σ2 maxd2(1 + 6ϵ), then with probability 1 δ/2,

|ϕ x (bΨΓ Ψ )|

256 log(12/δ)σ4maxd2(1 + 6ϵ)

Alternatively, if just Γ > 16 log(12|X|/δ)σ2 maxd2(1 + 6ϵ), then

|ϕ x (bΨΓ Ψ )|

256 log(12|X|/δ)σ4maxd2(1 + 6ϵ)

Since σ2 max σ2 x σ2 min for all x X,

|ϕ x (bΨΓ Ψ )| = |x bΣΓx x Σ x| | min n max n x bΣΓx, σ2 min o , σ2 max o σ2 x|.

In summary, if Γ > 16 log(12|X|/δ)σ2 maxd2(1 + 6ϵ),

|bσ2 x σ2 x|

C log(|X|/δ)σ4maxd2

where C = 256(1 + 6ϵ).

We now prove a simple lemma relating the additive bound on |bσ2 x σ2 x| to a multiplicative bound 1 CΓ,δ < σ2 x/bσ2 x < 1 + CΓ,δ.

Lemma D.1. Define bσ2 x = min n max n x bΣΓx, σ2 min o , σ2 max o . Letting σ2 min = minx X σ2 x,

κ = σ2 max/σ2 min, and

P x X, 1 CΓ,δ < σ2 x bσ2x < 1 + CΓ,δ

1 δ/2, where CΓ,δ =

C log(|X|/δ)κ2d2

Proof. Using Lemma D.1, we know that with probability greater than 1 δ/2 the following event holds.

|σ2 x bσ2 x|

C log(|X|/δ)σ4maxd2

C log(|X|/δ)σ4maxd2

Γ σ2 x bσ2 x +

C log(|X|/δ)σ4maxd2

bσ2 x bσ2 x

C log(|X|/δ)σ4maxd2

bσ4xΓ σ2 x bσ2 x + bσ2 x

C log(|X|/δ)σ4maxd2

bσ2 x bσ2 x

C log(|X|/δ)σ4maxd2

σ4 minΓ σ2 x bσ2 x + bσ2 x

C log(|X|/δ)σ4maxd2

C log(|X|/δ)κ2d2

Γ σ2 x bσ2x 1 +

C log(|X|/δ)κ2d2

E Proofs of Best-Arm Identification

We divide the proof of Theorem 4.1 and 4.2 between Appendices E and F for ease of exposition. In Appendix E, we consider the transductive linear best-arm identification problem (BAI) in which we are interested in identifying z = maxz Z z θ with probability 1 δ for δ (0, 1). In Appendix F, we consider the level set (LS) counterpart. Recall that for a set V, let PV := {λ R|V| : P

v V λv = 1, λv 0 v} denote distributions over V and Y(V) := {v v : v, v V, v = v } be an operator giving differences. We begin with the lower bound.

E.1 Lower Bound

Proof of Theorem 4.1 for transductive best-arm identification. This proof is similar to the lower bound proofs of [35] and [12]. Let C := {eθ Rd : i s.t. eθ (z zi) 0}, i.e., eθ C if and only if z is not the best arm in the linear bandit instance (X, Z, eθ). Additionally, label K = |X|.

We first invoke Lemma 1 of [27]. Under a δ-PAC strategy for finding the best arm for the bandit instance (X, Z, θ ), let Ti denote the random variable that is the number of times arm i is pulled. Defining σ2 x = x Σ x for all x X, we let νθ,i denote the reward distribution of the i-th arm of X, i.e., νθ,i = N(x i θ, σ2 xi). For any eθ C we know from [27] that

i=1 E(Ti)KL(νθ ,i||νeθ,i) log(1/2.4δ).

Following the steps of [12], we find that

i=1 E(Ti) log(1/2.4δ) min λ PX max eθ C

1 PK i=1 λi KL(νθ ,i||νeθ,i) . (17)

Label Z = {z1, z2, . . . , zn}, where z1 = z without loss of generality. For j = 1, λ PX and

ϵ > 0, define A(λ) := PK i=1 λi xix i σ2xi , wj = z zj and

θj(ϵ, λ) = θ (w j θ + ϵ)A(λ) 1wj

w j A(λ) 1wj .

Note that w j θj(ϵ, λ) = ϵ < 0, implying that θj C. We find that the KL divergence between νθ,i and νθj(ϵ,λ),i is given by:

KL(νθ,i||νθj(ϵ,λ),i) =

x i {θ θj(ϵ, λ)} 2

= w j A(λ) 1 (w j θ + ϵ)2 xix i 2σ2xi w j A(λ) 1wj 2 A(λ) 1wj.

Returning to Eq. 17,

i=1 E(Ti) log(1/2.4δ) min λ PX max eθ C

1 PK i=1 λi KL(νθ ,i||νeθ,i)

log(1/2.4δ) min λ PX max j=2, ,m 1 PK i=1 λi KL(νθ ,i||νθj(ϵ,λ),i)

log(1/2.4δ) min λ PX max j=2, ,m

w j A(λ) 1wj 2

PK i=1(w j θ + ϵ)2w j A(λ) 1λi xix i 2σ2xi A(λ) 1wj

(a) = 2 log(1/2.4δ) min λ PX max j=2, ,m

w j A(λ) 1wj 2

(w j θ + ϵ)2w j A(λ) 1 PK i=1 λi xix i σ2xi

= 2 log(1/2.4δ) min λ PX max j=2, ,m

w j A(λ) 1wj 2

(w j θ + ϵ)2 ,

where (a) uses the fact that PK i=1 λi xix i σ2xi = A(λ). Letting ϵ 0 establishes the result.

E.2 Upper Bound

We now prove Theorem 4.2 by first establishing that Alg. 2 is δ-PAC for the transductive linear bandits best-arm identification problem.

E.2.1 Proof of Correctness

Overview of Correctness Proof. Lemma D.1 along with Lemmas E.1 through E.3 are combined to prove that Algorithm 2 is δ-PAC for the best-arm identification problem. Lemma E.1 leverages Lemma D.1 and begins by establishing that in any round ℓ N, the estimation error for the difference in mean rewards between z Z and z = arg maxz Z z θ is bounded by ϵℓ= 2 ℓwith probability 1 δ for δ (0, 1). Lemma E.2 uses this result to prove that starting at round ℓ, z Zℓ+1 with probability 1 δ. This establishes that z is not eliminated from the set of items we are considering at any step with high probability. Lastly, Lemma E.3 uses Lemma E.1 to show that Algorithm 2 will eventually identify z . This is done by proving that Alg. 2 eliminates items with mean rewards that are outside a shrinking radius (halves at every step of the algorithm) of the highest reward, z θ . In other words, with probability 1 δ for δ (0, 1), we prove that items in Zℓwith expected rewards that are 2ϵℓaway from z θ will not be in Zℓ+1. Since ϵℓ 0 as ℓ , we know that limℓ P(Zℓ= {z }) = 1 δ.

Lemma E.1. With probability at least 1 δ for δ (0, 1),

| z z , bθℓ θ | ϵℓ

for all ℓ N and z Z.

Proof. Define σ2 x = x Σ x, bθℓas the weighted least squares estimator constructed in Algorithm 2 at stage ℓ, and events Ez,ℓ= | z z , bθℓ θ | ϵℓ

J = σ2 x bσ2 x (1 + CΓ,δ) = 3

2bσ2 x, x X .

Label nℓas the number of data points collected during stage ℓ, Υ = diag({σ2 i }nℓ i=1), bΥ = diag({bσ2 i }nℓ i=1), Xℓas a matrix whose rows are composed of the arms xi,ℓdrawn during round ℓ, Yℓas the vector of rewards drawn during round ℓ, Aℓ:= Pnℓ i=1 bσ 2 i xℓ,ix ℓ,i, and eΩ= A 1 ℓX ℓbΥ 1ΥbΥ 1XℓA 1 ℓ as the weighted least squares estimator s covariance. We first establish that for z Rd,

E n z (bθℓ θ ) o = z n A 1 t XℓbΥ 1E(Yℓ) θ o = z A 1 t XℓbΥ 1Xℓθ θ = 0,

V n z (bθℓ θ ) o = z eΩz.

Furthermore, defining u = z A 1 ℓX ℓbΥ 1/2 Rnℓand conditioning on event J ,

z eΩz = z A 1 ℓX ℓbΥ 1ΥbΥ 1XℓA 1 ℓz

= (z A 1 ℓX ℓbΥ 1/2)bΥ 1/2ΥbΥ 1/2(bΥ 1/2XℓA 1 ℓz)

= u bΥ 1/2ΥbΥ 1/2u

i=1 u2 i σ2 xi bσ2 i

2z A 1 ℓX ℓbΥ 1XℓA 1 ℓz

where (a) is the result of event J . Using this result and Proposition B.1, with probability at least 1 δ 4ℓ2|Z|,

Ez,ℓ= | z z , bθℓ θ | ||z z ||eΩ p

2 log(8ℓ2|Z|/δ)

2||z z ||A 1 ℓ p

2 log(8ℓ2|Z|/δ).

Conditioned on event J , we have with probability 1 δ 4ℓ2|Z|,

| z z , bθℓ θ | 3

2||z z || P

x V τℓλℓ,x xx

2 log(8ℓ2|Z|/δ)

2||z z || P

x X τℓλℓ,x xx

2 log(8ℓ2|Z|/δ)

3 2||z z || P

x X λℓ,x xx

2 log(8ℓ2|Z|/δ)

3 2||z z ||2 P

x X λℓ,x xx

2 ϵ 2 ℓq {Y(Zℓ)} log(8ℓ2|Z|/δ)

2 log(8ℓ2|Z|/δ)

Applying a union bound, we find that

P {( ℓ=1 z ZℓEz,ℓ|J ) J } = 1 P ℓ=1 z ZℓEc z,ℓ|J J c

z Zℓ P(Ec z,ℓ|J ) + P(J c)

δ 4ℓ2|Z| + δ

which proves the result.

Now we establish that, with probability 1 δ for δ (0, 1), the best arm z will not be eliminated from Zℓfor any round ℓ N.

Lemma E.2. With probability at least 1 δ for δ (0, 1), for any ℓ N and z Zℓ,

z z , bθℓ ϵℓ.

Proof. We know that with probability 1 δ for all ℓ N and z Zℓ,

z z , bθℓ = z z , bθℓ θ + z z , θ

z z , bθℓ θ ϵℓ,

where the last inequality follows from Lemma E.1.

Then, we establish that in round ℓ, items in Zℓthat have mean rewards farther than 2ϵℓfrom the highest reward will be eliminated.

Lemma E.3. With probability at least 1 δ for δ (0, 1), for any ℓ N and z Zℓsuch that z z, θ > 2ϵℓ,

max z Zℓ z z, bθℓ > ϵℓ.

Proof. We know that with probability 1 δ for all ℓ N,

max z Zℓ z z, bθℓ z z, bθℓ

= z z, bθℓ θ + z z, θ

(a) > ϵℓ+ 2ϵℓ = ϵℓ,

where (a) follows from Lemma E.1.

E.2.2 Proof of Sample Complexity

Overview of Sample Complexity Proof. In this section, we prove an upper bound on the sample complexity of Algorithm 2 for transductive best-arm identification with linear bandits. We begin by establishing some basic facts about the partial ordering of Hermetian positive definite matrices in Lemmas E.4 and E.5. Let A and B be Hermitian positive definite matrices in Rd d and define A B if (A B) is positive definite.

Lemma E.4. Let A and B be Hermetian Positive Definite matrices in Rd d,

A B I A 1/2BA 1/2.

Proof. For any x Rd we define y = A1/2x and

x Ax = x A1/2A 1/2AA 1/2A1/2x

= y A 1/2AA 1/2y

= y A 1/2BA 1/2y

Lemma E.5. Let A and B be Hermitian positive definite matrices in Rd d,

A B B 1 A 1.

Proof. We use Lemma E.4 twice in the following proof.

A B I A 1/2BA 1/2

I B1/2A 1/2A 1/2B1/2

I B1/2A 1B1/2

where line 2 follows because A 1/2BA 1/2 is similar to B1/2A 1/2A 1/2B1/2 and both are Hermetian positive definite.

We now use Lemmas E.4 and E.5 in our problem context. For V Rd, define g : (V, PX ) R such that g(v, λ) = v (P

x λxxx /bσ2x) 1, f : (V, PX ) R such that f(v, λ) = v (P

x λxxx /σ2x) 1 and

q (V) = min λ PX max v V f(v, λ), q(V) = min λ PX max v V g(v, λ).

Note that the function, q, is also defined and leveraged in Algorithm 2. Lemma E.6 establishes a multiplicative bound of the form

(1 CΓ,δ) g(v, λ) f(v, λ) (1 + CΓ,δ) g(v, λ)

using Lemmas D.1 and E.5.

Lemma E.6. Define bσ2 x = min n max n x bΣΓx, σ2 min o , σ2 max o . For a given λ, define

Υ = diag({σ2 x}x X ) diag({1/λx}x X ), bΥ 1 = diag({bσ2 x}x X ) diag({1/λx}x X ).

Assume CΓ,δ < 1 and v Rd, then with probability 1 δ/2 for δ (0, 1) the following holds true:

(1 CΓ,δ) v (X bΥ 1X) 1v v (X Υ 1X) 1v

(1 + CΓ,δ) v (X bΥ 1X) 1v,

or equivalently

(1 CΓ,δ) g(v, λ) f(v, λ) (1 + CΓ,δ) g(v, λ).

Proof. Using Lemma D.1, we establish that

v X bΥ 1Xv = v X bΥ 1ΥΥ 1Xv

= v X Υ 1/2 bΥ 1ΥΥ 1/2Xv

(1 + CΓ,δ) v X Υ 1Xv

v X bΥ 1Xv = v X bΥ 1ΥΥ 1Xv

= v X Υ 1/2 bΥ 1ΥΥ 1/2Xv

(1 CΓ,δ) v X Υ 1Xv.

Leveraging the partial ordering of Hermitian positive definite matrices and Lemma E.5, we can then represent these inequalities as

(1 CΓ,δ) X Υ 1X X bΥ 1X (1 + CΓ,δ) X Υ 1X 1 1 CΓ,δ

X Υ 1X 1 X bΥ 1X 1 1 1 + CΓ,δ

This implies that X Υ 1X 1 (1 CΓ,δ) X bΥ 1X 1

and X Υ 1X 1 (1 + CΓ,δ) X bΥ 1X 1 .

By combining these inequalities we arrive at the result.

We now want to use the multiplicative bound of Lemma E.6 to define the relationship between q(V) and q (V) and eventually relate the sample complexity upper bound of Alg. 2 to the lower bound established in Theorem 4.1. In order to accomplish this goal, we prove some general results concerning minimax problems in Lemmas E.7 and E.8. Lemma E.7. For c > 0, v Rd and λ PX , if

(1 c)g(v, λ) f(v, λ) (1 + c)g(v, λ),

(1 c) max v V g(v, λ) max v V f(v, λ)

(1 + c) max v V g(v, λ).

Proof. For a given λ PX , define v (λ) = arg maxv V g(v, λ) and v (λ) = arg maxv V f(v, λ), then

(1 c)g v (λ), λ f v (λ), λ

f {v (λ), λ} (1 + c)g {v (λ), λ}

(1 + c)g v (λ), λ .

Lemma E.8. For c > 0, v Rd and λ PX , if

(1 c)g(v, λ) f(v, λ) (1 + c)g(v, λ),

(1 c) min λ PX max v V g(v, λ) min λ PX max v V f(v, λ)

(1 + c) min λ PX max v V g(v, λ).

Proof. For a given λ PX , define v (λ) = arg maxv V g(v, λ), v (λ) = arg maxv V f(v, λ), λ = arg minλ PX maxv V f(v, λ) and λ = arg minλ PX maxv V g(v, λ), then

(1 c)g v (λ ), λ (1 c)g v (λ ), λ

(a) f {v (λ ), λ }

f v (λ ), λ

(1 + c)g v (λ ), λ

(b) (1 + c)g v (λ ), λ

where lines (a) and (b) follows from Lemma E.7.

Leveraging Lemma E.8, we bound q (V) in terms of q(V) in Lemma E.9.

Lemma E.9. With probability 1 δ/2 for δ (0, 1),

(1 CΓ,δ) q(V) q (V) (1 + CΓ,δ) q(V).

Proof. From Lemma E.6, we know that

(1 CΓ,δ) g(v, λ) f(v, λ) (1 + CΓ,δ) g(v, λ).

According to Lemma E.8, this implies

(1 CΓ,δ) min λ PX max v V g(v, λ) min λ PX max v V f(v, λ) (1 + CΓ,δ) min λ PX max v V g(v, λ).

In round ℓof Algorithm 2, we sample arms in X according to a given design, bλℓ. Although other rounding procedures are possible (see Proposition A.1), Theorem 4.2 assumes that we take τℓbλℓ,x samples of arm x X to accomplish this goal. This results in more than τℓsamples being taken in round ℓ. We define the sparsity of any bλℓdesign using Caratheodory s Theorem to motivate the number of extra samples needed by sampling in this manner.

Lemma E.10. [Appendix 8, [44]] In Algorithm 2, the support of bλℓfor all ℓ N is d(d + 1)/2 + 1.

Proof. Any design matrix in Rd d is symmetric and can consequently be expressed as a point in RM, M = d(d + 1)/2. The design matrices leveraged in this paper belong to a convex hull of a subset of points since they are a convex combination of xx for x X. According to Caratheodory s theorem, any point in the convex hull of any subset of points in RD can be defined as a convex combination of D + 1 points. The optimal experimental designs in this paper can therefore be represented using only M + 1 points [44]. This sparsity holds regardless of the design s form and so applies to both the homoskedastic and heteroskedastic variance settings.

Finally, we connect the sample complexity upper bound of Algorithm 2 to the lower bound for δ-PAC algorithms established in Theorem 4.2 for transductive best-arm identification.

Proof of Theorem 4.2 for transductive best-arm identification. Assume that maxz Zℓ| z z , θ | 2. Define Sℓ= {z Zℓ: z z, θ 4ϵℓ}. Note that by assumption Z = Z1 = S1. Lemma E.3 implies that with probability at least 1 δ for δ (0, 1), we know Zℓ Sℓfor all ℓ N. This means that

q(Y(Zℓ)) = inf λ PX max z,z Zℓ||z z ||2 (P

inf λ PX max z,z Sℓ||z z ||2 (P

For ℓ log2(4 1) , we know that Sℓ= {z }. Thus, the sample size needed to identify z is upper bounded as follows.

log2(4 1) X

x X τℓbλℓ,x (a) =

log2(4 1) X

log2(4 1) X

ϵ 2 ℓq {Y(Zℓ)} log(8ℓ2|Z|/δ)

2 log2(4 1) +

log2(4 1) X

ℓ=1 3ϵ 2 ℓq(Sℓ) log(8ℓ2|Z|/δ)

2 log2(4 1)

+ 3 log 8 log2(4 1) 2|Z|

log2(4 1) X

ℓ=1 22ℓq(Sℓ),

where (a) follows from Lemma E.10.

We relate this quantity to the lower bound in Theorem 4.1 below. With probability 1 δ,

ψ BAI = inf λ PX max z Z\{z }

{(z z) θ }2

= inf λ PX max ℓ log2(4 1) max z Sℓ

{(z z) θ }2

inf λ PX max ℓ log2(4 1) max z Sℓ

(a) 1 log2(4 1) inf λ PX

log2(4 1) X

ℓ=1 max z Sℓ

(b) 1 16 log2(4 1)

log2(4 1) X

ℓ=1 22ℓinf λ PX max z Sℓ||z z||2 P

(c) 1 64 log2(4 1)

log2(4 1) X

ℓ=1 22ℓinf λ PX max z,z Sℓ||z z||2 P

(d) 1 64 log2(4 1)

log2(4 1) X

ℓ=1 22ℓinf λ PX max z,z Sℓ 1 2||z z||2 P

1 128 log2(4 1)

log2(4 1) X

ℓ=1 22ℓinf λ PX max z,z Sℓ||z z||2 P

= 1 128 log2(4 1)

log2(4 1) X

ℓ=1 22ℓq(Sℓ)

here (a) follows from the fact that the maximum of a set of numbers is always greater than the average and (b) by the fact that the minimum of a sum is greater than the sum of the minimums. We used the fact that

max z,z Sℓ||z z||2 P

1 4 max z Sℓ||z z||2 P

by the triangle inequality in (c). Finally, (d) follows from Lemma E.9.

Consequently, putting the previous pieces together we find that the sample complexity can be upper bounded as

log2(4 1) X

x X τℓbλℓ,x d(d + 1)

2 log2(4 1)

+ 3 log 8 log2(4 1) 2|Z|

log2(4 1) X

ℓ=1 22ℓq(Sℓ)

2 log2(4 1)

+ 384 log2(4 1) log 8 log2(4 1) 2|Z|

where the dependency on d(d + 1)/2 can reduced to d by Proposition A.1. This proves the result.

F Proofs of Level Set Identification

This appendix contains the proofs for the results on transductive linear bandit level set identification. In this problem, we are interested in identifying the set Gα(θ ) = {z Z : z θ > α}, i.e., the items with mean rewards that exceed threshold α. We prove the lower bound on the sample complexity from Theorem 4.1 in App. F.1. The proof of correctness and the sample complexity upper bound for Alg. 2 are presented in App. F.2. These proofs resemble the transductive linear bandit identification proofs with slight modifications accounting for the the problem structure. Recalling Bα(θ ) = {z Z : z θ < α}, we assume that Z = Gα(θ ) Bα(θ ), which is equivalent to the condition {z Z : z θ = α} = . Let = minz Z |z θ α| denote the minimum absolute gap for the threshold α.

F.1 Lower Bound

Proof of Theorem 4.1 for transductive level set identification. Label X Rd as the measurement vectors such that K = |X| and Z as the decision item set. This proof is similar to the lower bound proofs of [35] and [12]. Let C be the set of alternatives such that for eθ C, Gα(θ ) = Gα(eθ). This set of alternatives is decomposed by [35] as,

C = z Gα(θ ){eθ : z eθ < α} z / Gα(θ ){eθ : z eθ > α} .

We now recall Lemma 1 of [27]. Under a δ-PAC strategy for bandit instance (X, Z, θ ), let Ti denote the random variable that is the number of times arm i is pulled. Defining σ2 x = x Σ x for all x X, we denote νθ,i as the reward distribution of the i-th arm of X, i.e., νθ,i = N x i θ, σ2 xi . For any eθ C, we know from [27] that

i=1 E(Ti)KL(νθ ,i||νeθ,i) log(1/2.4δ).

Following the steps of [12], we find that

i=1 E(Ti) log(1/2.4δ) min λ PX max eθ C

1 PK i=1 λi KL(νθ ,i||νeθ,i) . (18)

For λ PX , define A(λ) := PK i=1 λi xix i σ2xi . For ϵ > 0 and z Gα(θ ), define

θz(ϵ, λ) := θ (θ z α + ϵ)A(λ) 1z

z 2 A(λ) 1 ,

where z θz(ϵ, λ) = α ϵ, implying that θz(ϵ, λ) C. For z Gc α(θ ), define

θz(ϵ, λ) := θ (θ z α ϵ)A(λ) 1z

z 2 A(λ) 1 ,

where z θz(ϵ, λ) = α + ϵ, implying that θz(ϵ, λ) C.

Defining ϵz = ϵ if z Gα(θ ) and ϵz = ϵ if z Gc α(θ ), we find that the KL divergence between νθ,i and νθz(ϵ,λ),i is given by:

KL(νθ,i||νθz(ϵ,λ),i) =

x i {θ θz(ϵ, λ)} 2

= z A(λ) 1 (z θ α + ϵz)2 xix i 2σ2 xi {z A(λ) 1z}2 A(λ) 1z.

Returning to Eq. 18,

i=1 E(Ti) log(1/2.4δ) min λ PX max eθ C

1 PK i=1 λi KL(νθ ,i||νeθ,i)

log(1/2.4δ) min λ PX max z Z 1 PK i=1 λi KL(νθ ,i||νθz(ϵz,λ),i)

log(1/2.4δ) min λ PX max z Z

z A(λ) 1z 2

PK i=1(z θ α + ϵz)2z A(λ) 1λi xix i 2σ2xi A(λ) 1z

(a) = 2 log(1/2.4δ) min λ PX max z Z

z A(λ) 1z 2

(z θ α + ϵz)2z A(λ) 1 PK i=1 λi xix i σ2xi

= 2 log(1/2.4δ) min λ PX max z Z z A(λ) 1z (z θ α + ϵz)2 ,

where (a) uses the fact that PK i=1 λi xix i σ2xi = A(λ). Letting ϵz 0 establishes the result.

F.2 Upper Bound

We now prove the upper bound on the sample complexity of Alg. 2 for level set estimation. This proof follows similarly to the upper bound on the sample complexity of Alg. 2 for best-arm identification shown in App. E.2. Note that in each round ℓ N, Alg. 2 maintains a set Zℓof undecided items, a set Gℓof items estimated to have value exceeding the threshold α, and a set Bℓof items estimated to have value falling below the threshold α.

F.2.1 Proof of Correctness

To begin, we show that that for all rounds ℓ N of Alg. 2, the error in the estimates of z θ for all z Z are bounded by ϵℓwith probability at least 1 δ for δ (0, 1).

Lemma F.1. For all ℓ N and z Z, with probability at least 1 δ for δ (0, 1),

| z, bθℓ θ | ϵℓ.

Proof. This proof follows identically to Lemma E.1 by replacing z with the zero vector.

We now apply the preceding result to show that with probability 1 δ for δ (0, 1), Alg. 2 does not place items in the wrong category.

Lemma F.2. With probability at least 1 δ for δ (0, 1), for any z Zℓ, ℓ N, such that z θ > α,

z, bθℓ + ϵℓ> α,

and for any z Zℓsuch that z θ < α,

z, bθℓ ϵℓ< α.

Proof. Consider z Gα so that z θ > α. Using Lemma F.1, with probability at least 1 δ for δ (0, 1) we have

z, bθℓ + ϵℓ= z, bθℓ θ + ϵℓ+ z, θ

> z, bθℓ θ + ϵℓ+ α α.

By the procedure of Alg. 2, this guarantees Gℓ+1 Gα.

Now, consider z Bα so that z θ < α. Using Lemma F.1, with probability at least 1 δ for δ (0, 1) we have

z, bθℓ ϵℓ= z, bθℓ θ ϵℓ+ z, θ

< z, bθℓ θ ϵℓ+ α α.

By the procedure of Alg. 2, this guarantees Bℓ+1 Bα.

We now show that any z Z such that |z θ α| > 2ϵℓis removed from the active set, Zℓ, in round ℓ N of Alg. 2 with probability at least 1 δ for δ (0, 1). Since ϵℓ 0, this proves that we will eventually identify Gα correctly.

Lemma F.3. With probability at least 1 δ for δ (0, 1), for any ℓ N and z Zℓsuch that | z, θ α| > 2ϵℓ,

| z, bθℓ α| > ϵℓ.

Proof. Set δ (0, 1). Let us begin by considering any z Zℓsuch that z, θ α > 2ϵℓ. Using Lemma F.1, with probability at least 1 δ,

z, bθℓ ϵℓ= z, bθℓ θ + z, θ ϵℓ ϵℓ+ z, θ ϵℓ > α.

Now, consider any z Zℓsuch that z, θ α < 2ϵℓ. Using Lemma F.1, with probability at least 1 δ,

z, bθℓ + ϵℓ= z, bθℓ θ + z, θ + ϵℓ ϵℓ+ z, θ + ϵℓ < α.

Hence, for z Zℓsuch that | z, θ α| > 2ϵℓ, we have that | z, bθℓ ϵℓ| > α with probability at least 1 δ.

By the elimination conditions of Alg. 2, this guarantees that the z Zℓsuch that | z, θ α| > 2ϵℓ are not included in Zℓ+1.

F.2.2 Proof of Upper Bound on Sample Complexity

Now we prove the upper bound on the sample complexity of Alg. 2 for the transductive level set identification problem.

Proof of Theorem 4.2 for transductive level set identification. Set δ (0, 1). Assume that maxz Zℓ| z, θ α| 2. Define Sℓ= {z Zℓ: | z, θ α| 4ϵℓ} for ℓ N. Note that by assumption Z = S1 = Z1. Lemma F.3 implies that with probability at least 1 δ, we know Zℓ Sℓfor all ℓ N. This means

q(Zℓ) = inf λ PX max z Zℓ||z||2 P

inf λ PX max z Sℓ||z||2 P

For ℓ log2(4 1) , we know that Sℓ= , thus the sample complexity to identify the set Gα is upper bounded as follows.

log2(4 1) X

x X τℓbλℓ,x (a) =

log2(4 1) X

log2(4 1) X

ϵ 2 ℓq(Zℓ) log(8ℓ2|Z|/δ)

2 log2(4 1) +

log2(4 1) X

ℓ=1 3ϵ 2 ℓq(Sℓ) log(8ℓ2|Z|/δ)

2 log2(4 1) +

3 log 8 log2(4 1) 2|Z|

log2(4 1) X

ℓ=1 22ℓq(Sℓ),

where (a) follows from Lemma E.10. We now relate this quantity to the lower bound and explain below. With probability 1 δ,

ψ LS = inf λ PX max z Z

= inf λ PX max ℓ log2(4 1) max z Sℓ

inf λ PX max ℓ log2(4 1) max z Sℓ

(a) 1 16 log2(4 1) inf λ PX

log2(4 1) X

ℓ=1 22ℓmax z Sℓ||z||2 P

(b) 1 16 log2(4 1)

log2(4 1) X

ℓ=1 22ℓinf λ PX max z Sℓ||z||2 P

(c) 1 32 log2(4 1)

log2(4 1) X

ℓ=1 22ℓinf λ PX max z Sℓ||z||2 P

= 1 32 log2(4 1)

log2(4 1) X

ℓ=1 22ℓq(Sℓ).

where (a) follows from the fact that the maximum of a set of numbers is always greater than the average, (b) by the fact that the minimum of a sum is greater than the sum of the minimums, and (c) from Lemma E.9. Consequently, putting the previous pieces together we find that the sample

complexity can be upper bounded as

log2(4 1) X

x X τℓbλℓ,x d(d + 1)

2 log2(4 1)

+ 3 log 8 log2(4 1) 2|Z|

log2(4 1) X

ℓ=1 22ℓq(Sℓ)

2 log2(4 1)

+ 96 log2(4 1) log 8 log2(4 1) 2|Z|

where the dependency on d(d + 1)/2 can reduced to d by Proposition A.1. This proves the result.

G Comparing Identification Lower Bounds

In this section, we use the general notation for the level set and best-arm identification problems developed in Section 4. Define

ρ OBJ := σ2 max min λ PX max h HOBJ,q QOBJ

||h q||2 ( P

x X λxxx ) 1

{(h q) θ b OBJ}2 ,

which is the lower bound for both BAI and LS by [12] and [35] in the homoskedastic case when upper bounding the variance. Lemma G.1. In the same setting as Theorem 4.1,

σ2 min ρ OBJ ψ OBJ σ2 max ρ OBJ,

and both the upper and lower bounds are tight.

Proof. Consider semidefinite matrices A, B, and C in Rd d such that A B C. Then C 1 B 1 A 1. Hence, for x Rd, x C 1 x B 1 x A 1. Defining

i=1 λi xix i σ2xi ,

we note that 1 σ2max

i=1 λixix i A(λ) 1 σ2 min

i=1 λixix i .

Hence, σ2 min x 2 ( PK i=1 λixix i ) 1 x 2 A(λ) 1 σ2 max x 2 ( PK i=1 λixix i ) 1.

Applying this to the result of Theorem 4.1 and invoking Lemma E.8, we see that

σ2 min min λ PX max h HOBJ,q QOBJ

||h q||2 (P

x X λxxx ) 1

{(h q) θ b OBJ}2

min λ PX max h HOBJ,q QOBJ

||h q||2 A(λ) 1

{(h q) θ b OBJ}2

σ2 max min λ PX max h HOBJ,q QOBJ

||h q||2 (P

x X λxxx ) 1

{(h q) θ b OBJ}2 .

Recalling that

ρ OBJ := σ2 max min λ PX max h HOBJ,q QOBJ

||h q||2 (P

x X λxxx ) 1

{(h q) θ b OBJ}2

is the lower bound in both the best arm and level set problems for the homoskedastic variance algorithm in a heteroskedastic noise setting (when upper bounding the variance), we see that

κ 1 ψ OBJ/ρ OBJ 1.

Moreover, both the upper and lower bounds are tight by taking σ2 x = σ2 min, x X (to match the lower bound) and σ2 x = σ2 max, x X (to match the upper bound).

H Experiment Details

This Appendix gives more details on the experiments presented in Section 5.

H.1 Oracle Designs

We demonstrate the difference in oracle allocations referenced by Experiments 1-3 in Section 5. We begin with Experiment 1 in Fig. 4, where the oracle distribution that assumes homoskedastic variance devotes more attention to the informative directions with smaller magnitude. In contrast, the heteroskedastic oracle balances evenly because each informative arm has an equivalent signal-noise ratio when accounting for variance.

Next, we analyze Experiment 2 in Fig. 5. In this setting the oracle distribution that assumes homoskedastic variance allocates evenly between three informative arms, while the heteroskedastic oracle prioritizes the informative directions with less variance.

1 2 3 Informative Arms

Homoskedastic Heteroskedastic

Figure 4: In Experiment 1, the Heteroskedastic Oracle balances evenly between the informative arms.

1 2 3 4 Informative Arms

Homoskedastic Heteroskedastic

Figure 5: In Experiment 2, the Heteroskedastic Oracle prioritizes informative arms with less variance.

Lastly, we examine the oracle distributions for the multivariate testing experiment in Fig. 6. Note that the first four arm have higher variance than the last four arms. We can see in Fig. 6 that the Heteroskedastic Oracle prioritizes the second group of arms because these will require more samples to identify their effect sizes.

H.2 Assuming Homosekedastic Noise

If we assume that there is homoskedastic noise in a heteroskedastic setting, then we need to upperbound the noise with σ2 max in the sample size calculation of Alg. 2 in order to maintain correctness. In Section 5, we compare H-RAGE to RAGE; the latter uses the σ2 max upper-bound and is described in Alg. 5.

1 2 3 4 5 6 7 8 Arms

Homoskedastic Heteroskedastic

Figure 6: In the multivariate testing experiment, the Heteroskedastic Oracle prioritizes the arms that have higher variance, i.e., the arms that include the second variation in the first dimension.

Algorithm 5: (RAGE) Randomized Adaptive Gap Elimination

Result: Find z := arg maxz Z z θ for BAI or Gα := {z Z : z θ > α} for LS

1 Input: X Rd, Z Rd, confidence δ (0, 1), OBJ {BAI,LS}, threshold α R

2 Initialize: ℓ 1, Z1 Z, G1 , B1

3 while (|Zℓ| > 1 and OBJ=BAI) or (|Zℓ| > 0 and OBJ=LS) do

4 Let bλℓ PX be a minimizer of q {λ, Y(Zℓ)} if OBJ=BAI and q(λ, Zℓ) if OBJ=LS where

q(V) = infλ PX q(λ; V) = infλ PX maxz V ||z||2 (P

x X λxxx ) 1

5 Set ϵℓ= 2 ℓ, τℓ= 2ϵ 2 ℓσ2 maxq(Zℓ) log(8ℓ2|Z|/δ)

6 Pull arm x X exactly τℓbλℓ,x times for nℓsamples and collect {xℓ,i, yℓ,i}nℓ i=1

7 Define Aℓ:= Pnℓ i=1 xℓ,ix ℓ,i, bℓ= Pnℓ i=1 xℓ,iyℓ,i and construct bθℓ= A 1 ℓbℓ

8 if OBJ is BAI then

9 Zℓ+1 Zℓ\{z Zℓ: maxz Zℓ z z, bθℓ > ϵℓ}

11 Gℓ+1 Gℓ {z Zℓ: z, bθℓ ϵℓ> α}

12 Bℓ+1 Bℓ {z Zℓ: z, bθℓ + ϵℓ< α}

13 Zℓ+1 Zℓ\{{Gℓ\ Gℓ+1} {Bℓ\ Bℓ+1}}

17 Output: Zℓfor BAI or Gℓfor LS