# highdimensional_experimental_design_and_kernel_bandits__a65a752e.pdf

High-Dimensional Experimental Design and Kernel Bandits

Romain Camilleri 1 Julian Katz-Samuels 2 Kevin Jamieson 1

In recent years methods from optimal linear experimental design have been leveraged to obtain state of the art results for linear bandits. A design returned from an objective such as G-optimal design is actually a probability distribution over a pool of potential measurement vectors. Consequently, one nuisance of the approach is the task of converting this continuous probability distribution into a discrete assignment of N measurements. While sophisticated rounding techniques have been proposed, in d dimensions they require N to be at least d, d log(log(d)), or d2 based on the sub-optimality of the solution. In this paper we are interested in settings where N may be much less than d, such as in experimental design in an RKHS where d may be effectively inﬁnite. In this work, we propose a rounding procedure that frees N of any dependence on the dimension d, while achieving nearly the same performance guarantees of existing rounding procedures. We evaluate the procedure against a baseline that projects the problem to a lower dimensional space and performs rounding which requires N to just be at least a notion of the effective dimension. We also leverage our new approach in a new algorithm for kernelized bandits to obtain state of the art results for regret minimization and pure exploration. An advantage of our approach over existing UCB-like approaches is that our kernel bandit algorithms are also robust to model misspeciﬁcation.

1. Introduction

This work studies a non-parametric multi-armed bandit game through the lens of experimental design. Fix a ﬁnite set of measurements X Rd and a function µ : X R.

1Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 2University of Wisconsin, Madison, WI. Correspondence to: Romain Camilleri <camilr@cs.washington.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

We consider the following game between a learner and nature: at each time t = 1 . . . T, the learner requests xt X and nature immediately reveals

yt = µxt + ξt

where {ξt}T t=1 is a sequence of independent, mean-zero random variables with bounded variance. We are interested in two objectives:

Regret minimization In this setting, we evaluate the performance of an algorithm choosing actions {xt}T t=1 by its cumulative regret: RT = maxx X PT t=1 (µx µxt).

Pure exploration in the PAC setting For a tolerance ϵ 0 and conﬁdence level δ (0, 1), the aim of the learner in pure exploration is to sequentially take samples until a learner-deﬁned stopping criterion is met, at which time the learner outputs an arm bx X such that µbx maxx X µx ϵ with probability at least 1 δ.

To aid us in our objectives, we assume some structure on the reward function µ.

Assumption 1. There exists a known feature map φ : Rd 7 H that maps each x X to a (possibly inﬁnite dimensional) Hilbert space H, and moreover, there exists a θ H and h 0 such that maxx X |µx θ , φ(x) H| h.

Consequently, if h is not too big, the expected value of each of the observations yt is nearly a linear function of its associated features φ(xt). We say the model is misspeciﬁed when h > 0, and otherwise the setting is well-speciﬁed and reduces to the classical stochastic setting when h = 0.

Assumption 2. Rewards are bounded maxx X |µx| B.

Assumption 3. For every time t, the additive stochastic noise ξt is independent, mean-zero with E[ξ2 t ] σ2.

While we assume the learner knows B and σ2, we assume that the learner does not know the extent of the model misspeciﬁcation h 0. Note that we do not assume ξt is bounded, indeed, it can even be heavy tailed.

1.1. Elimination algorithms and experimental design

Whether the model is misspeciﬁed (h > 0) or not (h = 0), a popular class of algorithms for both the objectives of

High-Dimensional Experimental Design and Kernel Bandits

regret minimization and pure exploration is known as elimination algorithms. Elimination algorithms proceed in stages, maintaining a set b X X of candidates that may achieve maxx X µx given all previous observations. At the beginning of the stage ℓ 1 the algorithm decides which measurements to take, nature reveals the observations, and the stage ends by constructing an estimate bµ( ) of µ( ) and removing all elements x b X from b X where maxx b X bµx bµx > ϵℓ. This process is repeated indeﬁnitely in the case of regret minimization, or until b X contains a single element in the case of pure exploration. To be as effective as possible at discarding as many candidates as possible in the elimination stage (without discarding the best arm), a natural strategy of selecting how many and which measurements to take in the beginning of the round is to select x1, . . . , xn X to accurately estimate the differences of the estimates

max x,x b X (bµx bµx) (µx µx) ϵℓ. (1)

If x := arg maxx X µx and x b X at the start of the round, then we have that x will not be eliminated at the end since

max x b X bµx bµx max x b X µx µx + ϵℓ ϵℓ.

And moreover, it is straightforward to show that after the discarding step of stage ℓ, maxx b X µx µx 2ϵℓ. To guide our choice of x1, . . . , xn X to achieve (1), we exploit the assumed (nearly) linear model of above.

1.2. Optimal experimental design and the problem of rounding continuous designs

This section introduces the method of experimental design with the goal of achieving (1) by taking as few total samples as possible. Shortly, we will consider the case when h > 0 and φ is an arbitrary feature map. But for now, let us make the simplifying assumption that h = 0, φ is the identity map so that µx = θ , x , and ξt N(0, σ2). Thus, if at time t we select xt X Rd we observe θ , xt + ξt. Suppose we observed pairs {(xt, yt)}T t=1 where each xt X was chosen independently of any ys for s t. If we wished to achieve (1) for b X X with µx = x, θ , perhaps the most natural way forward would be to compute the least squares estimator bθLS = arg minθ PT t=1(yt xt, θ )2, and set bµx = bθLS, x . Then (1) is equivalent to maxv V bθLS θ , v ϵℓwith V = b X b X. By a standard sub-Gaussian tail-bound (Lattimore & Szepesv ari, 2020), we have with probability at least 1 δ that for all v V Rd

| v,bθLS θ | v (PT t=1xtx t ) 1 p

2σ2log(2|V|/δ), (2)

where we adopt the notation z A =

z Az for any z Rd and symmetric semi-deﬁnite positive A. Note that

this error bound only depends on those xt measurements that we choose before any responses yt are observed. This allows us to plan, that is, choose the T measurement vectors to minimize the RHS of (2). Unfortunately, this minimization problem is known to be NP-hard (Pukelsheim, 2006; Allen-Zhu et al., 2017). As a consequence, approximation algorithms based on the relaxation

λ = argminλ X max v V v P

x X λxxx 1 v (3)

have been proposed. These ﬁrst solve for λ and round this to a discrete allocation of measurements.

Deterministic rounding Perhaps the simplest scheme is to obtain a solution λ of (3) and then sample x X exactly λx T times. In the worst case, this will result in |support( λ)| additional measurements than the intended T. Caratheodory s theorem provides a polynomial-time algorithm for constructing λ X such that P

x X λxxx = P

x X λxxx and |support( λ)| (d + 1)d/2. However, more sophisticated rounding procedures exist. (Allen-Zhu et al., 2017) inﬂates the RHS of (2) by a constant factor while only requiring that T = Ω(d). When V = X, another strategy is to solve the optimization problem (3) with a Frank-Wolfe style algorithm that is terminated only after O(d log log(d)) iterations so that the rounding according to the naive ceiling operation only inﬂates T by the number of iterations which is O(d log log(d)) (Todd, 2016).

Stochastic rounding Another basic rounding algorithm simply samples x1, . . . , x T λ. Unfortunately, using the least squares estimator bθLS, we may have that PT t=1 xtx t deviates dramatically from T P

x X λxxx for moderate T, thus any guarantees require T to be poly(d) and moreover, performance relies on the spectrum of P

(Rizk et al., 2020). As a consequence, (Tao et al., 2018) proposed using the inverse propensity score (IPS) estimator bθIP S := (P

x X λxxx ) 1( 1

T PT t=1 xtyt). From (Tao et al., 2018), with probability at least 1 δ we have for all v V simultaneously

| v, bθIP S θ |

2σ2 v 2 A( λ) 1 log(2|V|/δ)

+ log(2|V|/δ)(1 + maxx X |v A( λ) 1x|)

where A(λ) := P

x X λxxx . The second term of (4) accounts for potentially rare but large deviations of size maxx X |v A( λ) 1x|. Sadly, this second term is cumbersome in analyses since it can dominate the ﬁrst term, and it cannot be removed in the worst-case. A ﬁnal class of algorithms rely on proportional volume sampling, or sampling from a determinantal point process (DPP), but are limited to speciﬁc optimality criteria (Nikolov et al., 2019; Derezinski et al., 2020).

High-Dimensional Experimental Design and Kernel Bandits

1.3. Main contributions

The main contributions of this paper include a novel scheme for experimental design and its application to kernel bandits.

We propose an estimator bθRIP S that overcomes many of the shortcomings of the prior art reviewed in Section 1.2 for h = 0 and φ identity. For any ﬁxed θ Rd, V Rd, X Rd, λ X , and T N, if T samples are drawn randomly according to λ to construct bθRIP S, then with probability at least 1 δ we have for all v V

| v, bθRIP S θ |

x X λxxx ) 1

c(σ2 + B2) log(2|V|/δ)

for an absolute constant c. Note that our method puts no restrictions on T but matches the ideal discrete allocation of (2) up to a constant by realizing that

infλ X maxv V v (P

x X T λxxx ) 1

min{xt}T t=1 X maxv V v (PT t=1 xtx t ) 1 1.

We also note that we only assume the stochastic noise has bounded variance and do not rule out heavy-tailed distributions. The estimator bθRIP S is a special case of our more general estimator.

We extend our estimator to the misspeciﬁed setting where h 0 and to use feature maps φ : Rd H for an RKHS H. When H can represent a high or even an inﬁnite dimensional space, restrictions on T based on the dimension start to become paramount. For any ﬁxed θ H, V H, X Rd, λ X , T N, and γ 0, if T samples are drawn randomly according to λ to construct bθRIP S(γ), then with probability at least 1 δ we have for all v V

| v,bθRIP S(γ) θ | v (P

x X λxφ(x)φ(x) +γI) 1

c(σ2+B2)log(2|V|/δ)

Note that since H may be inﬁnite-dimensional, the estimator bθRIP S(γ) is constructed implicitly and is implemented through kernel evaluations only.

We empirically compare bθRIP S(γ) to the sampling and estimator pairs of Section 1.2 and show that bθRIP S(γ) is competitive on both ﬁnite dimensional G-optimal design as well as its regularized RKHS variant sometimes called Bayesian experimental design.

We employ bθRIP S(γ) in a novel elimination style algorithm for kernel bandits. Our regret bounds match

state of the art results in the well-speciﬁed setting, and are the ﬁrst linear bounds that we are aware of for the misspeciﬁed setting. In addition, we state an instancedependent pure-exploration result for identifying an ϵ-good arm with probability at least 1 δ that compares favorably to known lower bounds. One advantage of our algorithm over prior kernel bandits and Bayesian Optimization algorithms (Srinivas et al., 2009; Valko et al., 2013; Frazier, 2018) is that our approach naturally allows for taking batches of pulls per round.

2. Robust Inverse Propensity Score (RIPS) estimator

In this section we introduce the bθRIP S estimator. In ﬁnite dimensions, our estimator ﬁrst constructs bθIP S but then to avoid the large deviations term of (4) applies robust mean estimation on each v, θ to obtain a bθRIP S which is consistent with all of these estimates. When we move to an RKHS setting, we add regularization to avoid vacuous bounds and account for the introduced bias. The bias of misspeciﬁcation is handled similarly. We begin with robust mean estimation.

Deﬁnition 1. Let X1, . . . , Xn be i.i.d. random variables with mean x and variance ν2. Let δ (0, 1). We say that bµ(X1, . . . , Xn) is a δ-robust estimator if there exist universal constants c1, c0 > 0 such that if n c1 log(1/δ), then with probability at least 1 δ

|bµ({Xt}n t=1) x| c0

ν2 log(1/δ)

Examples of δ-robust estimators include the median-ofmeans estimator and Catoni s estimator (Lugosi & Mendelson, 2019). This work employs the use of the Catoni estima-

tor which satisﬁes |bµ({Xt}n t=1) x| q

2ν2 log(1/δ) n 2 log(1/δ) for n > 2 log(1/δ) which leads to an optimal leading constant as n . We will use a separate robust mean estimate for each v V. In particular, to estimate v, θ we use bµ({v A(γ)(λ) 1φ(xt)yt}T t=1) where

A(γ)(λ) := X

x X λxφ(x)φ(x) + γI. (5)

Our RIPS procedure for experimental design in an RKHS is presented in Figure 1. It has the following guarantee.

Theorem 1. Fix any ﬁnite sets X Rd and V H, feature map φ : Rd H, number of samples τ and regularization γ > 0. If the RIPS procedure of Figure 1 is run with δ |V|- robust mean estimator bµ( ) and if τ c1 log(|V|/δ) then

High-Dimensional Experimental Design and Kernel Bandits

Algorithm 1 RIPS for Experimental Designs in an RKHS

Input: Finite sets X Rd and V H, feature map φ : Rd H, number of samples τ, regularization γ > 0, robust mean estimator bµ : R R

λ := arg min λ X max v V v ( P

x λxφ(x)φ(x) +γI) 1 (6)

Randomly draw ex1, . . . , exτ from X according to λ

Set W (v) = bµ({v A(γ)(λ ) 1φ(ext)eyt}τ t=1)

Set bθ := arg min θ max v V | θ, v W (v)| v (P

x X λ xφ(x)φ(x) +γI) 1

Return: {W (v)}v V, bθ

Figure 1. In this work, we assume each element in V is a linear combination of φ X which makes all quantities welldeﬁned and can be computed using kernel evaluations k(x, x ) := φ(x), φ(x ) H. Moreover, Equation 6 is convex with gradients that can be computed using kernel evaluations (See Section 2.3).

with probability at least 1 δ, we have

max v V |W (v) θ , v | v (P

x X λxφ(x)φ(x) +γI) 1

γ θ 2 + h + c0 q

τ log(2|V|/δ),

Moreover, W (v) = bµ({v A(γ)(λ) 1φ(xt)yt}τ t=1) can be replaced by bθ, v by multiplying the RHS by a factor of 2.

Proof sketch. Due to the regularization and potential misspeciﬁcation if h > 0, each v A(γ)(λ) 1φ(xt)yt is biased. Thus, we apply the guarantee of W (v) = bµ({v A(γ)(λ) 1φ(xt)yt}τ t=1) to the expectation of its arguments. The triangle inequality followed by repeated applications of Cauchy-Schwartz yields

|W (v) v, θ | |W (v) E[v A(γ)(λ) 1φ(x1)y1]|

+ |E[v A(γ)(λ) 1φ(x1)y1] v, θ |

ν2 log(1/δ)

τ + γ θ 2 + h

where we obtain an upper bound on the variance ν2 by

Var(v A(γ)(λ) 1φ(x1)y1) E[(v A(γ)(λ) 1φ(x1)y1)2]

= E v A(γ)(λ) 1φ(x1) 2 µ2 x1

+E v A(γ)(λ) 1φ(x1) 2 ξ2 1

(B2+σ2) v 2 A(γ)(λ) 1.

2.1. Practical implementation of the algorithms

The construction of bθ in the algorithms may at ﬁrst glance look confusing in the inﬁnite dimensional case. In actuality, the equivalent dual representation bθ = P|X| i=1 αiφ(xi) would be used. That is, the potentially inﬁnite dimensional object bθ is represented by a ﬁnite dimensional weight vector α R|X|. With that, the optimizations in the algorithms (e.g., to compute the RIPS estimator) are over the dual vector α R|X|, and inner products bθ, v = P|X| i=1 αi φ(xi), v are computed using the kernel matrix of X since in all instances of v used in the algorithms, v is a linear combination of {φ(x)}x X .

2.2. Comparison to IPS estimator

Note the difference between the bound of RIPS in Theorem 1 with the bound of the IPS estimator stated in equation (4). Consider the setting of equation (4). Ignoring log factors and constants, the conﬁdence bound of the IPS estimator

essentially scales as

σ2 v 2 A( λ) 1 T + maxx X |v A( λ) 1x|

T , while the conﬁdence bound of RIPS essentially scales

σ2 v 2 A( λ) 1 T . It can be shown that in the instance in the experiment corresponding to ﬁgure 4, the term

maxx X |v A( λ) 1x|

T d T while

σ2 v 2 A( λ) 1 T

T . Thus, the ﬁrst term dominates by a polynomial factor in the dimension until T d, and the experiment shows that indeed the IPS estimator has larger deviations than RIPS, as suggested by the above upper bounds.

2.3. Experimental Design optimization in an RKHS

We now discuss how to actually compute an allocation in a potentially inﬁnite dimensional RKHS H. The following lemma will be helpful and is proved in the appendix.

Lemma 1. If Aλ =P

x X λxφ(x)φ(x) then for a, b H

a (Aλ+γI) 1b= 1

γ kλ(a) (Kλ+γI|X|) 1kλ(b)

with kλ( ) R|X| so that for any c H, [kλ(c)]i = λiφ(xi) c, and Kλ R|X| |X| so that

[Kλ]i,j = p

λjφ(xj) φ(xj) =: p

λjk(xi, yi).

For x X, [kλ(x)]i = λiφ(xi) φ(x) = λik(xi, x).

If we call f(λ) the argument of Equation 6 in Figure 1, and v arg maxv V v P

x X λxφ(x)φ(x) + γI 1 v then the computation of the gradient of λ 7 f(λ) equals

[ λf(λ)]i = v ( X

x X λxφ(x)φ(x) + γI) 1φ(xi) 2.

High-Dimensional Experimental Design and Kernel Bandits

Importantly, in this work V will always be a linear combination of {φ(x)}x X (e.g. V = X X), thus the last quantity can be computed only using kernel evaluations thanks to Lemma 1. We use ﬁrst order optimization methods to minimize λ 7 f(λ) since it is convex.

2.4. Project-Then-Round (PTR) for RKHS designs

To the best of our knowledge, the RIPS procedure of Figure 1 is novel and should be benchmarked. To design a baseline, we take inspiration from previous works on experimental design in an RKHS. For instance, (Alaoui & Mahoney, 2015) employ a sampling distribution related to statistical leverage scores to construct a sketch of the kernel matrix using a Nystrom approximation. The objective in that problem is closest to V -optimal design which aims to minimize the sum-squared error P

x X E[ x, bθ θ 2] (note, our work is concerned with G-optimal-like objectives, or worst-case error over X). The Nystrom approximation to the kernel matrix effectively projects the problem to a low dimensional sub-space where ﬁnite-dimensional rounding techniques like those reviewed in Section 1.2 can be applied. (Bach, 2015) also relies on a sampling distribution to approximate integrals using kernels with an objective similar to V optimal.

We describe in Algorithm 2.4 the baseline procedure we call Project-Then-Round (PTR), that employs the ﬁnite rounding technique of (Allen-Zhu et al., 2017) described in Section 1.2. This procedure enjoys the following guarantees.

Theorem 2. Consider the procedure of Algorithm 2.4. If the number of measurements τ satisﬁes τ = Ω(ed(γ, λ)), then

max v V v 2 ( Pτ i=1 φ(exi)φ(exi) +τγI) 1

max(2, 1 + ϵ) max v V v 2 ( P

x X τλxφ(x)φ(x) +τγI) 1.

where ed(γ, λ) is deﬁned in the algorithm.

We refer the reader to the appendix for the proof of Theorem 2. This procedure performs rounding in a ﬁnite dimensional subspace which is a projection of the initial feature space of potentially inﬁnite dimension. With Theorem 2 one can obtain a guarantee similar to that of Theorem 1 up to a constant whenever τ = Ω(ed(λ, γ)). Though this effective dimension is rarely the dominating factor in analyses, it is cumbersome to keep around and bound.

2.5. Empirical evaluation of allocation methods

We brieﬂy describe illustrative experiments (see the supplementary material for more details).

G-optimal design experiment: We generate x1, . . . , xn by sampling xi N(0, Σ) with Σi,i = 1 if i d 10,

Σi,i = .1 if i > d 10 and all other entries of Σ set to 0. Then, we set xi = xi xi . We use θ = 1

d = 50 and n = d(d+1)

2 . We use mirror descent to solve the G-optimal design problem. We compare RIPS with IPS, Caratheodory s algorithm with the ceiling rounding technique (LS Cara), the rounding technique in (Allen-Zhu et al., 2017) (LS Regsel), and the random sampling approach taken in (Rizk et al., 2020) (LS Sampling). Figure 2 depicts the results, and shows that RIPS performs comparably to these other approaches. It also illustrates the shortcomings of the Caratheodory rounding algorithm, which does not return an estimate for T 1275, while the other algorithms have already learned nontrivial estimates of θ for much smaller values of T.

G-optimal design in an RKHS: We let X = {0, ( 1

m)2, . . . , ( m 1

m )2, 1} with m = 500 and use the RBF

kernel K(x, x ) = exp( x x 2

2ϕ2 ) with bandwidth parameter ϕ = 0.025. Due to this being an inﬁnite dimensional kernel, the ambient dimension for m points is equal to m. We focus on the regime T < m where standard rounding schemes do not apply and compare PTR with regularization γ, bθRIP S(γ), and bθIP S(γ) := A(γ)(λ) 1( 1

T PT t=1 xtyt) where we set γ = 0.005. Figure 3 depicts the results, showing that PTR(γ) does slightly better than IPS(γ) and RIPS(γ), and that all three algorithms have learned nontrivial estimates of θ using hundreds of samples fewer than standard rounding algorithms require to even output an estimate.

RIPS vs. IPS: While IPS has similar performance to RIPS in the two previous experiments, RIPS performs dramatically better in some settings. Let m N and d = m2 + m. Inspired by combinatorial bandits, we consider a setting where the measurement vectors X = {e1, . . . , ed} consist of the standard basis vectors, θ = 1, and the performance metric for an estimator bθ is E supi [m2] |v i (bθ θ )| where vi = Pm j=1 ej + ei+m. We compare the performance of IPS against RIPS for m {12, 14, 16} and estimate the expected maximum deviation at T = 4m. Figure 4 shows that as m grows, the performance of IPS degrades relative to RIPS, reﬂecting that IPS has large deviations in comparison to our proposed estimator RIPS.

3. Algorithms for Kernelized Bandits

We now leverage our proposed RIPS estimator of Algorithm 1 for the kernel bandits problem in an elimination style algorithm as introduced in Section 1.1. In this section we provide different algorithms to solve the regret minimization and pure exploration problems. This section illustrates the beneﬁts of using our RIPS estimator. In particular, the estimator enables us to design a regret minimization algorithm that trivially supports batching while enjoying state of

High-Dimensional Experimental Design and Kernel Bandits

Figure 2. G-Optimal Design Experiment

Figure 3. Kernel Experiment

Figure 4. RIPS vs IPS Experiment

Algorithm 2 PTR for Experimental Designs in an RKHS

Input: Finite sets X Rd and V H, feature map φ : Rd H, regularization γ > 0 Fix any λ X . Compute [K]i,j = k(xi, xj) = φ(xi), φ(xj) H the kernel matrix of the set of points X = {x1, . . . , xn}. Consider a decomposition K = bΦbΦ with bΦ Rn n

such that the rows of bΦ called bφ(xi) Rn are used to compute b A(λ) = Pn i=1 λi bφ(xi)bφ(xi) . Diagonalize b A(λ) as b A(λ) = V DV with D diagonal matrix with coefﬁcients (d1 d2 . . . dn). Deﬁne the effective dimension as

ed(λ, γ) = max{i [n] : di γ}.

Choose k = ed(γ, λ) [n] and denote Vk as the top k eigenvectors of b A(λ). Compute the projections V k bφ(x1), . . . , V k bφ(xn) Rk

Use the rounding procedure of (Allen-Zhu et al., 2017) to obtain the desired sparse allocation {exi}τ i=1. Return: {exi}τ i=1

the art performance in the well-speciﬁed setting. In addition, this same algorithm is robust to model misspeciﬁcation, suffering only linear regret with respect to that approximation error without any prior knowledge on this error (guarantees that, to the best of our knowledge, our novel). Last but not least, applying our RIPS estimator to pure exploration tasks leads to the ﬁrst best arm identiﬁcation provably robust to misspeciﬁcation.

3.1. RIPS for Regret minimization

As introduced in Section 1, our objective is to develop an algorithm that minimizes regret under the general stochastic and misspeciﬁed setting (Assumptions 1-3). Speciﬁcally, when pulling arm x X at time t we observe a random variable µx + ξt where ξt is independent, mean-zero noise with variance σ2. We assume there exists a θ H and

known feature map φ : X H such that maxx X |µx θ , φ(x) | h where h 0 is unknown to the learner. That is, µx is well-approximated by the linear function θ , φ(x) but may deviate from it by an amount h 0. Because of model misspeciﬁcation in the case when h > 0, we should not hope to obtain sub-linear regret if we seek a regret bound that grows only logarithmically in |X| and polynomial in d.

Algorithm 3 is a phased elimination strategy where at each round a (regularized) G-optimal design is performed to minimize the variances of the estimates of all the arms and then arms are discarded if their sub-optimality gap is deemed too large (under the assumed linear model). Due to model misspeciﬁcation, we should only expect this approach to work until hitting a kind of noise ﬂoor deﬁned by the level of misspeciﬁcation h, as suggested from the guarantee from Theorem 1. The algorithm is a combination of our RIPS estimator for the RKHS setting and the robust algorithm of (Lattimore et al., 2020).

Theorem 3. With probability at least 1 δ, the regret of Algorithm 3 satisﬁes

t=1 µx µxt c1 log(|X|/δ) + q

max V X f(V, γ) (7)

T(h+ γ θ )+ q

c2 0(σ2+B2)T log(|X| log(T)/δ)

where f(V, γ) = inf λ V sup y V φ(y) 2 (P

x X λxxx +γI) 1.

Choosing γ = 1/T, δ = 1/T yields an expected regret of

t=1 µx µxt c r

max V X f(V, 1

where c = O( p

θ 2 + σ2 + B2). Note that the h T term due to model misspeciﬁcation is comparable to the one in (Lattimore et al., 2020). Prior works such as (Srinivas et al., 2009; Valko et al., 2013) have demonstrated expected regret bounds in the well-speciﬁed (h = 0) setting that scale like p

γT T log(|X|) where

γT := max λ X log det(TA(0)(λ) + γ). (8)

High-Dimensional Experimental Design and Kernel Bandits

Algorithm 3 RIPS for Regret Minimization

Input: Finite sets X Rd (|X| = n), feature map φ, conﬁdence level δ (0, 1), regularization γ, sub Gaussian parameter σ, bound on maximum reward B. Set X1 X, ℓ 1 while |Xℓ| > 1 do

Let λℓ X be a minimizer of f(λ; Xℓ, γ) where

f(V, γ) = inf λ V f(λ; V, γ)

= inf λ V max y V φ(y) 2 (P

y V λyφ(y)φ(y) +γI) 1

Set ϵℓ 2 ℓ, q(1) ℓ c1 log(|X|/δ) Set q(2) ℓ c2 0(B2 + σ2)ϵ 2 ℓf(Xℓ, γ) log(4ℓ2|X|/δ)

Set τℓ l max n q(1) ℓ, q(2) ℓ om

Use Algorithm 1 with sets Xℓ, Vℓ= φ Xℓ, sampling τℓmeasurements x1, . . . , xτℓto get {W (v)}v Vℓ.

Set bθℓ:= arg min θ max v Vℓ | θ, v W (v)| v (P

x X λℓ,xφ(x)φ(x) +γI) 1 Update active set:

Xℓ+1 = n x Xℓ, max x Xℓ φ(x ) φ(x), bθℓ < 4ϵℓ o

ℓ ℓ+ 1 end while Play unique element of Xℓindeﬁnitely.

where A(0)(λ) is deﬁned as in (5). The following lemma shows that our own regret bound is never worse than these results.

Lemma 2. Let γT be deﬁned as in (8). Then

max V Xf(V, γ

T )=max V X inf λ Vsup y V φ(y) 2 A(γ/T )(λ) 1 3

The quantity f(X, γ) can also be bounded by a more interpretable form:

Lemma 3. Iff(X, γ)= inf λ Xmax y X φ(y) 2 (A(λ)+γI) 1 then

f(X, γ) Trace A(λ D)(A(λ D) + γI) 1

= Trace Kλ D(Kλ D + γI) 1

where λ D arg maxλ X log det A(γ)(λ) .

Notably, the RHS of Lemma 3 is the notion of effective dimension that appears in (Alaoui & Mahoney, 2015; Derezinski et al., 2020).

3.2. RIPS for Pure Exploration

We consider a slight generalization of the pure exploration setting introduced in Section 1. Fix ﬁnite sets X Rd and

Algorithm 4 RIPS for Pure Exploration

Input: Finite sets X Rd, Z Rd, feature map φ, conﬁdence level δ (0, 1), regularization γ, sub-Gaussian parameter σ, bound on maximum reward B, bound on the misspeciﬁcation noise h. Let Z1 Z, ℓ 1 while |Zℓ| > 1 do

Let λℓ X be a minimizer of f(λ; Zℓ; γ) where

f(V; γ) = inf λ X f(λ; V; γ)

= inf λ X max v,v V φ(v) φ(v ) 2 (P

x X λxφ(x)φ(x) +γI) 1

Set ϵℓ 2 ℓ, q(1) ℓ c1 log(|Z|/δ) Set q(2) ℓ c2 0ϵ 2 ℓf(Zℓ; γ)(B2 + σ2) log(2ℓ2|Z|2/δ)

Set τℓ l max n q(1) ℓ, q(2) ℓ om

Use Algorithm 1 with sets X, Vℓ= φ Zℓ φ Zℓ, sampling τℓmeasurements x1, . . . , xτℓto get {W (v)}v Vℓ.

Set bθℓ:=arg min θ max v Vℓ | θ, v W (v)| v (P

x X λℓ,xφ(x)φ(x) +γI) 1

Zℓ+1 = z Zℓ: max z Zℓ φ(z ) φ(z), bθℓ 2ϵℓ

ℓ ℓ+ 1 end while Output: Zℓ

Z Rd. We may have X = Z but there are interesting cases in which X = Z including combinatorial bandits and recommendation tasks (Fiez et al., 2019). We say a z Z is ϵ-good if µz maxz Z µz ϵ. In the pure exploration game, for ϵ > 0 and δ (0, 1) the player seeks to identify an ϵ-good arm by taking as few measurements in X as possible. Just as in regret minimization games, we assume that when the player at time t plays xt X she observes yt = µxt + ξt where ξt is independent mean-zero noise with variance σ2. Finally, we assume the existence of a θ H such that

max max z Z |µz θ , φ(z) |, max x X |µx θ , φ(x) | h

for some h 0 that is unknown to the player.

Consider the elimination style algorithm of Algorithm 4. The algorithm is a combination of our RIPS procedure and the algorithm of (Fiez et al., 2019). While the algorithm is inspired by (Fiez et al., 2019), their analysis only holds in the well-speciﬁed setting (h = 0), hence a new proof technique was necessary to achieve the following result for general h 0.

High-Dimensional Experimental Design and Kernel Bandits

Theorem 4. With z arg max z Z z, θ , ﬁx any ϵ ϵ where

ϵ = 8 min{ϵ 0 : 4( γ θ 2 + h)(2 + p

g(ϵ)= inf λ X sup z Z: θ ,φ(z ) φ(z) ϵ φ(z ) φ(z) 2 A(γ)(λ) 1

Then with probability at least 1 δ, once the algorithm has taken at least τ samples where τ = e O(c1 log(|Z|/δ) + log(ϵ 1)c2 0(B2 + σ2) log(|Z|/δ)ρ (γ, ϵ)) we have that µbz maxz Z ϵ where bz is any arm in the set Zℓunder consideration after τ pulls and

ρ (γ, ϵ)= inf λ X sup z Z

φ(z ) φ(z) 2 A(γ)(λ) 1 max{ϵ2, θ , φ(z ) φ(z) 2)}. (9)

Note that if X = Z we have

g(ϵ)= inf λ X sup z X: θ ,φ(z ) φ(z) ϵ φ(z ) φ(z) 2 A(γ)(λ) 1

4 inf λ X sup x X φ(x) 2 A(γ)(λ) 1

4Trace((A(λ D) + γI) 1A(λ D))

where the last line follows from Lemma 3. This means ϵ, the limit on how well one can estimate the maximizing arm, satisﬁes ϵ (γ θ + h)Trace((A(λ D) + γI) 1A(λ D))1/2. Thus, if we seek an ϵ-good arm, we should choose γ to make this right hand side less than ϵ. Note that γ = 0 and h = 0 implies ϵ = 0. If φ identity so that H = Rd, h = 0, and γ = 0 then the sample complexity of Theorem 4 is known to be optimal up to log factors to identify the very best arm (assuming it is unique) relative to any δ-correct algorithm over θ Rd (Soare et al., 2014; Fiez et al., 2019).

3.3. Comparing to the alternative baseline procedure

In Section 2.4 we proposed a natural alternative to our RIPS procedure for experimental design in an RKHS. This PTR baseline leveraged the fact that the added regularization γ > 0 effectively made many directions irrelevant. Thus, it projected the problem to a low dimensional subspace where it could apply any of the standard rounding techniques for ﬁnite dimensions described in Section 1.2. The dimension of this subspace, denoted ed, scales like the number of eigenvalues of P

x X λ xφ(x)φ(x) that are greater than γ where λ arg minλ X maxv V v 2 ( P

x X λxφ(x)φ(x) +γI) 1. Any

standard rounding algorithm would then require the number of samples taken from the design to be at least ed. Relative to our results, this inﬂates our regret bound and sample complexity by an additive factor of ed scaled by some problem-dependent log factors. Algebra shows that ed 2Trace((A(λ D) + γI) 1A(λ D)). Though for regret this is a lower order term, for pure-exploration with X = Z,

this term may potentially dominate the sample complexity because it does not capture the interplay between the geometry of X and Z. Fortunately, our RIPS procedure demonstrates it is unnecessary and avoids it.

4. Related work

There exist excellent surveys of experimental design from both a statistical and computational perspective (Pukelsheim, 2006; Atkinson et al., 2007; Todd, 2016). Our work is particularly interested in the task of converting a continuous design into a discrete allocation of T measurements. We reviewed a number of works in Section 1.2 for completing this task in ﬁnite dimensions. To move to an RKHS setting we considered a regularized design objective which is also known as Bayesian experimental design (Chaloner & Verdinelli, 1995; Allen-Zhu et al., 2017; Derezinski et al., 2020). While most Bayesian experimental design works assume a low-dimensional ambient space and use simple rounding, one exception is the work of (Alaoui & Mahoney, 2015) that performs experimental design in an RKHS for a different design objective, which inspired our project-then-round procedure described of Section 2.4. And very recently, (Derezinski et al., 2020) proposed a method of sampling from a determinantal point process (DPP) and showed that they can approximate many continuous experimental design objectives up to a constant factor if T deff := Trace((A(λ D) + γI) 1A(λ D)) with λ D deﬁned in Lemma 3. However, according to Table 1 of (Derezinski et al., 2020) the method may not apply to Goptimal-like objectives1, which is the primary objective of our work. To our knowledge, our proposed RIPS method is novel in that its performance is directly comparable to the continuous design without requiring a minimum number of measurements with some dependence on the (effective) dimension. However, our method does require the number of measurements to exceed log(|V|). While we leveraged experimental design techniques for kernel bandits, many prior works were able to obtain regret bounds and pureexploration results using other methods.

Kernel bandits In the well-speciﬁed setting (h = 0) (Srinivas et al., 2009) propose a UCB style algorithm (Auer et al., 2002) for the RKHS setting. Independently, (Gr unew alder et al., 2010) developed similar methods for minimizing simple regret. (Srinivas et al., 2009) established a regret bound of

T( θ γT + γT ) where γT is deﬁned in (8). (Valko et al., 2013) proposed another UCB variant to obtain a regret bound that scales just as θ p T T where p T is an algorithm-dependent constant that can be upper bounded by γT , thus improving (Srinivas et al., 2009). We recall that our own regret bound of Theorem 3 scales no worse than

1Our Theorem 2 with the fact ed 2deff suggests k only needs to be at least ed for G-like objectives, which adds to their table.

High-Dimensional Experimental Design and Kernel Bandits

θ γT T using Lemma 2, thus matching state of the art. (Chowdhury & Gopalan, 2017) offer improvements in regret over GP-UCB when the action space is inﬁnite. We also note that our algorithm naturally allows batch querying, a property that UCB-like algorithms achieve only through inelegant means (Desautels et al., 2012; Wu & Frazier, 2018).

Misspeciﬁed models Our approach to misspeciﬁed models draws inspiration from (Lattimore et al., 2020) which addresses linear bandits in ﬁnite dimensions. Their regret bound scales quadratically in the ambient dimension due to rounding effects. Our RIPS procedure extends this work to an RKHS. The misspeciﬁed model setting is related to the corrupted setting where an adversary can choose to corrupt the observed reward by ct in each round t. Any algorithms for this adversarial setting can also be used to solve kernelized multi-armed bandit in the misspeciﬁed setting with total amount of corruption equal to at most CT = PT t=1 ct = h T. Using this reduction, the regret bound for the corrupted setting of (Bogunovic et al., 2020) scales like CT γT T. Unfortunately, if we take CT = h T this bound is vacuous. Whether robust algorithms like (Gupta et al., 2019) can be extended to our kernel bandit setting is an open question. Concurrently, (Lee et al., 2021) independently proposed a very similar estimator and algorithm for the related task of solving adversarial bandits.

Constrained linear bandits If we assumed that θ 2 R for some explicit, known R > 0 then this setting is known as constrained linear bandits, tackled in (Degenne et al., 2020) for the pure-exploration and (Tirinzoni et al., 2020) for the regret setting, respectively. There, a lower bound on the sample complexity of identifying the best arm can be computed. The lower bound is infλ X supx =x infγ 0 G 1(λ, x, γ) where

G(λ, x, γ) = max{(x x ) (A(λ) + γI) 1A(λ)θ , 0}2

2 x x 2 (A(λ)+γI) 1

θ 2 (A(λ)+γI) 1A(λ) R2 ,

which is close to our upper bound ρ from equation 9. See Corollary 1 for the proof of this lower bound. (Degenne et al., 2020) propose an algorithm with an asymptotic upper bound in the sense that as δ 0, the dominant term matches the lower bound. However, while (Degenne et al., 2020) and (Tirinzoni et al., 2020) are tight asymptotically, they suffer from large sub-optimal dependencies on problem-speciﬁc parameters.

5. Conclusion

In this paper, we have brought to the non-parametric learning setting an estimator that relies on continuous designs while enjoying state of the art - theoretical and experimental - guarantees for both the well-speciﬁed and the misspeciﬁed

settings. We leveraged this estimator in a novel elimination style algorithm for kernel bandits. For the most part we have ignored computation. However, the computational cost of the RIPS estimator scales linearly in |V|. An interesting avenue of research is designing an estimator that leverages multi-dimensional robust mean estimation that has the same properties as RIPS but has no dependence on |V|. Such an estimator would be of considerable interest in problems such as combinatorial bandits where |V| is potentially exponential in the dimension (e.g., see (Katz-Samuels et al., 2020; Wagenmaker et al., 2021)).

Acknowledgments

The work of RC and KJ is supported in part by grants NSF RI 1907907 and NSF CCF 2007036.

High-Dimensional Experimental Design and Kernel Bandits

Alaoui, A. E. and Mahoney, M. W. Fast randomized kernel methods with statistical guarantees, 2015.

Allen-Zhu, Z., Li, Y., Singh, A., and Wang, Y. Near-optimal design of experiments via regret minimization. In International Conference on Machine Learning, pp. 126 135. PMLR, 2017.

Atkinson, A., Donev, A., and Tobias, R. Optimum experimental designs, with SAS, volume 34. Oxford University Press, 2007.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 05 2002. doi: 10.1023/A:1013689704352.

Bach, F. On the equivalence between kernel quadrature rules and random feature expansions, 2015.

Bogunovic, I., Krause, A., and Scarlett, J. Corruptiontolerant gaussian process bandit optimization, 2020.

Chaloner, K. and Verdinelli, I. Bayesian experimental design: A review. Statistical Science, pp. 273 304, 1995.

Chowdhury, S. R. and Gopalan, A. On kernelized multiarmed bandits, 2017.

Degenne, R., M enard, P., Shang, X., and Valko, M. Gamiﬁcation of pure exploration for linear bandits. In International Conference on Machine Learning, pp. 2432 2442. PMLR, 2020.

Derezinski, M., Liang, F., and Mahoney, M. Bayesian experimental design using regularized determinantal point processes. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 3197 3207. PMLR, 2020.

Desautels, T., Krause, A., and Burdick, J. Parallelizing exploration-exploitation tradeoffs with gaussian process bandit optimization, 2012.

Fiez, T., Jain, L., Jamieson, K., and Ratliff, L. Sequential experimental design for transductive linear bandits. Neur IPS, 2019.

Frazier, P. I. A tutorial on bayesian optimization. ar Xiv preprint ar Xiv:1807.02811, 2018.

Gr unew alder, S., Audibert, J.-Y., Opper, M., and Shawe Taylor, J. Regret bounds for gaussian process bandit problems. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 273 280. JMLR Workshop and Conference Proceedings, 2010.

Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pp. 1562 1578. PMLR, 2019.

Katz-Samuels, J., Jain, L., Jamieson, K. G., et al. An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits. Advances in Neural Information Processing Systems, 33, 2020.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in rl with a generative model, 2020.

Lee, C.-W., Luo, H., Wei, C.-Y., Zhang, M., and Zhang, X. Achieving near instance-optimality and minimaxoptimality in stochastic and adversarial linear bandits simultaneously, 2021.

Lugosi, G. and Mendelson, S. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, 2019.

Nikolov, A., Singh, M., and Tantipongpipat, U. T. Proportional volume sampling and approximation algorithms for a-optimal design. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1369 1386. SIAM, 2019.

Pukelsheim, F. Optimal design of experiments. SIAM, 2006.

Rizk, G., Colin, I., Thomas, A., and Draief, M. Reﬁned bounds for randomized experimental design. ar Xiv preprint ar Xiv:2012.15726, 2020.

Soare, M., Lazaric, A., and Munos, R. Best-arm identiﬁcation in linear bandits. ar Xiv preprint ar Xiv:1409.6110, 2014.

Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995, 2009.

Tao, C., Blanco, S., and Zhou, Y. Best arm identiﬁcation in linear bandits with linear dimension dependency. In International Conference on Machine Learning, pp. 4877 4886, 2018.

Tirinzoni, A., Pirotta, M., Restelli, M., and Lazaric, A. An asymptotically optimal primal-dual incremental algorithm for contextual linear bandits. ar Xiv preprint ar Xiv:2010.12247, 2020.

High-Dimensional Experimental Design and Kernel Bandits

Todd, M. J. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016.

Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits, 2013.

Wagenmaker, A., Katz-Samuels, J., and Jamieson, K. Experimental design for regret minimization in linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 3088 3096. PMLR, 2021.

Wu, J. and Frazier, P. I. The parallel knowledge gradient method for batch bayesian optimization, 2018.