# nonparametric_stochastic_contextual_bandits__b2f4629c.pdf

Nonparametric Stochastic Contextual Bandits

Melody Y. Guan

Stanford University 450 Serra Mall Stanford, California 94305 mguan@stanford.edu

Heinrich Jiang

Google 1600 Amphitheatre Pwky Mountain View, California 94043 heinrich.jiang@gmail.com

We analyze the K-armed bandit problem where the reward for each arm is a noisy realization based on an observed context under mild nonparametric assumptions. We attain tight results for top-arm identiﬁcation and a sublinear regret of O T 1+D 2+D , where D is the context dimension, for a modiﬁed UCB algorithm that is simple to implement. We then give global intrinsic dimension dependent and ambient dimension independent regret bounds. We also discuss recovering topological structures within the context space based on expected bandit performance and provide an extension to inﬁnite-armed contextual bandits. Finally, we experimentally show the improvement of our algorithm over existing approaches for both simulated tasks and MNIST image classiﬁcation.

Introduction Multi-armed bandits (MABs) are an important sequential optimization problem introduced by Robbins (1985). These models have extensively been used in a wide variety of ﬁelds related to statistics and machine learning. The classical MAB consists of K arms where at each point in time the learner can sample (or pull) one of them and observe a reward. Then various objectives can be established, such as ﬁnding the best arm (Top-Arm Identiﬁcation) or minimizing some regret over time. For contextual bandits (also referred to as bandits with side information or covariates), the learner has access to a context on which the payoffs depend. Then, based on the observations, we aim to determine the best policy (or contextto-arm mapping) and to optimize some notion of regret. Most approaches to stochastic contextual bandits make strong assumptions on the payoffs. A popular approach models the mean reward for each arm as being linear in the context space (Chu et al. 2011; Li et al. 2010). However, this is rarely the case in real data. In this paper, we take a more general approach and allow the reward functions to be non-linear and of arbitrary shape. Using recent developments in nonparametric statistics (Jiang 2017b), we show that with simple and easily implementable techniques, we can construct bandit algorithms

Equal Contribution. Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

which can learn over the entire context space with strong guarantees, despite the difﬁculty that arises with allowing a wide variety of reward functions. While this is not the ﬁrst work which blends nonparametric statistics with bandits, we are the ﬁrst to show simple and practical methods while still maintaining strong theoretical guarantees. We reanalyze the uniform and upper conﬁdence bound sampling strategies and demonstrate what nonparametric approaches can offer to contextual bandit learning. No other technique can adapt to the inherently difﬁcult and complex real world reward functions while allowing such a strong theoretical understanding of the underlying algorithms. While nonparametric models are powerful in their ability to learn arbitrary functions free of distributional assumptions, a major weakness is the curse of dimensionality. In order to have any theoretical guarantees, they require an exponential-in-dimension number of samples. However, when the data lies on an unknown low-dimensional structure such as a manifold, we show that our algorithms can converge as if the data was on a lower dimension and not in the potentially much large ambient dimension. Another striking fact is that no preprocessing of the data is required. This is of practical importance because modern data has increasingly more features but the underlying degrees of freedom often remain small. We then discuss recovering geometric structures in the context space based on bandit performance. Speciﬁcally, we recover the connected components of the context space in which a particular bandit is the top-arm. Although learning a context-to-arm mapping gives us the estimated top-arm at each point in the context space, this alone does not tell the space s topological structure, such as the number and shapes of connected components. We recover these structures with uniform consistency guarantees with mild assumptions, where the shapes and relative positions of the components can be arbitrary and the number of such components is recovered automatically. We then provide an extension to inﬁnite-armed bandits and conclude with empirical results from simulations and image classiﬁcation on the MNIST dataset.

Setup Suppose there are K bandit arms indexed in [K]. At each time-step t, the learner observes a context xt RD

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

where xt is drawn i.i.d. from a context density p X with compact support X bounded below away from zero (e.g. infx X p X(x) p0 for some p0 0). Then the learner chooses an arm It [K] and observes reward rt = f It(xt) + ξt where ξt is drawn according to white noise random variable ξ and fi : X R is the i-th arm s mean reward. We make the following assumptions. Assumption 1. (Lipschitz Mean Reward) There exists L such that |fi(x) fi(x )| L|x x | for all x, x X and i [K]. Assumption 2. (Sub-Gaussian White noise) ξ satisﬁes E[ξ] = 0 and is sub-Gaussian with parameter σ2 (i.e. E[exp(λξ)] exp(σ2λ2/2) for all λ R). We require the ﬁnite-sample strong uniform consistency result (Theorem 1) for k-NN regression deﬁned as fellows: Deﬁnition 1 (k-NN). Let the k-NN radius of x X be rk(x) := inf{r : |B(x, r) X| k} where B(x, r) := {x X : |x x | r} and the k-NN set of x X be Nk(x) := X B(x, rk(x)). Then for x X,

fk-NN(x) := 1 |Nk(x)|

i=1 yi 1 [xi Nk(x)] .

Theorem 1. (Rate for k-NN (Jiang 2017b)) Let δ > 0. There exists N0 and universal constant C such that if n N0 and k = n2/(2+D) , then with probability at least 1 δ,

sup x X |f(x) fk-NN(x)| C

log n log(1/δ) n 1/(2+D).

It will be implicitly understood from here on that fi denotes the k-NN regression estimate of fi under the settings of Theorem 1.

Top-Arm Identiﬁcation

Algorithm 1 Uniform Sampling

1: Parameters: T, total number of time steps. 2: For each arm i of the K arms: 3: For each time step t [ (i 1)T

K ]: 4: Pull arm It := i. 5: Deﬁne fi : X R to be the k-NN regression estimator from the sampled context and reward observations for each i [K].

Deﬁnition 2. (ϵ-optimal arm) Arm i is be ϵ-optimal at context x X if maxj [K] fj(x) fi(x) ϵ. Following we show a uniform (over context) result about ϵ-optimal arm recovery: Theorem 2. (ϵ-optimal arm recovery) Let δ > 0. For Algorithm 1, with probability at least 1 δ/K, if

T K max N0,

(2 + D)(2C)2+D log(1/δ)1+D/2

then ˆπ(x) := argmaxi [K] ˆfi(x) is ϵ-optimal at context x uniformly for all x X.

Remark 1. This result shows that with O(ϵ (2+D)) samples, we can determine an ϵ-approximate best arm. Known lower bounds in nonparametric regression stipulate that we need Ω(ϵ (2+D)) to identify differences between functions of size ϵ so our result matches lower bounds up to logarithmic factors.

Proof. By Theorem 1, it follows that based on the choice of T, each arm has at least enough time such that supx X | fi(x) fi(x)| ϵ/2. Thus, we have x X, deﬁning π(x) = maxj [K] fj(x),

fπ(x)(x) fˆπ(x)(x) fπ(x)(x) fˆπ(x) + ϵ ϵ,

as desired.

Regret Analysis For UCB Strategy Deﬁne Ti(t) to be the number of times arm i was pulled by time t.

Algorithm 2 Upper Conﬁdence Bound (UCB)

1: Parameters: M0, M1, δ, T. 2: Deﬁne σ(n) = M1

log n(log(n K/δ)) n 1/(2+D). 3: Pull each of the K arms M0 times. 4: For each round t = KM0, KM0 + 1, . . . , T: 5: Pull It := argmaxi [K] fi(t) + σ(Ti(t 1)).

We use the following notion of regret.

max i fi(xt) f It(xt)] .

Remark 2. Note that this notion of regret is different from those studied in classical MABs as well as other works in nonparametric contextual bandits. Usually the expected form E[RT ] is bounded. Here, our regret analysis is not under this expectation and hence is a stronger notion of regret. Theorem 3. Let δ > 0. Suppose that M0 N0 and M1 > C in Algorithm 2. Then we have that with probability at least 1 δ,

RT M121 + D

log T(log(TK/δ) T 1+D 2+D

+ KM0 max i ||fi|| .

Remark 3. This shows a sub-linear regret of O(T 1+D 2+D ).

Proof. Denote fi,Ti(t) to be the k-NN regression estimate of fi at time t. Letting C0 = KM0 maxi ||fi|| , we have by

i=1 σ(Tˆπ(xt)(t 1)) + C0 K

i=1 σ(i) + C0

log T(log(TK/δ)

t=1 t 1/(2+D) + C0

log T(log(TK/δ) T

t=0 (1 + t) 1/(2+D)dt

log T(log(TK/δ) T 1+D 2+D + C0.

The ﬁrst inequality holds because the conﬁdence bound of a sub-optimal arm must be higher than that of the optimal at xt in order for that arm to be chosen and the regret at that time-step is bounded by the conﬁdence bound. The second inequality holds because of the following simple combinatorial argument. Each time a suboptimal arm is chosen, its count increments, or otherwise there is no regret incurred.

Contextual Bandits on Manifolds Assumption 3. (Manifold Assumption) p X and the family of fi are supported on M, where: M is a d-dimensional smooth compact Riemannian manifold without boundary embedded in compact subset X RD. The volume of M is bounded above by a constant. M has condition number 1/τ, which controls the curvature and prevents self-intersection. Let p X be the density of P with respect to the uniform measure on M. Theorem 4. (Manifold Rate for k-NN (Jiang 2017b)) Let δ > 0. There exists N0 and universal constant C such that if n N0 and k = n2/(2+d) , then with probability at least 1 δ,

sup x X |f(x) fk(x)| C

log n log(1/δ) n 1/(2+d).

Then, simply by using Theorem 4 instead of Theorem 1, we automatically enjoy faster rates for Theorems 2 and 3. Theorem 5. (ϵ-optimal arm recovery on manifolds) Let δ > 0. For Algorithm 1, with probability at least 1 δ/K, if

T K max N0,

(2 + D)(2C)2+d log(1/δ)1+D/2

then ˆπ(x) := argmaxi [K] fi(x) is ϵ-optimal at context x uniformly for all x X.

Remark 4. Now the sample complexity is O(ϵ2+d) instead of O(ϵ2+D).

Theorem 6. (UCB Regret Analysis on Manifolds) Let δ > 0. Suppose that M0 N0 and M1 > C in Algorithm 2. Then we have that with probability at least 1 δ,

RT M121 + d

log T(log(TK/δ) T 1+d 2+d

+ KM0 max i ||fi|| .

Topological Analysis In this section, we discuss how topological features about the bandit arms can be recovered. This is similar to recovering the Hartigan notion of clusters as level-sets of the density functions from a ﬁnite sample (Chaudhuri and Dasgupta 2010; Jiang 2017a), but here, we ﬁnd similar structures in the reward functions based on noisy observations of them. We give procedures which can estimate with consistency guarantees the following structure: maximal connected regions in X where a particular arm is the top-arm. From the uniform sampling strategy earlier, we obtained estimated policy ˆπ which is δ-optimal uniformly in X with high probability. Although this is already powerful in giving us the mapping between context space and the corresponding top-arm, it does not immediately tell us the topological features of this mapping. In this subsection, we discuss how to recover the connected components of {x X : ri(x) = maxj [K] rj(x)}, the region where arm i is the top-arm. We give the following simple procedure.

Algorithm 3 Recovering Regions where i-th arm is top arm.

1: Given: Bandit arm i and R > 0. 2: Pull each of the K arms T/K times. 3: Let G be the graph with vertices {xt : t [T], fi(xt) = maxj [K] fj(xt)} and edges between vertices whose euclidean distance is at most R. 4: return The connected components of G.

We now give a consistency result for Algorithm 3. First, we require the following regularity assumption, which ensures that there are no full-dimensional regions where the top-arm is not unique. This ensures that it is possible to unambiguously recover the regions where a particular arm is top. Assumption 4. The region in X where the top-arm is not unique has measure 0, and for each arm i, the region Xi where it is unique can be partitioned into full-dimensional connected components. Our rates will be in terms of the Hausdorff distance. Deﬁnition 3. d H(A, B) = inf{ϵ 0 : A B ϵ, B A ϵ}, where A r := {x X : infa A d(x, a) r}. Theorem 7. Suppose that Xi := {x X : fi(x) = maxj [K] fj(x)}. Let C1, ..., Cl be the maximal connected components of Xi. Deﬁne the following minimum distance between two connected components. R0 := min p =q inf x Cp,y Cq d(x, y).

Also deﬁne the following minimum separation in the reward functions

D0 := inf x Xi R0/4 max j [K] fj(x) fi(x).

Then the following holds simultaneously for all C1, ..., Cl. Let Algorithm 3 with setting 0 < R < R0/4 return C1, ..., Cˆl. Then for n sufﬁciently large, ˆl = l and there exists permutation γ of [l] such that

d H(Cj, Cγ(j)) ξ(n)

for some ξ that satisﬁes ξ(n) 0 as n .

Proof. We ﬁrst show that no two connected components can appear in the same returned component in Algorithm 3. We choose n sufﬁciently large such that in light of Theorem 1, we have sup x X max j [K] fj(x) D0

. Then, uniformly for any x Xi R0/4, we have

fi(x) fi(x) + D0

3 max j [K] fj(x) 2D0

max j [K] fj(x) D0

3 < max j [K] fj(x).

Thus, Xi R0/4 is disjoint from the returned points. Since R < R0/4, it follows that no two connected components points will appear in the same returned connected component from Algorithm 3. Next, we show that for each connected component Cp, there exists Cq for some q [ l] such that d H( Cq, Cp) 0. It sufﬁces to show that for each r > 0, we have that for n sufﬁciently large, d H( Cq, Cp) < r. There are thus two directions to show, that Cp Cq r and Cq Cp r. To show the ﬁrst, deﬁne

D1 := inf x (Cq r)\(Cq (r/2)) max j [K] fj(x) fi(x).

Then choose n sufﬁciently large such that in light of Theorem 1, we have

sup x X max j [K] | fj(x) fj(x)| D1

. Then we have for all x Cp, if x = Cq r/2, then

fi(x) fi(x) + D1

3 max j [K] fj(x) 2D1

3 < max j [K] fj(x),

thus, x Cq r/2 Cq r. The other direction follows from a similar argument. All that remains is to show that such points appear in in the same connected component in the graph computed by Algorithm 3. This follows from uniform concentration bounds on balls (e.g. Chaudhuri and Dasgupta (2010)).

Inﬁnite-Armed Bandits In this section, we consider the setting where the action space A is no longer a ﬁnite set of bandits, but a compact subset of RD for some D > 0. We given analogous results for the uniform sampling toparm identiﬁcation and regret bounds for UCB-type strategy. Deﬁnition 4. (Mean Reward function)

where f(x, a) is the expected reward of action a A at context x X. Assumption 5. (Lipschitz Reward) There exists L > 0 such that for all x, x X and a, a A, |f(x, a) f(x , a )| L|(x, a) (x , a )|, where (x, a) represents the (D + D )- dimensional concatenation of x and a. Then at each time t, the learner chooses arm at A and observes context xt X and a stochastic reward

RT = f(xt, at) + ξt,

where ξ1, ... are i.i.d. white noise with mean 0 and variance σ2.

Algorithm 4 Inﬁnite-Armed Uniform Sampling

1: Parameters: T, total number of time steps. 2: For t = 1, ..., T: 3: Pull It, sampled uniformly from A. 4: Observe context xt and reward Rt. 5: Deﬁne ˆf to be the k-NN regression estimate from samples (a1, R1), ..., (a T , RT ) with setting k = n2/(2+D+D ) .

Deﬁnition 5. (ϵ-optimal arm) Deﬁne arm a A to be ϵoptimal at context x X if supa A f(x, a ) f(x, a) ϵ. Following is a uniform (over context and action space) result about ϵ-optimal arm recovery: Theorem 8. (ϵ-optimal arm recovery) There exists constant C1, C2 such that the following holds. Let δ > 0. For Algorithm 4, with probability at least 1 δ, we have that for

log(1/δ)1+(D+D )/2

ϵD+D +2 + C2,

arm ˆπ(x) := argmaxa A ˆf(x) is ϵ-optimal at context x uniformly for all x X.

Proof. By Theorem 1, it follows that based on the choice of T, there is enough time spent on pulling each arm such that supa A,x X | ˆf(x, a) f(x, a)| ϵ/2. Thus, we have x X, deﬁning π(x) = argmaxa Af(x, a),

f(x, π(x)) f(x, ˆ π(x))

2 + ˆf(x, π(x)) + ϵ

2 ˆf(x, ˆπ(x)) ϵ,

as desired.

Algorithm 5 Inﬁnite-Armed Upper Conﬁdence Bound (UCB)

1: Parameters: M, M1, T 2: Deﬁne σ(n) = M1n 1/(2+D+D ). 3: For t = 1, ..., M: 4: Sample at uniformly from A. 5: Observe context xt and reward Rt. 6: For t = M + 1, ..., T: 7: Choose It := argmaxa A f(xt, a) + σ(t).

Finally, using the notion of regret

sup a A f(xt, a) f(xt, at) ,

we give the following result. The proof idea is similar to that of Theorem 3 and is omitted here. Theorem 9. There exists C1 and C2 such that the following holds. Let δ > 0. Suppose that M and M1 are chosen sufﬁciently large in Algorithm 5 depending on f and σ. Then we have that with probability at least 1 δ,

log T(log(T/δ) T

Remark 5. This shows a sub-linear regret of O(T

Related Works Canonical works for the standard bandit problem are Lai and Robbins (1985); Berry and Fristedt (1985); Gittins (1989); Auer et al. (2002); Cesa-Bianchi and Lugosi (2006); Bubeck and Cesa-Bianchi (2012). Work in contextual bandits can be roughly classiﬁed into adversarial and stochastic approaches. Much of the former, initiated by Auer et al. (2002), assumes that there is an adversarial game between nature and the learner where, based on a context seen by both players, nature generates rewards for each arm at the same time the learner chooses an arm. Solutions typically involve game theoretical methods. In the stochastic approach, one assumes that the rewards for the arms are generated by a context-dependent distribution. Approaches to modeling the arm rewards as a function of context are most commonly parametric. One of the most popular is that of linear payoffs, studied under a minimax framework (Goldenshluger and Zeevi 2009; 2013), with UCB-type algorithms (Chu et al. 2011; Li et al. 2010; Auer et al. 2002), or with Thompson sampling (Agrawal and Goyal 2013). However, it is often the case that the dependency between the payoffs and the contexts are complex and therefore difﬁcult to capture with models such as linear payoffs, many of which requiring strong assumptions on the data. To alleviate this, we can go beyond parametric modeling and blend nonparametric statistics with contextual bandits. Despite the advantage of learning much more general context-payoff dependencies, this line of work has received far less attention. To the best of our knowledge, the ﬁrst such work appeared in Yang and Zhu (2002), who used histogram, k-NN, and

kernel methods and showed asymptotic convergence rates. Rigollet and Zeevi (2010); Perchet and Rigollet (2013) then combined histogram-type binning techniques in nonparametric statistics to obtain strong regret guarantees for contextual bandits with optimality guarantees. Lu, P al, and P al (2009) study an interesting setting where the reward depends on a Lipschitz measure which is jointly in the context and the action space. They provide upper and lower regret bounds based on a covering argument and give results in terms of the packing dimension. This is highly related to the inﬁnite-armed bandit setting in the present work; we provide similar regret guarantees but with a simple and practical procedure. More recently, Qian and Yang (2016b); Qian and Yang (2016a) use the strong uniform consistency properties of kernel smoothing regression to establish regret guarantees. Langford and Zhang (2008); Dudik et al. (2011) alternatively impose neither linear nor smoothness assumptions on the mean reward function. The former propose a modiﬁcation of an ϵ-greedy policy and showed that expected regret converges to 0 while the latter considers a ﬁnite class of policies. In this paper, using recent ﬁnite-sample results about k NN regression established in Jiang (2017b), we show that using the simple k-NN regression is an effective alternative approach. Moreover, unlike many other nonparametric techniques, k-NN adapts to a lower intrinsic dimension (Kpotufe 2011) and thus we show that our regret bounds can adapt to a lower intrinsic dimension automatically and perform as if we were operating in that lower dimensional space.

Experiments Simulations We consider three two-arm bandit scenarios in the twodimensional unit square, where p X is uniform. We set arm i {1, 2} to be top in region Ri respectively. Figure 1 illustrates the regions for the different scenarios. Scenario 1 (Quintic Function): We deﬁne two regions above and below a quintic function: Scenario 2 (Smiley): We use two circles and a semicircle to demarcate the regions in a smiley face pattern.

Scenario 3 (Bullseye): We deﬁne the regions using the alternating regions of four concentric circles centered in the support.

The true reward functions of the two arms are as follows.

fi(x) = 1, x Ri 0.5, x Rj =i

The learner observes the rewards with white noise random variable ξ N(μ = 0, σ = 0.5). We compare the performance of k-NN regression (nonparametric) and Ridge regression at top-arm identiﬁcation and regret minimization in the three scenarios. Mirroring our theoretical discussion, we use uniform sampling for top-arm identiﬁcation and UCB strategy for regret analysis. Note that Ridge regression with UCB is the Lin UCB algorithm.

Table 1: Top-arm identiﬁcation and regret results from Ridge and k-NN regressors. Each model was tuned individually and optimal hyperparameters are shown. k-NN performs better on both metrics for all three scenarios.

Quintic Function Smiley Bullseye Ridge k NN Ridge k NN Ridge k NN

Top-Arm Test Error from Uniform Sampling 0.065 0.002 0.080 0.000 0.335 0.005 Number of samples 500k 500k 2k 5000k 100k 500k Number of neighbors - 100 - 50 - 20

Test Regret from UCB sampling 0.0315 0.001 0.0375 0.0135 0.161 0.004 Number of samples 1k 500k 5k 1000k 50k 1000k Number of neighbors - 100 - 20 - 100

(a) Quintic (b) Smiley (c) Bullseye

Figure 1: Top-arm boundaries. Red and blue regions correspond to where top-arm is arm 1 and 2 respectively.

(a) Quintic

(c) Bullseye

Figure 2: Observed reward density plots from 10k uniform samples illustrating pseudo-randomness of training data. In the colormap (right) warmer colors correspond to higher values, normalized on the range of the observed rewards.

(a) Quintic Ridge

(b) Smiley Ridge

(c) Bullseye Ridge

(d) Quintic k-NN

(e) Smiley k-NN

(f) Bullseye k-NN

Figure 3: Test results on top-arm identiﬁcation using Ridge regression and 25-NN regression. Contexts are labeled in red and blue if arms 1 and 2 are estimated to be top respectively.

Qualitative Analysis We ﬁrst qualitatively show that k NN regression can successfully model the bandits whereas the linear method cannot. The difﬁculty of the task is illustrated by Figure 2, which plots 10k uniformly sampled samples from each scenario with a colormap. We can see that a human would have a hard time recovering the regions where each arm is top due to the randomness in the observed rewards. This randomness is considerable as we set σ = 0.5 to be the same as |fi(x Ri) fi(x / Ri)|. We ﬁx the number of training samples N to 10k and the number of nearest neighbors to k = 25. We evaluate on 10k random test samples. Figure 3 shows that k-NN regression does an excellent job of reproducing the region boundaries. Ridge regression does a poor job in the Quintic Function case, making a linear approximation to the quintic curve, and completely fails in the Smiley and Bullseye Cases, simply choosing the arm whose top-arm region is larger.

Quantitative Analysis We report numerical results and optimal hyperparameters in Table 1. We tuned other hyperparameters using grid search on a validation set of size 1k using grid search and we evaluate performance of our models on a test set of size 1k. We use the UCB strategy in Auer et al. (2002) (a simpliﬁed version of UCB by Agrawal and Goyal (2013)). We found that a conﬁdence level of 0.1 worked well for all settings. We see that k-NN signiﬁcantly outperforms Ridge regression for both top-arm identiﬁcation and regret minimization in all three scenarios (Table 1).

Image Classiﬁcation Experiments We extend our experiments to image classiﬁcation of the canonical MNIST dataset, which consists of 60k training images and 10k test images of isolated, normalized, handwritten digits. The task is to classify each 28 28 image into one of ten classes. We reframe this as a contextual MAB problem by treating the classes as arms and the images as the contexts. Note that for every context, the payoff of all arms are known: 1 if the class is the true label and 0 otherwise. We compare k-NN and Ridge regressions at regret minimization using the UCB strategy. As before we use the UCB strategy in Auer et al. (2002) and ﬁx the conﬁdence level to 0.1. We do not employ any data augmentation. We obtain test regret of 17.5% from Lin UCB with α = 5, where α is the coefﬁcient of L2 regularization, and significantly lower test regret of 5.8% from 4-NNUCB. Figure

4 shows that k-NN regression maintains lower regret than Ridge regression over a range of values of k and α. We note that Ridge regression working well for relatively large values of α itself suggests that it is a poor model for the task.

Figure 4: Test results on regret minimization for MNIST image classiﬁcation over varied values of α (for Lin UCB) and k (for k-NNUCB). The nonparametric approach achieves signiﬁcantly lower regret over a range of hyperparameters.

Conclusion For the multi-armed bandit setting, we use nonparametric regression to attain tight results for top-arm identiﬁcation and a sublinear regret of O(T 1+D 2+D ), where D is the dimension of the context. We also show that if the underlying context space has a lower intrinsic dimension d < D, then our algorithm automatically adapts to the lower dimension and attains a faster rate of O(T 1+d 2+d ). We also provide a procedure for recovering the maximal connected regions in a support where a particular arm is the top-arm and provide a consistency analysis. We then give a natural extension to inﬁnitearmed contextual bandits. Our simulations conﬁrm that our method is able to learn in the contextual setting with arbitrary decision boundaries, even in the presence of signiﬁcant noise, and our experiments on classiﬁcation of MNIST images demonstrate superior performance of our method over Lin UCB on a real world task.

References Agrawal, S., and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 127 135. Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM journal on computing 32(1):48 77. Berry, D. A., and Fristedt, B. 1985. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability), volume 12. Springer. Bubeck, S., and Cesa-Bianchi, N. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5(1):1 122.

Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, learning, and games. Cambridge university press. Chaudhuri, K., and Dasgupta, S. 2010. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, 343 351. Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. E. 2011. Contextual bandits with linear payoff functions. In International Conference on Artiﬁcial Intelligence and Statistics, 208 214. Dudik, M.; Hsu, D.; Kale, S.; Karampatziakis, N.; Langford, J.; Reyzin, L.; and Zhang, T. 2011. Efﬁcient optimal learning for contextual bandits. ar Xiv preprint ar Xiv:1106.2369. Gittins, J. 1989. Multi-armed bandit allocation indices. wiley-interscience series in systems and optimization. Goldenshluger, A., and Zeevi, A. 2009. Woodroofes onearmed bandit problem revisited. The Annals of Applied Probability 19(4):1603 1633. Goldenshluger, A., and Zeevi, A. 2013. A linear response bandit problem. Stochastic Systems 3(1):230 261. Jiang, H. 2017a. Density level set estimation on manifolds with dbscan. ar Xiv preprint ar Xiv:1703.03503. Jiang, H. 2017b. Rates of uniform consistency for k-nn regression. ar Xiv preprint ar Xiv:1707.06261. Kpotufe, S. 2011. k-nn regression adapts to local intrinsic dimension. In Advances in Neural Information Processing Systems, 729 737. Lai, T. L., and Robbins, H. 1985. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics 6(1):4 22. Langford, J., and Zhang, T. 2008. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, 817 824. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661 670. ACM. Lu, T.; P al, D.; and P al, M. 2009. Showing relevant ads via context multi-armed bandits. Technical report. Perchet, V., and Rigollet, P. 2013. The multi-armed bandit problem with covariates. The Annals of Statistics 41(2):693 721. Qian, W., and Yang, Y. 2016a. Kernel estimation and model combination in a bandit problem with covariates. Journal of Machine Learning Research. Qian, W., and Yang, Y. 2016b. Randomized allocation with arm elimination in a bandit problem with covariates. Electronic Journal of Statistics 10(1):242 270. Rigollet, P., and Zeevi, A. 2010. Nonparametric bandits with covariates. ar Xiv preprint ar Xiv:1003.1630. Robbins, H. 1985. Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers. Springer. 169 177. Yang, Y., and Zhu, D. 2002. Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. The Annals of Statistics 30(1):100 121.