# contextual_bandits_for_unbounded_context_distributions__5dff1803.pdf

Contextual Bandits for Unbounded Context Distributions

Puning Zhao 1 2 Rongfei Fan 3 Shaowei Wang 4 Li Shen 1 2 Qixin Zhang 5 Zong Ke 6 Tianhang Zheng 7

Abstract Nonparametric contextual bandit is an important model of sequential decision making problems. Under α-Tsybakov margin condition, existing research has established a regret bound of O T 1 α+1

d+2 for bounded supports. However, the optimal regret with unbounded contexts has not been analyzed. The challenge of solving contextual bandit problems with unbounded support is to achieve both exploration-exploitation tradeoff and bias-variance tradeoff simultaneously. In this paper, we solve the nonparametric contextual bandit problem with unbounded contexts. We propose two nearest neighbor methods combined with UCB exploration. The first method uses a fixed k. Our analysis shows that this method achieves minimax optimal regret under a weak margin condition and relatively light-tailed context distributions. The second method uses adaptive k. By a proper data-driven selection of k, this method achieves an expected regret of O T 1 (α+1)β α+(d+2)β + T 1 β , in which β is a parameter describing the tail strength. This bound matches the minimax lower bound up to logarithm factors, indicating that the second method is approximately optimal.

1. Introduction

Multi-armed bandit (Robbins, 1952; Lai & Robbins, 1985) is an important sequential decision problem that has been extensively studied (Agrawal, 1995; Auer et al., 2002; Garivier & Capp e, 2011). In many practical applications such as recommender systems and information retrieval in healthcare

1Shenzhen Campus of Sun Yat-sen University, Shenzhen, China 2Guangming Laboratory, Shenzhen, China 3Beijing Institute of Technology, Beijing, China 4Guangzhou University, Guangzhou, China 5Nanyang Technological University, Singapore 6National University of Singapore, Singapore 7Zhejiang University, Hangzhou, China. Correspondence to: Li Shen <shenli6@mail.sysu.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

and finance (Bouneffouf et al., 2020), decision problems are usually modeled as contextual bandits (Woodroofe, 1979), in which the reward depends on some side information, called contexts. At the t-th iteration, the decision maker observes the context Xt, and then pulls an arm At A based on Xt and the previous trajectory (Xi, Ai), i = 1, . . . , t 1. Many research assume linear rewards (Abbasi-Yadkori et al., 2011; Bastani & Bayati, 2020; Bastani et al., 2021; Qian et al., 2023; Langford & Zhang, 2007; Dudik et al., 2011; Chu et al., 2011; Li et al., 2010), which is restrictive and may not fit well into practical scenarios. Consequently, in recent years, nonparametric contextual bandits have received significant attention, which does not make any parametric assumption about the reward functions (Perchet & Rigollet, 2013; Guan & Jiang, 2018; Gur et al., 2022; Blanchard et al., 2023; Suk & Kpotufe, 2023; Suk, 2024; Cai et al., 2024).

Despite significant progress on nonparametric contextual bandits, existing studies focus only on the case with bounded contexts, and the probability density functions (pdf) of the contexts are required to be bounded away from zero. However, many practical applications often involve unbounded contexts, such as healthcare (Durand et al., 2018), dynamic pricing (Misra et al., 2019) and recommender systems (Zhou et al., 2017). In particular, the contexts may follow a heavytailed distribution (Zangerle & Bauer, 2022), which is significantly different from bounded contexts. Therefore, to bridge the gap between theoretical studies and practical applications of contextual bandits, an in-depth theoretical study of unbounded contexts is crucially needed. Compared with bounded contexts, heavy-tailed context distribution requires the learning method to be adaptive to the pdf of contexts, in order to balance the bias and variance of the estimation of reward functions. On the other hand, compared with existing works on nonparametric classification and regression with identically and independently distributed (i.i.d) data, bandit problems require us to achieve a good balance between exploration and exploitation, thus the learning method needs to be adaptive to the suboptimality gap of reward functions. Therefore, the main challenge of solving nonparametric contextual bandit problems with unbounded contexts is to achieve both bias-variance tradeoff and exploration-exploitation tradeoff using a single algorithm.

In this paper, we solve the nonparametric contextual ban-

Contextual Bandits for Unbounded Context Distributions

Method Bound of expected regret Bounded context Heavy-tailed context

(Rigollet & Zeevi, 2010) UCBogram O T 1 min{ α+1

d+2 , 2 d+2}

(Perchet & Rigollet, 2013) ABSE O T 1 α+1

(Gur et al., 2022) SACB O T 1 α+1

(Guan & Jiang, 2018) k NN-UCB O T d+1 d+2

(Reeve et al., 2018) k NN-UCB O T 1 α+1

This work k NN-UCB1 O T max{1 α+1

d+2 , 2 α+3}

O T 1 β min(d,α+1) min(d 1,α)+max(1,dβ)+2β

This work Adaptive k NN-UCB O T 1 α+1

O T 1 min{ (α+1)β α+(d+2)β ,β}

Minimax lower bound Ω T 1 α+1

Ω T 1 min{ (α+1)β α+(d+2)β ,β}

Table 1. Comparison of the T-step expected cumulative regrets of learning algorithms for nonparametric contextual bandits with Lipschitz reward function under α-Tsybakov margin condition and tail parameter β (Assumption 1(a) and (b)).

dit problem with heavy-tailed contexts. To begin with, we derive a minimax lower bound that characterizes the theoretical limits of contextual bandit learning. We then propose a relatively simple method that uses fixed k combined with upper confidence bound (UCB) exploration and derive the bound of expected regret. Even for bounded contexts, our method improves over an existing nearest neighbor method (Guan & Jiang, 2018) for large margin parameter α, since our method uses an improved UCB calculation which is more adaptive to the suboptimality gap of reward functions. Despite such progress, there is still some gap between the regret bound and the minimax lower bound, indicating room for further improvement. To close such a gap, we further propose a new adaptive nearest neighbor approach, which selects k adaptively based on the density of samples and the suboptimality gap of reward functions. Our analysis shows that the regret bound of this new method nearly matches the minimax lower bound up to logarithmic factors, indicating that this method is approximately minimax optimal.

The general guidelines of our adaptive k NN method are summarized as follows. Firstly, with higher context pdf, we use larger k, and vice versa. Secondly, given a specific context, if the value of an action is far away from optimal (i.e. large suboptimality gap), then we use smaller k, and vice versa. Such a choice of k achieves a good tradeoff between estimation bias and variance. With a lower pdf or larger suboptimality gap, the samples are relatively more sparse. As a result, a large bias may happen due to large k NN distances. Therefore, we use smaller k to control the bias. On the contrary, with a higher pdf or smaller suboptimality gap, the samples are dense and thus we can use larger k to reduce the variance. Note that the pdf and the

1Despite that the name k NN-UCB is the same as (Guan & Jiang, 2018) and (Reeve et al., 2018), the calculations of UCB are different between these methods. See details in the Nearest neighbor method with fixed k section.

suboptimality gap are unknown to the learner. Therefore, we design a method, such that the value of k is selected by a data-driven manner, based on the density of existing samples.

1.1. Contribution

The contributions of this paper are summarized as follows.

We derive the minimax lower bound of nonparametric contextual bandits with heavy-tailed context distributions.

We propose a simple k NN method with UCB exploration. The regret bound matches the minimax lower bound with small α and large β.

We propose an adaptive k NN method, such that k is selected based on previous steps. The regret bound matches the minimax lower bound under all parameter regimes.

Our results and the comparison with related works are summarized in Table 1. In general, to the best of our knowledge, our work is the first attempt to handle heavy-tailed context distribution in contextual bandit problems. In particular, our new proposed adaptive k NN method achieves the minimax lower bound for the first time. The proofs of all theoretical results in the paper are shown in the supplementary material.

2. Related Work

In this section, we briefly review the related works about contextual bandits and nearest neighbor methods.

Nonparametric contextual bandits with bounded contexts. (Yang & Zhu, 2002) first introduced the nonparametric contextual bandit problem, proposed an ϵ-greedy ap-

Contextual Bandits for Unbounded Context Distributions

proach and proved the consistency. (Rigollet & Zeevi, 2010) derived a minimax lower bound on the regret, and showed that this bound is achievable by a UCB method. (Perchet & Rigollet, 2013) proposed Adaptively Binned Successive Elimination (ABSE), which adapts to the unknown margin parameter. (Qian & Yang, 2016) proposed a kernel estimation method. (Hu et al., 2020) analyzed nonparametric bandit problem under general H older smoothness assumption. (Gur et al., 2022) proposed Smoothness-Adaptive Contextual Bandits (SACB), which is adaptive to the smoothness parameter. (Slivkins, 2014; Suk & Kpotufe, 2023; Akhavan et al., 2024; Ghosh et al., 2024; Komiyama et al., 2024; Suk, 2024) analyzed the problem of dynamic regret. Furthermore, (Wanigasekara & Yu, 2019; Locatelli & Carpentier, 2018; Krishnamurthy et al., 2020; Zhu et al., 2022) discussed the case with continuous actions.

Nearest neighbor methods. Nearest neighbor classification has been analyzed in (Chaudhuri & Dasgupta, 2014; D oring et al., 2018) for bounded support of features. (Gadat et al., 2016; Kpotufe, 2011; Cannings et al., 2020; Zhao & Lai, 2021b;a) proposed adaptive nearest neighbor methods for heavy-tailed feature distributions. (Guan & Jiang, 2018) proposed k NN-UCB method for contextual bandits and proved a regret bound O(T 1+d 2+d ).

Compared with existing methods on nonparametric contextual bandits, for unbounded contexts, the methods need to adapt to different density levels of contexts and achieve a better bias and variance tradeoff in the estimation of reward functions. Moreover, existing works on nonparametric classification can not be easily extended here, since the samples are no longer i.i.d and we now need to bound the regret instead of the estimation error. These factors introduce new technical difficulties in theoretical analysis. In this work, to address these challenges, we design new algorithms that are adaptive to both the pdf and the suboptimality of reward functions and provide a corresponding theoretical analysis.

3. Preliminaries

Denote X as the space of contexts, and A as the space of actions. Throughout this paper, we discuss the case with infinite X and finite A. At the t-th step, the context Xt is a random variable drawn from a distribution with probability density function (pdf) f. Then the agent takes action At A and receive reward Yt:

Yt = ηAt(Xt) + Wt, (1)

in which ηa(x) for a A and x X is an unknown expected reward function, and Wt denotes the noise, with E[Wt|X1:t, A1:t] = 0.

Throughout this paper, define

η (x) = max a ηa(x) (2)

as the maximum expected reward of context x. For any suboptimal action a, ηa(x) < η (x). Correspondingly, η (x) ηa(x) is called suboptimality gap.

The performance of an algorithm is evaluated by the expected regret

t=1 (η (Xt) ηAt(Xt))

We then present the assumptions needed for the analysis. To begin with, we state some basic conditions in Assumption 1.

Assumption 1. There exists some constants Cα, σ, L, such that

(a) (Tsybakov margin condition) For some α d, for all a A and u > 0, P(0 < η (X) ηa(X) < u) Cαuα;

(b) Wt is subgaussian with parameter σ2, i.e. E[eλWi] e 1 2 λ2σ2;

(c) For all a, ηa is Lipschitz with constant L, i.e. for any x and x , |ηa(x) ηa(x )| L x x .

Now we comment on these assumptions. (a) is the Tsybakov margin condition, which was first introduced in (Audibert & Tsybakov, 2007) for classification problems, and then used in contextual bandit problems (Perchet & Rigollet, 2013). Note that P(0 < η (X) ηa(X) < t) 1 always hold, thus for any ηa, (a) holds with Cα = 1 and α = 0. Therefore, this assumption is nontrivial only if it holds with some α > 0. Moreover, we only consider the case with α d here. If α > d, then an arm is either always or never optimal, thus it is easy to achieve logarithmic regret (see (Perchet & Rigollet, 2013), Proposition 3.1). An additional remark is that in (Perchet & Rigollet, 2013; Reeve et al., 2018), the margin assumption is P(0 < η (X) ηs(X) < t) tα, in which ηs(x) is the second largest one among {ηa(x)|a A}. Our assumption (a) is slightly weaker than existing ones since we only impose margin conditions on the suboptimality gap η (x) ηa(x) for each a separately, instead of on the minimum suboptimality gap. In (b), following existing works (Reeve et al., 2018), we assume that the noise has light tails. (c) is a common assumption for various literatures on nonparametric estimation (Mai & Johansson, 2021). It is possible to extend this work to a more general H older smoothness assumption by adaptive nearest neighbor weights (Cannings et al., 2020). In this paper, we focus only on Lipschitz continuity for convenience.

Assumption 2 is designed for the case that the contexts have bounded support.

Assumption 2. f(x) c for all x X, in which f is the pdf of contexts.

Contextual Bandits for Unbounded Context Distributions

In Assumption 2, the pdf f is required to be bounded away from zero, which is also made in (Perchet & Rigollet, 2013; Guan & Jiang, 2018; Reeve et al., 2018). Note that even for estimation with i.i.d data, this assumption is common (Audibert & Tsybakov, 2007; D oring et al., 2018; Gao et al., 2018).

We then show some assumptions for heavy-tailed distributions.

Assumption 3. (a) For any u > 0, P(f(X) u) Cβuβ

for some constants Cβ and β;

(b) The difference of regret function η among all actions are bounded, i.e. supx X (η (x) mina ηa(x)) M for some constant M.

(a) is a common tail assumption for nonparametric statistics, which has been made in (Gadat et al., 2016; Zhao & Lai, 2021b). β describes the tail strength. Smaller β indicates that the context distribution has heavy tails, and vice versa. To further illustrate this assumption, we show several examples.

Example 1. If f has bounded support X, then Assumption 3(a) holds with Cβ = V (X) and β = 1, in which V (X) is the volume of the support set X.

Example 2. If f has p-th bounded moment, i.e. E[ X p] < , then for all β < p/(p + d), there exists a constant Cβ such that Assumption 3(a) holds. In particular, for subgaussian or subexponential random variables, Assumption 3(a) holds for all β < 1.

Proof. The analysis of these examples and other related discussions are shown in the supplementary material.

It worths mentioning that although the growth rate of the regret is affected by the value of β, our proposed algorithms including both fixed and adaptive methods do not require knowing β.

(b) restricts the suboptimality gap of each action. This is not necessary if the support is bounded. However, with unbounded support, without assumption (b), η (x) ηa(x) can increase appropriately with the decrease of f(x), such that the regret of suboptimal action is large, and the identification of best action is hard.

Finally, we clarify notations as follows. Throughout this paper, denotes ℓ2 norm. a b denotes a Cb for some constant C, which may depend on the constants in Assumption 2. The notation is defined conversely.

4. Minimax Analysis

In this section, we show the minimax lower bound, which characterizes the theoretical limit of regrets of contextual

bandits. Throughout this section, denote π : X X t 1 Rt 1 A as the policy, such that each action is selected according to policy π. To be more precise,

At = π(Xt; X1:t 1, Y1:t 1), (4)

which indicates that the action At at time t depends on the current context and the records of contexts and rewards in previous t 1 steps.

The minimax lower bound for the case with bounded support has been shown in Theorem 4.1 in (Rigollet & Zeevi, 2010). For completeness and notation consistency, we state the results below and provide a simplified proof.

Theorem 1. ((Rigollet & Zeevi, 2010), Theorem 4.1) Denote FA as the set of pairs (f, η) that satisfy Assumption 1 and 2 (which means that the contexts have bounded support). Then

inf π sup (f,η) FA R T 1 1+α

We then show the minimax regret bounds for unbounded support, which is a new result that has not been obtained before.

Theorem 2. Denote FB as the set of pairs (f, η) that satisfy Assumption 1 and 3 (which means that the contexts have unbounded support). Then

inf π sup (f,η) FB R T 1 (α+1)β α+(d+2)β + T 1 β. (6)

From the results above, with β , (6) reduces to (5).

Proof. (Outline) For bounded support, we just derive the lower bound of regret by analyzing the minimax optimal number of suboptimal actions first. Define

t=1 1(ηAt(Xt) < η (Xt))

S can be lower bounded using standard tools in nonparametric statistics (Tsybakov, 2009), which constructs multiple hypotheses and bounds the minimum error probability. As shown in (Rigollet & Zeevi, 2010), the lower bound of S can then be transformed to the lower bound of R.

The minimax analysis becomes more complex with unbounded support. Firstly, the heavy-tailed context distribution requires different hypotheses construction. Secondly, the transformation from the lower bound of S to R does not yield tight lower bounds. We design new approaches to construct a set of candidate functions η and derive lower bounds of R directly.

Contextual Bandits for Unbounded Context Distributions

For bounded context support, regret comes mainly from the region with η (x) ηa(x) T 1/(d+2) (which is the classical rate for nonparametric estimation (Tsybakov, 2009)), in which the identification of best action is not guaranteed to be correct. However, with heavy-tailed contexts, regret may also come from the tail, i.e. the region with small f(x), where the number of samples around x is not enough to yield a reliable best action identification. For heavy-tailed cases, i.e. β is small, the regret caused by the tail region may dominate. This also explains why we need to use different techniques to derive the minimax lower bound for heavy-tailed contexts.

In the remainder of this paper, we claim that a method is nearly minimax optimal if the dependence of expected regret on T matches (5) or (6). Following conventions in existing works (Rigollet & Zeevi, 2010; Perchet & Rigollet, 2013; Hu et al., 2020; Gur et al., 2022), currently, the minimax lower bounds are derived for contextual bandit problems with only two actions, thus we do not consider the minimax optimality of regrets with respect to the number of actions |A|.

5. Nearest Neighbor Method with Fixed k

To begin with, we propose and analyze a simple nearest neighbor method with fixed k. We make the following definitions first.

Denote na(t) = |{i < t|Ai = a}| as the number of steps with action a before time step t. Let Nt(x, a) be the set of k nearest neighbors among {i < t|Ai = a}. Define

ρa,t(x) = max i Nt(x,a) Xi x (8)

as the k nearest neighbor distance, i.e. the distance from x to its k-th nearest neighbor among all previous steps with action a.

With the above notations, we describe the fixed k nearest neighbor method as follows. If na(t) k, then

ˆηa,t(x) = 1

i Nt(x,a) Yi + b + Lρa,t(x), (9)

in which b has a fixed value

k ln(d T 2d+2|A|), (10)

If na(t) < k, then

ˆηa,t(x) = . (11)

Here we explain our design. If na(t) k, then it is possible to give a UCB estimate of ηa(x), shown in (9). Lρa,t(x)

bounds the estimation bias, while b is an upper bound of the error caused by random noise that holds with high probability. In Lemma 5 in Appendix E, we show that ˆηa,t(x) is a valid UCB estimate of ηa,t(x), i.e. ˆηa,t(x) ηa,t(x) holds with high probability. If na(t) < k, then it is impossible to give a UCB estimate. In this case, we just let ˆηa,t(x) to be infinite.

Finally, the algorithm selects the action At with the maximum UCB value:

At = arg max a ˆηa,t(Xt). (12)

According to (11), as long as an action has not been taken for at least k times, the UCB estimate of ηa will be infinite. Note that the selection rule (12) ensures that the actions with infinite UCB values will be taken first. Therefore, the first k|A| steps are used for pure exploration. In this stage, the agent takes each action a for k times. After k|A| steps, the UCB values for all x X and a A become finite. Since then, at each step, the action is selected with the maximum UCB value specified in (9).

Algorithm 1 Adaptive nearest neighbor with UCB exploration

for t = 1, . . . , T do

Receive context Xt; for a A do

Calculate na(t) = |{i < t|Ai = a}; if na(t) k then

Calculate ˆηa,t(Xt) using (9); else

Let ˆηa,t(Xt) = ; end if end for At = arg maxa ˆηa,t(Xt); Pull At; end for

The procedures above are summarized in Algorithm 1. Compared with (Guan & Jiang, 2018), our method constructs the UCB differently. In (Guan & Jiang, 2018), the UCB is 1

i Nt(x,a) Yi + σ(Ta(t 1)), in which σ(Ta(t 1)) is

uniform among all x with fixed action a.

Therefore, the method (Guan & Jiang, 2018) is not adaptive to the suboptimality gap η (x) ηa(x). On the contrary, our method has a term Lρa,t(x) that varies for different x, and thus adapts better to the suboptimality gap. The bound of regret is shown in Theorem 3.

Theorem 3. Under Assumption 1 and 2, the regret of the simple nearest neighbor method with UCB exploration is bounded as follows:

Contextual Bandits for Unbounded Context Distributions

(1) If d > α + 1, then with k T 2 d+2 ,

2 (d T 2d+2|A|); (13)

(2) If d α + 1, then with k T 2 α+3 ,

R T 2 α+3 |A| ln

2 (d T 2d+2|A|). (14)

We compare our result with (Guan & Jiang, 2018), which proposes a similar nearest neighbor method. The analysis in (Guan & Jiang, 2018) does not make Tsybakov margin assumption (Assumption 1(a)), and the regret bound is O(T d+1 d+2 ). Without any restriction on η, Assumption 1(a) holds with Cα = 1 and α = 0, under which (13) reduces to O(T d+1 d+2 ). Therefore, our result matches (Guan & Jiang, 2018) with α = 0. If α > 0, which indicates that a small optimality gap only happens with small probability, then the regret of the method in (Guan & Jiang, 2018) is still O(T d+1 d+2 ), while our result improves it to O(T 1 α+1

d+2 ). As discussed earlier, compared with (Guan & Jiang, 2018), our method improves the UCB calculation in (17), and is thus more adaptive to the suboptimalilty gap η (x) ηa(x). With α > 0, our method achieves smaller regret due to a better tradeoff between exploration and exploitation.

Compared with the minimax lower bound shown in Theorem 1, it can be found that the k NN method with fixed k is not completely optimal. With d > α + 1, the upper bound matches the lower bound derived in Theorem 1. However, with d α + 1, the regret is significantly higher than the minimax lower bound, indicating that there is room for further improvement.

We then analyze the performance for heavy-tailed context distribution. The result is shown in the following theorem.

Theorem 4. Under Assumption 1 and 3, the regret of the simple nearest neighbor method with UCB exploration is bounded as follows:

R T 1 β min(d,α+1) min(d 1,α)+max(1,dβ)+2β |A|

ln 1 2 max(d,α+1)(d T 2d+2|A|). (15)

From (15), there are two phase transitions. The first one is at d = α + 1, while the second one is at dβ = 1. Intuitively, the phase transition occurs because the regret is dominated by different regions depending on the settings α and β. Compared with the minimax lower bound shown in Theorem 2, it can be found that the k NN method with fixed k achieves nearly minimax optimal regret up to logarithm factors if d > α + 1 and β > 1/d, otherwise the regret bound is suboptimal. Here we provide an intuition of the reason why the k NN method with fixed k achieves suboptimal regrets. In the region where the context pdf f(x) is low,

or the suboptimality gap η (x) ηa(x) is large, the samples with action a are relatively sparse. In this case, with fixed k, the nearest neighbor distances are too large, resulting in a large estimation bias. On the contrary, if f(x) is high or η (x) ηa(x) is small, then samples with action a are relatively dense, thus the bias is small, and we can increase k to achieve a better bias and variance tradeoff. Therefore, if k is fixed throughout the support set, then the algorithm estimates the reward function ηa(x) in an inefficient way, resulting in suboptimal regrets. Apart from suboptimal regret, another drawback is that with d α + 1, the optimal selection of k depends on the margin parameter α, which is usually unknown in practice. In the next section, we propose an adaptive nearest neighbor method to address these issues mentioned above.

6. Nearest Neighbor Method with Adaptive k

In the previous section, we have shown that the standard k NN method with fixed k is suboptimal with d α + 1 or β 1/d. The intuition is that the standard nearest neighbor method does not adjust k based on the pdf and the suboptimality gap. In this section, we propose an adaptive nearest neighbor approach. To achieve a good explorationexploitation tradeoff and bias-variance tradeoff, k needs to be smaller for small pdf f(x) or large suboptimality gap η (x) ηa(x), and vice versa. However, as both f(x) and η (x) ηa(x) are unknown to the learner, we need to decide k based entirely on existing samples. The guideline of our design is that given a context Xt at time t, we use large k if previous samples are relatively dense around Xt, and vice versa. To be more precise, for all x X, let

kt(x) = max

j|Lρa,t,j(x)

in which ρa,t,j(x) is the distance from x to its j-th nearest neighbors among existing samples with action a, i.e. {i < t|Ai = a}. Such selection of k makes the bias term Lρa,t,j(x) matches the variance term p

ln T/j, thus (16) achieves a good tradeoff between bias and variance. The exploration-exploitation tradeoff is also desirable as ρa,t,j(x) is large with large η (x) ηa(x), which yields smaller k. Note that (16) can be calculated only if Lρa,t,1

ln T, which means that the 1-nearest neighbor distance can not be too large. At some time step t, for some action a, if there is no existing samples, or Xt is more than

ln T/L far away from any existing samples X1, . . . , Xt 1, then we can just let the UCB estimate to be infinite, i.e. ˆηa,t(x) = . Otherwise, we calculate the upper confidence bound as follows:

ˆηa,t(x) = 1 ka,t(x)

i Nt(x,a) Yi + ba,t(x) + Lρa,t(x), (17)

Contextual Bandits for Unbounded Context Distributions

in which Nt(x, a) is the set of ka,t(x) neighbors of x among {i < t|Ai = a}, ρa,t(x) is the corresponding ka,t(x) neighbor distance of x, i.e. ρa,t(x) = ρa,t,ka,t(x)(x), and

ka,t(x) ln(d T 2d+3|A|). (18)

Similar to the fixed nearest neighbor method, the last two terms in (17) cover the uncertainty of reward function estimation. The term ba,t(x) gives a high probability bound of random error, and Lρa,t(x) bounds the bias. With the UCB calculation in (17), the ˆηa,t(x) is an upper bound of η(x) that holds with high probability, so that the exploration and exploitation can be balanced well. The complete description of the newly proposed adaptive nearest neighbor method is shown in Algorithm 2.

Algorithm 2 Adaptive nearest neighbor with UCB exploration

for t = 1, . . . , T do

Receive context Xt; for a A do

if Lρa,t,1(Xt) >

ln T then ˆηa,t(Xt) = ; else

Calculate kt(Xt) using (16); Calculate ˆηa,t(Xt) using (17); end if end for At = arg maxa ˆηa,t(Xt); Pull At; end for

We then analyze the regret of the adaptive method for both bounded and unbounded supports of contexts.

Theorem 5. Under Assumption 1 and 2, the regret of the adaptive nearest neighbor method with UCB exploration is bounded by

By comparing Theorem 5 with Theorem 3, it can be found that for the case with bounded support, the adaptive method improves over the fixed k nearest neighbor method. From the minimax bound in Theorem 1, the fixed k method is only optimal for d α + 1, while the adaptive method is also optimal for d < α + 1, up to logarithm factors. An intuitive explanation is that with large α, the suboptimality gap η (x) ηa(x) is small only in a small region, and the exploration-exploitation tradeoff becomes harder, thus the advantage of the adaptive method over the fixed one becomes more obvious.

We then analyze the performance of the adaptive nearest neighbor method for heavy-tailed distribution. The result is shown in the following theorem.

Theorem 6. Under Assumption 1 and Assumption 3, the regret of the adaptive nearest neighbor method with UCB exploration is bounded by

( T 1 min{ (α+1)β α+(d+2)β ,β}|A| ln T if β = 1 d+2 T d+2 d+1 |A| ln2 T if β = 1 d+2.

Compared with the minimax lower bound shown in Theorem 2, it can be found that our method achieves nearly minimax optimal regret up to a logarithm factor. Regarding this result, we have some additional remarks.

Remark 1. It can be found that with β , the regret bound in (20) reduces to (19). As discussed earlier, the case that contexts have bounded support and f is bounded away from zero can be viewed as a special case with β .

Remark 2. In (Zhao & Lai, 2021b), it is shown that the optimal rate of the excess risk of nonparametric classification

is O N (α+1)β α+(d+2)β

2. From (20), the average regret over

all T steps is O T (α+1)β α+(d+2)β , which has the same rate as the nonparametric classification problem.

7. Numerical Experiments

To begin with, to validate our theoretical analysis, we run experiments using some synthesized data. We then move on to experiments with the MNIST dataset (Le Cun, 1998).

7.1. Synthesized Data

To begin with, we conduct experiments with d = 1. In each experiment, we run T = 1, 000 steps and compare the performance of the adaptive nearest neighbor method with the UCBogram (Rigollet & Zeevi, 2010), ABSE (Perchet & Rigollet, 2013) and fixed k nearest neighbor method. For a fair comparison, for UCBogram and ABSE, we try different numbers of bins and only pick the one with the best performance. The results are shown in Figure 1. In (a), (b), (c), and (d), the contexts follow uniform distribution in [ 1, 1], standard Gaussian distribution, t4 distribution, and Cauchy distribution, respectively. The uniform distribution is an example of distributions with bounded support. The Gaussian, t4 and Cauchy distribution satisfy the tail assumption (Assumption 3(a)) with β = 1, 0.8 and 0.5, respectively. In each experiment, there are two actions. For uniform and Gaussian distribution, we have η1(x) = x and η2(x) = x.

2The analysis in (Zhao & Lai, 2021b) is under a general smoothness assumption with parameter p. p = 1 corresponds to the Lipschitz assumption (Assumption 1(c) in this paper). Therefore, here we replace the bounds in (Zhao & Lai, 2021b) with p = 1.

Contextual Bandits for Unbounded Context Distributions

0 200 400 600 800 1000 Steps

Cumulative regret

UCBogram ABSE fixed, k=15 adaptive

(a) Uniform distribution.

0 200 400 600 800 1000 Steps

Cumulative regret

UCBogram ABSE fixed, k=10 adaptive

(b) Gaussian distribution.

0 200 400 600 800 1000 Steps

Cumulative regret

UCBogram ABSE fixed, k=10 adaptive

(c) t4 distribution.

0 200 400 600 800 1000 Steps

Cumulative regret

UCBogram ABSE fixed, k=10 adaptive

(d) Cauchy distribution.

Figure 1. Comparison of cumulative regrets of different methods for one dimensional distributions.

For t4 and Cauchy distribution, since they are heavy-tailed, to ensure that Assumption 3(b) is satisfied, we do not use the linear reward function. Instead, we let η1(x) = sin(x) and η2(x) = cos(x). To make the comparison more reliable, the values in each curve in Figure 1 are averaged over m = 100 random and independent trials.

We then run experiments for two dimensional distributions. In these experiments, the context distributions are just Cartesian products of two one dimensional distributions. The two dimensional Gaussian distribution still satisfies Assumption 3(a) with β = 1, and the two dimensional t4 and Cauchy distribution satisfy Assumption 3(a) with β = 2/3 and 1/3, respectively, which are lower than the one dimensional case. The results are shown in Figure 2.

From these experiments, it can be observed that the adaptive nearest neighbor method significantly outperforms the other baselines.

7.2. Real Data

Now we run experiments using the MNIST dataset (Le Cun, 1998), which contains 60, 000 images of handwritten digits with size 28 28. Following the settings in (Guan & Jiang, 2018), the images are regarded as contexts, and there are 10 actions from 0 to 9. The reward is 1 if the selected action equals the true label, and 0 otherwise. The results are shown in Figure 3. Image data have high dimensionality but low intrinsic dimensionality. Compared with bin splitting based methods (Rigollet & Zeevi, 2010; Perchet & Rigollet, 2013), nearest neighbor methods are more adaptive to local intrinsic dimension (Kpotufe, 2011). Therefore, in this experiment, we do not compare with the bin splitting based methods.

0 500 1000 1500 2000 Steps

Cumulative regret

UCBogram ABSE fixed, k=10 adaptive

(a) Uniform distribution.

0 500 1000 1500 2000 Steps

Cumulative regret

UCBogram ABSE fixed, k=10 adaptive

(b) Gaussian distribution.

0 500 1000 1500 2000 Steps

Cumulative regret

UCBogram ABSE fixed, k=5 adaptive

(c) t4 distribution.

0 500 1000 1500 2000 Steps

Cumulative regret

UCBogram ABSE fixed, k=5 adaptive

(d) Cauchy distribution.

Figure 2. Comparison of cumulative regrets of different methods for one dimensional distributions.

0 1000 2000 3000 4000 5000 Steps

Cumulative regret

fixed, k=3 fixed, k=5 fixed, k=7 adaptive

Figure 3. Cumulative regrets for MNIST dataset.

From Figure 3, the adaptive k NN method performs better than the standard k NN method with various values of k.

8. Conclusion

This paper analyzes the contextual bandit problem that allows the context distribution to be heavy-tailed. To begin with, we have derived the minimax lower bound of the expected cumulative regret. We then show that the expected cumulative regret of the fixed k nearest neighbor method is suboptimal compared with the minimax lower bound. To close the gap, we have proposed an adaptive nearest neighbor approach, which significantly improves the performance, and the bound of expected regret matches the minimax lower bound up to logarithm factors. Finally, we have conducted numerical experiments to validate our results.

In the future, this work can be extended in the following ways. Firstly, following existing analysis in (Gur et al., 2022), it may be meaningful to design a smoothness adaptive method that can handle any H older smoothness parameters. Secondly, it is worth extending current work to handle dynamic regret functions. Finally, the theories developed in this paper can be extended to more complicated tasks, such as reinforcement learning (Zhao & Lai, 2024).

Contextual Bandits for Unbounded Context Distributions

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

This work is supported by STI 2030 Major Projects (No. 2021ZD0201405), Shenzhen Basic Research Project (Natural Science Foundation) Basic Research Key Project (NO. JCYJ20241202124430041), 2025 Open Project of the Key Laboratory of Blockchain Technology and Data Security of the Ministry of Industry and Information Technology, Open Research Fund from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (NO. GMLKF-24-23), National Natural Science Foundation of China (No.62372120, 62102108), Guang Dong Basic and Applied Basic Research Foundation (No.2022A1515010061), and Science and Technology Projects in Guangzhou (No.2025A03J3182).

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011.

Agrawal, R. Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in applied probability, 27(4):1054 1078, 1995.

Akhavan, A., Lounici, K., Pontil, M., and Tsybakov, A. B. Contextual continuum bandits: Static versus dynamic regret. ar Xiv preprint ar Xiv:2406.05714, 2024.

Audibert, J.-Y. and Tsybakov, A. B. Fast learning rates for plug-in classifiers. The Annals of Statistics, pp. 608 633, 2007.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235 256, 2002.

Bastani, H. and Bayati, M. Online decision making with high-dimensional covariates. Operations Research, 68 (1):276 294, 2020.

Bastani, H., Bayati, M., and Khosravi, K. Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329 1349, 2021.

Blanchard, M., Hanneke, S., and Jaillet, P. Adversarial rewards in universal learning for contextual bandits. ar Xiv preprint ar Xiv:2302.07186, 2023.

Bouneffouf, D., Rish, I., and Aggarwal, C. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1 8, 2020.

Cai, C., Cai, T. T., and Li, H. Transfer learning for contextual multi-armed bandits. The Annals of Statistics, 52(1): 207 232, 2024.

Cannings, T. I., Berrett, T. B., and Samworth, R. J. Local nearest neighbour classification with applications to semisupervised learning. The Annals of Statistics, 48(3):1789 1814, 2020.

Chaudhuri, K. and Dasgupta, S. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, volume 27, 2014.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011.

D oring, M., Gy orfi, L., and Walk, H. Rate of convergence of k-nearest-neighbor classification rule. Journal of Machine Learning Research, 18(227):1 16, 2018.

Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. Efficient optimal learning for contextual bandits. ar Xiv preprint ar Xiv:1106.2369, 2011.

Durand, A., Achilleos, C., Iacovides, D., Strati, K., Mitsis, G. D., and Pineau, J. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine learning for healthcare conference, pp. 67 82, 2018.

Fedotov, A. A., Harremo es, P., and Topsoe, F. Refinements of pinsker s inequality. IEEE Transactions on Information Theory, 49(6):1491 1498, 2003.

Gadat, S., Klein, T., and Marteau, C. Classification in general finite dimensional spaces with the k nearest neighbor rule. The Annals of Statistics, pp. 982 1009, 2016.

Gao, W., Oh, S., and Viswanath, P. Demystifying fixed k-nearest neighbor information estimators. IEEE Transactions on Information Theory, 64(8):5629 5661, 2018.

Garivier, A. and Capp e, O. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 359 376, 2011.

Ghosh, A., Sankararaman, A., Ramchandran, K., Javidi, T., and Mazumdar, A. Competing bandits in non-stationary matching markets. IEEE Transactions on Information Theory, 2024.

Contextual Bandits for Unbounded Context Distributions

Guan, M. and Jiang, H. Nonparametric stochastic contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.

Gur, Y., Momeni, A., and Wager, S. Smoothness-adaptive contextual bandits. Operations Research, 70(6):3198 3216, 2022.

Hu, Y., Kallus, N., and Mao, X. Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes. In Conference on Learning Theory, pp. 2007 2010, 2020.

Huo, M., Lu, K., Li, Y., and Zhu, Q. Ct-patchtst: Channel-time patch time-series transformer for longterm renewable energy forecasting. ar Xiv preprint ar Xiv:2501.08620, 2025a. URL https://arxiv. org/abs/2501.08620.

Huo, M., Lu, K., Zhu, Q., and Chen, Z. Enhancing customer contact efficiency with graph neural networks in credit card fraud detection workflow. ar Xiv preprint ar Xiv:2504.02275, 2025b. URL https://arxiv. org/abs/2504.02275.

Jiang, H. Non-asymptotic uniform rates of consistency for knn regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3999 4006, 2019.

Komiyama, J., Fouch e, E., and Honda, J. Finite-time analysis of globally nonstationary multi-armed bandits. Journal of Machine Learning Research, 25(112):1 56, 2024.

Kpotufe, S. k-nn regression adapts to local intrinsic dimension. In Advances in Neural Information Processing Systems, pp. 729 737, 2011.

Krishnamurthy, A., Langford, J., Slivkins, A., and Zhang, C. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. Journal of Machine Learning Research, 21(137):1 45, 2020.

Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1): 4 22, 1985.

Langford, J. and Zhang, T. The epoch-greedy algorithm for multi-armed bandits with side information. Advances in Neural Information Processing Systems, 20, 2007.

Le Cun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Locatelli, A. and Carpentier, A. Adaptivity to smoothness in x-armed bandits. In Conference on Learning Theory, pp. 1463 1492, 2018.

Mai, V. V. and Johansson, M. Stability and convergence of stochastic gradient clipping: Beyond lipschitz continuity and smoothness. In International Conference on Machine Learning, pp. 7325 7335, 2021.

Misra, K., Schwartz, E. M., and Abernethy, J. Dynamic online pricing with incomplete information using multiarmed bandit experiments. Marketing Science, 38(2): 226 252, 2019.

Perchet, V. and Rigollet, P. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693 721, 2013.

Qian, W. and Yang, Y. Kernel estimation and model combination in a bandit problem with covariates. Journal of Machine Learning Research, 17(149):1 37, 2016.

Qian, W., Ing, C.-K., and Liu, J. Adaptive algorithm for multi-armed bandit problem with high-dimensional covariates. Journal of the American Statistical Association, pp. 1 13, 2023.

Reeve, H., Mellor, J., and Brown, G. The k-nearest neighbour ucb algorithm for multi-armed bandits with covariates. In Algorithmic Learning Theory, pp. 725 752, 2018.

Rigollet, P. and Zeevi, A. Nonparametric bandits with covariates. 23th Annual Conference on Learning Theory, pp. 54, 2010.

Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952.

Slivkins, A. Contextual bandits with similarity information. Journal of Machine Learning Research, 15:2533 2568, 2014.

Suk, J. Adaptive smooth non-stationary bandits. ar Xiv preprint ar Xiv:2407.08654, 2024.

Suk, J. and Kpotufe, S. Tracking most significant shifts in nonparametric contextual bandits. Advances in Neural Information Processing Systems, 36:6202 6241, 2023.

Tsybakov, A. B. Introduction to Nonparametric Estimation. 2009.

Wanigasekara, N. and Yu, C. L. Nonparametric contextual bandits in an unknown metric space. Advances in Neural Information Processing Systems, 32:14684 14694, 2019.

Woodroofe, M. A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799 806, 1979.

Contextual Bandits for Unbounded Context Distributions

Yang, Y. and Zhu, D. Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. The Annals of Statistics, 30(1):100 121, 2002.

Zangerle, E. and Bauer, C. Evaluating recommender systems: survey and framework. ACM Computing Surveys, 55(8):1 38, 2022.

Zhang, Q., Deng, Z., Chen, Z., Hu, H., and Yang, Y. Stochastic continuous submodular maximization: Boosting via non-oblivious function. In International Conference on Machine Learning, pp. 26116 26134. PMLR, 2022.

Zhang, Q., Deng, Z., Chen, Z., Zhou, K., Hu, H., and Yang, Y. Online learning for non-monotone dr-submodular maximization: From full information to bandit feedback. In International Conference on Artificial Intelligence and Statistics, pp. 3515 3537. PMLR, 2023.

Zhao, P. and Lai, L. Efficient classification with adaptive knn. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11007 11014, 2021a.

Zhao, P. and Lai, L. Minimax rate optimal adaptive nearest neighbor classification and regression. IEEE Transactions on Information Theory, 67(5):3155 3182, 2021b.

Zhao, P. and Lai, L. Analysis of knn density estimation. IEEE Transactions on Information Theory, 68(12):7971 7995, 2022.

Zhao, P. and Lai, L. Minimax optimal q learning with nearest neighbors. IEEE Transactions on Information Theory, 2024.

Zhao, P. and Wan, Z. Robust nonparametric regression under poisoning attack. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 17007 17015, 2024.

Zhou, Q., Zhang, X., Xu, J., and Liang, B. Large-scale bandit approaches for recommender systems. In Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 1418, 2017, Proceedings, Part I 24, pp. 811 821. Springer, 2017.

Zhu, Y., Foster, D. J., Langford, J., and Mineiro, P. Contextual bandits with large action spaces: Made practical. In International Conference on Machine Learning, pp. 27428 27453. PMLR, 2022.

Contextual Bandits for Unbounded Context Distributions

A. Examples of Heavy-tailed Distributions

This section explains Example 1 and 2 in the paper. For Example 1,

P(f(X) < t) = Z

X f(x)1(f(x) < t)dx t V (X). (21)

For Example 2, from H older s inequality, Z f 1 β(x)dx = Z f 1 β(x)(1 + x γ) 1 1 + x γ

Z f(x)(1 + x γ) 1 1 β dx 1 β Z (1 + x γ) 1

Let γ = p(1 β), then R f(x)(1 + x γ) 1 1 β dx 1 β < . If β < p/(p + d), then γ/β > d, thus R (1 + x γ) 1

β dx β < . Hence R f 1 β(x)dx < , and

P(f(X) < t) = P(f β(X) > t β) tβE[f β(X)] = tβ Z f 1 β(x)dx. (23)

Therefore for all β < p/(p + d), Assumption 3(a) holds with some finite Cβ.

For subgaussian or subexponential random variables, E[ X p] < holds for any p, thus Assumption 3(a) holds for β arbitrarily close to 1.

B. Expected Sample Density

In this section, we define expected sample density, which is then used in the later analysis. Throughout this section, denote x(j) as the value of j-th component of vector x.

Definition 1. (expected sample density) qa : X R is defined as the function such that for all S X,

t=1 1(Xt S, At = a)

S qa(x)dx. (24)

To show the existence of qa, define

t=1 1(Xt {u|u(1) x(1), . . . , u(d) x(d)}, At = a)

qa(x) = d Qa x(1) . . . x(d)

and then (24) is satisfied for all S X.

Then we show the following basic lemmas.

Lemma 1. Regardless of η, qa satisfies

qa(x) Tf(x) (27)

for almost all x X.

Contextual Bandits for Unbounded Context Distributions

Proof. Note that P(Xt S) = R

S f(x)dx. Therefore for all set S,

t=1 1(Xt S, At = a)

S f(x)dx. (28)

From (24) and (28), R

S qa(x)dx T R

S f(x)dx for all S. Therefore qa(x) Tf(x) for almost all x X.

Lemma 2. R = P a A Ra, in which Ra is defined as

X (η (x) ηa(x))qa(x)dx. (29)

t=1 (η (Xt) ηAt(Xt))

t=1 (η (Xt) ηa(Xt))1(At = a)

X (η (x) ηa(x)) qa(x)dx. (30)

The proof is complete.

C. Proof of Theorem 1

Recall that

t=1 (η (Xt) ηAt(Xt))

Now we define

t=1 1(ηAt(Xt) < η (Xt))

as the expected number of steps with suboptimal actions.

The following lemma characterizes the relationship between S and R.

Lemma 3. There exists a constant C0, such that

Proof. The proof of Lemma 3 follows the proof of Lemma 3.1 in (Rigollet & Zeevi, 2010). For completeness and consistency of notations, we show the proof in Appendix I.10.

From now on, we only discuss the case with only two actions, such that A = {1, 1}. Construct B disjoint balls with centers c1, . . . , c B and radius h. Let

j=1 1(x Bj), (34)

in which Bj = {x | x cj h} is the j-th ball. To ensure that the pdf defined above is normalized (i.e. R f(x)dx = 1), B and h need to satisfy

Bvdhd = 1, (35)

Contextual Bandits for Unbounded Context Distributions

in which vd is the volume of d dimensional unit ball.

Let η1(x) = η(x) and η2(x) = 0, with

j=1 vjh1(x B(cj, h)), (36)

in which vj { 1, 1} for j = 1, . . . , K. To satisfy the margin assumption (Assumption 1(a)), note that

P(0 < |η(X)| t) Kvdhd if t h 0 if t < h. (37)

Note that for any suboptimal action a, η (x) ηa(x) = |η(x)|. Assumption 1(a) requires that P(0 < |η(X)| t) Cαtα. Therefore, it suffices to ensure that

Kvdhd = Cαhα. (38)

t=1 P(Xt Bj, At = a (Xt))

Bj f(x)P(At = a (x)|Xt = x)dx

Bj f(x)P(At = vj|Xt = x)dx

Bj f(x)1(π(x; X1:t 1, Y1:t 1) = vj)dx

ˆvj(t) = sign

Bj f(x)π(x; X1:t 1, Y1:t 1)dx

Bj f(x)1(π(x; X1:t 1, Y1:t 1) ˆvj(t))dx Z

Bj f(x)1 (π(x; X1:t 1, Y1:t 1) = ˆvj(t)) dx. (41)

Bj f(x)1 (π(x; X1:t 1, Y1:t 1) ˆvj(t)) dx + Z

Bj f(x)1 (π(x; X1:t 1, Y1:t 1) = ˆvj(t)) dx

Bj f(x)dx, (42)

Bj f(x)1 (π(x; X1:t 1, Y1:t 1) = ˆvj(t)) dx 1

Bj f(x)dx. (43)

If ˆvj(t) = vj, then Z

Bj f(x)1 (π(x; X1:t 1, Y1:t 1) = vj) dx 1

Bj f(x)dx. (44)

Contextual Bandits for Unbounded Context Distributions

Therefore, from (39),

1 2P(ˆvj(t) = vj) Z

1 2vdhd P(ˆvj(t) = vj). (45)

Note that the error probability of hypothesis testing between distance p and q is at least (1 TV(p, q))/2, in which TV denotes the total variation distance. Let (V1, . . . , VK) be a vector of K random variables taking values from { 1, 1}K

randomly. In other words, P(Vj = 1) = P(Vj = 1) = 1/2, and Vj for different j are i.i.d. Denote PXY |Vj=vj as the distribution of X and Y conditional on Vj = vj. Moreover, Pt 1 XY |Vj=vj means the distribution of X and Y of the first t 1 samples conditional on Vj = vj. Then

P(ˆvj(t) = vj) 1 2

1 TV Pt 1 XY |vj=1||Pt 1 XY |Vj= 1

1 2D Pt 1 XY |Vj=1||Pt 1 XY |Vj= 1 !

in which the second step uses Pinsker s inequality (Fedotov et al., 2003), and D(p||q) denotes the Kullback-Leibler (KL) divergence between distributions p and q. Note that the KL divergence between the conditional distribution is bounded by

D(PY |X,Vj=1||PY |X,Vj= 1) 1

2(η1(x) η2(x))2 1

for x Bj. Therefore

D(PXY |Vj=1||PXY |Vj= 1) = Z f(x)D(PY |X,Vj=1||PY |X,Vj= 1)dx

= 1 2vdhd+2. (48)

Hence, from (46),

P(ˆvj(t) = vj) 1 2

1 4(t 1)vdhd+2 !

Tvdhd+2 . (49)

Recall (45),

t=1 vdhd 1 1

= 1 4KTvdhd 1 1

= 1 4CαThα 1 1

Tvdhd+2 , (50)

in which the last step comes from (38).

Let h T 1 d+2 , then

S T 1 α d+2 . (51)

Contextual Bandits for Unbounded Context Distributions

From Lemma 3,

R T(1 α d+2) α+1

D. Proof of Theorem 2

In this section, we derive the minimax lower bound of the expected regret with unbounded support. Recall that in the case with bounded support of contexts (Proof of Theorem 1 in Appendix C), we construct B disjoint balls with pdf f(x) = 1 for all x X. Now for the case with unbounded support, the distribution of context has tails, on which the pdf f(x) is small. Therefore, we modify the construction of balls as follows. We now construct B + 1 disjoint balls with center c0, . . . , c B, such that

B0 = {x | x c0 r0}, (53)

Bj = {x | x cj h}. (54)

j=1 vjh1(x B(cj, h)), (55)

in which vj {0, 1} is unknown, and

f(x) = 1(x B0) +

j=1 m1(x Bj), (56)

in which m 1 will be determined later. Here we construct one ball that denotes the center region which has the most of probability mass, as well as B balls that denotes the tail region. For simplicity, we let η(x) = 0 at the largest ball B0, and only

To satisfy the margin condition (i.e. Assumption 1(a)), note that now

P(0 < |η(X)| < t) m Kvdhd if t > h 0 if t h. (57)

The right hand side of (57) can not exceed Cαtα, which requires

m Kvdhd Cαhα. (58)

Moreover, to satisfy the tail assumption (Assumption 3(a)), note that

P(f(X) < t) m Kvdhd if t > m 0 if t m. (59)

The right hand side of (59) can not exceed Cβtβ, which requires

m Kvdhd Cβmβ. (60)

Following (45), S can be lower bounded by

1 2P(ˆvj(t) = vj) Z

1 2mvdhd P(ˆvj(t) = vj)

1 2D(PXY |Vj=1||PXY |Vj= 1)

t=1 mvdhd 1 1

Tmvdhd+2 . (61)

Contextual Bandits for Unbounded Context Distributions

From (61), we pick m and h to ensure that

Tmvdhd+2 < 1

Then under three conditions (58), (60) and (62),

S KTmhd. (63)

It remains to determine the value of m, h and K based on these three conditions. Let

h (Tm) 1 d+2 , (64)

m T α α+β(d+2) , (65)

K hα d/m, (66)

S Thα T 1 αβ α+β(d+2) . (67)

Based on Lemma 3,

α T 1 (α+1)β α+β(d+2) . (68)

It remains to show that R T 1 β. Let h 1, K T 1 β and m 1/T, the conditions (58), (60) and (62) are still satisfied. In this case,

S T 1 β. (69)

Direct transformation using Lemma 3 yields suboptimal bound. Intuitively, for the case with heavy tails (i.e. β is small), the regret mainly occur at the tail of the context distribution. Therefore, we bound the expected regret again.

t=1 (η (Xt) ηAt(Xt))

t=1 h1 (ηAt(Xt) < η (Xt))

(b) = S T 1 β. (70)

(a) comes from the construction of η in (55). (b) holds since we set h = 1 here.

Combine (68) and (70),

R T 1 (α+1)β α+β(d+2) + T 1 β. (71)

E. Proof of Theorem 3

To begin with, we show the following lemma.

Lemma 4. For all u > 0,

i Nt(x,a) Wi

d T 2d|A|e ku2

Contextual Bandits for Unbounded Context Distributions

Proof. The proof is shown in Appendix I.1.

From Lemma 4, recall the definition of b in (10),

i Nt(x,a) Wi

Therefore, with probability 1 1/T, for all x X, a A and t = 1, . . . , T, | P

i Nt(x,a) Wi|/k b. Denote E as the event such that | P i Nt(x,a) Wi|/k b, x, a, t, then

i Nt(x,a) Wi

i Nt(x,a) Wi

Recall the calculation of UCB in (9). Based on Lemma 4, we then show some properties of the UCB in (9).

Lemma 5. Under E, if |{i < t|Ai = a}| k, then

ηa(t) ˆηa,t(x) ηa(x) + 2b + 2Lρa,t(x). (75)

Proof. The proof is shown in Appendix I.2.

We then bound the number of steps with suboptimal action a. Define

n(x, a, r) :=

t=1 1 ( Xt x < r, At = a) . (76)

Then the following lemma holds.

Lemma 6. Under E, for any x X, a A, if η (x) ηa(x) > 2b, define

ra(x) = η (x) ηa(x) 2b

n(x, a, ra(x)) k. (78)

Proof. The proof is shown in Appendix I.3.

From Lemma 6, the expectation of n(x, a, ra(x)) can be bounded as follows.

E[n(x, a, ra(x))] = P(E)E[n(x, a, ra(x))|E] + P(Ec)T

k + 1, (79)

in which the first step holds since even if E does not hold, the number of steps in n(x, a, ra(x)) is no more than the total sample size T. The second step uses (74). From (79) and the definition of expected sample density in (24), Z

B(x,ra(x)) qa(u)du k + 1. (80)

Contextual Bandits for Unbounded Context Distributions

It bounds the average value of qa over the neighborhood of x. However, it does not bound qa(x) directly. To bound Ra, we introduce a new random variable Z, with pdf

MZ [(η (z) ηa(z)) ϵ]d , (81)

in which ϵ = 4b, with b defined in (10). MZ is the constant for normalization. We then bound Ra defined in (29). Ra can be split into two terms:

X (η (x) ηa(x))qa(x)1(η (x) ηa(x) > ϵ)dx

X (η (x) ηa(x))qa(x)1(η (x) ηa(x) ϵ)dx. (82)

To begin with, we bound the first term in (82). We show the following lemma.

Lemma 7. There exists a constant C1, such that

X (η (x) ηa(x))qa(x)1(η (x) ηa(x) > ϵ)dx C1MZE

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

in which ϵ = 4b.

Proof. The proof of Lemma 7 is shown in Appendix I.4.

Now we bound the right hand side of (83). We show the following lemma.

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

k MZcϵα+1 d if d > α + 1 k MZc ln 1

ϵ if d = α + 1 k MZc if d < α + 1,

in which c is the lower bound of pdf of contexts, which comes from Assumption 2.

Proof. The proof of Lemma 8 is shown in Appendix I.5.

From Lemma 7 and 8,

X (η (x) ηa(x))qa(x)1(η (x) ηa(x) > ϵ)dx

kϵα+1 d if d > α + 1 k ln 1

ϵ if d = α + 1 k if d < α + 1.

Now we bound the second term in (82). From Lemma 1, qa(x) Tf(x) for almost all x X. Thus Z (η (x) ηa(x))qa(x)1(η (x) ηa(x) ϵ)dx T Z (η (x) ηa(x))f(x)1(η (x) ηa(x) ϵ)dx

TϵP (η (X) ηa(X) ϵ)

CαTϵα+1. (86)

Therefore, from (82), (85) and (86),

Tϵα+1 + kϵα+1 d if d > α + 1 Tϵα+1 + k ln 1

ϵ if d = α + 1 Tϵα+1 + k if d < α + 1.

Contextual Bandits for Unbounded Context Distributions

Recall that

k ln(d T 2d+2|A|). (88)

If d > α + 1, let k T 2 d+2 , then

ϵ T 1 d+2 q

ln(d T 2d+2|A|), (89)

2 (d T 2d+2|A|). (90)

If d α + 1, let k T 2 α+3 , then

ϵ T 1 α+3 q

ln(d T 2d+2|A|), (91)

Ra T 2 α+3 ln

2 (d T (2d + 2)|A|). (92)

Theorem 3 can then be proved using by R = P

a A Ra stated in Lemma 2.

F. Proof of Theorem 4

Recall the expression of regret shown in Lemma 2. We decompose Ra as follows.

X (η (x) ηa(x))qa(x)1(η (x) ηa(x) ϵ)dx

X (η (x) ηa(x))qa(x)1 η (x) ηa(x) > ϵ, f(x) > k Tϵd

X (η (x) ηa(x))qa(x)1 η (x) ηa(x) > ϵ, k

T < f(x) k Tϵd

X (η (x) ηa(x))qa(x)1 η (x) ηa(x) > ϵ, f(x) k

:= I1 + I2 + I3 + I4, (93)

in which ϵ is the same as the proof of Theorem 3 in Appendix 5, i.e. ϵ = 4b.

Bound of I1. From Lemma 1, q(x) Tf(x) for almost all x X. Hence

X (η (x) ηa(x))1(η (x) ηa(x) ϵ)dx

TϵP(η (X) ηa(X) ϵ)

CαTϵ1+α. (94)

Bound of I2. The regret of the high density region can be bounded similarly as the regret for pdf bounded away from zero. Follow the proof of Theorem 3 in Appendix 5, define

MZ [(η (z) ηa(z)) ϵ]d 1 f(z) > k Tϵd

Similar to Lemma 5,

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du. (96)

Contextual Bandits for Unbounded Context Distributions

Similar to Lemma 6, now we replace c with k/(Tϵd). Then

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

T ϵd MZ ϵα+1 d if d > α + 1

T ϵd MZ ln 1

ϵ if d = α + 1

T ϵd MZ if d < α + 1.

Tϵdϵα+1 d if d > α + 1 Tϵd ln 1

ϵ if d = α + 1 Tϵd if d < α + 1.

Bound of I3. Here we introduce the following lemma.

Lemma 9. (Restated from Lemma 6 in (Zhao & Lai, 2021b)) For any 0 < a < b,

E[f p(X)1(a f(X) < b)]

bβ p if p > β ln b

a if p = β aβ p if p < β.

Proof. The proof of Lemma 9 can follow that of Lemma 6 in (Zhao & Lai, 2021b). For completeness, we show the proof in Appendix I.9.

Based on Lemma 9, I3 can be bounded by

X (η (x) ηa(x))qa(x)1 η (x) ηa(x) > ϵ, k

d Tf(x)1 η (x) ηa(x) > ϵ, k

T < f(x) k Tϵd

T < f(x) < k Tϵd

T β if β < 1

T β ϵ1 dβ if β > 1

Bound of I4.

I4 TM Z f(x)1 f(x) k

Now we bound Ra by selecting k to minimize the sum of I1, I2, I3, I4. Recall that ϵ = 4b, in which b is defined in (10), thus ϵ p

ln(d T 2d+2|A|)/k.

(1) If d > α + 1 and β > 1/d, then with k T

2β α+(d+2)β ,

T 1 β(α+1) α+(d+2)β ln

2 (d T 2d+2|A|). (101)

Contextual Bandits for Unbounded Context Distributions

(2) If d α + 1 and β > 1/d, then with k T

2β d 1+β(d+2) ,

T 1 βd d 1+(d+2)β ln d 2 (d T 2d+2|A|). (102)

(3) If d > α + 1 and β 1/d, then with k T

2β 1+α+2β ,

R Tϵ1+α + T k

2 (d T 2d+2|A|). (103)

(4) If d α + 1 and β 1/d, then with k T

R Tϵd + T k

T 1 βd d+2β ln d 2 (d T 2d+2|A|). (104)

Combine all these cases, we conclude that

R T 1 β min(d,α+1) min(d 1,α)+max(1,dβ)+2β |A| ln 1 2 max(d,α+1)(d T 2d+2|A|). (105)

The proof of Theorem 4 is complete.

G. Proof of Theorem 5

To begin with, similar to Lemma 4 and Lemma 5, we show the following lemmas.

i Nt,k(x,a) Wi

d T 2d+1|A|e u2

2σ2 , (106)

with Nt,k(s, a) being the set of k neighbors among {Xi|i < t, Ai = a}.

Proof. From Lemma 4,

i Nt,k(x,a) Wi

d T 2d|A|e u2

2σ2 . (107)

Lemma 10 can then be proved by taking a union bound over all k.

Lemma 11. Define event E, such that E = 1 if

i Nt,k(x,a) Wi

2σ2 ln(d T 2d+3|A|) (108)

for all x, a, k, t, then P(E) 1 1/T. Moreover, under E,

ηa(x) ˆηa,t(x) ηa(x) + 2ba,t(x) + 2Lρa,t(x). (109)

Contextual Bandits for Unbounded Context Distributions

Proof. The proof is similar to the proof of Lemma 5.

|ˆηa,t(x) (ηa(x) + ba,t(x) + Lρa,t(x))|

i Nt(x,a) (Yi ηa(x))

i Nt(x,a) (Yi ηa(Xi))

+ 1 ka,t(x)

i Nt(x,a) |ηa(Xi) ηa(x)|

Lρa,t(x) + ba,t(x). (110)

With these preparations, we then bound the number of steps around each x in the next lemma, which is crucially different with Lemma 6. Here we keep the definition n(x, a, r) := PT t=1 1 ( Xt x < r, At = a) to be the same as (76), but change the definition of ra and na as follows.

Lemma 12. Define

ra(x) = 1 2L C1 (η (x) ηa(x)), (111)

na(x) = C1 ln T (η (x) ηa(x))2 , (112)

C1 = max{4, 32σ2(2d + 3 + ln(d|A|))}. (113)

Then under E,

n(x, a, ra(x)) na(x). (114)

Proof. The proof of Lemma 12 is shown in Appendix I.6.

From Lemma 12,

E[n(x, a, ra(x))] P(E)E[n(x, a, ra(x))|E] + P(Ec)E[n(x, a, ra(x))|Ec]

na(x) + 1. (115)

From the definition of qa in (24), Z

B(x,ra(x)) qa(u)du na(x) + 1. (116)

Now we bound Ra. Similar to (81), let random variable Z follows a distribution with pdf g:

g(z) = 1 MZ[(η (z) ηa(z)) ϵ]d . (117)

The difference with the case with fixed k is that in (81), ϵ = 4b. However, now ba,t(x) varies among x, thus we do not determine ϵ based on b. Instead, for the adaptive nearest neighbor method, ϵ will be determined after we get the final bound of Ra.

We show the following lemma.

Contextual Bandits for Unbounded Context Distributions

Lemma 13. There exists a constant C2, such that

X (η (x) ηa(x))qa(x)1(η (x) ηa(x) > ϵ)dx C2MZE

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

Proof. The proof of Lemma 13 is shown in Appendix I.7.

We then bound the right hand side of (118).

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

ϵα d 1 ln T + Tϵ1+α . (119)

Proof. The proof of Lemma 14 is shown in Appendix I.8.

From Lemma 13 and 14, Z

X (η (x) ηa(x))qa(x)1(η (x) ηa(x) > ϵ)dx ϵα d 1 ln T + Tϵ1+α. (120)

Moreover, Z

X (η (x) ηa(x))qa(x)1(η (x) ηa(x) ϵ)dx Tϵ Z f(x)1(η (x) ηa(x) ϵ)dx

Tϵ1+α. (121)

Ra ϵα d 1 ln T + Tϵ1+α. (122)

Note that we have not specified the value of ϵ earlier. Therefore, in (122), ϵ can take any values. We can then select ϵ to minimize the right hand side of (122). Therefore, let

1 d+2 , (123)

d+2 . (124)

The overall regret can then be bounded by summation over a:

d+2 . (125)

H. Proof of Theorem 6

g(z) = 1 MZ(η (z) ηa(z))d 1 f(z) 1

T , η (z) ηa(z) > ϵ(x) , (126)

ϵ(x) = (Tf(x)) 1 d+2 , (127)

Contextual Bandits for Unbounded Context Distributions

and MZ is the normalization constant, which ensures that R g(z)dz = 1. Let Z be a random variable with pdf g. We then bound Ra for the case with unbounded support on the contexts, under Assumption 2 and 3.

X (η (x) ηa(x))qa(x)dx

X (η (x) ηa(x))qa(x)1 η (x) ηa(x) > ϵ(x), f(x) 1

X (η (x) ηa(x))qa(x)1 η (x) ηa(x) ϵ(x), f(x) 1

X (η (x) ηa(x))qa(x)1 f(x) < 1

:= I1 + I2 + I3. (128)

Now we bound three terms in (128) separately.

Bound of I1. Following Lemma 7 and 13, it can be shown that for some constant C3, such that

X (η (x) ηa(x))qa(x)1(η (x) ηa(x) > ϵ(x))dx C3MZE

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

The right hand side of (129) can be bounded as follows.

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

B(Z,ra(Z)) qa(u)(η (Z) ηa(Z))du

3 2E [(na(Z) + 1)(η (Z) ηa(Z))]

Z (na(z) + 1)(η (z) ηa(z))g(z)dz

Z C1 ln T η (z) ηa(z) + η (z) ηa(z) 1 (η (z) ηa(z))d 1 η(z) ϵ(z), f(z) 1

Z (η (z) ηa(z)) (d+1)1 η(z) ϵ(z), f(z) 1

I1 ln T Z (η (x) ηa(x)) (d+1)1 η(x) ϵ(x), f(x) 1

Let c R that will be determined later. Z (η (x) ηa(x)) (d+1)1(η (x) ηa(x) ϵ(x), f(x) c)dx

Z (η (x) ηa(x)) (d+1)1 η (x) ηa(x) (Tc) 1 d+2 f(x)dx

= 1 c E h (η (X) ηa(X)) (d+1)1 η (X) ηa(X) (Tc) 1 d+2 i

1 c (Tc) d+1 d+2(1 α d+1)

= T(Tc) α+1

d+2 . (132)

To bound the integration of the other side, i.e. 1/T f(x) < c, we use Lemma 9.

Contextual Bandits for Unbounded Context Distributions

Z (η (x) ηa(x)) (d+1)1 η(x) ϵ(x), 1

T f(x) < c dx

Z ϵ (d+1)(x)1 1

T f(x) < c dx

= T d+1 d+2 Z f d+1 d+2 (x)1 1

T f(x) < c dx

= T d+1 d+2 E f 1 d+2 (X)1 1

( T d+1 d+2 T 1 d+2 β + cβ 1 d+2 if β = 1 d+2 T d+1 d+2 ln(Tc) if β = 1 d+2.

From (132) and (133),

T ln T h (Tc) α+1

d+2 + T 1 d+2 cβ 1 d+2 + T βi if β = 1 d+2 T ln T h (Tc) α+1

d+2 + T 1 d+2 ln(Tc) i if β = 1 d+2.

To minimize (134), let

c = T α α+(d+2)β , (135)

( T 1 (α+1)β α+(d+2)β ln T + T 1 β ln T if β = 1 d+2 T d+1 d+2 ln2 T if β = 1 d+2.

Bound of I2. We still discuss f(x) c and 1/T f(x) < c separately. For f(x) c, Z (η (x) ηa(x))qa(x)1(η (x) ηa(x) ϵ(x), f(x) c)dx

T Z (η (x) ηa(x))1 η (x) ηa(x) (Tc) 1 d+2 f(x)dx

T(Tc) 1 d+2 P η (X) ηa(X) (Tc) 1 d+2

d+2 . (137)

For 1/T f(x) < c,

Z (η (x) ηa(x))qa(x)1 η (x) ηa(x) ϵ(x), 1

T f(x) < c dx

T Z ϵ(x)f(x)1 1

T f(x) < c dx

= T d+1 d+2 Z f d+1 d+2 (x)1 1

T f(x) < c dx

( T d+1 d+2 (T 1 d+2 β + cβ 1 d+2 ) if β = 1 d+2 T d+1 d+2 ln(Tc) if β = 1 d+2.

Similar to I1, pick c = T α/(α+(d+2)β).

Contextual Bandits for Unbounded Context Distributions

Bound of I3. From Assumption 3(b), η (x) ηa(x) M. Moreover, from Lemma 1, q(x) Tf(x) for almost all x X. Hence Z (η (x) ηa(x))qa(x)1 f(x) < 1

dx MT Z f(x)1 f(x) < 1

dx T 1 β. (139)

Combine I1, I2 and I3,

( T 1 (α+1)β α+(d+2)β ln T + T 1 β ln T if β = 1 d+2 T d+1 d+2 ln2 T if β = 1 d+2.

Theorem 6 can then proved by R = P

I. Proof of Lemmas

I.1. Proof of Lemma 4

From Assumption 2(c), Wi is subgaussian with parameter σ2. Therefore for any fixed set I {1, . . . , T} with |I| = k,

2λ2σ2 , (141)

inf λ e λu E

inf λ e λu exp λ2σ2

Now we need to give a union bound3. Let Aij be d 1 dimensional hyperplane that bisects Xi, Xj. Then the number of planes is at most Np = T(T 1)/2. Note that Np planes divide a d dimensional space into at most Nr = Pd j=0 Np j

regions. Therefore

2T(T 1) d < d T 2d. (143)

The k nearest neighbors for all x within a region should be the same. Combining with the action space A, there are at most Nr|A| regions. Hence

|{Nt(x, a)|x X, a A}| d T 2d|A|. (144)

i Nt(x,a) Wi

d T 2d|A|e ku2

2σ2 . (145)

The proof is complete.

3The construction of hyperplanes follows the proof of Lemma 3 in (Jiang, 2019) and Appendix H.5 in (Zhao & Wan, 2024).

Contextual Bandits for Unbounded Context Distributions

I.2. Proof of Lemma 5

From (9), with |{i < t|Ai = a}| k,

ˆηa,t(x) = 1 k

i Nt(x,a) Yi + b + Lρa,t(x)

i Nt(x,a) ηa(Xi) + 1

i Nt(x,a) Wi + b + Lρa,t(x). (146)

|ˆηa,t(x) (ηa(x) + b + Lρa,t(x))| 1 k

i Nt(x,a) |ηa(Xi) ηa(x)| +

i Nt(x,a) Wi

Lρa,t(x) + b, (147)

which comes from Assumption 2(d) and Lemma 4. Lemma 5 can then be proved using (147).

I.3. Proof of Lemma 6

We prove Lemma 6 by contradiction. If n(x, a, ra(x)) > k, then let

t = max{τ| Xτ x ra(x), Aτ = a} (148)

be the last step falling in B(x, ra(x)) with action a. Then B(x, ra(x)) B(Xt, 2ra(x)), and thus there are at least k points in B(Xt, 2ra(x)). Therefore,

ρa,t(x) < 2ra(x) (149)

a (x) = arg max a ηa(x) (150)

as the best action at context x. At = a is selected only if the UCB of action a is not less than the UCB of action a (x), i.e.

ˆηa,t(Xt) ˆηa (x),t(Xt). (151)

From Lemma 5,

ˆηa,t(Xt) ηa(Xt) + 2b + 2Lρa,t(Xt), (152)

ˆηa (x),t(Xt) ηa (x)(Xt) = η (Xt). (153)

From (151), (152) and (153),

ηa(Xt) + 2b + 2Lρa,t(Xt) η (Xt), (154)

which yields

ρa,t(Xt) η (Xt) ηa(Xt) 2b

η (x) ηa(x) 2b 2Lra(x)

2L = 2ra(x), (155)

in which the last step comes from the definition of ra in (77). Note that (155) contradicts (149). Therefore n(x, a, ra(x)) k. The proof of Lemma 6 is complete.

Contextual Bandits for Unbounded Context Distributions

I.4. Proof of Lemma 7

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

B(z,ra(z)) qa(u)(η (u) ηa(u))du

4 ra(u)) g(z)qa(u)(η (u) ηa(u))dzdu

inf z u 3ra(u)/4g(z) 3

qa(u)(η (u) ηa(u))dzdu

X g(u)rd a(u)qa(u)(η (u) ηa(u))du

1 [(η (u) ηa(u)) ϵ]d

η (u) ηa(u) 2b

d qa(u)(η (u) ηa(u))du

X 1(η (u) ηa(u) > ϵ) 1 (η (u) ηa(u))d

η (u) ηa(u)

d qa(u)(η (u) ηa(u))du

d 1 (12L)d MZ

X qa(u)(η (u) ηa(u))1(η (u) ηa(u) > ϵ)du. (156)

Based on (156), Lemma 7 holds with

d (12L)d. (157)

Now we explain some key steps in (156).

In (a), the order of integration is swapped. Note that if u B(z, ra(z)), then u z ra(z). From Assumption 2(d), |ηa(u) ηa(z)| Lra(z). Then from (77),

ra(u) = η (u) ηa(u) 2b

η (z) ηa(z) + 2Lra(z) 2b

= ra(z) + 1

= 4 3ra(z), (158)

thus z u 3ra(u)/4 implies u z ra(z). Therefore (a) holds.

For (b) in (156), note that for z u 3ra(u)/4, using Assumption 2(d) again,

η (z) ηa(z) η (u) ηa(u) + L3ra(u)

4 + L3ra(u)

4 = η (u) ηa(u) + 3

2Lra(u). (159)

g(z) g(u) = [(η (u) ηa(u)) ϵ]d

[(η (z) ηa(z)) ϵ]d

[(η (u) ηa(u)) ϵ]d (η (u) ηa(u) + 3

2Lra(u)) ϵ d

Contextual Bandits for Unbounded Context Distributions

in which the last step comes from the definition of ra in (77).

(c) uses (77) and the definition of g in (81).

For (d), recall the statement of Lemma 7, ϵ = 4b. Therefore, if η (u) ηa(u) > ϵ, then η (u) ηa(u) 2b > (η (u) ηa(u))/2.

I.5. Proof of Lemma 8

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

B(Z,ra(Z)) qa(u)(η (z) ηa(z))du

(b) 4 3(k + 1)E [(η (Z) ηa(Z))]

= 4 3(k + 1) Z

X (η (z) ηa(z))g(z)dz

X (η (z) ηa(z)) 1 [(η (z) ηa(z)) ϵ]d dz

X (η (z) ηa(z)) (d 1)1(η (z) ηa(z) > ϵ)dz

X (η (z) ηa(z))1(η (z) ηa(z) < ϵ)dz . (161)

For (a), from Assumption 2(d),

η (u) ηa(u) η (z) ηa(z) + 2Lra(z)

= η (z) ηa(z) + 2Lη (z) ηa(z) 2b

4 3(η (z) ηa(z)). (162)

(b) comes from (80).

The first term in the bracket in (161) can be bounded by Z

X (η (z) ηa(z)) (d 1)1(η (z) ηa(z) > ϵ)dz

X (η (z) ηa(z)) (d 1)1(η (z) ηa(z) > ϵ)f(z)dz

(b) = 1 c E h (η (X) ηa(X)) (d 1)1(η (X) ηa(X) > ϵ) i

0 P(ϵ < η (X) ηa(X) < t 1 d 1 )dt

0 P(η (X) ηa(X) < t 1 d 1 )dt. (163)

(a) comes from Assumption 2, which requires that f(x) c all over the support. In (b), the random variable X follows a distribution with pdf f.

If d > α + 1, then from Assumption 2(b),

0 t α d 1 dt = Cα(d 1) c(d 1 α)ϵα+1 d. (164)

Contextual Bandits for Unbounded Context Distributions

If d = α + 1, then

1 t α d 1 dt = 1

c + Cα(d 1)

If d < α + 1, then

1 t α d 1 dt 1

c + Cα(d 1) c(α + 1 d). (166)

Now it remains to bound the second term in (161): Z

X (η (z) ηa(z))1(η (z) ηa(z) < ϵ)dz 1 c E[(η (X) ηa(X))1(η (X) ηa(X) < ϵ)]

c ϵα+1. (167)

Therefore, from (161), (164), (165), (166) and (167),

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

k MZcϵα+1 d if d > α + 1 k MZc ln 1

ϵ if d = α + 1 k MZc if d < α + 1.

I.6. Proof of Lemma 12

We prove Lemma 12 by contradiction. Suppose now that n(x, a, ra(x)) > na(x). Let t be the last sample falling in B(x, ra(x)), i.e.

t := max {j| Xj x ra(x), Aj = a} . (169)

We first show that ρa,t(x) 2ra(x). From (169) and the condition n(x, a, ra(x)) > na(x), before time step t, there are already at least na(x) steps in B(x, ra(x)). Note that Xt x ra(x), thus B(x, ra(x)) B(Xt, 2ra(x)). Therefore, there are already at least na(x) samples with action a in B(Xt, 2ra(x)). Recall that ρa,t(x) = ρa,t,ka,t(x)(x). If ρa,t(x) > 2ra(x), then ka,t(x) > na(x). From (16),

Lρa,t(x) = Lρa,t,ka,t(x)(x)

ln T ka,t(x)

ln T na(x) = 2Lra(x), (170)

then contradiction occurs. Therefore ρa,t(x) 2ra(x).

From Lemma 11, under E,

ˆηa,t(x) ηa(x) + 2

na(x) ln(d T 2d+3|A|) + 2Lra(x). (171)

Since action a is selected at time t, from Lemma 11,

ˆηa,t(x) ˆηa (x),t(x) η (x). (172)

Combining (171) and (172) yields

na(x) ln(d T 2d+3|A|) + 2Lra(x) η (x) ηa(x). (173)

We now derive an inequality that contradicts with (173). From (112) and (113),

na(x) ln(d T 2d+3|A|) = 2

C1 ln T ln(d T 2d+3|A|)(η (x) ηa(x))

ln(d T 2d+3|A|) (2d + 3 + ln(d|A|)) ln T (η (x) ηa(x))

< 1 2(η (x) ηa(x)). (174)

Contextual Bandits for Unbounded Context Distributions

From (111),

2Lra(x) = 1 C1 (η (x) ηa(x)) 1

2(η (x) ηa(x)). (175)

From (174) and (175),

na(x) ln(d T 2d+3|A|) + 2Lra(x) < η (x) ηa(x). (176)

(176) contradicts (173). Hence

n(x, a, ra(x)) na(x). (177)

I.7. Proof of Lemma 13

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

B(u,2ra(u)/3) g(z)qa(u)(η (u) ηa(u))dzdu

inf z u 2ra(u)/3g(z) 2

d rd a(u)qa(u)(η (u) ηa(u))du

X g(u)rd a(u)qa(u)(η (u) ηa(u))du

1 [(η (u) ηa(u)) ϵd]rd a(u)qa(u)(η (u) ηa(u))du

X 1(η (u) ηa(u) > ϵ) 1 (η (u) ηa(u))d (η (u) ηa(u))d

(4L)d qa(u)(η (u) ηa(u))du

1 23d Ld MZ

X qa(u)(η (u) ηa(u))1(η (u) ηa(u) > ϵ)du. (178)

For (a), if u z ra(z), then from the definition of ra in (111),

ra(z) = η (u) ηa(u)

η (z) ηa(z) η (z) ηa(z) + 2Lra(z)

η (z) ηa(z) = 1 + 1 C1 3

g(z) g(u) = [(η (u) ηa(u)) ϵ]d

[(η (z) ηa(z)) ϵ]d

[(η (u) ηa(u)) ϵ]d (η (u) ηa(u) + 4

3Lra(u)) ϵ d

Contextual Bandits for Unbounded Context Distributions

I.8. Proof of Lemma 14

B(Z,ra(Z)) qa(u)(η (u) ηa(u))du

B(Z,ra(Z)) qa(u)(η (z) ηa(z))du

3 2E ((na(Z) + 1) (Tf(Z)rd a(Z)))(η (Z) ηa(Z))

Z ((na(z) + 1) (Tf(z)rd a(z)))(η (z) ηa(z)) 1 MZ[(η (z) ηa(z)) ϵ]d dz

Z C1 ln T η (z) ηa(z) + η (z) ηa(z) 1 (η (z) ηa(z))d 1(η (z) ηa(z) > ϵ)dz

+ Z Tf(z)rd a(z)(η (z) ηa(z)) 1

ϵd 1(η (z) ηa(z) ϵ)dz

1 MZ E h (η (Z) ηa(Z)) (d+1)1(η (Z) ηa(Z) > ϵ) i ln T

ϵd E (η (Z) ηa(Z))d+11(η (Z) ηa(Z) ϵ)

ϵα d 1 ln T + Tϵ1+α . (181)

η (u) ηa(u) η (z) ηa(z) + 2Lra(z)

η (z) ηa(z) + 1 C1 (η (z) ηa(z))

3 2(η (z) ηa(z)). (182)

I.9. Proof of Lemma 9

E[f p(X)1(a f(X) < b)] = Z

0 P(f(X) < t 1

p , a f(X) < b)dt

0 P(f(X) < b)dt + Z a p

b p P(f(X) < t 1

Cβbβ p + Cβ

p dt. (183)

If p > β, i.e. β/p < 1, then

(183) Cβbβ p + Cβ 1 β/p(a p)1 β

p aβ p. (184)

If p < β, then

(183) Cβbβ p + Cβ

p dt = Cβbβ p + Cβ β/p 1bβ p bβ p. (185)

If p = β, then

(183) Cβbβ p + Cβ ln a p

Contextual Bandits for Unbounded Context Distributions

I.10. Proof of Lemma 3

Our proof follows the proof of Lemma 3.1 in (Rigollet & Zeevi, 2010).

t=1 (η (Xt) ηAt(Xt)), (187)

t=1 1(ηAt(Xt) < η (Xt)). (188)

Then R = E[U] and S = E[V ]. For any δ > 0,

t=1 1(ηAt(Xt) < η (Xt))1(|η (Xt) ηAt(Xt)| > δ)

t=1 1 (At = a (Xt), |η (Xt) ηAt(Xt)| δ)

Take expectations, we have

R δS TCαδα+1. (190)

Now we minimize the right hand side of (190). By making the derivative to be zero, let

δ = S (α + 1)TCα

R S S (α + 1)TCα

S (α + 1)TCα

S (α + 1)TCα

The proof is complete.