# efficient_contextual_bandits_with_continuous_actions__12a8dbf6.pdf

Efﬁcient Contextual Bandits with Continuous Actions

Maryam Majzoubi New York University Chicheng Zhang University of Arizona Rajan Chari Microsoft Research

Akshay Krishnamurthy Microsoft Research John Langford Microsoft Research Aleksandrs Slivkins Microsoft Research

We create a computationally tractable algorithm for contextual bandits with continuous actions having unknown structure. Our reduction-style algorithm composes with most supervised learning representations. We prove that it works in a general sense and verify the new functionality with large-scale experiments.

1 Introduction

In contextual bandit learning [6, 1, 39, 3], an agent repeatedly observes its environment, chooses an action, and receives a reward feedback, with the goal of optimizing cumulative reward. When the action space is discrete, there are many solutions to contextual bandit learning with successful deployments in personalized health, content recommendation, and elsewhere [e.g., 42, 54, 2, 44, 25, 43]. However, in many practical settings the action chosen is actually continuous. How then can we efﬁciently choose the best action given the context? This question is also extremely relevant to reinforcement learning more generally since contextual bandit learning is one-step reinforcement learning.

There are many concrete examples of reinforcement learning problems with continuous actions. In precision medicine [20, 31], doctors may prescribe to a patient a medication with a continuous value of dosage [32]. In data center optimization, the fan speeds and liquid coolant ﬂow may be controllable continuous values [41]. In operating systems, when a computer makes a connection over the network, we may be able to adjust its packet send rate in response to the current network status [30]. All of these may be optimizable based on feedback and context.

A natural baseline approach here is to posit smoothness assumptions on the world, as in much prior work, e.g., [5, 34, 18, 50, 19]. This approach comes with practical drawbacks. Many applications do not exhibit any smoothness structure. When/if they do, the smoothness parameters (such as a Lipschitz constant) must be known in advance. Unfortunately, discovering the smoothness parameters is challenging, and requires knowing some other parameters and/or extensive exploration.

A recent approach to continuous actions [37] realizes similar performance guarantees without knowing the Lipschitz constant (let alone a more reﬁned smoothness structure), while leveraging any preferred policy representation. Here, each action is smoothed to a distribution over an interval, and the benchmark one competes with is smoothed similarly. Unfortunately, their algorithm is computationally infeasible since it requires enumeration of all possible policy parameter settings.

mm7918@nyu.edu chichengz@cs.arizona.edu rajan.chari@microsoft.com akshaykr@microsoft.com jcl@microsoft.com slivkins@microsoft.com

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

In this paper, we realize beneﬁts similar to this approach with a computationally practical algorithm, for contextual bandits with continuous action space [0, 1]. Our algorithms are oracleefﬁcient [39, 21, 3, 47, 53]: computationally efﬁcient whenever we can solve certain supervised learning problems. Our main algorithm chooses actions by navigating a tree with supervised learners acting as routing functions in each node. Each leaf corresponds to an action, which is then smoothed to a distribution from which the ﬁnal action is sampled. We use the reward feedback to update the supervised learners in the nodes to improve the tree policy.

Our contributions can be summarized as follows:

We propose CATS, a new algorithm for contextual bandits with continuous actions (Algorithm 1). It uses ϵ-greedy exploration with tree policy classes (Deﬁnition 2) and is implemented in a fully online and oracle-efﬁcient manner. We prove that CATS has prediction and update times scaling as log of the tree size, an exponential improvement over traditional approaches. Assuming realizability, CATS has a sublinear regret guarantee against the tree policy class (Theorem 6).

We propose CATS Off, an off-policy optimization version of CATS (Algorithm 3) that can utilize logged data to train and select tree policies of different complexities. We also establish statistical guarantees for this algorithm (Theorem 7).

We implement our algorithms in Vowpal Wabbit (vowpalwabbit.org), and compare with baselines on real datasets. Experiments demonstrate the efﬁcacy and efﬁciency of our approach (Section 5).

Discussion. The smoothing approach has several appealing properties. We look for a good interval of actions, which is possible even when the best single action is impossible to ﬁnd. We need to guess a good width, but the algorithm adjusts to the best location for the interval. This is less guessing compared to uniform discretization (where the width and location are tied to some extent). While the bandwidth controls statistical performance, an algorithm is free to discretize actions for the sake of computational feasibility. An algorithm can improve accuracy by reusing datapoints for overlapping bands. Finally, the approach is principled, leading to speciﬁc, easily interpretable guarantees.

The tree-based classiﬁer is a successful approach for supervised learning with a very large number of actions (which we need for computational feasibility). However, adapting it for smoothing runs into some challenges. First, a naive implementation leads to a prohibitively large per-round running time; we obtain an exponential improvement as detailed in Section 3.1. Second, existing statistical guarantees do not carry over to regret in bandits: they merely transfer errors from tree nodes to the root [10, 9], but the former errors could be huge. We posit a realizability assumption; even then, the analysis is non-trivial because the errors accumulate as we move down the tree.

Another key advantage of our approach is that it allows us to use off-policy model selection. For off-policy evaluation, we use smoothing to induce exploration distribution supported on the entire action space. Hence, we can discover when reﬁnements in tree depth or smoothing parameters result in superior performance. Such model selection is not possible when using discretization approaches. When employed in an ofﬂine setup with data collected by a baseline logging policy, our experiments show that off-policy optimization can yield dramatic performance improvements.

Related work. Contextual bandits are quite well-understood for small, discrete action spaces, with rich theoretical results and successful deployments in practice. To handle large or inﬁnite action spaces, most prior work either makes strong parametric assumptions such as linearity, or posits some continuity assumptions such as Lipschitzness. More background can be found in [17, 51, 40].

Bandits with Lipschitz assumptions were introduced in [5], and optimally solved in the worst case by [33]. [34, 35, 18, 50] achieve optimal data-dependent regret bounds, while several papers relax global smoothness assumptions with various local deﬁnitions [7, 34, 35, 18, 49, 45, 27]. This literature mainly focuses on the non-contextual version, except for [50, 36, 19, 56] (which only consider a ﬁxed policy set Π). As argued in [37], the smoothing-based approach is productive in these settings, and extends far beyond, e.g., to instances when the global optimum is a discontinuity.

Most related to this paper is [37], which introduces the smoothness approach to contextual bandits and achieves data-dependent and bandwidth-adaptive regret bounds. Their approach extends to generic smoothing distributions (kernels), as well as to adversarial losses. However, their algorithms are inherently computationally inefﬁcient, because they build on the techniques from [6, 21].

Our smoothing-based reward estimator was used in [37] for contextual bandits, as well as in [31, 20] in the observational setting. The works of [31, 20] learn policies that are linear functions of the context, and perform policy optimization via gradient descent on the IPS loss estimate.

2 Preliminaries

Setting and key deﬁnitions. We consider the stochastic (i.i.d.) contextual bandits (CB) setting. At each round t, the environment produces a (context, loss) pair (xt, ℓt) from a distribution D. Here, context xt is from the context space X, and the loss function ℓt is a mapping from the action space A [0, 1] to [0, 1]. Then, xt is revealed to the learner, based on which it chooses an action at A and observes loss ℓt(at). The learner s goal is to minimize its cumulative loss, PT t=1 ℓt(at).

Deﬁne a smoothing operator: Smoothh : A (A), that maps each action a to a uniform distribution over the interval {a A : |a a | h} = [a h, a + h] [0, 1]. As notation, let ν denote the Lebesgue measure, i.e. the uniform distribution over [0, 1]. Denote by Smoothh(a |a) the probability density function w.r.t., ν for Smoothh(a) at action a . We deﬁne Smooth0(a) δa, where δa is the Dirac point mass at a. For a policy π : X A, we deﬁne πh(a |x) Smoothh(a |π(x)) to be the probability density value for action a of the smoothed policy πh on context x.

Equivalently, we deﬁne h-smoothed loss ℓh(a) Ea Smoothh( |a) [ℓ(a )]. For policy π : X A we deﬁne the corresponding h-smoothed expected loss as λh(π) E(x,ℓ) D Ea Smoothh(π(x)) [ℓ(a)]. This is equivalent to deﬁning πh : x 7 Smoothh(π(x)), and evaluating πh on the original loss, i.e., λ0(πh) = λh(π). The bandwidth h governs an essential bias-variance trade-off in the continuousaction setting: with small h, the smoothed loss λh(π) closely approximates the true expected loss function λ0(π), whereas the optimal performance guarantees scale inversely with h.

Over the T rounds, the learner accumulates a history of interaction. After round t, this is (xs, as, ps, ℓs(as))t s=1, where xs is the context, as is the chosen action, ps = Ps(as | xs) is the value of the density Ps( | xs) used at round s at as, and ℓs(as) is the observed loss. From this history, we use an inverse propensity score (IPS) estimator [29] to compute an unbiased estimate of the smoothed loss λh(π): ˆVt(πh) = 1

t Pt s=1 πh(as|xs) Ps(as|xs)ℓs(as). A useful primitive in contextual

bandits is to ﬁnd a policy π that minimizes ˆVt(πh), which is a surrogate for λ0(π).

A natural approach for policy optimization is to reduce to cost-sensitive multiclass classiﬁcation (CSMC). We choose a discretization parameter K, and instantiate a policy class Π : X AK where AK {0, 1

K , . . . , K 1

K }. Then, as πh(as | xs) = Smoothh(as | i/K), if π(xs) = i/K, policy optimization can be naturally phrased as a CSMC problem. For each round, we create a costsensitive example (xs, cs) where cs(i/K) = ℓs(as) Ps(as|xs)Smoothh (as|i/K), for all i/K in AK. Then,

optimizing ˆVt(πh) is equivalent to computing argminπ Π Pt s=1 cs(π(xs)). When working with h-smoothed losses, the error incurred by using the discretized action space AK can be controlled, as we can show that ℓh(a) is 1/h-Lipschitz [37].1 So, this discretization strategy can compete with policies that are not restricted to AK, incurring an additional error of 1 h K per round.

Tree policies. One challenge with applying the CSMC approach is computational: for general classes Π, classical methods for CSMC (such as one-versus-all) have Ω(K) running time. This is particularly problematic since we want K to be quite large in order to compete with policies that are not restricted to AK. To overcome this challenge, we consider a structured policy class induced by a binary tree T , where each node v is associated with a binary classiﬁer f v from some base class F.2

Deﬁnition 1 (Tree policy). Let K = 2D for some natural number D, and F be a class of binary classiﬁers from X to {left, right}. T is said to be a tree policy over action space AK = {i/K}K 1 i=0 using F, if: (1) T is a complete binary tree of depth D with K = 2D leaves, where each leaf v has label label(v) = 0/K, . . . , K 1/K from left to right, respectively; (2) in each internal node v of T , there is a classiﬁer f v in F; (3) the prediction of T on an example x, T .get action(x), is deﬁned

1Although we use the 1/h-Lipschitz property of h-smoothed losses here, in general, h-smoothed losses have more structure than 1/h-Lipschitz losses, which admit better regret guarantees in general. 2We assume that F is ﬁnite for simplicity. This can be extended with empirical process arguments.

as follows. Starting from the root of T , repeatedly route x downward by entering the subtree that follows the prediction of the classiﬁer in the tree nodes. When a leaf is reached, its label is returned (see Algorithm 4 in Appendix A for a formal description).

In other words, a tree policy over action space AK can be viewed as a decision tree of depth D, where its nodes form a hierarchical partition of the discretized action space AK. For each node in the tree, there is a subset of the context space that gets routed to it; therefore, given a tree policy T over AK, it also implicitly deﬁnes a hierarchical partition of the context space X. The crucial difference between a tree policy and a decision tree in the usual sense, is that each leaf node corresponds to a distinct action. Our tree policy approach is also fundamentally different from the approach of [50, 56] in contextual bandits, in that their usages of trees are in performing regression of reward as a function of (context, action) pairs. Our policy classes of interest are tree policy classes:

Deﬁnition 2. Let FK denote the policy class of all tree policies over action space AK using base class F, that is, the set of tree policies FK = {T : T is a tree policy over action space AK using F}. Furthermore, Let F denote the policy class of all tree policies of arbitrary depths using base class F, formally, F = S

As a computational primitive, we assume that we can solve CSMC problems over the base class F. Note that formally these are binary classiﬁcation problems. The main advantage of using these structured policy classes is computational efﬁciency. As we demonstrate in the next section, we can use fast online CSMC algorithms to achieve a running time of O(log K) per example. At the same time, due to the hierarchical structure, choosing an action using a policy in FK also takes O(log K) time. Both of these are exponentially faster than the O(K) running time that typically arises from ﬂat representations. Finally, given a tree policy, we deﬁne the tree policy rooted at one of its nodes:

Deﬁnition 3. Let v be an internal node in T . We deﬁne T v as the tree-based policy with root at v. We will abbreviate T v.get action(x) as T v(x) or v.get action(x).

The performance benchmark. We deﬁne the performance benchmark: Reg(T, F , h)

E h PT t=1 ℓt(at) i T infπ F λh(π). In words, we are comparing the cumulative expected loss of our algorithm, with the h-smoothed cumulative expected loss of the best tree policy of arbitrary depth. We call this the h-smoothed regret w.r.t. F . Although the focus of this paper is on contextual bandit algorithms with computational efﬁciency guarantees, in Appendix D, we also present several extensions of our results to general policy classes.

Miscellaneous notation. Given a set of CSMC examples S = {(x, c)} of size n, and a function f, we use ES [f(x, c)] 1 n P

(x,c) S f(x, c) to denote empirical expectation of f over S. Given a function f with domain Z, deﬁne its range to be the set of values it can take, i.e. range(f) {f(z) : z Z}. Speciﬁcally, given a tree T over action space AK and a node v in T , range(T v) denotes the actions reachable by T , i.e. the action labels of the leaves that are descendants of v. Given a natural number n, we denote by [n] {1, . . . , n}.

3 Algorithm

We describe our main algorithm CATS for learning with continuous actions using tree policies in Algorithm 1 and an off-policy version in Algorithm 3 for unknown h. In Appendix C, we also present a variant of CATS that works online for unknown h.

3.1 Smoothed ϵ-greedy algorithm with trees

We present Algorithm 1 in this section. It consists of two main components: ﬁrst, a smoothed ϵgreedy exploration strategy (lines 4 to 6); second, a tree training procedure Train tree called at line 7, namely Algorithm 2. We discuss each component in detail next.

ϵ-greedy exploration with smoothing. At time step t, the algorithm uses the policy πt learned from data collected in previous time steps to perform action selection. Speciﬁcally, with probability ϵ, it chooses an action uniformly at random from A; otherwise, it chooses an action based on the prediction of πt,h, the h-smoothing of policy πt. As we will see, πt,h has expected loss competitive

Algorithm 1 CATS: continuous action tree with smoothing

Input: Exploration and smoothing parameters (ϵ, h), discretization scale K = 2D, base class F

1: Initialize dataset S0 , and tree policy T with classiﬁers f v F at every internal node v. 2: for t = 1, 2, . . . , T do 3: Let πt be the tree policy T with {f v} as classiﬁers. 4: Deﬁne policy Pt(a | x) := (1 ϵ)πt,h(a|x) + ϵ. 5: Observe context xt, select action at Pt( | xt), observe loss ℓt(at).

6: Let ct(i/K) Smoothh(at|i/K)

Pt(at|xt) ℓt(at) for all i

7: T Train tree(K, F, {(xs, cs)}t s=1).

Algorithm 2 Tree training: Train tree

Input: K = 2D, F, data {(xs, cs)}n s=1 with cs RK

1: For level d = 0, . . . , D 1: Bd {(D d 1)n + 1, . . . , (D d)n }, where n = n/D . 2: for level d from D 1 down to 0 do 3: for nodes v at level d do 4: For each (xs, cs) deﬁne binary cost cv s with cv s(left) = cs(v.left.get action(xs)), cv s(right) = cs(v.right.get action(xs)). 5: Train classiﬁer at node v: f v argminf F ESv [cv(f(x))], where Sv = {(xs, cv s) : s Bl, cv s(left) = cv s(right)}: 6: return tree T with {f v} as node classiﬁers.

with any smoothed policy πh with π in FK (and is therefore competitive with F ). This component is similar to the ϵ-greedy algorithm for discrete action contextual bandits [e.g. 39]; here ϵ is a parameter that trades off between exploration and exploitation, where a larger ϵ yields better quality data for learning, and a smaller ϵ implies actions with better instantaneous losses are taken.

Tree training. Given the interaction log collected up to time t, {(xs, as, Ps(as | xs), ℓt(as))}t s=1, Algorithm 1 incorporates it to produce a policy πt+1 for time t + 1. Speciﬁcally, πt+1 is a tree policy T in FK that approximately minimizes ˆVt(T h) over all policies T in F . To this end, we use Train tree (Algorithm 2) over the set of cost-sensitive examples (xs, cs)t s=1 constructed by IPS. For technical reasons3, Train tree differs from the ﬁlter tree algorithm [10] in that it partitions the dataset into D = log K subsets, with their indices B0, . . . , BD 1 being disjoint subsets in [n]. For every i, the examples with indices in Bi are dedicated to training classiﬁers in tree nodes at level i. Train tree trains the classiﬁers in the tree nodes in a bottom-up fashion. At the bottom layer, each node v with two leaves as children seeks a classiﬁer fv in F that directly classiﬁes the context x to the action in range(T v) with smaller expected cost E[ℓ(a) | x]. For this, it invokes CSMC learning with class F where costs are the IPS costs for the two children v.left and v.right. At other internal nodes v, given that all the downstream classiﬁers in subtrees rooted at v.left and v.right have been trained, it aims to ﬁnd a classiﬁer fv in F such that fv, in conjunction with other classiﬁers in T v, routes context x to the action in range(T v) with the smallest expected cost.

Computational complexity. CATS can be implemented in a fully online and oracle-efﬁcient fashion, using online CSMC learners. Speciﬁcally, line 5 in Train tree can be implemented by maintaining a stateful online learner for each tree node v, which at time t maintains f v t , an approximation of argminf F Pt 1 s=1 [cv s(f(xs))]. Then, upon seeing a binary CSMC example (xt, cv t), the learner employs incremental update rules such as stochastic gradient descent to update its internal state to f v t+1, an approximate solution to the next CSMC problem.

We now look at the per-example computational cost of CATS using the above online implementation of CSMC oracle. Naively, in line 5 of Train tree, if we instead deﬁne Sv = {(xs, cv s) : s Bi} for every node v, i.e. we do not ﬁlter out examples with identical costs for left and right sides at

3We need to partition the input CSMC dataset in a delicate manner to ensure Train tree s theoretical guarantees; see Lemma 9 and its proof in Appendix B for more details. In our implementation we ignore such subtlety; see Algorithm 8 in Appendix G.

Algorithm 3 CATS Off

Input: logged data {(xt, at, Pt(at | xt), ℓt(at))}T t=1, minimum density of action distribution pmin, set of (bandwidth, discretization level) combinations J [0, 1] 2N, base class F. 1: for (bandwidth, discretization level) (h, K) in J do

2: For every t in [T], let ch t (i/K) Smoothh(at|i/K)

Pt(at|xt) ℓt(at) for all i {0, . . . , K 1}. 3: for t = 1, 2, . . . , T do 4: T h,K t Train tree(K, F, (xs, ch s) t 1 s=1).

5: Let (ˆh, ˆK) argmin(h,K) J 1 T PT t=1 ch t (T h,K t (xt)) + Pen(h, K) , where Pen(h, K) = r

1 T PT t=1 ch t (T h,K t (xt)) 64 ln 4T |J |

δ T pminh + 64 ln 4T |J |

δ T pminh .

6: return ˆπ drawn uniformly at random over set n T ˆh, ˆ K t,ˆh

node v, the time for processing each example would be O(K), since it contributes a binary CSMC example to Sv for every node v.

Our ﬁrst observation is that, if at time t, cv t(left) = cv t(right), the online CSMC learner can skip processing example (xt, cv t), as is done in line 5 of Train tree. This is because adding this example does not change the cost-sensitive ERM from round t to round t + 1. However, the algorithm still must decide whether this happens for each node v, which still requires O(K) time naively.

Our second observation is that, by carefully utilizing the piecewise constant nature of the IPS cost vector ct, we can ﬁnd the nodes that need to be updated and compute the costs of their left and right children, both in O(log K) time per example. Speciﬁcally, as ct is piecewise constant with two discontinuities, only two root-to-leaf paths contain nodes that have children with differing costs and must be updated (see Appendix G, speciﬁcally Lemma 17 and its proof for more explanations). Exploiting these observations, we implement CATS to have O(log K) update time, an exponential improvement over naive implementations. This is summarized in the next theorem.

Theorem 4. CATS with an online learner at each node requires O(log K) computation per example.

We elaborate on our online implementation and present the proof of the theorem in Appendix G. Our theorem generalizes the computational time analysis of the offset tree algorithm for discrete-action contextual bandits [9], in that we allow input IPS CSMC examples to have multiple nonzero entries.

3.2 Off-policy optimization

As discussed above, one major advantage of the smoothing approach to contextual bandits with continuous actions is that policy optimization can be easily reduced to a CSMC learning problem via counterfactual techniques. This allows off-policy optimization, in the sense that the logged data can be collected using one policy that takes action in A, while we can optimize over (smoothed) policy classes that take actions in AK. In the special setting that we learn from a tree policy class FK, the CATS Off algorithm (Algorithm 3) can be used.

The algorithm receives an interaction log {(xt, at, Pt(at | xt), ℓt(at)}T t=1, collected by another algorithm such that Pt(a | xt) pmin for all a A, a collection of (bandwidth, disretization levels) J , and a base policy class F as input. It consists of two stages: tree training and policy selection. In the tree training stage (lines 1 to 4), for each ((h, K), t) combination in J [T], the algorithm again calls Train tree over cost-sensitive examples (xs, ch s) t 1 s=1 induced by the interaction log

and the bandwidth h. As a result, we obtain a set of tree policies n T h,K t : t [T], (h, K) J o

In the policy selection stage (line 5), we choose a pair (ˆh, ˆK) from the set J using structural risk minimization [55], by trading off 1

T PT t=1 ch t (T h,K t (xt)), the progressive validation loss estimate

of smoothed policies n T h,K t,h : t [T] o on logged data [15] and its deviation bound Pen(h, K) that depends on h and K. A similar procedure has been proposed in the discrete-action contextual bandit

learning setting [52]. As we see from Theorem 7 below, the obtained tree policy T has expected loss competitive with all policies in the set (h,K) J {πh : π FK}.

4 Performance guarantees

In this section, we show that CATS and CATS Off achieve sublinear regret or excess loss guarantees under a realizability assumption over the (context, loss) distribution D. We defer the formal statements of our theorems and their proofs to Appendix B.

As learning decision trees is computationally hard in general [28], many existing positive results pose strong assumptions on the learning model, such as uniform or product unlabeled distribution [13, 16], separability [23, 48, 14] or allowing membership queries [38, 26]. Our tree policy training guarantee under the following realizability assumption is complementary to these works:

Deﬁnition 5. A hypothesis class F and data distribution D is said to be (h, K)-realizable, if there exists a tree policy T in FK such that the following holds: for every internal node v in T , there exists a classiﬁer f v, in F, such that

f v, (x) = left min a range(T l) E[ℓh(a) | x] min a range(T r) E[ℓh(a) | x],

f v, (x) = right min a range(T r) E[ℓh(a) | x] min a range(T l) E[ℓh(a) | x],

where l = v.left and r = v.right are v s two children; recall that ℓh(a) Ea Smoothh( |a)ℓ(a ).

Intuitively, the above realizability assumption states that our base class F is expressive enough, such that for every discretization parameter K in 2N, there exists a set of (K 1) classiﬁers {f v, }v T F occupying the internal nodes of a tree T of K leaves, and T routes any context x to its Bayes optimal discretized action in AK, formally argmina AK E[ℓh(a) | x]. As ℓh( ) is 1

h-Lipschitz, mina AK E[ℓh(a) | x] mina A E[ℓh(a) | x] 1 h K . This implies that, if K is large enough, the Bayes optimal policy π (x) = argmina A E[ℓh(a) | x] can be well-approximated by a tree policy in FK with little excess loss. Under the above realizability assumption, we now present a theorem that characterizes the regret guarantee when Algorithm 1 uses policy class FK.

Theorem 6 (Informal). Given h, suppose (F, D) is (h, K)-realizable for any K 2N. Then with appropriate settings of greedy parameter ϵ and discretization scale K, with high probability, Algorithm 1 run with inputs ϵ, h, K, F has regret bounded as: Reg(T, F , h) O T 4 ln|F|/h3 1/5 .

We remark that we actually obtain a stronger result: with appropriate tuning of ϵ, the h-smoothed

regret of CATS against FK is O K2T 2 ln |F|

δ /h 1/3 , which is similar to the O T 2/3|A| 1/3 regret

for ϵ-greedy in the discrete actions setting, with 1/h serving as the effective number of actions. We also note that if we used exact ERM over FK instead of the computationally efﬁcient Train tree procedure, the dependence on K would improve from K 2/3 to K 1/3. This K 2/3 dependence is due to compounding errors accumulating in each node, and we conjecture that it is the price we have to pay for using the computationally-efﬁcient Train tree for approximate ERM.

The aforementioned h-smoothed regret bound against FK reﬂects a natural bias-variance tradeoff in the choice of h and K: for a smaller value of h, the h-smoothed loss more closely approximates the true loss, while achieving a low h-smoothed regret bound is harder. A similar reasoning applies to K: For larger K, FK more closely approximates F , while the regret of CATS against FK can be higher.

We now present learning guarantees of CATS Off under the same realizability assumption.

Theorem 7 (Informal). Suppose (F, D) is (h, K)-realizable for all (h, K) J . In addition, the logged data S = {(xt, at, Pt(at | xt), ℓt(at))}T t=1 has a sufﬁcient amount of exploration: Pt(a | xt) pmin. Then, with high probability, Algorithm 3 run with inputs S, pmin, J , F outputs a policy

ˆπ such that: λ0(ˆπ) min(h,K) J ,π FK

λh(π) + O K q

δ /(pminh T ) .

Wis Friday Price Cpu Ds Zurich Datasets

Absolute Loss

CATS d Linear d Tree

Wis Friday Price Cpu Ds Zurich Datasets

Absolute Loss

Initial model CATS_Off optimized model

Figure 1: (left) Best progressive validation losses obtained by parameter search for different online learning algorithms on six regression datasets. (right) Test-set absolute losses for initial online-trained model using CATS with an initial set of discretization and smoothing parameter (Kinit, hinit) = (4, 1/4), and off-policy optimized models output by CATS Off. All conﬁdence intervals are calculated with a single run using the Clopper-Pearson interval with 95% conﬁdence level (note that they are very small for most of the datasets).

The above theorem shows the adaptivity of CATS Off: so long as the logged data is generated by an sufﬁciently explorative logging policy, its learned policy is competitive with any policy in the set (h,K) J {πh : π FK}, under realizability assumptions.

5 Experiments

Following the contextual bandit learning evaluation protocol of [12], we evaluate our approach on six large-scale regression datasets, where regression predictions are treated as continuous actions in A = [0, 1]. To simulate contextual bandit learning, we ﬁrst perform scaling and offsetting to ensure yt s are also in [0, 1]. Every regression example (xt, yt) is converted to (xt, ℓt), where ℓt(a) = |a yt| is the absolute loss induced by yt. When action at is taken, the algorithm receives bandit feedback ℓt(at), as opposed to the usual label yt.

Of the six datasets, ﬁve are selected from Open ML with the criterion of having millions of samples with unique regression values (See Appendix F for more details). We also include a synthetic dataset ds, created by the linear regression model with additive Gaussian noise.

Online contextual bandit learning using CATS. We compare CATS with two baselines that perform ϵ-greedy contextual bandit learning [39] over the discretized action space AK. The ﬁrst baseline, d Linear, reduces policy training to cost-sensitive one-versus-all multiclass classiﬁcation [11] which takes O(K) time per example. The second baseline, d Tree, uses the ﬁlter tree algorithm [10] as a cost-sensitive multiclass learner for policy training, which takes O(log K) time per example, but does not perform information sharing among actions through smoothing. We run CATS with (h, K) combinations in the following set:

J = (h, K) : h 2 13, . . . , 2 1 , K 22, , 213 , h K 20, . . . , 211 . (1)

We also run d Linear and d Tree with values of K in 21, 22, , 213 . All algorithms use ϵ = 0.05; see Appendix F for additional experimental details.

In the left panel of Figure 1 we compare CATS with d Linear and d Tree. Using progressive validation [15] for online evaluation, our algorithm (with optimally-tuned discretization and bandwidth) achieves performance similar to d Linear, and is better than d Tree for most of the datasets.

As discussed in Section 3, the time cost of our implementation of CATS is O(log(K)) per example. Figure 2 demonstrates that the training time of CATS is constant w.r.t. bandwidth h, and grows logarithmically w.r.t. the discretization K. This shows that CATS has the same computational complexity as d Tree. In contrast, d Linear has O(K) time complexity per example. The time improvement of CATS compared with d Linear becomes more signiﬁcant when K becomes larger. In summary, CATS outperforms d Tree statistically, and has much better scalability than d Linear.

Off-policy optimization using CATS Off. A major advantage of the CATS approach over na ıve discretization methods is that the interaction log collected by our algorithm with one setting of (h, K) can be used to optimize policies with alternate settings of (h, K). To validate this, we ﬁrst

1 22 Bandwidth (h)

Training time (s)

1 210 Discretization scale (1/K)

Training time (s)

1 210 Discretization scale (1/K)

Training time (s)

CATS d Linear d Tree

Figure 2: Online learning time costs of CATS (blue bar) w.r.t: (left) bandwidth (h) with a ﬁxed discretization scale K = 213; (middle) discretization scale (1/K) with a ﬁxed h = 1/4; (right) discretization scale (1/K) with a ﬁxed h = 1/4, compared against d Linear (orange bar) and d Tree (green bar), on the ds dataset.Similar ﬁgures for the rest of the datasets can be found in the Appendix H.

create an 80-20% split or training and test sets. With the training set, we ﬁrst collect interaction log tuples of (xt, at, Pt(at | xt), ℓt(at)) using CATS with initial discretization and smoothing parameter (Kinit, hinit) = (4, 1

4), and greedy parameter ϵ = 0.05. We then run CATS Off over the logged data using J , deﬁned in (1), as the set of parameters. Since standard generalization error bounds are loose in practice, we replaced 64 ln 2T |J |

δ in the penalty term in line 5 with constant 1. Note that this constant term as well as the learning rate and the greedy parameter are ﬁxed for all of the datasets in our experiments.

The right panel in Figure 1 shows the test losses of the models obtained by CATS after making a pass over the training data, and the test losses of the optimized models obtained through CATS Off by optimizing counterfactual estimates ofﬂine. It can be seen that ofﬂine policy training produces tree policies that have dramatically smaller test losses than the original policies.

6 Conclusion

Contextual bandit learning with continuous actions with unknown structure is quite tractable via the CATS algorithm, as we have shown theoretically and empirically. This broadly enables deployment of contextual bandit approaches across a wide range of new applications.

Broader Impact

Our study of efﬁcient contextual bandits with continuous actions can be applied to a wide range of applications, such as precision medicine, personalized recommendations, data center optimization, operating systems, networking, etc. Many of these applications have potential for signiﬁcant positive impact to society, but these methods can also cause unintend harms, for example by creating ﬁlter bubble effects when deployed in recommendation engines. More generally our research belongs to the general paradigm of interactive machine learning, which must always be used with care due to the presence of feedback loops. We are certainly mindful of these issues, and encourage practitioners to consider these consequences when deploying interactive learning systems.

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers for their helpful feedback. Much of this work was done while Maryam Majzoubi and Chicheng Zhang were visiting Microsoft Research NYC. This work was supported by Microsoft.

[1] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263 293, 2003. [2] Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, Siddhartha Sen, and Alex Slivkins. Making contextual decisions with low technical debt. arxiv:1606.03966, 2017.

[3] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, 2014. [4] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. In Conference on Learning Theory, 2017. [5] Rajeev Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 1995. [6] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 2002. [7] Peter Auer, Ronald Ortner, and Csaba Szepesv ari. Improved rates for the stochastic continuumarmed bandit problem. In Conference on Learning Theory, 2007. [8] Peter L Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari. High-probability regret bounds for bandit online linear optimization. 2008. [9] Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129 138, 2009. [10] Alina Beygelzimer, John Langford, and Pradeep Ravikumar. Error-correcting tournaments. In International Conference on Algorithmic Learning Theory, pages 247 262. Springer, 2009. [11] Alina Beygelzimer, John Langford, and Bianca Zadrozny. Weighted one-against-all. In Proceedings of the 20th International Conference on International Conference on Machine Learning, 2005. [12] Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake-off. ar Xiv preprint ar Xiv:1802.04064, 2018. [13] Guy Blanc, Jane Lange, and Li-Yang Tan. Top-down induction of decision trees: rigorous guarantees and inherent limitations. ar Xiv preprint ar Xiv:1911.07375, 2019. [14] Avrim Blum. Rank-r decision trees are a subclass of r-decision lists. Information Processing Letters, 42(4):183 185, 1992. [15] Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT, pages 203 208, 1999. [16] Alon Brutzkus, Amit Daniely, and Eran Malach. On the optimality of trees generated by id3. ar Xiv preprint ar Xiv:1907.05444, 2019. [17] S ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 2012. [18] S ebastien Bubeck, R emi Munos, Gilles Stoltz, and Csaba Szepesv ari. X-armed bandits. Journal of Machine Learning Research, 2011. [19] Nicol o Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and S ebastien Gerchinovitz. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In Conference on Learning Theory, 2017. [20] Guanhua Chen, Donglin Zeng, and Michael R Kosorok. Personalized dose ﬁnding using outcome weighted learning. Journal of the American Statistical Association, 111(516):1509 1521, 2016. [21] Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efﬁcient optimal learning for contextual bandits. In Uncertainty in Artiﬁcial Intelligence, 2011. [22] Miroslav Dud ık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097 1104, 2011. [23] Andrzej Ehrenfeucht and David Haussler. Learning decision trees from random examples. Information and Computation, 82(3):231 246, 1989. [24] David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118, 1975.

[25] Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. On context-dependent clustering of bandits. In Proceedings of the 34th International Conference on Machine Learning, 2017. [26] Parikshit Gopalan, Adam Tauman Kalai, and Adam R Klivans. Agnostically learning decision trees. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 527 536, 2008. [27] Jean-Bastien Grill, Michal Valko, and R emi Munos. Black-box optimization of noisy functions with unknown smoothness. In Advances in Neural Information Processing Systems, 2015. [28] Thomas Hancock, Tao Jiang, Ming Li, and John Tromp. Lower bounds on learning decision lists and trees. Information and Computation, 126(2):114 122, 1996. [29] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a ﬁnite universe. Journal of the American statistical Association, 47(260):663 685, 1952. [30] Nathan Jay, Noga Rotman, Brighten Godfrey, Michael Schapira, and Aviv Tamar. A deep reinforcement learning perspective on internet congestion control. In International Conference on Machine Learning, pages 3050 3059, 2019. [31] Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1243 1251, 2018. [32] TE Klein, RB Altman, Niclas Eriksson, BF Gage, SE Kimmel, MT Lee, NA Limdi, D Page, DM Roden, MJ Wagner, et al. Estimation of the warfarin dose with clinical and pharmacogenetic data. New England Journal of Medicine, 360(8):753 764, 2009. [33] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, 2004. [34] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Symposium on Theory of Computing, 2008. [35] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Bandits and experts in metric spaces. Journal of the ACM, 2019. To appear. Merged and revised version of conference papers in ACM STOC 2008 and ACM-SIAM SODA 2010. Also available at http://arxiv.org/abs/1312.1277. [36] Andreas Krause and Cheng S. Ong. Contextual gaussian process bandit optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2447 2455. Curran Associates, Inc., 2011. [37] Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: smoothing, zooming, and adapting. In Conference on Learning Theory, 2019. [38] Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the fourier spectrum. SIAM Journal on Computing, 22(6):1331 1348, 1993. [39] John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems, 2007. [40] Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. preprint, 2018. [41] Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. Data center cooling using model-predictive control. In Advances in Neural Information Processing Systems, pages 3814 3823, 2018. [42] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670. ACM, 2010. [43] Shuai Li, Wei Chen, Shuai Li, and Kwong-Sak Leung. Improved algorithm on online clustering of bandits. In Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence, pages 2923 2929. AAAI Press, 2019. [44] Kanak Mahadik, Qingyun Wu, Shuai Li, and Amit Sabne. Fast distributed bandits for online recommendation systems. In Proceedings of the 34th ACM International Conference on Supercomputing, pages 1 13, 2020.

[45] Stanislav Minsker. Estimation of extreme values and associated level sets of a regression function via selective sampling. In Conference on Learning Theory, 2013. [46] Francesco Orabona and D avid P al. Coin betting and parameter-free online learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 577 585, 2016. [47] Alexander Rakhlin and Karthik Sridharan. Bistro: An efﬁcient relaxation-based method for contextual bandits. In ICML, pages 1977 1985, 2016. [48] Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229 246, 1987. [49] Aleksandrs Slivkins. Multi-armed bandits on implicit metric spaces. In Advances in Neural Information Processing Systems, 2011. [50] Aleksandrs Slivkins. Contextual bandits with similarity information. The Journal of Machine Learning Research, 2014. [51] Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends R in Machine Learning, 12(1-2):1 286, 2019. [52] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814 823, 2015. [53] Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert Schapire. Efﬁcient algorithms for adversarial contextual learning. In International Conference on Machine Learning, pages 2159 2168, 2016. [54] Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495 517. Springer, 2017. [55] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1995. [56] Tianyu Wang, Weicheng Ye, Dawei Geng, and Cynthia Rudin. Towards practical lipschitz stochastic bandits. ar Xiv preprint ar Xiv:1901.09277, 2019.