# bandits_for_bmo_functions__211d5fb1.pdf Bandits for BMO Functions Tianyu Wang 1 Cynthia Rudin 1 We study the bandit problem where the underlying expected reward is a Bounded Mean Oscillation (BMO) function. BMO functions are allowed to be discontinuous and unbounded, and are useful in modeling signals with infinities in the domain. We develop a toolset for BMO bandits, and provide an algorithm that can achieve poly-log δ-regret a regret measured against an arm that is optimal after removing a δ-sized portion of the arm space. 1. Introduction Multi-Armed Bandit (MAB) problems model sequential decision making under uncertainty. Algorithms for this problem have important real-world applications including medical trials (Robbins, 1952) and web recommender systems (Li et al., 2010). While bandit methods have been developed for various settings, one problem setting that has not been studied, to the best of our knowledge, is when the expected reward function is a Bounded Mean Oscillation (BMO) function in a metric measure space. Intuitively, a BMO function does not deviate too much from its mean over any ball, and can be discontinuous or unbounded. Such unbounded functions can model many real-world quantities. Consider the situation in which we are optimizing the parameters of a process (e.g., a physical or biological system) whose behavior can be simulated. The simulator is computationally expensive to run, which is why we could not exhaustively search the (continuous) parameter space for the optimal parameters. The reward of the system is sensitive to parameter values and can increase very quickly as the parameters change. In this case, by failing to model the infinities, even state-of-the-art continuum-armed bandit methods fail to compute valid confidence bounds, potentially leading to underexploration of the important part of 1Department of Computer Science, Duke University, Durham, NC, USA. Correspondence to: Tianyu Wang . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). the parameter space, and they may completely miss the optima. As another example, when we try to determine failure modes of a system or simulation, we might try to locate singularities in the variance of its outputs. These are cases where the variance of outputs becomes extremely large. In this case, we can use a bandit algorithm for BMO functions to efficiently find where the system is most unstable. There are several difficulties in handling BMO rewards. First and foremost, due to unboundedness in the expected reward functions, traditional regret metrics are doomed to fail. To handle this, we define a new performance measure, called δ-regret. The δ-regret measures regret against an arm that is optimal after removing a δ-sized portion of the arm space. Under this performance measure, and because the reward is a BMO function, our attention is restricted to a subspace on which the expected reward is finite. Subsequently, strategies that conform to the δ-regret are needed. To develop a strategy that handles δ-regret, we leverage the John-Nirenberg inequality, which plays a crucial role in harmonic analysis. We construct our arm index using the John-Nirenberg inequality, in addition to a traditional UCB index. In each round, we play an arm with highest index. As we play more and more arms, we focus our attention on regions that contain good arms. To do this, we discretize the arm space adaptively, and carefully control how the index evolves with the discretization. We provide two algorithms Bandit-BMO-P and Bandit-BMO-Z. They discretize the arm space in different ways. In Bandit-BMO-P, we keep a strict partitioning of the arm space. In Bandit-BMO-Z, we keep a collection of cubes where a subset of cubes form a discretization. Bandit-BMO-Z achieves poly-log δ-regret with high probability. 2. Related Works Bandit problems in different settings have been actively studied since as far back as Thompson (1933). Upper confidence bound (UCB) algorithms remain popular (Robbins, 1952; Lai and Robbins, 1985; Auer, 2002) among the many approaches for (stochastic) bandit problems (e.g., see Srinivas et al., 2010; Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2012; Bubeck and Slivkins, 2012; Seldin and Slivkins, 2014). Various extensions of upper confidence Bandits for BMO Functions bound algorithms have been studied. Some works use KLdivergence to construct the confidence bound (Lai and Robbins, 1985; Garivier and Capp e, 2011; Maillard et al., 2011), and some works include variance estimates within the confidence bound (Audibert et al., 2009; Auer and Ortner, 2010). UCB is also used in the contextual setting (e.g., Li et al., 2010; Krause and Ong, 2011; Slivkins, 2014). Perhaps Lipschitz bandits are closest to BMO bandits. The Lipschitz bandit problem was termed continuum-armed bandits in early stages (Agrawal, 1995). In continuumarmed bandits, arm space is continuous e.g., [0, 1]. Along this line, bandits that are Lipschitz continuous (or H older continuous) have been studied. In particular, Kleinberg (2005) proves a Ω(T 2/3) lower bound and proposes a e O T 2/3 algorithm. Under other extra conditions on top of Lipschitzness, regret rate of e O(T 1/2) was achieved (Cope, 2009; Auer et al., 2007). For general (doubling) metric spaces, the Zooming bandit algorithm (Kleinberg et al., 2008) and Hierarchical Optimistic Optimization algorithm (Bubeck et al., 2011a) were developed. In more recent years, some attention has been given to Lipschitz bandit problems with certain extra conditions. To name a few, Bubeck et al. (2011b) study Lipschitz bandits for differentiable rewards, which enables algorithms to run without explicitly knowing the Lipschitz constants. The idea of robust mean estimators (Bubeck et al., 2013; Bickel et al., 1965; Alon et al., 1999) was applied to the Lipschitz bandit problem to cope with heavy-tail rewards, leading to the development of a near-optimal algorithm (Lu et al., 2019). Lipschitz bandits with an unknown metric, where a clustering is used to infer the underlying unknown metric, has been studied by Wanigasekara and Yu (2019). Lipschitz bandits with discontinuous but bounded rewards were studied by Krishnamurthy et al. (2019). An important setting that is beyond the scope of the aforementioned works is when the expected reward is allowed to be unbounded. This setting breaks the previous Lipschitzness assumption or almost Lipschitzness assumption (Krishnamurthy et al., 2019), which may allow discontinuities but require boundedness. To the best of our knowledge, this paper is the first work that studies the bandit learning problem for BMO functions. 3. Preliminaries We review the concept of (rectangular) Bounded Mean Oscillation (BMO) in Euclidean space (e.g., Fefferman, 1979; Stein and Murphy, 1993). Definition 1. (BMO Functions) Let (Rd, µ) be the Euclidean space with the Lebesgue measure. Let L1 loc(Rd, µ) denote the space of measurable functions (on Rd) that are locally integrable with respect to µ. A function f L1 loc(Rd, µ) is said to be a Bounded Mean Oscillation function, f BMO(Rd, µ), if there exists a constant Cf, such that for any hyper-rectangles Q Rd, Q |f f Q |dµ Cf, f Q := For a given such function f, the infimum of the admissible constant Cf over all hyper-rectangles Q is denoted by f BMO, or simply f . We use f BMO and f interchangeably in this paper. A BMO function can be discontinuous and unbounded. The function in Figure 1 illustrates the singularities a BMO function can have over its domain. Our problem is most interesting when multiple singularities of this kind occur. To properly handle the singularities, we will need the John Nirenberg inequality (Theorem 1), which plays a central role in our paper. Theorem 1 (John-Nirenberg inequality). Let µ be the Lebesgue measure. Let f BMO Rd, µ . Then there exists constants C1 and C2, such that, for any hypercube q Rd and any λ > 0, µ n x q : f(x) f q > λ o C1µ(q) exp λ The John-Nirenberg inequality dates back to at least John (1961), and a proof is provided in Appendix C. As shown in Appendix C, C1 = e and C2 = e2d provide a pair of legitimate C1, C2 values. However, this pair of C1 and C2 values may be overly conservative. Tight values of C1 and C2 are not known in general cases (Lerner, 2013; Slavin and Vasyunin, 2017), and it is also conjectured that C2 and C1 might be independent of dimension (Cwikel et al., 2012). For the rest of the paper, we use f = 1, C1 = 1, and C2 = 1, which permits cleaner proofs. Our results generalize to cases where C1, C2 and f are other constant values. In this paper, we will work in Euclidean space with the Lebesgue measure. For our purpose, Euclidean space is as general as doubling spaces, since we can always embed a doubling space into a Euclidean space with some distortion of metric. This fact is formally stated in Theorem 2. Theorem 2. (Assouad, 1983). Let (X, d) be a doubling metric space and ς (0, 1). Then (X, dς) admits a bi Lipschitz embedding into Rn for some n N. In a doubling space, any ball of radius ρ can be covered by Md balls of radius ρ 2, where Md is the doubling constant. In the space (Rd, ), the doubling constant Md is 2d. In domains of other geometries, the doubling constant can Bandits for BMO Functions be much smaller than exponential. Throughout the rest of the paper, we use Md to denote the doubling constant. 4. Problem Setting: BMO Bandits The goal of a stochastic bandit algorithm is to exploit the current information, and explore the space efficiently. In this paper, we focus on the following setting: a payoff function is defined over the arm space ([0, 1)d, max, µ), where µ is the Lebesgue measure (note that [0, 1)d is a Lipschitz domain). The payoff function is: f : [0, 1)d R where f BMO([0, 1)d, µ). (3) The actual observations are given by y(a) = f(a) + Ea, where Ea is a zero-mean noise random variable whose distribution can change with a. We assume that for all a, |Ea| DE almost surely for some constant DE (N1). Our results generalize to the setting with sub-Gaussian noise (Shamir, 2011). We also assume that the expected reward function does not depend on noise. In our setting, an agent is interacting with this environment in the following fashion. At each round t, based on past observations (a1, y1, , at 1, yt 1), the agent makes a query at point at and observes the (noisy) payoff yt, where yt is revealed only after the agent has made a decision at. For a payoff function f and an arm sequence a1, a2, , a T , we use δ-regret incurred up to time T as the performance measure (Definition 2). Definition 2. (δ-regret) Let f BMO([0, 1)d, µ). A number δ 0 is called f-admissible if there exists a real number z0 that satisfies µ({a [0, 1)d : f(a) > z0}) = δ. (4) For an f-admissible δ, define the set F δ to be F δ := z R : µ({a [0, 1)d : f(a) > z}) = δ . (5) Define f δ := inf F δ. For a sequence of arms A1, A2, , and σ-algebras F1, F2, where Ft describes all randomness before arm At, define the δ-regret at time t as rδ t := max{0, f δ Et[f(At)]}, (6) where Et is the expectation conditioned on Ft. The total δ-regret up to time T is then Rδ T := PT t=1 rδ t . Intuitively, the δ-regret is measured against an amended reward function that is created by chopping off a small portion of the arm space where the reward may become unbounded. As an example, Figure 1 plots a BMO function and its f δ value. A problem defined as above with performance measured by δ-regret is called a BMO bandit problem. Figure 1. Graph of f(x) = log(|x|), with δ and f δ annotated. This function is an unbounded BMO function. Remark 1. The definition of δ-regret, or a definition of this kind, is needed for a reward function f BMO([0, 1)d, µ). For an unbounded BMO function f, the max value is infinity, while f δ is a finite number as long as δ is f-admissible. Remark 2 (Connection to bandits with heavy-tails). In the definition of bandits with heavy tails (Bubeck et al., 2013; Medina and Yang, 2016; Shao et al., 2018; Lu et al., 2019), the reward distribution at a fixed arm is heavy-tail having a bounded expectation and bounded (1+β)-moment (β (0, 1]). In the case of BMO rewards, the expected reward itself can be unbounded. Figure 1 gives an instance of unbounded BMO reward, which means the BMO bandit problem is not covered by settings of bandits with heavy tails. A quick consequence of the definition of δ-regret is the following lemma. This lemma is used in the regret analysis when handling the concentration around good arms. Lemma 1. Let f be the reward function. For any fadmissible δ 0, let Sδ := a [0, 1)d : f(a) > f δ . Then we have Sδ measurable and µ(Sδ) = δ. Before moving on to the algorithms, we put forward the following assumption. Assumption 1. We assume that the expected reward function f BMO([0, 1)d, µ) satisfies f [0,1)d = 0. Assumption 1 does not sacrifice generality. Since f is a BMO function, it is locally-integrable. Thus f [0,1)d is finite, and we can translate the reward function up or down such that f [0,1)d = 0. 5. Solve BMO Bandits via Partitioning BMO bandit problems can be solved by partitioning the arm space and treating the problem as a finite-arm problem among partitions. For our purpose, we maintain a sequence Bandits for BMO Functions of partitions using dyadic cubes. By dyadic cubes of Rd, we refer to the collection of all cubes of the following form: QRd := Πd i=1 mi2 k, mi2 k + 2 k (7) where Π is the Cartesian product, and m1, , md, k Z. Dyadic cubes of [0, 1)d is Q[0,1)d := q QRd : q [0, 1)d . Dyadic cubes of [0, 1)2 are {[0, 1)2, [0, 0.5)2, [0.5, 1)2, [0.5, 1) [0, 0.5), }. We say a dyadic cube Q is a direct sub-cube of a dyadic cube Q if Q Q and the edge length of Q is twice the edge length of Q. By definition of doubling constant, for any cube Q, it has Md direct sub-cubes, and these direct sub-cubes form a partition of Q. If Q is a direct sub-cube of Q , then Q is a direct super cube of Q. At each step t, Bandit-BMO-P treats the problem as a finitearm bandit problem with respect to the cubes in the dyadic partition at t; each cube possesses a confidence bound. The algorithm then chooses a best cube according to UCB, and chooses an arm uniformly at random within the chosen cube. Before formulating our strategy, we put forward several functions that summarize cube statistics. Let Qt be the collection of dyadic cubes of [0, 1)d at time t (t 1). Let (a1, y1, a2, y2, , at, yt) be the observations received up to time t. We define the cube count nt : Qt R, such that for q Qt i=1 I[ai q]; ent(q) := max(1, nt(q)). (8) the cube average mt : Qt R, such that for q Qt ( Pt 1 i=1 yi I[ai q] nt(q) , if nt(q) > 0; 0, otherwise. (9) At time t, based on the partition Qt 1 and observations (a1, y1, a2, y2, , at 1, yt 1), our bandit algorithm picks a cube (and plays an arm within the cube uniformly at random). More specifically, the algorithm picks Qt arg max q Qt Ut(q), where (10) Ut(q) :=mt(q) + Ht(q) + J(q), (11) Ht(q) :=(Ψ+DE) p 2 log(2T 2/ϵ) p J(q) := log (µ(q)/η) + , Ψ := max log(T 2/ϵ), 2 log2(1/η) , (12) where T is the time horizon, DE is the a.s. bound on the noise, ϵ and η are algorithm parameters (to be discussed in more detail later), and z + = max{0, z}. Here Ψ is the Effective Bound of the expected reward, and η controls minimal cube size in the partition Qt (Proposition 3 in Appendix A.4). All these quantities will be discussed in more detail as we develop our algorithm. After playing an arm and observing reward, we update the partition into a finer one if needed. Next, we discuss our partition refinement rules and the tie-breaking mechanism. Partition Refinement: We start with Q0 = {[0, 1)d}. At time t, we split cubes in Qt 1 to construct Qt so that the following is satisfied for any q Qt Ht(q) J(q), or equivalently 2 log(2T 2/ϵ) p ent(q) log (µ(q)/η) + . (13) In (13), the left-hand-side does not decrease as we make splits (the numerator remains constant while the denominator can only decrease), while the right-hand-side decreases until it hits zero as we make more splits. Thus (13) can always be satisfied with additional splits. Tie-breaking: We break down our tie-breaking mechanism into two steps. In the first step, we choose a cube Qt Qt 1 such that: Qt arg max q Qt 1 Ut(q). (14) After deciding from which cube to choose an arm, we uniformly randomly play an arm At within the cube Qt. If measure µ is non-uniform, we play arm At, so that for any subset S Qt, P(At S) = µ(S) µ(Qt). The random variables {(Qt , At , Yt )}t (cube selection, arm selection, reward) describe all randomness in the learning process up to time t. We summarize this strategy in Algorithm 1. Analysis of Algorithm 1 is found in Section 5.1, which also provides some tools for handling δ-regret. Then in Section 6, we provide an improved algorithm that exhibits a stronger performance guarantee. 5.1. Regret Analysis of Bandit-BMO-P In this section we provide a theoretical guarantee on the algorithm. We will use capital letters (e.g., Qt, At, Yt) to denote random variables, and use lower-case letters (e.g. a, q) to denote non-random quantities, unless otherwise stated. Theorem 3. Fix any T. With probability at least 1 2ϵ, for any δ > |QT |η such that δ is f-admissible, the total δ-regret for Algorithm 1 up to time T satisfies t=1 rδ t d e O p T|QT | , (15) where the d sign omits constants that depends on d, and |QT | is the cardinality of QT . Bandits for BMO Functions Algorithm 1 Bandit-BMO-Partition (Bandit-BMO-P) 1: Problem intrinsics: µ( ), DE, d, Md. 2: {µ( ) is the Lebesgue measure. DE bounds the noise.} 3: {d is the dimension of the arm space.} 4: {Md is the doubling constant of the arm space.} 5: Algorithm parameters: η > 0, ϵ > 0, T. 6: {T is the time horizon. ϵ and η are parameters.} 7: for t = 1, 2, . . . , T do 8: Let mt and nt be defined as in (9) and (8). 9: Select a cube Qt Qt such that: Qt arg max q Qt 1 Ut(l), where Ut is defined in (11). 10: Play arm At Qt uniformly at random. Observe Yt. 11: Update the partition Qt to Qt+1 according to (13). 12: end for From uniform tie-breaking, we have E[f(At)|Ft] = 1 µ(Qt) a Qt f(a) da = f Qt , (16) Ft = σ(Q1, A1, Y1, , Qt 1, At 1, Yt 1, Qt), (17) where Ft is the σ-algebra generated by random variables Q1, A1, Y1, , Qt 1, At 1, Yt 1, Qt all randomness right after selecting cube Qt. At time t, the expected reward is the mean function value of the selected cube. The proof of the theorem is divided into two parts. In Part I, we show that some good event holds with high probability. In Part II, we bound the δ-regret under the good event. Part I: For t T, and q Qt, we define Et(q) := n f q mt(q) Ht(q) o , (18) Ht(q) = (Ψ+DE) p 2 log(2T 2/ϵ) p ent(q) . (19) In the above, Et(q) is essentially saying that the empirical mean within a cube q concentrates to f q. Lemma 2 shows that Et(q) happens with high probability for any t and q. Lemma 2. With probability at least 1 ϵ T , the event Et(q) holds for any q Qt at any time t. To prove Lemma 2, we apply a variation of Azuma s inequality (Vu, 2002; Tao and Vu, 2015). We also need some additional effort to handle the case when a cube q contains no observations. The details are in Appendix A.3. Part II: Next, we link the δ-regret to the J(q) term. Lemma 3. Recall J(q) = log(µ(q)/η). For any partition Q of [0, 1)d, there exists q Q, such that f δ f q J(q), (20) for any f-admissible δ > η|Q|, where |Q| is the cardinality of Q. In the proof of Lemma 3, we suppose, in order to get a contradiction, that there is no such cube. Under this assumption, there will be contradiction to the definition of f δ. By Lemma 3, there exists a good cube eqt (at any time t T), such that (20) is true for eqt. Let δ be an arbitrary number satisfying (1) δ > |QT |η and (2) δ is f-admissible. Then under event E(eqt), f δ = f δ f eqt + f eqt mt(eqt) + mt(eqt) 1 J(eqt) + Ht(eqt) + mt(eqt), (21) where 1 uses Lemma 3 for the first brackets and Lemma 2 (with event Et(eqt)) for the second brackets. The event where all good cubes and all cubes we select (for t T) have nice estimates, namely TT t=1 Et(eqt) T TT t=1 Et(Qt) , occurs with probability at least 1 2ϵ. This result comes from Lemma 2 and a union bound, and we note that Et(q) depends on ϵ (and T), as in (19). Under this event, from (18) we have f Qt mt(Qt) Ht(Qt). This and (16) give us E [f(At)|Ft] = f Qt mt(Qt) Ht(Qt). (22) We can then use the above to get, under the good event , f δ E[f(At)|Ft] 1 mt(eqt) + Ht(eqt) + J(eqt) mt(Qt) + Ht(Qt) 2 mt(Qt)+Ht(Qt)+J(Qt) mt(Qt)+Ht(Qt) =2Ht(Qt) + J(Qt) 3Ht(Qt), (23) where 1 uses (21) for the first three terms and (22) for the last three terms, 2 uses that Ut(Qt) Ut(eqt) since Qt maximizes the index Ut( ) according to (14), and the last inequality uses the rule (13). Next, we use Lemma 4 to link the number of cubes up to a time t to the Hoeffding-type tail bound in (23). Intuitively, this bound (Lemma 4) states that the numbers of points within the cubes grows fast enough to be bounded by a function of the number of cubes. Lemma 4. We say a partition Q is finer than a partition Q if for any q Q, there exists q Q such that q q . Consider an arbitrary sequence of points x1, x2, , xt, in a space X, and a sequence of partitions Q1, Q2, of X such that Qt+1 is finer than Qt for all t = 1, 2, , T 1. Then for any T, and {qt Qt}T t=1, 1 ent(qt) e|QT | log 1 + (e 1) T Bandits for BMO Functions where ent is defined in (8) (using points x1, x2, ), and |Qt| is the cardinality of partition Qt. A proof of Lemma 4 is in Appendix B. We can apply Lemma 4 and the Cauchy-Schwarz inequality to (23) to prove Theorem 3. The details can be found in Appendix A.7. 6. Achieve Poly-log Regret via Zooming In this section we study an improved version of the previous section that uses the Zooming machinery (Kleinberg et al., 2008; Slivkins, 2014) and inspirations from Bubeck et al. (2011a). Similar to Algorithm 1, this algorithm runs by maintaining a set of dyadic cubes Qt. In this setting, we divide the time horizon into episodes. In each episode t, we are allowed to play multiple arms, and all arms played can incur regret. This is also a UCB strategy, and the index of q Qt is defined the same way as (11): Ut(q) := mt(q) + Ht(q) + J(q) (25) Before we discuss in more detail how to select cubes and arms based on the above index Ut( ), we first describe how we maintain the collection of cubes. Let Qt be the collection of dyadic cubes at episode t. We first define terminal cubes, which are cubes that do not have sub-cubes in Qt. More formally, a cube Q Qt is a terminal cube if there is no other cube Q Qt such that Q Q. A pre-parent cube is a cube in Qt that directly contains a terminal cube: For a cube Q Qt, if Q is a direct super cube of any terminal cube, we say Q is a pre-parent cube. Finally, for a cube Q Qt, if Q is a pre-parent cube and no super cube of Q is a pre-parent cube, we call Q a parent cube. Intuitively, no sibling cube of a parent cube is a terminal cube. As a consequence of this definition, a parent cube cannot contain another parent cube. Note that some cubes are none of the these three types of cubes. Figure 2 gives examples of terminal cubes, pre-parent cubes and parent cubes. Algorithm Description Pick zooming rate α 0, (Ψ+DE) 2 log(2T 2/ϵ) log(Md/η) collection of cubes grows following the rules below: (1) Initialize Q0 = {[0, 1)d} and [0, 1)d. Warm-up: play nwarm arms uniformly at random from [0, 1)d so that 2 log(2T 2/ϵ) nwarm α log Md 2 log(2T 2/ϵ) nwarm+1 < α log Md (2) After episode t (t = 1, 2, , T), ensure 2 log(2T 2/ϵ) p ent(Qter) α log Mdµ(Qter) for any terminal cube Qter. If (27) is violated for a terminal cube Qter, we include the Md direct sub-cubes of Qter into Qt. Then Qter will no longer be a terminal cube and the direct sub-cubes of Qter will be terminal cubes. We repeatedly include direct sub-cubes of (what were) terminal cubes into Qt, until all terminal cubes satisfy (27). We choose α to be smaller than (Ψ+DE) 2 log(2T 2/ϵ) log(Md/η) so that (27) can be satisfied with ent(Qter) = 1 and µ(Qter) = 1. As a consequence, any non-terminal cube Qpar (regardless of whether it is a pre-parent or parent cube) satisfies: 2 log(2T 2/ϵ) p ent(Qpar) < α log Mdµ(Qpar) Figure 2. Example of terminal cubes, pre-parent and parent cubes. After the splitting rule is achieved, we select a parent cube. Specifically Qt is chosen to maximize the following index: Qt arg max q Qt,q is a parent cube Ut(q). Within each direct sub-cube of Qt (either pre-parent or terminal cubes), we uniformly randomly play one arm. In each episode t, Md arms are played. This algorithm is summarized in Algorithm 2. Regret Analysis: For the rest of the paper, we define Qt , {At ,j}Md j=1, {Yt ,j}Md j=1 which is the σ-algebra describing all randomness right after selecting the parent cube for episode t. We use Et to denote the expectation conditioning on Ft. We will show Algorithm 2 achieves e O (poly-log(T)) δ-regret with high probability (formally stated in Theorem 4). Let At,i be the i-th arm played in episode t. Let us denote δ t,i := f δ Et[f(At,i)]. Since each At,i is selected uniformly randomly within a direct sub-cube of Qt, we have i=1 Et[f(At,i)] = Md f Qt , (29) Bandits for BMO Functions Algorithm 2 Bandit-BMO-Zooming (Bandit-BMO-Z) 1: Problem intrinsics: µ( ), DE, d, Md. 2: {µ( ), DE, d, Md are same as those in Algorithm 1.} 3: Algorithm parameters: η, ϵ, T > 0, and α 0, (Ψ+DE) 2 log(2T 2/ϵ) log(Md/η) 4: {η, ϵ, T are same as those in Algorithm 1. α is the zooming rate.} 5: Initialize: let Q0=[0, 1)d. Play warm-up phase (26). 6: for episode t = 1, 2, . . . , T do 7: Let mt, nt, Ut be defined as in (9), (8) and (25). 8: Select parent cube Qt Qt such that: Qt arg max q Qt, q is a parent cube. Ut(q). 9: for j = 1, 2, . . . , Md do 10: Locate the j-th direct sub-cube of Qt: Qsub j . 11: Play At,j Qsub j uniformly at random, and observe Yt,j. 12: end for 13: Update the collection of dyadic cubes Qt to Qt+1 according to (27). 14: end for where Et is the expectation conditioning on all randomness before episode t. Using the above equation, for any t, i=1 δ t,i = Md(f δ f Qt). (30) The quantity PMd i=1 δ t,i is the δ-regret incurred during episode t. We will bound (30) using tools in Section 5. In order to apply Lemma 3, we need to show that the parent cubes form of partition of the arm space (Proposition 1). Proposition 1. At any episode t, the collection of parent cubes forms a partition of the arm space. Since the parent cubes in Qt form a partition of the arm space, we can apply Lemma 3 to get the following. For any episode t, there exists a parent cube qmax t , such that f δ f qmax t + log (µ(qmax t )/η) . (31) Let us define e ET := TT t=1 Et(qmax t ) T TT t=1 Et(Qt) , where Et(qmax t ) and Et(Qt) are defined in (18). By Lemma 2 and another union bound, we know the event e ET happens with probability at least 1 2ϵ. Since each episode creates at most a constant number of new cubes, we have |Qt| = O(t). Using the argument we used for (23), we have that at any t T, for any δ > η|Qt| that is f-admissible, under event e ET , i=1 δ t,i = Md f δ f Qt 2 log(2T 2/ϵ) p ent(Qt) + log Mdµ(Qt) Md(1 + 2α) log Mdµ(Qt) where (32) uses (30) and the last inequality uses (28). Next, we extend some definitions from Kleinberg et al. (2008), to handle the δ-regret setting. Firstly, we define the set of (λ, δ)-optimal arms as Xδ(λ) := [ {Q [0, 1]d : f δ f Q λ} . (34) We also need to extend the definition of zooming number (Kleinberg et al., 2008) to our setting. We denote by Nδ(λ, ξ) the number of cubes of edge-length ξ needed to cover the set Xδ(λ). Then we define the (δ, η)-Zooming Number with zooming rate α as e Nδ,η,α:= sup λ η 1 d ,1 i Nδ (1 + 2α) log Mdλd/η , λ , (35) where Nδ (1 + 2α) log Mdλd/η , λ is the number of cubes of edge-length λ needed to cover Xδ((1 + 2α) log(Mdλd/η)). The number e Nδ,η,α is well-defined. This is because the Xδ((1 + 2α) log(Mdλd/η)) is a subspace of (0, 1]d, and number of cubes of edge-length > η 1 d needed to cover (0, 1]d is finite. Intuitively, the idea of zooming is to use smaller cubes to cover more optimal arms, and vice versa. BMO properties convert between units of reward function and units in arm space. We will regroup the t,i terms to bound the regret. To do this, we need the following facts, whose proofs are in Appendix A.9. Proposition 2. Following the Zooming Rule (27), we have 1. Each parent cube of measure µ is played at most 2(Ψ+DE)2 log(2T 2/ϵ) α2[log(µ/η)]2 episodes. 2. Under event e ET , each parent cube Qt selected at episode t is a subset of Xδ ((1 + 2α) log (Mdµ(Qt)/η)). For cleaner writing, we set η = 2 d I for some positive integer I, and assume the event e ET holds. By Proposition 2, we can regroup the regret in a similar way to that of Kleinberg et al. (2008). Let Ki be the collection of selected parent cubes such that for any Q Ki, µ(Q) = 2 di (dyadic cubes are always of these sizes). The sets Ki regroup the selected parent cubes by their size. By Proposition 2 (item 2), we know each parent cube in Ki is a subset Bandits for BMO Functions of Xδ (1 + 2α) log Md2 di/η . Since cubes in Ki are subsets of Xδ (1 + 2α) log Md2 di/η and cubes in Ki are of measure 2 di, we have |Ki| Nδ (1 + 2α) log Md2 di/η , 2 i , (36) where |Ki| is the number of cubes in Ki. For a cube Q, let SQ be the episodes where Q is played. With probability at least 1 2ϵ, we can regroup the regret as t=1 (1 + 2α)Md log (Mdµ(Qt)/η) (37) t SQ (1 + 2α)Md log Md2 di/η , (38) where (37) uses (33), (38) regroups the sum as argued above. Using Proposition 2, we can bound (38) by: t SQ (1 + 2α)Md log Md2 di/η Q Ki |SQ|(1 + 2α)Md log Md2 di 2(Ψ+DE)2 log(2T 2/ϵ) α2 [log (2 di/η)]2 (39) (1+2α)Md log Md2 di i=0 Nδ (1 + 2α) log Md2 di/η , 2 di 2(Ψ+DE)2 log(2T 2/ϵ) α2 [log (2 di/η)]2 (1 + 2α)Md log Md2 di 2(1 + 2α)Md(Ψ + DE)2 α2 e Nδ,η,α log(2T 2/ϵ) log(Md2 di/η) [log(2 di/η)]2 , where 1 uses item 1 in Proposition 2, (40) uses (36). Recall η = 2 d I for some positive integer I. We can use the above to prove Theorem 4, by using η = 2 d I and log(Md2 di/η) [log(2 di/η)]2 = log Md log 2 di log Md d2 (log 2)2 (I i)2 + 1 d(log 2)(I i) (41) =O (1) + O (log I) , =O (log log(1/η)) , where the first term in (41) is O(1) since P i=1 1 i2 = O(1) and the second term in (41) is O (log I) by the order of a harmonic sum. The above analysis gives Theorem 4. Theorem 4. Choose positive integer I, and let η = 2 Id. For ϵ > 0 and t T, with probability 1 2ϵ, for any δ > |Qt|η such that δ is f-admissible, Algorithm 2 (with zooming rate α) admits t-episode δ-regret of: α2 MdΨ2 e Nδ,η,α log T log log(1/η) , (42) where Ψ = O (log(T/ϵ) + log(1/η)), e Nδ,η,α is defined in (35), and O omits constants. Since each episode plays Md arms, the average δ-regret each arm incurs is independent of Md. When proving Theorem 4, the definition of e Nδ,η,α is used in (40). For a more refined bound, we can instead use e N δ,η,α:= sup λ (lmin,1] Nδ (1 + 2α) log Mdλd/η , λ , where lmin is the minimal possible cube edge length during the algorithm run. This replacement will not affect the argument. Some details and an example regarding this refinement are in Appendix A.10. In Remark 3, we give an example of regret rate on f(x) = 2 log 1 x, x (0, 1] with specific input parameters. Remark 3. Consider the (unbounded, BMO) function f(x) = 2 log 1 x, x (0, 1]. Pick T 20. For some t T, the t-step δ-regret of Algorithm 2 is O (poly-log(t)) while allowing δ = O(1/T) and η = Θ 1/T 4 . Intuitively, Algorithm 2 gets close to f δ even if f δ is very large. Details of this example can be found in Appendix A.10. 7. Experiments We deploy Algorithms 1 and 2 on the Himmelblau s function and the Styblinski-Tang function (arm space normalized to [0, 1)2, function range rescaled to [0, 10]). The results are in Figure 3. We measure performance using traditional regret and δ-regret. Traditional regret can be measured because both functions are continuous, in addition to being BMO. 8. Discussion on Future Directions 8.1. Lower Bound A classic trick to derive minimax lower bounds for (stochastic) bandit problems is the needle-in-a-haystack. In this argument (Auer, 2002), we construct a hard problem instance, where one arm is only slightly better than the rest of the arms, making it hard to distinguish the best arm from the rest of the arms. This argument is also used in metric spaces (e.g., Kleinberg et al., 2008; Lu et al., 2019). This argument, Bandits for BMO Functions Figure 3. Algorithms 1 and 2 on Himmelblau s function (left) and Styblinski Tang function (right). Each line is averaged over 10 runs. The shaded area represents one variance above and below the average regret. For the Bandit-BMO-Z algorithm, all arms played incur regret, and each episode has 4 arm trials in it. In the figures, Bandit-BMO-Zδ (resp. Bandit-BMO-Pδ) plots the δ-regret (δ = 0.01) for Bandit-BMO-Z (resp. Bandit-BMO-P). Bandit-BMO-Zt (resp. Bandit-BMO-Pt) plots the traditional regret for Bandit-BMO-Z (resp. Bandit-BMO-P). For Bandit-BMO-P algorithm, we use ϵ = 0.01, η = 0.001, total number of trials T = 10000. For Bandit-BMO-Z algorithm, we use α = 1, ϵ = 0.01, η = 0.001, number of episodes T = 2500, with four arm trials in each episode. Note that we have plotted trials (arm pulls) rather than episodes. The landscape of the test functions are in Appendix D. however, is forbidden by the definition of δ-regret, since here, the set of good arms can have small measure, and will be ignored by definition. Hence, we need new insights to derive minimax lower bounds of bandit problems measured by δ-regret. 8.2. Singularities of Analytical Forms In this paper, we investigate the bandit problem where the reward can have singularities in the arm space. A natural problem along this line is when the reward function has specific forms of singularities. For example, when the average reward can be written as f(x) = Pk i=1 1 (x si)αi where si are the singularities and αi are the degree of singularities. To continue leveraging the advantages of BMO function and John-Nirenberg inequalities, one might consider switching away from the Lebesgue measure and use decomposition results from classical analysis (e.g., Rochberg and Semmes, 1986). 9. Conclusion We study the bandit problem when the (expected) reward is a BMO function. We develop tools for BMO bandits, and provide an algorithm that achieves poly-log δ-regret with high probability. Our result suggests that BMO functions can be optimized (with respect to δ-regret) even though they can be discontinuous and unbounded. Acknowledgement The authors thank Weicheng Ye and Jingwei Zhang for insightful discussions. The authors thank anonymous reviewers for valuable feedback. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320. Agrawal, R. (1995). The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33(6):1926 1951. Agrawal, S. and Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39 1. Alon, N., Matias, Y., and Szegedy, M. (1999). The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137 147. Assouad, P. (1983). Plongements Lipschitziens dans Rn. Bulletin de la Soci et e Math ematique de France, 111:429 448. Audibert, J.-Y., Munos, R., and Szepesv ari, C. (2009). Exploration exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902. Auer, P. (2002). Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422. Auer, P. and Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65. Auer, P., Ortner, R., and Szepesv ari, C. (2007). Improved rates for the stochastic continuum-armed bandit problem. In Conference on Computational Learning Theory, pages 454 468. Springer. Bickel, P. J. et al. (1965). On some robust estimates of location. The Annals of Mathematical Statistics, 36(3):847 858. Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711 7717. Bubeck, S., Munos, R., Stoltz, G., and Szepesv ari, C. (2011a). X-armed bandits. Journal of Machine Learning Research, 12(May):1655 1695. Bandits for BMO Functions Bubeck, S. and Slivkins, A. (2012). The best of both worlds: stochastic and adversarial bandits. In Conference on Learning Theory, pages 42 1. Bubeck, S., Stoltz, G., and Yu, J. Y. (2011b). Lipschitz bandits without the Lipschitz constant. In International Conference on Algorithmic Learning Theory, pages 144 158. Springer. Cope, E. W. (2009). Regret and convergence bounds for a class of continuum-armed bandit problems. IEEE Transactions on Automatic Control, 54(6):1243 1253. Cwikel, M., Sagher, Y., and Shvartsman, P. (2012). A new look at the John Nirenberg and John Str omberg theorems for BMO. Journal of Functional Analysis, 263(1):129 166. Fefferman, R. (1979). Bounded mean oscillation on the polydisk. Annals of Mathematics, 110(3):395 406. Garivier, A. and Capp e, O. (2011). The KL UCB algorithm for bounded stochastic bandits and beyond. In Conference on Learning Theory, pages 359 376. John, F. (1961). Rotation and strain. Communications on Pure and Applied Mathematics, 14(3):391 413. Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multiarmed bandits in metric spaces. In ACM Symposium on Theory of Computing, pages 681 690. ACM. Kleinberg, R. D. (2005). Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pages 697 704. Krause, A. and Ong, C. S. (2011). Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447 2455. Krishnamurthy, A., Langford, J., Slivkins, A., and Zhang, C. (2019). Contextual bandits with continuous actions: Smoothing, zooming, and adapting. In Conference on Learning Theory, pages 2025 2027. PMLR. Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22. Lerner, A. K. (2013). The John Nirenberg inequality with sharp constants. Comptes Rendus Mathematique, 351(1112):463 466. Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In International Conference on World Wide Web, pages 661 670. ACM. Lu, S., Wang, G., Hu, Y., and Zhang, L. (2019). Optimal algorithms for Lipschitz bandits with heavy-tailed rewards. In International Conference on Machine Learning, pages 4154 4163. Maillard, O.-A., Munos, R., and Stoltz, G. (2011). A finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In Conference On Learning Theory, pages 497 514. Martell, J. An easy proof of the John-Nirenberg inequality math blog of Hyunwoo Will Kwon. http://willkwon.dothome.co.kr/index. php/archives/618, last accessed on 20/06/2020. Medina, A. M. and Yang, S. (2016). No-regret algorithms for heavy-tailed linear bandits. In International Conference on Machine Learning, pages 1642 1650. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535. Rochberg, R. and Semmes, S. (1986). A decomposition theorem for BMO and applications. Journal of functional analysis, 67(2):228 263. Seldin, Y. and Slivkins, A. (2014). One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287 1295. Shamir, O. (2011). A variant of Azuma s inequality for martingales with sub-Gaussian tails. ar Xiv preprint ar Xiv:1110.2392. Shao, H., Yu, X., King, I., and Lyu, M. R. (2018). Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs. In Advances in Neural Information Processing Systems, pages 8420 8429. Slavin, L. and Vasyunin, V. (2017). The John Nirenberg constant of BMOp, 1 p 2. St. Petersburg Mathematical Journal, 28(2):181 196. Slivkins, A. (2014). Contextual bandits with similarity information. The Journal of Machine Learning Research, 15(1):2533 2568. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning. Stein, E. M. and Murphy, T. S. (1993). Harmonic analysis: real-variable methods, orthogonality, and oscillatory integrals, volume 3. Princeton University Press. Bandits for BMO Functions Tao, T. and Vu, V. (2015). Random matrices: universality of local spectral statistics of non-hermitian matrices. The Annals of Probability, 43(2):782 874. Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294. Vu, V. H. (2002). Concentration of non-Lipschitz functions and applications. Random Structures & Algorithms, 20(3):262 316. Wang, T., Ye, W., Geng, D., and Rudin, C. (2019). Towards practical Lipschitz stochastic bandits. ar Xiv preprint ar Xiv:1901.09277. Wanigasekara, N. and Yu, C. (2019). Nonparametric contextual bandits in an unknown metric space. In Advances in Neural Information Processing Systems.