# robust_bandit_learning_with_imperfect_context__427b15ab.pdf

Robust Bandit Learning with Imperfect Context

Jianyi Yang, Shaolei Ren University of California, Riverside {jyang239, shaolei}@ucr.edu

A standard assumption in contextual multi-arm bandit is that the true context is perfectly known before arm selection. Nonetheless, in many practical applications (e.g., cloud resource management), prior to arm selection, the context information can only be acquired by prediction subject to errors or adversarial modiﬁcation. In this paper, we study a contextual bandit setting in which only imperfect context is available for arm selection while the true context is revealed at the end of each round. We propose two robust arm selection algorithms: Max Min UCB (Maximize Minimum UCB) which maximizes the worst-case reward, and Min WD (Minimize Worst-case Degradation) which minimizes the worst-case regret. Importantly, we analyze the robustness of Max Min UCB and Min WD by deriving both regret and reward bounds compared to an oracle that knows the true context. Our results show that as time goes on, Max Min UCB and Min WD both perform as asymptotically well as their optimal counterparts that know the reward function. Finally, we apply Max Min UCB and Min WD to online edge datacenter selection, and run synthetic simulations to validate our theoretical analysis.

Introduction Contextual bandits (Lu, P al, and P al 2010; Chu et al. 2011) concern online learning scenarios such as recommendation systems (Li et al. 2010), mobile health (Lei, Tewari, and Murphy 2014), cloud resource provisioning (Chen and Xu 2019), wireless communications (Saxena et al. 2019), in which arms (a.k.a., actions) are selected based on the underlying context to balance the tradeoff between exploitation of the already learnt knowledge and exploration of uncertain arms (Auer et al. 2002; Auer, Cesa-Bianchi, and Fischer 2002; Bubeck and Cesa-Bianchi 2012; Dani et al. 2008). The majority of the existing studies on contextual bandits (Chu et al. 2011; Valko et al. 2013; Saxena et al. 2019) assume that a perfectly accurate context is known before each arm selection. Consequently, as long as the agent learns increasingly more knowledge about reward, it can select arms with lower and lower average regrets. In many cases, however, the perfect (or true) context is not available to the agent prior to arm selection. Instead, the true context is revealed after taking an action at the end of each round (Kirschner and

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Krause 2019), but can be predicted using predictors, such as time series prediction(Brockwell et al. 2016; Gers, Schmidhuber, and Cummins 2000), to facilitate the agent s arm selection. For example, in wireless communications, the channel condition is subject to various attenuation effects (e.g., path loss and small-scale multi-path fading), and is critical context information for the transmitter conﬁguration such as modulation and rate adaption (i.e., arm selection) (Goldsmith 2005; Saxena et al. 2019). But, the channel condition context is predicted and hence can only be coarsely known until the completion of transmission. For another example, the exact workload arrival rate is crucial context information for cloud resource management, but cannot be known until the workload actually arrives. Naturally, context prediction is subject to prediction errors. Moreover, it can also open a new attack surface an outside attacker may adversarially modify the predicted context. For example, a recent study (Chen, Tan, and Zhang 2019) shows that the energy load predictor in smart grid can be adversarially attacked to produce load estimates with higher-than-usual errors. More motivating examples are provided in (Yang and Ren 2021). In general, imperfectly predicted and even adversarially presented context is very common in practice. As motivated by practical problems, we consider a bandit setting where the agent receives imperfectly predicted context and selects an arm at the beginning of each round and the context is revealed after arm selection. We focus on robust arm optimization given imperfect context, which is as crucial as robust reward function estimation or exploration in contextual bandits (Dud ık, Langford, and Li 2011; Neu and Olkhovskaya 2020; Zhu et al. 2018). Concretely, with imperfect context, our goal is to select arms online in a robust manner to optimize the worst-case performance in a neighborhood domain with the received imperfect context as center and a defense budget as radius. In this way, the robust arm selection can defend against the imperfect context error ( from either context prediction error or adversarial modiﬁcation) constrained by the budget. Importantly and interestingly, given imperfect context, maximizing the worst-case reward (referred to as type-I robustness objective) and minimizing the worst-case regret (referred to as type-II robustness objective) can lead to different arms, while they are the same under the setting of perfect context (Saxena et al. 2019; Li et al. 2010; Slivkins

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

2019). Given imperfect context, the strategy for type-I robustness is more conservative than that for type-II robustness in terms of reward. The choice of the robustness objective depends on applications. For example, some safetyaware applications (Sun, Dey, and Kapoor 2017; Garcıa and Fern andez 2015) intend to avoid extremely low reward, and thus type-I objective is suitable for them. Other applications (Li et al. 2010; Chen et al. 2018; Guan et al. 2020) focus on preventing large sub-optimality of selected arms, and type II objective is more appropriate. As a distinction from other works on robust optimization of bandits (Bogunovic et al. 2018; Kirschner et al. 2020; Nguyen et al. 2020), we highlight the difference of the two types of robustness objectives. We derive two algorithms Max Min UCB (Maximize Minimum UCB), which maximizes the worst-case reward for type-I objective, and Min WD (Minimize Worst-case Degradation), which minimizes the worst-case regret for type-II objective. The challenge of algorithm designs is that the agent has no access to exact knowledge of reward function but the estimated counterpart based on history collected data. Thus, in our design, Max Min UCB maximizes the lower bound of reward, while Min WD minimizes the upper bound of regret. We analyze the robustness of Max Min UCB and Min WD by deriving both regret and reward bounds, compared to a strong oracle that knows the true context for arm selection as well as the exact reward function. Importantly, our results show that, while a linear regret term exists for both Max Min UCB and Min WD due to imperfect context, the added linear regret term is actually the same as the amount of regret incurred by respectively optimizing type-I and type II objectives with perfect knowledge of the reward function. This implies that as time goes on, Max Min UCB and Min WD will asymptotically approach the corresponding optimized objectives from the reward and regret views, respectively. Finally, we apply Max Min UCB and Min WD to the problem of online edge datacenter selection and run synthetic simulations to validate our theoretical analysis.

Related Work

Contextual bandits. Linear contextual bandit learning is considered in Lin UCB by (Li et al. 2010). . The study (Abbasi-Yadkori, P al, and Szepesv ari 2011) improves the regret analysis of linear contextual bandit learning, while the studies (Agrawal and Goyal 2012, 2013) solve this problem by Thompson sampling and give a regret bound. There are also studies to extend the algorithms to general reward functions like non-linear functions, for which kernel method is exploited in GP-UCB (Srinivas et al. 2010), Kernel UCB (Valko et al. 2013), IGP-UCB and GP-TS (Chowdhury and Gopalan 2017; Deshmukh, Dogan, and Scott 2017). Nonetheless, a standard assumption in these studies is that perfect context is available for arm selection, whereas imperfect context is common in many practical applications (Kirschner et al. 2020). Adversarial bandits and Robustness. The prior studies on adversarial bandits (Auer and Chiang 2016; Jun et al. 2018; Altschuler, Brunel, and Malek 2019; Liu and Shroff

2019) have primarily focused on that the adversary maliciously presents rewards to the agent or directly injects errors in rewards. Moreover, many studies (Audibert and Bubeck 2009; Gerchinovitz and Lattimore 2016) consider the best constant policy throughout the entire learning process as the oracle, while in our setting the best arm depends on the true context at each round. The adversarial setting has also been extended to contextual bandits (Neu and Olkhovskaya 2020; Syrgkanis, Krishnamurthy, and Schapire 2016; Han et al. 2020). Recently, robust bandit algorithms have been proposed for various adversarial settings. Some focus on robust reward estimation and exploration (Altschuler, Brunel, and Malek 2019; Guan et al. 2020; Dud ık, Langford, and Li 2011), and others train a robust or distributionally robust policy (Wu et al. 2016; Syrgkanis, Krishnamurthy, and Schapire 2016; Si et al. 2020b,a). Our study differs from the existing adversarial bandits by seeking two different robust algorithms given imperfect (and possibly adversarial) context. Optimization and bandits with imperfect context. (Rakhlin and Sridharan 2013) considers online optimization with predictable sequences and (Jadbabaie et al. 2015) focuses on adaptive online optimization competing with dynamic benchmarks. Besides, (Chen et al. 2014; Jiang et al. 2013) study the robust optimization of mini-max regret. These studies assume perfectly known cost functions without learning. A recent study (Bogunovic et al. 2018) considers Bayesian optimization and aims at identifying a worstcase good input region with input perturbation (which can also model a perturbed but ﬁxed environment/context parameter). The study (Wang, Wu, and Wang 2016) considers the linear bandit where certain context features are hidden, and uses iterative methods to estimate hidden contexts and model parameters. Another recent study (Kirschner and Krause 2019) assumes the knowledge of context distribution for arm selection, and considers a weak oracle that also only knows context distribution. The relevant papers (Kirschner et al. 2020) and (Nguyen et al. 2020) consider robust Bayesian optimizations where context distribution information is imperfectly provided, and propose to maximize the worst-case expected reward for distributional robustness. Although the objective of Max Min UCB in our paper is similar to the robust optimization objectives in the two papers, we additionally derive a lower bound for the true reward in our analysis, which provides another perspective on the robustness of arm selection. More importantly, considering that the objectives in the two relevant papers are equivalent to minimizing a pseudo robust regret, we propose Min WD and derive an upper bound for the incurred true regret.

Problem Formulation Assume that at the beginning of round t, the agent receives imperfect context ˆxt X which is exogenously provided and not necessarily the true context xt. Given the imperfect context ˆxt X and an arm set A, the agent selects an arm at A for round t. Then, the reward yt along with the true context xt is revealed to the agent at the end of round t. Assume that X A Rd, and we use xat,t to denote the d-dimensional concatenated vector [xt, at].

The reward yt received by the agent in round t is jointly decided by the true context xt and selected arm at, and can be expressed as follows

yt = f(xt, at) + nt, (1)

where f : X A R is the reward function, X is the context domain, and nt is the noise term. We assume that the reward function f belongs to a reproducing kernel Hilbert space (RKHS) H generated by a kernel function k : (X A) (X A) R. In this RKHS, there exists a mapping function φ : (X A) H which maps context and arm to their corresponding feature in H. By reproducing property, we have k ([x, a], [x , a ]) = φ (x, a) , φ (x , a ) and f (x, a) = φ (x, a) , θ where θ is the representation of function f( , ) in H. Further, as commonly considered in the bandit literature (Slivkins 2019; Li et al. 2010), the noise nt follows b-sub-Gaussian distribution for a constant b 0, i.e. conditioned on the ﬁltration Ft 1 = {xτ, ya,τ, aτ, τ = 1, , t 1}, σ R,

E [eσnt|Ft 1] exp σ2b2

2 . Without knowledge of reward function f, bandit algorithms are designed to decide an arm sequence {at, t = 1, , T} to minimize the cumulative regret

t=1 f(xt, A (xt)) f(xt, at), (2)

where A (xt) = arg maxa A f(xt, a) is the oracle-optimal arm at round t given the true context xt. When the received contexts are perfect, i.e. ˆxt = xt, minimizing the cumulative regret is equivalent to maximizing the cumulative reward FT = PT t=1 f(xt, at).

Context Imperfectness The context error can come from a variety of sources, including imperfect context prediction algorithms and adversarial corruption (Kirschner et al. 2020; Chen, Tan, and Zhang 2019) on context. We simply use context error to encapsulate all the error sources without further differentiation. We assume that context error xt ˆxt , where is a certain norm (Bogunovic et al. 2018), is less than . Also, is referred to as the defense budget and can be considered as the level of robustness/safeguard that the agent intends to provide against context errors: with a larger , the agent wants to make its arm selection robust against larger context errors (at the possible expense of its reward). A time-varying error budget can be captured by using t. Denote the neighborhood domain of context x as B (x) = {y X | y x }. Then, we have the true context xt B (ˆxt), where ˆxt is available to the agent.

Reward Estimation Reward estimation is critical for arm selection. Kernel ridge regression, which is widely used in contextual bandits (Slivkins 2019) serves as the reward estimation method in our algorithm designs. By kernel ridge regression, the estimated reward given arm a and context x is expressed as ˆft(x, a) = k T t (x, a)(Kt + λI) 1yt (3)

where I is an identity matrix, yt Rt 1 contains the history of yτ, kt(x, a) Rt 1 contains k([x, a], [xτ, aτ]), and Kt R(t 1) (t 1) contains k([xτ, aτ], [xτ , aτ ]), for τ,τ {1, , t 1}. The conﬁdence width of kernel ridge regression is given in the following concentration lemma followed by a deﬁnition of reward UCB.

Lemma 1 (Concentration of Kernel Ridge Regression). Assume that the reward function f(x, a) satisﬁes |f(x, a)| B, the noise nt satisﬁes a sub-Gaussian distribution with parameter b, and kernel ridge regression is used to estimate the reward function. With a probability of at least 1 δ, δ (0, 1), for all a A and t N, the estimation error satisﬁes | ˆft(x, a) f(x, a)| htst(x, a), where ht =

γt 2 log (δ), γt = log det(I + Kt/λ) d log(1 + t dλ) and d is the rank of Kt. Let Vt = λI + Pt s=1 φ(x, a)φ(x, a) , the squared conﬁdence width is given by s2 t(x, a) = φ(x, a) V 1 t 1φ(x, a) = 1 λk([x, a], [x, a]) 1

λkt(x, a)T (Kt + λI) 1kt(x, a).

Deﬁnition 1. Given arm a A and context x X, the reward UCB (Upper Conﬁdence Bound) is deﬁned as Ut (x, a) = ˆft (x, a) + htst(x, a).

The next lemma shows that the term st (xt, at) has a vanishing impact on regret over time.

Lemma 2. The sum of conﬁdence widths given xt for t {1, , T} satisﬁes PT t=1 s2 t(xt, at) 2γT , where γT = log det(I + KT /λ) d log(1 + T dλ) and d is the rank of KT .

Then, we give the deﬁnition of UCB-optimal arm which is important in our algorithm designs.

Deﬁnition 2. Given context x X, the UCB-optimal arm is deﬁned as A t (x) = arg maxa A Ut (x, a) .

Note that if the received contexts are perfect, i.e. ˆxt = xt, the standard contextual UCB strategy selects arm at round t as A t (xt). Under the cases with imperfect context, a naive policy (which we call Simple UCB) is simply oblivious of context errors, i.e. the agent selects the UCB-optimal arm regarding imperfect context ˆxt, denoted as a t = A t (ˆxt), by simply viewing the imperfect context ˆxt as true context. Nonetheless, if we want to guarantee the arm selection performance even in the worst case, robust arm selection that accounts for context errors is needed.

Robustness Objectives

In the existing bandit literature such as (Auer and Chiang 2016; Han et al. 2020; Li et al. 2010), maximizing the cumulative reward is equivalent to minimizing the cumulative regret, under the assumption of perfect context for arm selection. In this section, we will show that maximizing the worst-case reward is equivalent to minimizing a pseudo regret and is different from minimizing the worst-case true regret.

c I CRntext

arm 1 arm 2

c I CRntext

arm 1 arm 2

(a) Type-I Robustness

c I CRntext

arm 1 arm 2

c I CRntext

arm 1 arm 2

(b) Type-II Robustness

Figure 1: Illustration of reward and regret functions that Type-I and Type-II robustness objectives are suitable for, respectively. The golden dotted vertical line represents the imperfect context c I, and the gray region represents the defense region B (c I).

Type-I Robustness With imperfect context, one approach to robust arm selection is to maximize the worst-case reward. With perfect knowledge of reward function, the oracle arm that maximizes the worst-case reward at round t is

at = arg max a A min x B (ˆxt) f(x, a). (4)

For analysis in the following sections, given at, the corresponding context for the worst-case reward is denoted as

xt = arg min x B (ˆxt) f (x, at) , (5)

and the resulting optimal worst-case reward is denoted as

MFt = f ( xt, at) . (6)

Next, Type-I robustness objective is deﬁned based on the difference PT t=1 MFt FT , where FT = PT t=1 f(xt, at) is the actual cumulative reward. Deﬁnition 3. If, with an arm selection strategy {a1, , a T }, the difference between the optimal cumulative worst-case reward and the cumulative true reward PT t=1 MFt FT is sub-linear with respect to T, then the strategy achieves Type-I robustness. If an arm selection strategy achieves Type-I robustness, the lower bound for the true reward f (xt, at) approaches the optimal worst-case reward MFt in the defense region as t increases. Therefore, a strategy achieving type-I robustness objective can prevent very low reward. For example, in Fig. 1(a), arm 1 is the one that maximizes the worst-case reward, which is not necessarily optimal but always avoids extremely low reward under any context in the defense region. Note that maximizing the worst-case reward is equivalent to minimizing the robust regret deﬁned in (Kirschner et al. 2020), which is written using our formulation as

t=1 min x B (ˆxt) f (x, at) min x B (ˆxt) f (x, at) . (7)

However, this robust regret is a pseudo regret because the rewards of oracle arm at and selected arm at are compared under different contexts (i.e., their respective worst-case contexts), and it is not an upper or lower bound of the true regret RT . To obtain a robust regret performance, we need to deﬁne another robustness objective based on the true regret.

Type-II Robustness To provide robustness for the regret with imperfect context, we can minimize the cumulative worst-case regret, which is expressed as

t=1 max x B (ˆxt) [f (x, A (x)) f (x, at)] . (8)

Clearly, the true regret RT RT , and minimizing the worstcase regret is equivalent to minimizing an upper bound for the true regret. Deﬁne the instantaneous regret function with respect to context x and arm a as r (x, a) = f(x, A (x)) f(x, a). Since given the reward function the optimization is decoupled among different rounds, the robust oracle arm to minimize the worst-case regret at round t is

at = arg min a A max x Bt(ˆxt) r(x, a). (9)

For analysis in the following sections, given at, the corresponding context for the worst-case regret is denoted as

xt = arg max x B (ˆxt) r(x, at), (10)

and the resulting optimal worst-case regret is

MRt = r( xt, at). (11)

Now, we can give the deﬁnition of Type-II robustness as follows. Deﬁnition 4. If, with an arm selection strategy {a1, , a T }, the difference between the cumulative true regret and the optimal cumulative worst-case regret RT PT t=1 MRt is sub-linear with respect to T, then the strategy achieves Type-II robustness. If an arm selection strategy achieves Type-II robustness, as time increases, the upper bound for the true regret r(xt, at) also approaches the optimal worst-case regret MRt. Hence, a strategy achieving type-II robustness objective can prevent a high regret. As shown in Fig. 1(b), arm 1 is selected by minimizing the worst-case regret, which is a robust arm selection because the regret of arm 1 under any context in the defense region is not too high.

Comparison of Two Robustness Objectives The two types of robustness correspond to the algorithms maximizing the worst-case reward and minimizing the

Algorithm 1 Robust Arm Selection with Imperfect Context

Input: Context error budget for t = 1, , T do Receive imperfect context ˆxt. Select arm a I t to solve Eqn. (12) in Max Min UCB; or select arm a II t to solve Eqn. (16) in Min WD Observe the true context xt and the reward yt. end for

worst-case regret, respectively. In many cases, they result in different arm selections. Take the two scenarios in Fig. 1 as examples. In the scenario of Fig. 1(a), given the defense region, arm 1 is selected by maximizing the worst-case reward and arm 2 is selected by minimizing the worst-case regret. It can be observed that the worst-case regrets of the two arms are very close, but the worst-case reward of arm 2 is much lower than that of arm 1. Thus, the strategy of maximizing the worst-case reward is more suitable for this scenario. Differently, in the scenario of Fig. 1(b), arm 2 is selected by maximizing the worst-case reward and arm 1 is selected by minimizing the worst-case regret. Since the worst-case rewards of the two arms are very close and the worst-case regret of arm 2 is much larger than arm 1, it is more suitable to minimize the worst-case regret.

Robust Bandit Arm Selection In this section, we propose two robust arm selection algorithms: (1) Max Min UCB (Maximize Minimum Upper Conﬁdence Bound), which aims to maximize the minimum reward (Type-I robustness objective); and (2) Min WD (Minimize Worst-case Degradation), which aims to minimize the maximum regret (Type-II robustness objective). We derive the regret and reward bounds for both algorithms and the proofs are available in (Yang and Ren 2021).

Max Min UCB: Maximize Minimum UCB Algorithm To achieve type-I robustness, Max Min UCB in Algorithm 1 selects an arm a I t by maximizing the minimum UCB within the defense region B (ˆxt):

a I t = arg max a A min x B (ˆxt) Ut (x, a) . (12)

The corresponding context that attains the minimum UCB in Eqn.(12) is x I t = minx B (ˆxt) Ut x, a I t .

Analysis The next theorem gives a lower bound of the cumulative true reward of Max Min UCB in terms of the optimal worst-case reward and a sub-linear term. Theorem 3. If Max Min UCB is used to select arms with imperfect context, then for any true contexts xt B (ˆxt) at round t, t = 1, , T, with a probability of 1 δ, δ (0, 1), we have the following lower bound on the worst-case cumulative reward

t=1 MFt 2h T

2T d log(1 + T

where MFt is the optimal worst-case reward in Eqn. (6), d is the rank of Kt and h T is given in Lemma 1.

Remark 1. Theorem 3 shows that by Max Min UCB, the difference between the optimal cumulative worst-case reward and the cumulative true reward is sub-linear and thus effectively achieves Type-I robustness according to Deﬁnition 3. This means that the reward by Max Min UCB has a bounded sub-linear gap compared to the optimal worst-case reward PT t=1 MFt obtained with perfect knowledge of the reward function.

We are also interested in the cumulative true regret of Max Min UCB which is given in the following corollary.

Corollary 3.1. If Max Min UCB is used to select arms with imperfect context, then for any true contexts xt B (ˆxt) at round t, t = 1, , T, with a probability of 1 δ, δ (0, 1), we have the following bound on the cumulative true regret deﬁned in Eqn. (2):

2T d log(1 + T

where MRt = maxx B (ˆxt) f (x, A (x)) MFt, MFt is the optimal worst-case reward in Eqn. (6) .

Remark 2. Corollary 3.1 shows that the worst-case regret by Max Min UCB can be quite larger than the optimal worstcase regret MRt given in Eqn. (11) (Type-II robustness objective). Actually, despite being robust in terms of rewards, arms selected by Max Min UCB can still have very large regret as shown in Fig. 1(b). Thus, to achieve type-II robustness, it is necessary to develop an arm selection algorithm that minimizes the worst-case regret.

Min WD: Minimize Worst-case Degradation

Algorithm Min WD is designed to asymptotically minimize the worst-case regret. Without the oracle knowledge of reward function, Min WD performs arm selection based on the upper bound of regret. Denote Da (x) = Ut x, A t (x) Ut (x, a) referred to as UCB degradation at context x. By Lemma 1, the instantaneous true regret can be bounded as

r(xt, at) [Dat (xt)+2htst (xt, at)]

Dat +2htst (xt, at) , (15)

where Dat = maxx B (ˆxt) Da (x) is called the worst case degradation, and 2htst (xt, at) has a vanishing impact by Lemma 2. Thus, to minimize worst-case regret, Min WD minimizes its upper bound Dat excluding the vanishing term 2htst (xt, at), i.e.

a II t = min a A max x B (ˆxt)

n Ut x, A t (x) Ut (x, a) o . (16)

The context that attains the worst case in Eqn. (16) is written as x II t = arg maxx B (ˆxt) Da II t (x).

Analysis Given arm a II t selected by Min WD, the next lemma gives an upper bound of worst-case degradation.

Lemma 4. If Min WD is used to select arms with imperfect context, then for each t = 1, 2, , T, with a probability at least 1 δ, δ (0, 1), we have

Da II t ,t MRt + 2htst xt, A t ( xt) , (17)

where MRt is the optimal worst-case regret deﬁned in Eqn. (11), xt = arg maxx B (ˆxt) D at (x) is the context that maximizes the degradation given the arm at deﬁned for the optimal worst-case regret in Eqn. (10). Then, in order to show that Da II t,t approaches MRt, we

need to prove that 2htst xt, A t ( xt) vanishes as t increases. But, this is difﬁcult because the considered sequence n xt, A t ( xt) o is different from the actual sequence

of context and selected arms xt, a II t under Min WD. To circumvent this issue, we ﬁrst introduce the concept of ϵ covering (Wu 2016). Denote Φ = X A as the context-arm space. If a ﬁnite set Φϵ is an ϵ covering of the space Φ, then for each ϕ Φ, there exists at least one ϕ Φϵ satisfying ϕ ϕ 2 ϵ. Denote Cϵ ( ϕ) = {ϕ | ϕ ϕ 2 ϵ} as the cell with respect to ϕ Φϵ. Since the dimension of the entries in Φ is d, the size of the Φϵ is |Φϵ| O 1

ϵd . Besides, we assume the mapping function φ is Lipschitz continuous, i.e. x, y Φ, φ(x) φ(y) Lφ x y . Next, we prove the following proposition to bound the sum of conﬁdence widths under some conditions. Proposition 5. Let XT = {xa1,1, , xa T ,T } be the sequence of true contexts and selected arms by bandit algorithms and XT = { x a1,1, , x a T ,T } be the considered sequence of contexts and actions. Suppose that both xat,t and x at,tbelong to Φ. Besides, with an ϵ covering Φϵ Φ, ϵ > 0, there exists κ 0 such that two conditions are satisﬁed: First, ϕ Φϵ, t κ/ϵd such that xat,t Cϵ ( ϕ). Second, if at round t, xat,t Cϵ ( ϕ) for some ϕ Φϵ, then t t < t + κ/ϵd such that xa t,t Cϵ ( ϕ). If the mapping function φ is Lipschitz continuous with constant Lφ, the sum of squared conﬁdence widths is bounded as

t=1 s2 t ( x at,t)

T 4 d log 1+ T

+ 8L2 φκ2/d

where d is the dimension of xat,t, d is the effective dimension deﬁned in the proof, s2 t ( x at,t) = φ( x at,t) V 1 t 1φ( x at,t) and Vt =λI + Pt s=1 φ(xas,s)φ(xas,s) . Remark 3. The conditions in Proposition 5 guarantee that the time interval between the events that true context-arm feature lies in the same cell is not larger than κ/ϵd , which is proportional to the size of the ϵ-covering |Φϵ|. That means, similar contexts and selected arms occur in the true sequence repeatedly if T is large enough. If contexts are sampled from a bounded space X with some distribution, then similar contexts will occur repeatedly. Also, note that the arm in our considered sequence A t ( xt) is the UCB-optimal arm, which becomes close to the optimal arm for xt if the conﬁdence width is sufﬁciently small. Hence, there exists some context error budget sequence { t} such that, starting from

a certain round T0, the two conditions are satisﬁed. The two conditions in Proposition 5 are mainly for theoretical analysis of Min WD. By Lemma 4 and Proposition 5, we bound the cumulative regret of Min WD. Theorem 6. If Min WD is used to select arms with imperfect context and as time goes on, and the conditions in Proposition 5 are satisﬁed, then for any true context xt B (ˆxt) at round t, t = 1, , T, with a probability of 1 δ, δ (0, 1), we have the following bound on the cumulative true regret:

t MRt + 2h T T 3 4

s 4 d log 1 + T

2 λLφκ 1 d h T T 1 1

2T d log(1+ T

where MRt is the optimal worst-case regret for round t in Eqn. (11), d is the dimension of xat,t, d is the effective dimension deﬁned in the proof of Proposition 5, d is the rank of Kt and h T is given in Lemma 1.

Remark 4. Theorem 6 shows that by Min WD, RT PT t=1 MRt is sub-linear w.r.t. T and thus Type-II robustness is effectively achieved according to Deﬁnition 4. This means the true regret bound approaches PT t MRt, the optimal worst-case regret, asymptotically. Next, in parallel with Max Min UCB, we derive the bound of true reward for Min WD. Corollary 6.1. If Min WD is used to select arms with imperfect context and as time goes on, and the true sequence of context and arm obeys the conditions in Proposition 5, then for any true contexts xt B (ˆxt) at round t, t = 1, , T, with a probability of 1 δ, δ (0, 1), we have the following lower bound of the cumulative reward

t=1 [MFt MRt] 2h T T 3 4

s 4 d log 1 + T

2 λLφκ 1 d h T T 1 1

2T d log(1+ T

where MRt is the optimal worst-case regret for round t in Eqn. (11), d is the dimension of xat,t, d is the effective dimension deﬁned in the proof of Proposition 5, d is the rank of Kt, and h T is given in Lemma 1. Remark 5. Corollary 6.1 shows that as t becomes sufﬁciently large, the difference between the optimal worstcase reward and the true reward of the selected arm is no larger than the optimal worst-case regret MRt. With perfect context, we have MRt = 0, and hence Max Min UCB and Min WD both asymptotically maximize the reward, implying that these two types of robustness are the same under perfect context.

Summary of Main Results We summarize our analysis of Max Min UCB and Min WD in Table 1, while the algorithms details are available in Algorithm 1. In the table, d is the dimension of context-arm

0 200 400 600 800 1000 Time Slots

Simple UCB Max Min UCB Min WD

(a) Robust regret.

0 200 400 600 800 1000 Time Slots

Simple UCB Max Min UCB Min WD

(b) Worst-case regret.

0 200 400 600 800 1000 Time Slots

Simple UCB Max Min UCB Min WD

(c) True regret.

Figure 2: Different cumulative regret objectives for different algorithms.

Algorithms Regret Reward Max Min UCB PT t=1MRt + O( Tlog T)

PT t=1MFt O( Tlog T) Min WD PT t=1MRt + O(T 3 4 log T + T 1 1

2d + Tlog T)

PT t [MFt MRt] O(T 3 4 log T + T 1 1

2d + Tlog T)

Table 1: Summary of Analysis

vector [x, a], MRt = maxx B (ˆxt) f (x, A (x)) MFt, and MFt and MRt are deﬁned in Eqn. (6) and (11), respectively. Type-I and type-II robustness objectives are achieved by Max Min UCB and Min WD respectively.

Edge computing is a promising technique to meet the demand of latency-sensitive applications (Shi et al. 2016). Given multiple heterogeneous edge datacenters located in different locations, which one should be selected? Specifically, each edge datacenter is viewed as an arm, and the users workload is context that can only be predicted prior to arm selection. Our goal is to learn datacenter selection to optimize the latency in a robust manner given imperfect workload information. We assume that the service rate of the edge datacenter a, a A, is µa, the computation latency satisﬁes an M/M/1 queueing model and the average communication delay between this datacenter and users is pa. Hence, the average total latency cost can be expressed as l(x, a) = pa x + x µa x which is commonly-considered in the literature (Lin et al. 2011; Xu, Chen, and Ren 2017; Lin et al. 2012). The detailed settings are given in (Yang and Ren 2021). In Fig. 2, we compare different algorithms in terms of three cumulative regret objectives: robust regret in Eqn. (7), worst-case regret in Eqn. (8) and true regret in Eqn. (2). We consider the following algorithms: Simple UCB with imperfect context, Max Min UCB with imperfect context and Min WD with imperfect context. Given a sequence of true contexts, imperfect context sequence is generated by sampling i.i.d. uniform distribution over B (xt) at each round. In the simulations, Gaussian kernel with parameter 0.1 is used for reward (loss) estimation. λ in Eqn. (3) is set as 0.1.

The exploration rate is set as ht = 0.04. As is shown in Fig. 2(a), Max Min UCB has the best performance of robust regret among the three algorithms. This is because Max Min UCB targets at type-I robustness objective which is equivalent to minimizing the robust regret. However, Max Min UCB is not the best algorithm in terms of true regret as is shown in Fig. 2(c) since robust regret is not an upper or lower bound of true regret. Another robust algorithm Min WD is also better than Simple UCB in terms of robust regret, and it has the best performance among the three algorithms in terms of the worst-case regret, as shown in Fig. 2(b). This is because the regret of Min WD approaches the optimal worst-case regret (Theorem 6). Min WD also has a good performance of true regret, which coincides with the fact that the worst-case regret is the upper bound of the true regret. By comparing the three algorithms in terms of the three regret objectives, we can clearly see that Max Min UCB and Min WD achieve performance robustness in terms of the robust regret and worst-case regret, respectively.

Conclusion In this paper, considering a bandit setting with imperfect context, we propose: Max Min UCB which maximizes the worst-case reward; and Min WD which minimizes the worstcase regret. Our analysis of Max Min UCB and Min WD based on regret and reward bounds shows that as time goes on, Max Min UCB and Min WD both perform as asymptotically well as their counterparts that have perfect knowledge of the reward function. Finally, we consider online edge datacenter selection and run synthetic simulations for evaluation.

Acknowledgments This work was supported in part by the NSF under grants CNS-1551661 and ECCS-1610471.

References Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved Algorithms for Linear Stochastic Bandits. Neur IPS .

Agrawal, S.; and Goyal, N. 2012. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT .

Agrawal, S.; and Goyal, N. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML .

Altschuler, J.; Brunel, V.-E.; and Malek, A. 2019. Best Arm Identiﬁcation for Contaminated Bandits. Journal of Machine Learning Research 20(91): 1 39.

Audibert, J.; and Bubeck, S. 2009. Minimax policies for adversarial and stochastic bandits. COLT .

Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finitetime Analysis of the Multiarmed Bandit Problem. Machine Learning 47: 235 256.

Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The Nonstochastic Multiarmed Bandit Problem. SIAM Journal on Computing 32: 48 77.

Auer, P.; and Chiang, C.-K. 2016. An Algorithm with Nearly Optimal Pseudo-regret for Both Stochastic and Adversarial Bandits. In COLT.

Bogunovic, I.; Scarlett, J.; Jegelka, S.; and Cevher, V. 2018. Adversarially Robust Optimization with Gaussian Processes. In NIPS.

Brockwell, P. J.; Brockwell, P. J.; Davis, R. A.; and Davis, R. A. 2016. Introduction to time series and forecasting. Springer.

Bubeck, S.; and Cesa-Bianchi, N. 2012. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning 5: 1 122.

Chen, B.; Wang, J.; Wang, L.; He, Y.; and Wang, Z. 2014. Robust optimization for transmission expansion planning: Minimax cost vs. minimax regret. IEEE Transactions on Power Systems 29(6): 3069 3077.

Chen, L.; and Xu, J. 2019. Budget-constrained edge service provisioning with demand estimation via bandit learning. IEEE Journal on Selected Areas in Communications 37(10): 2364 2376.

Chen, L.; Xu, J.; Ren, S.; and Zhou, P. 2018. Spatio temporal edge service placement: A bandit learning approach. IEEE Transactions on Wireless Communications 17(12): 8388 8401.

Chen, Y.; Tan, Y.; and Zhang, B. 2019. Exploiting Vulnerabilities of Load Forecasting Through Adversarial Attacks. In e-Energy.

Chowdhury, S. R.; and Gopalan, A. 2017. On kernelized multi-armed bandits. ICML .

Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandits with linear payoff functions. Neur IPS .

Dani, V.; Hayes, T.; Thomas, P.; and Kakade, S. 2008. Stochastic linear optimization under bandit feedback. COLT .

Deshmukh, A. A.; Dogan, U.; and Scott, C. 2017. Multi Task Learning for Contextual Bandits. Neur IPS .

Dud ık, M.; Langford, J.; and Li, L. 2011. Doubly robust policy evaluation and learning. ar Xiv preprint ar Xiv:1103.4601 .

Garcıa, J.; and Fern andez, F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1): 1437 1480.

Gerchinovitz, S.; and Lattimore, T. 2016. Reﬁned lower bounds for adversarial bandits. Neur IPS .

Gers, F. A.; Schmidhuber, J.; and Cummins, F. 2000. Learning to Forget: Continual Prediction with LSTM. Neural Computation 12(10): 2451 2471.

Goldsmith, A. 2005. Wireless Communications. Cambridge University Press.

Guan, Z.; Ji, K.; Bucci Jr, D. J.; Hu, T. Y.; Palombo, J.; Liston, M.; and Liang, Y. 2020. Robust Stochastic Bandit Algorithms under Probabilistic Unbounded Adversarial Attack. In AAAI.

Han, Y.; Zhou, Z.; Zhou, Z.; Blanchet, J.; Glynn, P. W.; and Ye, Y. 2020. Sequential Batch Learning in Finite-Action Linear Contextual Bandits. ar Xiv preprint ar Xiv:2004.06321 .

Jadbabaie, A.; Rakhlin, A.; Shahrampour, S.; and Sridharan, K. 2015. Online optimization: Competing with dynamic comparators. In AISTATS.

Jiang, R.; Wang, J.; Zhang, M.; and Guan, Y. 2013. Twostage minimax regret robust unit commitment. IEEE Transactions on Power Systems 28(3): 2271 2282.

Jun, K.-S.; Li, L.; Ma, Y.; and Zhu, X. 2018. Adversarial Attacks on Stochastic Bandits. In NIPS.

Kirschner, J.; Bogunovic, I.; Jegelka, S.; and Krause, A. 2020. Distributionally Robust Bayesian Optimization. In AISTATS.

Kirschner, J.; and Krause, A. 2019. Stochastic Bandits with Context Distributions. In Neur IPS.

Lei, H.; Tewari, A.; and Murphy, S. 2014. An actor-critic contextual bandit algorithm for personalized interventions using mobile devices. Neur IPS .

Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-bandit Approach to Personalized News Article Recommendation. In WWW.

Lin, M.; Liu, Z.; Wierman, A.; and Andrew, L. L. H. 2012. Online algorithms for geographical load balancing. In IGCC.

Lin, M.; Wierman, A.; Andrew, L. L. H.; and Thereska, E. 2011. Dynamic right-sizing for power-proportional data centers. In INFOCOM.

Liu, F.; and Shroff, N. 2019. Data Poisoning Attacks on Stochastic Bandits. In ICML.

Lu, T.; P al, D.; and P al, M. 2010. Contextual multi-armed bandits. AISTATS .

Neu, G.; and Olkhovskaya, J. 2020. Efﬁcient and robust algorithms for adversarial linear contextual bandits. In COLT.

Nguyen, T.; Gupta, S.; Ha, H.; Rana, S.; and Venkatesh, S. 2020. Distributionally robust bayesian quadrature optimization. In AISTATS.

Rakhlin, A.; and Sridharan, K. 2013. Online Learning with Predictable Sequences. In COLT.

Saxena, V.; Jald en, J.; Gonzalez, J. E.; Bengtsson, M.; Tullberg, H.; and Stoica, I. 2019. Contextual Multi-Armed Bandits for Link Adaptation in Cellular Networks. In Workshop on Network Meets AI & ML (Net AI). Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; and Xu, L. 2016. Edge Computing: Vision and Challenges. IEEE Internet of Things Journal 3(5): 637 646. Si, N.; Zhang, F.; Zhou, Z.; and Blanchet, J. 2020a. Distributional Robust Batch Contextual Bandits. ar Xiv preprint ar Xiv:2006.05630 . Si, N.; Zhang, F.; Zhou, Z.; and Blanchet Wu, J. 2020b. Distributionally Robust Policy Evaluation and Learning in Ofﬂine Contextual Bandits. In ICML. Slivkins, A. 2019. Introduction to Multi-Armed Bandits. Foundations and Trends in Machine Learning 12(1-2): 1 286. Srinivas, N.; Krause, A.; Kakade, S.; and Seeger, M. 2010. Gaussian process optimization in the bandit setting: no regret and experimental design. ICML . Sun, W.; Dey, D.; and Kapoor, A. 2017. Safety-aware algorithms for adversarial contextual bandit. In ICML. Syrgkanis, V.; Krishnamurthy, A.; and Schapire, R. 2016. Efﬁcient algorithms for adversarial contextual learning. ICML . Valko, M.; Korda, N.; Munos, R.; Flaounas, I.; and Cristianini, N. 2013. Finite-time analysis of kernelised contextual bandits. UAI . Wang, H.; Wu, Q.; and Wang, H. 2016. Learning Hidden Features for Contextual Bandits. CIKM . Wu, Y. 2016. Packing, covering, and consequences on minimax risk. http://www.stat.yale.edu/ yw562/teaching/598/ lec14.pdf. Wu, Y.; Shariff, R.; Lattimore, T.; and Szepesv ari, C. 2016. Conservative bandits. In ICML. Xu, J.; Chen, L.; and Ren, S. 2017. Online Learning for Ofﬂoading and Autoscaling in Energy Harvesting Mobile Edge Computing. IEEE Transactions on Cognitive Communications and Networking 3(3): 361 373. Yang, J.; and Ren, S. 2021. Robust Bandit Learning with Imperfect Context. ar Xiv preprint ar Xiv:2102.05018 . Zhu, F.; Guo, J.; Li, R.; and Huang, J. 2018. Robust actorcritic contextual bandit for mobile health (mhealth) interventions. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 492 501.