# asymptoticallyoptimal_gaussian_bandits_with_side_observations__e8c228a6.pdf

Asymptotically-Optimal Gaussian Bandits with Side Observations

Alexia Atsidakou * 1 Orestis Papadigenopoulos * 2 Constantine Caramanis 1 Sujay Sanghavi 1 3

Sanjay Shakkottai 1

We study the problem of Gaussian bandits with general side information, as first introduced by Wu, Szepesv ari, and Gy orgy. In this setting, the play of an arm reveals information about other arms, according to an arbitrary a priori known side information matrix: each element of this matrix encodes the fidelity of the information that the row arm reveals about the column arm. In the case of Gaussian noise, this model subsumes standard bandits, full-feedback, and graph-structured feedback as special cases. In this work, we first construct an LP-based asymptotic instance-dependent lower bound on the regret. The LP optimizes the cost (regret) required to reliably estimate the suboptimality gap of each arm. This LP lower bound motivates our main contribution: the first known asymptotically optimal algorithm for this general setting.

1. Introduction

In the stochastic online learning framework, a player sequentially selects from a set of available actions (or arms ) and collects a stochastic reward associated with the chosen action. Under the objective of maximizing the (expected) cumulative reward collected over a number of rounds, the complexity of an instance is greatly impacted by the nature of the feedback the player receives at the end of each round. In the full-feedback case, for instance, where the player observes the realized rewards of all actions, the most reasonable (and provably optimal) strategy is to greedily select at each round the action with maximum estimated mean reward computed using the observed samples. In

*Equal contribution 1Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, Texas, USA 2Department of Computer Science, University of Texas at Austin, Austin, Texas, USA 3Amazon Science, USA. Correspondence to: Alexia Atsidakou <atsidakou@utexas.edu>, Orestis Papadigenopoulos <papadig@cs.utexas.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

the bandit-feedback case (Lai & Robbins, 1985; Bubeck & Cesa-Bianchi, 2012), on the other hand, where only the reward of the chosen action is revealed to the player, more sophisticated ideas have to be leveraged in order to align the two conflicting objectives of the problem: exploration (learning the best action) and exploitation (not playing suboptimal actions).

In a number of real world applications, however, the feedback does not fall into any of the above extreme categories, since an action can potentially leak information regarding, not only its own reward distribution, but also the distributions of other actions. Drawing an example from music recommendation, the fact that a user likes a specific recommended song (corresponding to an action chosen by the platform) can potentially imply that similar songs (e.g., of the same artist or genre) might also be appreciated by the user. In another example, encouraging a certain user of a social network to advertise a purchased product (potentially by offering a discount as part of a promotion campaign), the seller can obtain valuable information about other neighboring users by recording their reactions.

Motivated by such practical scenarios, researchers have focused their attention on online learning models with richer information structures namely, feedback which interpolates between full and bandit (Mannor & Shamir, 2011; Caron et al., 2012; Lin et al., 2014; Yun et al., 2018; Cortes et al., 2020). While the presence of side observations permits improved regret guarantees compared to those of standard multi-armed bandits, techniques and algorithms that have been proved successful for bandit-feedback (e.g., optimism principle and the UCB method) fail to achieve optimality. Indeed, enriching the feedback model severely perplexes the role of exploration and exploitation. To illustrate, one can think of the following example: K 1 arms provide to the player standard bandit feedback. In addition to these, there also exists an information-revealing arm with (deterministically) zero reward, such that, once played, the player gets full-feedback on the realized rewards of the round, for all arms. Even in the above simple example, the number of times (if any) this information-revealing arm should played in a minimum regret solution is now a complex function of K and the (a priori unknown) suboptimality gaps.

Initiated by the work of (Mannor & Shamir, 2011) in the context of adversarial bandits, the majority of works in this field focuses on the so-called graph-structured feedback model (Alon et al., 2015; Wu et al., 2015; Koc ak et al., 2016; Arora et al., 2019; Cortes et al., 2020). There, each arm corresponds to a node of a given (directed) graph and every time the player chooses an action, she observes the reward realization of the arm played, but also the realizations of all the adjacent arms in the graph. In the above setting, a series of works (Buccapatnam et al., 2014; Alon et al., 2015; Cohen et al., 2016; Lykouris et al., 2020) has provided algorithms with regret guarantees that no longer depend on the number of arms, but on smaller inherent characteristics of the feedback-graph (e.g., clique or independence number).

One of the most general feedback models that has been introduced in the literature is due to Wu, Szepesv ari, and Gy orgy (Wu et al., 2015). According to their model, playing an arm reveals noisy information about the reward of all other arms according to an arbitrary a priori known side information matrix. Specifically, each element of this square matrix encodes the fidelity of the information expressed in terms of standard deviation of the noise that the row arm reveals about the column arm. Notice that, modulo the Gaussian noise assumption made in (Wu et al., 2015), the above model subsumes the full, bandit, and graph-structured feedback as special cases. For the case of Gaussian noise, (Wu et al., 2015) provide finite-time instance-dependent lower bounds on the regret of any algorithm. In terms of upper bounds, the authors only address the special case where each entry of the feedback matrix satisfies σi,j {σ, } for some fixed σ (which is essentially the graphstructured feedback model), but they leave open the question of an algorithm for general feedback matrices.

1.1. Our Contributions

In our work, we provide the first algorithm and regret analysis for the setting of general feedback matrices proposed by (Wu et al., 2015). As we show, the regret guarantee of our algorithm is asymptotically optimal, as it matches our asymptotic instance-dependent lower bound. We now outline the main challenges and technical contributions of this work.

Concentration bounds for a natural ML estimator. In the restricted case where the entries of the feedback matrix are in {σ, }, the natural estimator of the mean reward of an arm is the sample average of the (finite-variance) samples collected. In our algorithm, where the collected samples are subject to different noise levels, we replace the above estimator by the weighted average of the collected samples, where the weight of each sample is the inverse variance of the noise of its source. The above maximum-likelihood estimator suggests that the notion of number of samples ,

as a measure of the amount of information collected, needs to be replaced by the weighted number of samples.

The use of weighted average as an estimator introduces technical hurdles: in scenarios where the number of samples is a random variable that depends on the trajectory of the algorithm, one needs to apply a union bound over all possible sample numbers in order to decorrelate the estimate from the evolution of the algorithm up to each round. However, in our case where each sample can be collected under K different noise levels, the weighted number of samples of an arm can take exponentially many different values a fact that invalidates the use of union bound. In Section 3, we show how to overcome the above issue by proving concentration guarantees for our estimator, which extend to the more general case of sub-Gaussian noise.

Asymptotic regret lower bound. In our setting, a minimum regret arm-pulling schedule is a complicated function of the number of arms, the suboptimality gaps, and the feedback matrix. In the imaginary scenario where prior knowledge of the suboptimality gaps is assumed, the underlying (combinatorial) problem is to collect sufficient information in order to distinguish the optimal arm by paying the minimum cost (i.e., total suboptimality gap of the arms played for exploration). Although this problem is computationally hard, it can be closely approximated (up to rounding errors) and relaxed by a linear program (LP). As we prove in Section 4, since the above approximation becomes tighter as the time horizon increases, the exact same LP provides a clear and intuitive asymptotic lower bound on the regret.

Asymptotically optimal algorithm. The LP-based asymptotic regret lower bound that we construct serves as a starting point for the design of our asymptotically optimal algorithm. Since the suboptimality gaps are initially unknown, the role of our algorithm is now twofold: to estimate the LP (including constraints and objective) up to a sufficient degree, while simultaneously to implement its optimal solution in an online manner. In order to achieve the above dual objective, our algorithm interleaves rounds of pure greedy exploitation, when sufficient information (with respect to the estimated LP) has been collected, and exploration rounds. The latter case is further partitioned into rounds where, by greedily collecting high fidelity samples, the algorithm attempts to uniformly estimate the LP, and more refined exploration rounds that are dictated by its solution. In Section 5, we describe our algorithm and prove its asymptotic optimality.

Due to space constraints, all omitted proofs have been moved to the Appendix.

1.2. Related Work

The problem of bandits with graph-structured feedback was introduced by (Mannor & Shamir, 2011) in an adversarial

setting. Their model naturally interpolates between experts, where the learner observes feedback from all actions, and bandits, where the learner receives feedback only from the action selected at each round. Since then, the graph feedback model has been extensively studied in both adversarial and stochastic settings.

In the stochastic case, the work of (Caron et al., 2012) presents a Ω(log(T)) regret lower bound for the graph feedback setting, as long as not all suboptimal arms have a maximum-reward neighbor. In addition, (Caron et al., 2012) examine the regret of natural UCB variants that include the side observations in the UCB indices. They provide instance-dependent regret guarantees of worst-case order O χ max

2 min log(T) , where χ is the clique-partition number of the graph (i.e. the minimum number of cliques into which the graph can be partitioned). (Buccapatnam et al., 2014) provide an asymptotic instance-dependent LP lower bound for the graph-structured feedback model. The authors also propose two algorithms which exploit (a relaxation of) the minimum dominating set of the graph (i.e., the smallest set of nodes which, in terms of their one-step neighbors, covers all arms) in order to perform more efficient exploration. Their gap-dependent regret guarantees are of the form O P

i D log(T )

, where D is a special dominating set of the graph. Similar guarantees for UCB and Thomson Sampling are provided by (Lykouris et al., 2020) in terms of the independence number of the feedback graph (i.e. the size of the maximum independent set).

In principle, it seems that the absence of initial knowledge of the suboptimality gaps prevents the above approaches from achieving optimal regret guarantees. The work of (Wu et al., 2015) is the first to resolve the above issue and achieve optimal instance-dependent regret. Focusing on the Gaussian setting, they present an algorithm that attempts to estimate the gaps as well as satisfy the constraints of the asymptotic instance-dependent LP lower bound due to (Buccapatnam et al., 2014). They also present a minimax optimal algorithm for this setting. In addition, (Wu et al., 2015) generalize the graph-structured feedback model to arbitrary feedback matrices, for which they present finitetime gap-dependent lower bounds.

The graph-structured feedback model has also been extensively studied in an adversarial setting. (Mannor & Shamir, 2011) present algorithms with regret guarantees of the form O( p

α log(k) T), for undirected, and O( p

χ log(k) T) for directed graphs, together with Ω(

α T) lower bounds. (Alon et al., 2013) provide improved results for the directed feedback graph case. In (Alon et al., 2015), the authors investigate the relation between the observability properties of the graph and the optimal achievable regret. (Koc ak et al., 2016) generalize the graph

feedback to weighted graph and provide O(

α T) guarantees, where α is a generalization of the independence number for weighted graphs and the O( ) notation is used to hide logarithmic terms. Other variants of adversarial online learning with graph-structured feedback have been studied in (Rangi & Franceschetti, 2019; Cortes et al., 2019; Arora et al., 2019; Resler & Mansour, 2019; Cortes et al., 2020).

Finally, additional examples of bandits that incorporate graph-structured feedback include the work of (Liu et al., 2018a;b), where the authors study the Bayesian version of the problem and provide O( p

α T log(K)) regret guarantees for time-varying graphs. In (Liu et al., 2018a) the authors employ information directed sampling for this setting. (Li et al., 2020) extend the graph-structured feedback by adding observation probabilities on the edges. (Cohen et al., 2016) study the case where the feedback graph is never fully revealed to the learner. Other variations of the problem include (Yun et al., 2018; Singh et al., 2020; Wang et al., 2020), and the so-called partial monitoring setting (Lin et al., 2014; Bart ok et al., 2014; Hanawal et al., 2016).

2. Problem Definition

Model. We consider the variant of the stochastic multiarmed bandit problem where the player is given K arms with (unknown) expected rewards µ = (µ1, . . . , µK) and a feedback matrix Σ = (σi,j)i,j [K]. By playing an action i [K] the player observes a noisy sample Xj from the reward of each arm j [K], distributed independently as Xj N(µj, σ2 i,j), and collects the realized value Xi N(µi, σ2 i,i). Therefore, the matrix Σ quantifies the quality of the noisy observations of the expected rewards in terms of standard deviation of the Gaussian noise. At any round t where arm it is played, we denote by Xj,t the noisy observation for each arm j [K], which coincides with the realized collected reward in the case where j = it. The objective is to minimize the expected cumulative regret over an unknown time horizon T, defined as

RT (µ) = T µ E

t [T ] Xit,t

where µ = maxi [K] µi is the maximum expected reward.

We remark that we do not impose any non-trivial restrictions on the feedback matrix Σ Rk k 0 . In particular, we allow Σ to be asymmetric, i.e., σi,j = σj,i for i = j, and we permit that σj,i < σi,i, namely, higher quality information about the reward of an action can potentially be obtained by playing a different action. Finally, σi,j = corresponds to the case where pulling arm i reveals no information about the reward of arm j.

Technical notation. We denote by i (µ) the maximum expected reward arm of vector µ and by µ its expected re-

ward. In the case of more than one optimal arms, we choose i (µ) to be the optimal arm of smallest index. For any vector µ, the suboptimality gap of arm i [K] is defined as i(µ) = µ µi. We define the minimum and maximum suboptimality gaps as min(µ) = mini: i(µ)>0 i(µ) and max(µ) = maxi [K] i(µ), respectively. For brevity, we simply use max instead of max(µ) when µ is the actual mean reward vector.

For any feedback-matrix Σ, we define σmin i = minj [K] σj,i to be the minimum standard deviation of the noise under which we can obtain information about µi. Note that, although the condition σmin i < for any i [K] is essential for an instance to be identifiable (indeed, if σmin i = for some i [K], then it is impossible to estimate µi), we do not explicitly make this assumption, given that the dependence on {σmin i }i [K] appears naturally in our results. Finally, we define σ = maxi [K] σmin i to be the maximum σmin i over all arms.

At any round t, we denote by Ni(t) the number of times arm i has been pulled up to and including round t 1. Finally, we define ζi(t) = P

j [K] Nj(t)/σ2 j,i to be the weighted number of samples corresponding to arm i [K] at the beginning of round t.

3. Maximum-Likelihood Estimator and Concentration Bounds

By definition of our model, every time an arm j [K] is played at some round τ, for each arm i [K], the player observes a sample Xi,τ drawn independently from N(µi, σ2 j,i). From the perspective of a fixed arm i [K], the player collects, at each round, a noisy estimate of µi under various (Gaussian) noise levels. In particular, the noise variance depends on the played arm, which in turn is a function of the trajectory of the algorithm up to that round. The results of the following sections rely on a natural maximum-likelihood (ML) estimator for the mean of samples from heterogeneous sources of identical means and different standard deviations, given by matrix Σ. In this section, we define this estimator and prove useful concentration properties. For brevity, for the rest of this section we fix an arm i [K] and drop any reference to it.

The ML estimator for the mean of any arm at the beginning of round t (namely, using t 1 samples) is defined as follows:

Xτ I (iτ = j)

where, again, we use the definition of the weighted number

of samples ζ(t) = P

j [K] Nj(t)

Observe that the above estimator is a weighted average of t 1 samples coming from K possible different sources (namely, arms played), where each sample is weighted by the inverse variance of its source. Here, the weighted number of samples ζ(t) reflects the total quality of the information collected up to time t, and generalizes the number of collected samples a critical notion in the standard concentration results used in multi-armed bandits.

In the case where the number of samples from each type of noise distribution is fixed, the following guarantee can be trivially proved: Fact 3.1. Assuming that the trajectory of the algorithm (namely, the type of sources) up to and including time t 1 is fixed and independent of the observed samples thus, ζ(t) is deterministic the following inequality is true for any ϵ > 0 and i [K]:

Pr[| ˆµ(t) µ| > ϵ] < 2 exp ζ(t) ϵ2

In scenarios of active sampling, however, where the actions depend on the history of observed rewards, the quantity ζ(t) is also a random variable depending on the trajectory up to time t 1 a fact that invalidates the use of the above standard concentration result.

Remark. A usual approach in such settings is to apply a union bound over all possible numbers of samples, bypassing in that way the dependence between the trajectory and the observed samples. However, given that in our case ζ(t) can take t+K 2 K 1 different values (since the reward of each arm can be sampled in K different ways at each round), this approach would result in an additional O(t K)-factor with an undesirable exponential dependence on K in the bound of Fact 3.1.

In order to overcome the above issue, we first prove an auxiliary result, which includes concentration bounds for active sampling related to the estimator in Equation (1). Lemma 3.1. Let {Zt }t N be a sequence of random variables. We denote by Ft the σ-algebra generated by {Zτ}τ t and by F = (Ft )t N the corresponding filtration. Each random variable Zt is drawn independently from a zero-mean sub-Gaussian distribution with variance proxy σ2 t , where σt is an Ft 1-measurable random variable. We define Wt = Pt

τ=1 Zτ σ2 τ and ζt = Pt

τ=1 1 σ2 τ .

(a) Let ϕ be an F-stopping time which satisfies either ζϕ I for some interval I = [L, H] with H > L > 0, or ϕ = t + 1. Then, we have that

Pr |Wϕ| > q

2α ζϕ log t and ϕ t 2 t αL/H.

(b) Let ψ be an F-stopping time which satisfies either ζψ r for some r R 0, or ψ = t + 1. Then, for any ϵ > 0, we have that

Pr |Wψ| > ζψ ϵ and ψ t 2 exp r ϵ2

Proof sketch. As an auxiliary result, we first prove that the sequence ( Gt )t N, where Gt = exp λWt λ2 ζt

I (t t) for some λ R, is a super-martingale which satisfies E h Gt i 1 for any t N. For proving part (a), we

focus on bounding the probability that Wϕ > p2α ζϕ log t and ϕ t, since the other tail bound follows symmetrically. By denoting Gt = exp λ(Wt p

2α ζt log t) I (t t) for some λ > 0, and using Markov s inequality, we get that

2α ζϕ log t and ϕ t E [Gϕ] .

In order to upper-bound E [Gϕ], by setting Gt =

exp λWt λ2 ζt

2 I (t t), we first rewrite

Gt = Gt exp λ2 ζt

2α ζt log t .

Note that, for t = ϕ, the event that ϕ t implies that ζϕ I and, thus, L ζϕ H. Therefore, by setting λ = 1

H 2αL log t, Gϕ can be upper-bounded as

Gϕ Gϕ exp α L

The proof follows by using the fact that E h Gϕ i 1 for any λ > 0.

The proof of (b) follows by similar arguments.

In the next Lemma, we provide a concentration result for our estimator in the case where the weighted number of samples randomly depends on the trajectory of observed realizations. The result holds in the case where we have at least one sample from the source with minimum noise.

Lemma 3.2. Let α > 0. For any t 2, for the estimator defined in Equation (1), if ζ(t) 1 (σmin)2 , where σmin = minj [K] σj, then we have that

| ˆµ(t) µ| >

< 2 log2(t 1) t α/2.

Note that imposing conditions on ζ(t) or the source noise is necessary in order to obtain any concentration results

for the estimator in Equation (1); for instance, we cannot hope for concentration results in the case where all t samples come from a source with σ = noise. The proof of the above Lemma is based on an exponentially-spaced discretization of the possible range of values of ζ(t), combined with Lemma 3.1.

We remark that another possible approach for obtaining concentration guarantees for our estimator is through the idea of the self-normalizing bound (Abbasi-yadkori et al., 2011) combined with the assumptions of Lemma 3.2. However, this approach comes at an expense of a slightly worse dependence on t.

4. Asymptotic Regret Lower Bound

In this section, we provide an asymptotic regret lower bound for policies that are consistent with respect to any environment (µ, Σ). We recall the definition of consistent policies:

Definition 4.1. A policy is called consistent if, for any environment (µ, Σ), its regret satisfies

lim T RT (µ)

T p = 0 for any p > 0.

In this section, we show that, in the asymptotic regime, the problem of collecting sufficient information to distinguish the suboptimality gaps of the arms, while paying the minimum cost in terms of regret accumulated during suboptimal plays, can be closely approximated by an LP. For any reward vector µ, we define the following parameterized set of constraints:

j [K] cj σ2 j,i 2 2 i (µ), i = i (µ)

j [K] cj σ2 j,i 2 2 min(µ), i = i (µ)

To provide some intuition on the above constraint set, we recall the case of standard bandit feedback, where σi,i < and σi,j = for any i = j. There, for any suboptimal arm i the corresponding constraint of C(µ) becomes ci 2 σ2 i,i 2 i (µ). This matches the multiplicative factor in the existing lower bounds for the standard bandit feedback case, where each arm must be played at least 2 σ2 i,i 2 i (µ) log(T) times. In our general feedback setting, for any arm i, in addition to the information collected by playing the arm itself, the constraints also take into account the information collected by any other arm j [K] \ {i}, weighted by its inverse variance.

A natural lower bound on the number of suboptimal plays of each arm can be constructed using the minimum-cost feasible vector in C(µ) with respect to the suboptimality

gaps, formally defined as

c (µ) = argmin c C(µ)

i [K] ci i(µ). (3)

As we show in the following theorem, the optimal solution of the LP in Equation (3) asymptotically characterizes a lower bound on the number of plays of each suboptimal arm relative to log(T).

Theorem 4.1. For any environment (µ, Σ), the regret of any consistent policy within T rounds satisfies

lim inf T RT (µ)

i [K] c i (µ) i(µ).

In order to prove Theorem 4.1, first, we consider a reward vector µ and identify a suboptimal arm k in µ. For some ϵ > 0 we construct a vector µ such that

( µi, if i = k µ + ϵ, if i = k. (4)

Then, in the following proposition, we decompose the KLdivergence of two distributions P, P , each capturing the interplay of a policy with environments (µ, Σ) and (µ , Σ), respectively.

Proposition 4.1. Let two Gaussian K-armed bandit instances ν and ν with the same side information matrix Σ and mean reward vectors µ and µ , respectively. Let P (resp. P ) be the distribution associated with the interplay of ν (resp. ν ) and a policy π within t rounds. If µ and µ

differ only in the reward of arm k, then the KL-divergence of P with respect to P satisfies:

D(P||P ) = X

i [K] E ν [Ni(t)] (µk µ k)2

2 σ2 i,k . (5)

Theorem 4.1 follows by Proposition 4.1, Definition 4.1 and an application of Bretagnolle-Huber inequality for distributions with mean rewards µ, µ as defined in Equation (4).

Before we proceed to the presentation of our algorithm, we comment that any given solution c (µ) of the LP in Equation (3) could be turned into an optimal Explore-Then Commit (ETC) strategy for our problem: by playing each arm c i (µ) log T times, thus incurring regret at most P

i [K] c i (µ) i(µ) log T + P

i [K] i(µ), by Lemma 3.2 we have collected enough information to distinguish the optimal arm with high probability. Hence, c (µ) readily describes an optimal (up to rounding errors) arm-pulling schedule with respect to the lower bound.

5. Algorithm and Regret Analysis

In this section we provide the first algorithm for the general setting of Gaussian bandits with arbitrary side information matrix Σ and prove its asymptotic optimality.

5.1. Description of the Algorithm and Main Results

The general idea behind our algorithm is the following: recall that in the ideal scenario where the suboptimality gaps are known a priori, the linear program described in Equation (3) would directly provide an optimal (up to vanishing rounding errors) exploration strategy, namely, a minimumregret arm-pulling schedule collecting the necessary information to distinguish the optimal arm with high probability. The above intuition, however, cannot be readily transformed into an algorithm, since prior knowledge of the suboptimality gaps would trivialize our problem.

At a high level, our algorithm attempts to estimate the LP (including constraints and objective) described in Equation (3) up to some accuracy, while at the same time implement its solution (i.e., play, for each arm, the indicated number of samples) in an online manner. The choice of the desired accuracy, up to which the LP must be estimated, is a key for the success of our algorithm: the LP needs to be estimated wellenough to allow the learner to take near-optimal decisions, yet without suffering a significant estimation overhead. In principle, our algorithm generalizes and extends that of (Wu et al., 2015) for the simple case of graph-structured feedback, i.e., when each entry of Σ satisfies σi,j {σ, } for some σ < .

Our algorithm is presented in pseudocode in Algorithm 1. The algorithm starts by playing, for each arm j [K], the arm ij [K] that provides the most accurate information about j. For the rest of the rounds, the algorithm either exploits the already collected information, or further explores the environment.

Greedy exploitation. Let ˆµ(t) RK be the vector of estimated means, where the coefficient ˆµi(t) for every i [K] is given in Equation (1) (note that in Section 3 we have dropped any reference to i for generality). Recall that C(ˆµ(t)) is the constraint set of the estimated LP at time t, constructed using the collected samples. In the case where the collected number of samples for each arm provides (up to proper scaling) a feasible solution to C(ˆµ(t)) (see Case (A) of Algorithm 1), the algorithm greedily plays the arm with the maximum empirical mean reward.

Uniform and LP-dictated exploration. If the collected samples are not feasible for C(ˆµ(t)) (up to proper scaling), then the algorithm enters an exploration phase where it either attempts to refine the estimate of the LP, or to make progress in exploring arms according to what the currently

estimated LP dictates. We refer to each case as Uniform and LP-dictated exploration, respectively. Here, the player needs to address the delicate issue that the total exploration cost needs to be close to that of the optimal solution of the LP. This is achieved by performing a relatively small number of uniform exploration rounds, which allow the learner to obtain a sufficiently good estimate of the LP.

The algorithm considers the LP to be sufficiently (uniformly) explored if the weighted number of samples for each arm satisfies a uniform lower bound. In particular, our algorithm compares the minimum weighted number of samples over all arms with a sublinear function of the total sample weight collected during all exploration rounds. Formally, Algorithm 1 inspects the weighted number of samples of each arm, and checks whether mini [K] ζi(t) < 1 K β (ne(t)), where β(x) = xγ 2 σ2 , with γ (0, 1) and σ = maxi [K] σmin i , and ne(t) is the total number of exploration rounds up to time t. In the case where the arms are not uniformly explored (see Case (B) of Algorithm 1), our algorithm attempts to satisfy this uniform exploration bound, by playing the arm which provides the less noisy information about an arm i that has the minimum ζi(t) (breaking ties arbitrarily).

On the other hand, if the arms are sufficiently uniformly explored (see Case (C) of Algorithm 1), our algorithm computes an optimal solution c (ˆµ(t)) to the estimated LP, and then plays any arm i such that Ni(t) < 4α c i (ˆµ(t)) log t, namely, which is considered unexplored (again up to proper scaling) according to the LP solution.

Main Results. Before we state the regret guarantee of Algorithm 1, we introduce some useful definitions. For any vector µ, we consider the family of vectors µ that are guaranteed to be ϵ-close to µ with respect to the ℓ -norm, that is, |µ i µi| ϵ for all arms i [K]. Let c (µ ) be the optimal solution of the minimization problem in Equation (3) using parameters µ . The following quantity appears naturally in our regret guarantees:

Definition 5.1. For any vector µ, the worst-case ϵapproximate LP solution for arm j [K] is defined as

c j(µ, ϵ) = sup µ : µ µ ϵ c j(µ ).

Note that, by continuity, taking ϵ 0 we get c j(µ, ϵ) c j(µ) for every arm j [K].

We prove the following regret upper bound for Algorithm 1:

Theorem 5.1. For any α > 4, γ (0, 1), and ϵ > 0, the

Algorithm 1 Asymptotically-Optimal Algorithm for Gaussian Bandits with Side Observations

Input: K arms, Σ, β(x) = xγ 2 σ2 , γ (0, 1), α > 4. Set ne(K + 1) 0. For each arm j [K], play arm ij arg mini [K] σ2 i,j.

for t = K + 1, K + 2, . . . do

if N1(t) 4a log t, ..., NK(t)

4a log t C(ˆµ(t)) then // Case (A): Greedy exploitation (event At in the analysis). Play arm it arg maxi [K] ˆµi(t). Set ne(t + 1) ne(t). else if mini [K] ζi(t) < 1

K β (ne(t)) then // Case (B): Uniform exploration (event At c, Bt in the analysis). Let i arg mink [K] ζk(t). Play it arg mink [K] σ2 k,i. Set ne(t + 1) ne(t) + 1. else

// Case (C): LP-Dictated exploration (event At c, Bt c

in the analysis). Compute c (ˆµ(t)) arg minc C(ˆµ(t)) P

i [K] ci i(ˆµ(t)).

Play it i for some i [K] such that Ni(t) < 4α c i (ˆµ(t)) log t. Set ne(t + 1) ne(t) + 1. end if end for

regret of algorithm Algorithm 1 satisfies

RT (µ) 2K + 8K α 4 + 2 max

+ 2 max K X

τ [T ] exp τ γϵ2

j [K] c j(µ, ϵ) log T + K

j [K] j(µ)c j(µ, ϵ) log T.

We remark that, in the restricting case of graph sideinformation structure (i.e. each σi,j = σ or ), the regret upper bound of Algorithm 1 also matches the asymptotically-optimal regret shown in (Wu et al., 2015). Further, notice that, as σ = maxi [K] σmin i grows to infinity, the dependence on T in the second term of the above bound becomes linear. This is a natural effect, however, since having a large σ implies the existence of an arm i [K] with

large σmin i , for which there is no possible way of collecting accurate information.

The following corollary bounds the asymptotic regret of Algorithm 1 in terms of the worst-case ϵ-approximate LP solution.

Corollary 5.1. As the time horizon tends to infinity, the regret of Algorithm 1 satisfies

j [K] j(µ)c j(µ).

Notice that, although the above bound matches the lower bound of Theorem 4.1 for ϵ = 0, we are not allowed to directly use ϵ 0 in the bound of Theorem 5.1, since that would make the term P

τ [T ] exp τ γϵ2

4K σ2 linear in T. However, by choosing ϵ to be a decreasing function of T, e.g. ϵ log(T) γ/4, the suboptimal terms in Theorem 5.1 are vanishing as T and the regret bound in Corollary 5.1 matches the asymptotic regret lower bound (up to constant factors).

5.2. Outline of the Regret Analysis

For the rest of this section, we provide an overview of our regret analysis, while the complete proof can be found in Appendix C.

Clearly, the first K rounds contribute at most a (K 1) max-additive loss in the regret. For each round t K + 1, we consider the following events:

4a log t, . . . , NK(t)

Bt = min i [K] ζi(t) < 1

K β(ne(t)) .

Notice that the above events immediately characterize how Algorithm 1 operates at each round. Specifically, for t K + 1, the algorithm enters in Case A when At holds, in Case B when At c, Bt, and in Case C, when At c, Bt c.

We also define the error event:

| ˆµi(t) µi|

ζi(t) , i [K]

As we show in Lemma 3.2 of Section 3, for any t the event Ut holds with high probability. Therefore, we can assume that Ut holds in every round, since the probability that Ut does not hold at some round converges to a small O( K

α 4 max)-additive loss in the regret.

Rounds satisfying Ut, At. When the event Ut holds, if the algorithm enters Case A, it can be shown that it chooses

the optimal arm during the greedy selection. In particular, the event Ut, At implies that for any arm i that is suboptimal with respect to ˆµ(t), that is, i = i (ˆµ(t)), we have that

µi ˆµi(t) i(ˆµ(t))

while for the optimal arm i (ˆµ(t)), we get

ˆµi (ˆµ(t))(t) µi (ˆµ(t)) min(ˆµ(t))

Since each true parameter µi is i(ˆµ(t))

2 -close to the corresponding estimate at time t, we can conclude that the arm selected by Algorithm 1 in that case is the optimal. Hence, the rounds such that Ut, At holds do not contribute to the regret.

Rounds satisfying Ut, At c, Bt. We now turn our focus to the rounds where At c, Bt holds, which implies that Algorithm 1 enters Case B.

Through a counting argument we show that the above can happen at most σ2 β(ne) + 1 times, where ne is the total number of exploration rounds, i.e. ne = PT t=K+1 I (At c). Specifically, we show the following: Lemma 5.1. The following inequality holds:

t=K+1 I (At c, Bt) 1

t=K+1 I (At c)

Notice that the fact that our algorithm plays the most informative arm (i.e., the one with smaller noise) in order to collect a sample for arm i = arg mink [K] ζk(t) is crucial for the above counting argument to hold. Thus, using Lemma 5.1, we have that

t=K+1 I (Ut, At c, Bt)

t=K+1 I (At c, Bt)

PT t=K+1 I (At c) γ

We now bound the regret accumulated during all exploration rounds (that is, including Cases (2) and (3)). We have that

I (At c) I (Ut c) + I (Ut, At c, Bt c) + I (Ut, At c, Bt) . (7)

By combining Equation (6) with Equation (7), we can see that in order to upper bound PT t=K+1 I (Ut, At c, Bt), it suffices to provide a bound on PT t=K+1 I (Ut, At c, Bt c). By doing this we are able to conclude that, overall, the number of rounds such that Ut, At c, Bt holds is at most order

τ [T ] exp τ γϵ2

j [K] c j(µ, ϵ) log T + K

Rounds satisfying Ut, At c, Bt c. It remains to describe how we bound the regret accumulated in rounds where {At c, Bt c} holds, and thus the algorithm enters Case C.

We define the following event which states that, at time t, the error in the mean estimates upper bounded by ϵ:

Et = {| ˆµi(t) µi| ϵ, i [K]} . (8)

In order to bound the error of case {Ut, At c, Bt c}, we further distinguish two cases according to whether Et holds:

Recall that, when {At c, Bt c} holds, the algorithm selects an arm j such that Nj(t) 4a log t < c j(ˆµ(t)). For rounds where {Ut, At c, Bt c, Et} holds, by Definition 5.1 in combination with the definition of Et, we prove the following result:

Lemma 5.2. For every arm j [K], it holds that

t=K+1 I (Ut, At c, Bt c, Et, it = j) 4α c j(µ, ϵ) log T.

The above results immediately implies that the contribution to the regret of rounds t K + 1 such that {Ut, At c, Bt c, Et} is at most 4α P

j [K] j(µ)c j(µ, ϵ) log T.

Finally, for the rounds such that {Ut, At c, Bt c, Et c} holds, i.e., when the error in some of the estimates is greater than ϵ, by applying a union bound over all arms, we get

I (Ut, At c, Bt c, Et c)

i [K] I Ut, At c, min j [K] ζj(t) β(ne(t))

K , | ˆµi(t) µi| > ϵ .

Given that minj [K] ζj(t) β(ne(t))

K implies that ζi(t) β(ne(t))

K for any arm i [K], using part (c) of Lemma 3.1, we can upper bound the probability of each term in the above summation by exp ϵ2

2 1 K β(τ) . Thus, the total contribution to the regret of the rounds such that {Ut, At c, Bt c, Et c} holds can be upper bounded by

τ [T ] exp τ γϵ2

The regret guarantee presented in Theorem 5.1 follows by combining the above losses.

Conclusion and Further Directions

In this work, we revisit the general feedback model introduced by Wu, Szepesv ari, and Gy orgy in (Wu et al., 2015) and provide the first algorithm for the case of arbitrary feedback matrices. To the best of our knowledge, this is one of the most general feedback models that appears in the literature, modulo the Gaussian noise assumption.

A number of questions still remain open in the area of online learning under rich feedback structures. For instance, it

would be interesting to explore the existence of algorithms that achieve (minimax) optimal regret in the finite time horizon regime. Another direction would be to examine different noise models, other than Gaussian, or even general sub-Gaussian noise with known variance proxies. We believe that our work could serve as a building block in the above direction.

Acknowledgements

This research is supported in part by NSF Grants 2019844, 2107037 and 2112471, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program.

Abbasi-yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011.

Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. From bandits to experts: A tale of domination and independence. Advances in Neural Information Processing Systems, 07 2013.

Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. Online learning with feedback graphs: Beyond bandits. In COLT, 2015.

Arora, R., Marinov, T. V., and Mohri, M. Bandits with feedback graphs and switching costs. In Neur IPS, 2019.

Bart ok, G., Foster, D., P al, D., Rakhlin, A., and Szepesv ari, C. Partial monitoring classification, regret bounds, and algorithms. Mathematics of Operations Research, 39: 967 997, 11 2014. doi: 10.1287/moor.2014.0663.

Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn., 5:1 122, 2012.

Buccapatnam, S., Eryilmaz, A., and Shroff, N. Stochastic bandits with side observations on networks. ACM SIGMETRICS Performance Evaluation Review, 42, 06 2014. doi: 10.1145/2591971.2591989.

Caron, S., Kveton, B., Lelarge, M., and Bhagat, S. Leveraging side observations in stochastic bandits. In UAI, 2012.

Cohen, A., Hazan, T., and Koren, T. Online learning with feedback graphs without the graphs. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of

Proceedings of Machine Learning Research, pp. 811 819, New York, New York, USA, 20 22 Jun 2016. PMLR.

Cortes, C., De Salvo, G., Gentile, C., Mohri, M., and Yang, S. Online learning with sleeping experts and feedback graphs. In ICML, 2019.

Cortes, C., De Salvo, G., Gentile, C., Mohri, M., and Zhang, N. Online learning with dependent stochastic feedback graphs. In ICML, 2020.

Hanawal, M. K., Szepesv ari, C., and Saligrama, V. Sequential learning without feedback. Co RR, abs/1610.05394, 2016.

Koc ak, T., Neu, G., and Valko, M. Online learning with noisy side observations. In AISTATS, 2016.

Lai, T. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1): 4 22, 1985. ISSN 0196-8858. doi: https://doi.org/10. 1016/0196-8858(85)90002-8.

Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2020. doi: 10.1017/ 9781108571401.

Li, S., Chen, W., Wen, Z., and Leung, K.-S. Stochastic online learning with probabilistic graph feedback. Ar Xiv, abs/1903.01083, 2020.

Lin, T., Abrahao, B., Kleinberg, R., Lui, J., and Chen, W. Combinatorial partial monitoring game with linear feedback and its applications. volume 3, 06 2014. doi: 10.13140/RG.2.1.1502.1521.

Liu, F., Buccapatnam, S., and Shroff, N. B. Information directed sampling for stochastic bandits with graph feedback. In AAAI, 2018a.

Liu, F., Zheng, Z., and Shroff, N. B. Analysis of thompson sampling for graphical bandits without the graphs. Ar Xiv, abs/1805.08930, 2018b.

Lykouris, T., Tardos, E., and Wali, D. Graph regret bounds for thompson sampling and ucb. In ALT, 2020.

Mannor, S. and Shamir, O. From bandits to experts: On the value of side-observations. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS 11, pp. 684 692, Red Hook, NY, USA, 2011. Curran Associates Inc. ISBN 9781618395993.

Rangi, A. and Franceschetti, M. Online learning with feedback graphs and switching costs. AISTATS, 2019.

Resler, A. and Mansour, Y. Adversarial online learning with noise. In ICML, 2019.

Singh, R., Liu, F., Liu, X., and Shroff, N. Contextual bandits with side-observations, 06 2020.

Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387790519.

Wang, L., Li, B., Zhou, H., Giannakis, G. B., Varshney, L. R., and Zhao, Z. Adversarial linear contextual bandits with graph-structured side observations. Co RR, abs/2012.05756, 2020.

Wu, Y., Gy orgy, A., and Szepesv ari, C. Online learning with gaussian payoffs and side observations. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS 15, pp. 1360 1368, Cambridge, MA, USA, 2015. MIT Press.

Yun, D., Ahn, S., Proutiere, A., Shin, J., and Yi, Y. Multiarmed bandit with additional observations. ACM SIGMETRICS Performance Evaluation Review, 46:53 55, 06 2018. doi: 10.1145/3292040.3219639.

A. ML Estimator and Concentration Bounds: Omitted Proofs

A.1. Proof of Lemma 3.1

Lemma 3.1. Let {Zt }t N be a sequence of random variables. We denote by Ft the σ-algebra generated by {Zτ}τ t and by F = (Ft )t N the corresponding filtration. Each random variable Zt is drawn independently from a zeromean sub-Gaussian distribution with variance proxy σ2 t , where σt is an Ft 1-measurable random variable. We define Wt = Pt

τ=1 Zτ σ2τ and ζt = Pt

τ=1 1 σ2τ .

(a) Let ϕ be an F-stopping time which satisfies either ζϕ I for some interval I = [L, H] with H > L > 0, or ϕ = t + 1. Then, we have that

Pr |Wϕ| > q

2α ζϕ log t and ϕ t 2 t αL/H.

(b) Let ψ be an F-stopping time which satisfies either ζψ r for some r R 0, or ψ = t + 1. Then, for any ϵ > 0, we have that

Pr |Wψ| > ζψ ϵ and ψ t 2 exp r ϵ2

Proof. Before we proceed to the main part of the proof, we first show the following auxiliary result:

Proposition A.1. Let Gt = exp λWt λ2 ζt

2 I (t t) for some λ R. Then, ( Gt )t N is a super-martingale

satisfying E h Gt i 1 for any t N.

Proof. In order to show that the sequence ( Gt )t N is a super-martingale satisfying E h Gt i 1 for any t N, we

proceed by induction. For the base case where t = t + 1, we have E h Gt+1 | Ft i = 0 Gt. Let us now fix any time

t t. Given that σt is Ft 1-measurable and Zt is sub-Gaussian with variance proxy σ2 t , for any λ R we have that

E h exp λ Zt λ 2σ2 t 2 Ft 1 i 1. Thus, by setting λ = λ σ2 t , we get that E h exp λ Zt

Ft 1 i 1. Using that, we have

E h Gt | Ft 1 i = E exp λWt λ2 ζt

= Gt 1 E exp λZt

thus proving that ( Gt )t is a super-martingale. In addition, since ϕ satisfies ϕ t + 1 almost surely, by Doob s optional

stopping theorem, we can conclude that E h Gt i 1 for all t N.

(a) It suffices to upper-bound the probability that Wϕ > p2α ζϕ log t and ϕ t by t α L

H . Notice that the other tail bound follows by symmetry and the factor of 2 in the desired bound results from a union bound over the two tails.

By denoting Gt = exp λ(Wt p

2α ζt log t) I (t t) for some λ > 0, we get

2α ζϕ log t and ϕ t = Pr Gϕ 1 E [Gϕ],

where the last step above follows by Markov s inequality, given that Gϕ is by construction a non-negative random variable. Thus, in order to complete the proof, it suffices to upper-bound E [Gϕ].

By setting Gt = exp λWt λ2 ζt

2 I (t t), we can rewrite

Gt = Gt exp λ2 ζt

2α ζt log t .

Now, for t = ϕ, the event that ϕ t implies by definition that ζϕ I and, thus, L ζϕ H. Therefore, by setting λ = 1

H 2αL log t, we get that

Gϕ = Gϕ exp

2α ζϕ log t

2αL log t = Gϕ exp α L

Finally, by using the fact that E h Gϕ i 1 for any λ > 0, as proved in Proposition A.1, we have that E [Gϕ]

H log t = t α L

H , which concludes the proof.

(b) As in the proof of part (a), we can focus on bounding the probability that Wψ > ζψ ϵ and ψ t, since the other tail bound follows by symmetry. By denoting G t = exp (λ(Wt ζt ϵ)) I (t t) for some λ > 0, we get

Pr Wψ > ζψ and ψ t = Pr G ψ 1 E G ψ ,

where the last step above follows by Markov s inequality.

In order to upper-bound E h G ϕ i , by setting Gt = exp λWt λ2 ζt

2 I (t t), we can rewrite

G t = Gt exp λ2 ζt

By setting λ = ϵ, we get that G t Gt exp ζt ϵ2

2 . Further, if ψ t, then by definition we have that ζψ r, which implies that

G ψ Gψ exp r ϵ2

The proof follows by taking expectation in the above expression and using the fact that E h Gψ i 1 for any λ R, as we show in Proposition A.1.

A.2. Proof of Lemma 3.2

Lemma 3.2. Let α > 0. For any t 2, for the estimator defined in Equation (1), if ζ(t) 1 (σmin)2 , where σmin = minj [K] σj, then we have that

| ˆµ(t) µ| >

< 2 log2(t 1) t α/2.

Proof. Recall that ζ(t) is defined as ζ(t) = P

j [K] Nj(t)/σ2 j and that σmin = minj [K] σj. Without loss of generality, we can assume that σmin > 0, since, otherwise, by collecting any sample with σj = 0, the concentration inequality follows trivially. Recalling that ζ(t) involves t 1 samples, by definition, by time t 2, it is easy to see that

ζ(t) 1 (σmin)2 , t 1 (σmin)2

r [ log2(t 1) ]

(σmin)2 , 2r

and, hence, by using the above decomposition and applying union bound we get that

|ˆµ(t) µ| >

r [ log2(t 1) ]

| ˆµ(t) µ| >

ζ(t) and ζ(t) 2r 1

(σmin)2 , 2r

r [ log2(t 1) ] Pr

| ˆµ(t) µ| >

ζ(t) and ζ(t) 2r 1

(σmin)2 , 2r

Notice that, by multiplying both sides with ζ(t), the inequality |ˆµ(t) µ| > q

ζ(t) is equivalent to Pt 1 τ=1 Xτ µ

2a ζ(t) log t. By setting Zt = Xt µ, we can observe that the sequence {Zt }t t satisfies the conditions of Lemma 3.1.

For any fixed r [ log2(t 1) ] let us define I(r) = h 2r 1 (σmin)2 , 2r (σmin)2 i . Hence, for L = 2r 1 (σmin)2 and H = 2r (σmin)2 , then

the event that Pt 1 τ=1 Xτ µ

2a ζ(t) log t and ζ(t) I(r) implies that there must exist a stopping-time ϕ where

ζϕ I(r) as described in Lemma 3.1 for I = I(r), such that |Wϕ| > p2α ζϕ log ϕ and ϕ t 1. Thus, by the result of Lemma 3.1 we get

|ˆµ(t) µ| >

ζ(t) and ζ(t) I(r)

2a ζ(t) log t and ζ(t) I(r)

Pr h |Wϕ| > q

2α ζϕ log ϕ and ϕ t 1 i

< 2 t α 2r 1

Therefore, we can conclude that:

| ˆµ(t) µ| >

< 2 log2(t 1) t α/2.

B. Asymptotic Regret Lower Bound: Omitted Proofs

B.1. Proof of Proposition 4.1

Proposition 4.1. Let two Gaussian K-armed bandit instances ν and ν with the same side information matrix Σ and mean reward vectors µ and µ , respectively. Let P (resp. P ) be the distribution associated with the interplay of ν (resp. ν ) and a policy π within t rounds. If µ and µ differ only in the reward of arm k, then the KL-divergence of P with respect to P

D(P||P ) = X

i [K] E ν [Ni(t)] (µk µ k)2

2 σ2 i,k . (5)

Proof. Recall that we denote by iτ the arm played at round τ by policy π. Let Pi (resp. P i) be the distribution over the random vector of the observations obtained at any round where arm i is played under the instance ν (resp. ν ). By working along the lines of Lemma 15.1 in (Lattimore & Szepesv ari, 2020), it can be proved that the KL-divergence between distributions P and P satisfies

τ=1 E ν D(Piτ ||P iτ ) =

i=1 E ν [D(Pi||P i) I (iτ = i)] =

i=1 D(Pi||P i) E ν [Ni(t)] .

In contrast to Lemma 15.1 in (Lattimore & Szepesv ari, 2020), here Pk, P k are multivariate distributions of K arms. We denote by Pi,j the marginal distribution that corresponds to the reward observation of arm j, when arm i is played. Given

that the noisy reward observations are independent for each arm j [K], by using the additivity of the KL-divergence in that case, we have that D(Pi||P i) = PK j=1 D(Pi,j, Pi ,j), which in turn leads to

j=1 D(Pi,j, Pi ,j) E ν [Ni(t)] =

i=1 E ν [Ni(t)] (µk µ k)2

where the last equation follows by the fact that for any i [K], the distributions Pi,k and Pi ,k are both Gaussian with means µk and µk , respectively, and the same variance σ2 i,j.

B.2. Proof of Theorem 4.1

Theorem 4.1. For any environment (µ, Σ), the regret of any consistent policy within T rounds satisfies

lim inf T RT (µ)

i [K] c i (µ) i(µ).

Proof. Let us fix any consistent policy π. Let µ be a mean reward vector and let k be a suboptimal arm in µ. We define vector µ such that

( µi, if i = k µ + ϵ, if i = k (9)

for some ϵ > 0. Moreover, let P and P be the measures induced by the interplay of π with the environments (µ, Σ) and (µ , Σ) respectively, and let us denote by E and E the corresponding expectations. For any event E, due to the Bretagnolle-Huber inequality (Tsybakov, 2008), we have that

P[E] + P [Ec] 1

2 exp ( D (P || P )) , (10)

where D (P || P ) is the KL-divergence of P with respect to P .

Let us define the event Q = Nk(T) > T

2 , namely, the event that, by round T OP: or by time T +1?, arm k has being played more than T/2 times. Since k is a suboptimal arm in µ and optimal in µ , for the regret of policy π in the two environments we have that

Rπ T (µ) + Rπ T (µ ) T

2 P[Q] k(µ) + T

By considering the minimum gap in the above two terms we obtain

Rπ T (µ) + Rπ T (µ ) min{ k(µ), ϵ} T

2 (P[Q] + P [Qc])

min{ k(µ), ϵ} T

4 exp ( D (P || P ))

min{ k(µ), ϵ} T

i [K] E [Ni(t)] ( k(µ) + ϵ)2

where the second inequality follows by Equation (10), while the last inequality follows by Proposition 4.1. By rearranging terms, we obtain

σ2 i,k 2 ( k(µ) + ϵ)2 log T min{ k(µ), ϵ}

4(Rπ T (µ) + Rπ T (µ ))

We recall that for any consistent policy π, any environment (µ, Σ), and any p > 0, we have that

lim T Rπ T (µ) T p = 0.

Thus, dividing Equation (12) by log T and taking the limit of T , we get

lim inf T 1 log T

σ2 i,k lim inf T 1 log T 2 ( k(µ) + ϵ)2 log T min{ k(µ), ϵ}

4(Rπ T (µ) + Rπ T (µ ))

= 2 ( k(µ) + ϵ)2 lim inf T 1 log T

log T log (Rπ T (µ) + Rπ T (µ )) + log min{ k(µ), ϵ}

Now, observe that, by Definition 4.1 for consistent policies, we have that

lim inf T 1 log T

log T log (Rπ T (µ) + Rπ T (µ )) + log min{ k(µ), ϵ}

= 1 lim inf T log (Rπ T (µ) + Rπ T (µ )) log T = 1,

and, therefore, the bound in Equation (13) becomes

i [K] E [Ni(t)]

σ2 i,k log t 2 ( k(µ) + ϵ)2 .

Notice that the above analysis holds for any suboptimal arm and ϵ > 0 and thus is sufficient to conclude the theorem. To illustrate:

Let us consider the reward vector µ and the following constraint set indicated by the above analysis:

cj σ2 j,i 2 2 i (µ) i = i (µ)

Suppose that the expected number of plays of policy π divided by log t do not satisfy the above constraints, i.e. E[N1(t)]

log t , ..., E[NK(t)]

log t C(µ). Then, w.l.o.g. we assume that the constraint of C(µ) that is not satisfied is the one that corresponds to the k-th suboptimal arm, that is:

E[Nj(t)] σ2 j,k log t < 2 2 k(µ).

Then, we can write that P

j [K] E[Nj(t)]

σ2 j,k = 2 log t ( k(µ)+ϵ )2 for some ϵ > 0. If we consider environments µ and µ as defined

in Equation (9) for some ϵ < ϵ , by using Equation (11) we get that:

Rπ t (µ) + Rπ t (µ ) t min{ k(µ), ϵ}

P i [K] E [Ni(t)]

2 σ2 i,k ( k(µ) + ϵ)2 !

= t min{ k(µ), ϵ}

4 exp 2 log t 2( k(µ) + ϵ )2 ( k(µ) + ϵ)2

= t min{ k(µ), ϵ}

which is a contradiction to the fact that policy π is consistent.

Finally, by construction of the objective of the LP, we conclude that the optimal solution of the LP characterizes an asymptotic lower bound on the regret of any consistent policy.

C. Algorithm and Regret Analysis: Omitted Proofs

C.1. Proof of Theorem 5.1

We prove the main result of Section 5:

Theorem 5.1. For any α > 4, γ (0, 1), and ϵ > 0, the regret of algorithm Algorithm 1 satisfies

RT (µ) 2K + 8K α 4 + 2 max

+ 2 max K X

τ [T ] exp τ γϵ2

j [K] c j(µ, ϵ) log T + K

j [K] j(µ)c j(µ, ϵ) log T.

Proof. As a first step, we define the following events which correspond to the conditions examined in Cases A and B of Algorithm 1, respectively:

4a log t, ..., NK(t)

C(ˆµ(t)) and Bt = min i [K] ζi(t) < 1

K β(ne(t)) .

Moreover, we consider the following error events for the ML estimator ˆµ(t):

| ˆµi(t) µi|

ζi(t) , i [K]

and Et = { ˆµ(t) µ ϵ} .

Using the above definitions, we can provide an upper bound on the regret accumulated by Algorithm 1 within T rounds through the following decomposition:

RT (µ) = Tµ E

t [T ] Xit,t

t [T ] t(µ)

t=K+1 E [ t(µ) (I (Ut c) + I (Ut, At) + I (Ut, At c, Bt) + I (Ut, At c, Bt c, Et) + I (Ut, At c, Bt c, Et c))] .

In the rest of this proof, we bound each of the above terms separately.

Regret due to Ut c. The regret accumulated due to the event Ut c over T rounds can be upper-bounded by via a union bound over all arms and, then, by applying the concentration result proved in Lemma 3.2. In particular, we obtain that

t=K+1 I (Ut c)

| ˆµi(t) µi| >

t=K+1 2 log2(t 1) t α/2 4K α 4. (15)

Therefore, for the rest of the cases we can assume the the event Ut holds for every t K.

Regret due to Ut, At. When the events Ut and At hold, we show that the greedy arm selection in Case A of Algorithm 1 leads to the selection of the optimal arm. By using the definitions of the events Ut and At, we have that the error in the mean estimate of any arm i [K] which is suboptimal in the vector of estimations ˆµ(t) can be upper-bounded as

| ˆµi(t) µi|

8α log t 2 i (ˆµ(t)) = i(ˆµ(t))

2 , i = i (ˆµ(t)).

In particular, this implies that for any arm i = i (ˆµ(t)) (i.e., which is suboptimal with respect to estimates the ˆµ(t)), we have that

µi ˆµi(t) i(ˆµ(t))

Similarly, by using the definitions of At and C(ˆµt), for the optimal arm of ˆµ(t), i (ˆµ(t)), we have

ˆµi (ˆµ(t))(t) µi (ˆµ(t)) min(ˆµ(t))

By adding Equation (16) and Equation (17) we get

µi ˆµi(t) + ˆµi (ˆµ(t))(t) µi (ˆµ(t)) i(ˆµ(t))

2 + min(ˆµ(t))

2 i(ˆµ(t)).

Now, since i(ˆµ(t)) = ˆµi (ˆµ(t))(t) ˆµi(t) by definition, eliminating the identical terms in the above expression yields µi µi (ˆµ(t)).

By repeating the above analysis for each arm i = i (ˆµ(t)), we can conclude that the mean reward of i (ˆµ(t)) satisfies µi (ˆµ(t)) maxi [K] µi, which in turn implies that i (ˆµ(t)) also corresponds to the optimal arm of vector µ. Thus, the rounds where the events Ut and At hold do not contribute to the regret, namely,

t(µ) I (Ut, At) = 0. (18)

Regret due to Ut, At c, Bt. We focus on the regret accumulated during the rounds where Ut, At c and Bt hold. These correspond to the the rounds where Algorithm 1 attempts to ensure uniform exploration for all arms in terms of their weighted numbers of samples.

In order to upper-bound the regret accumulated in the rounds where the events Ut, At c and Bt hold, we relate the number of rounds where At c and Bt hold with the total number of exploration rounds , i.e., the number of times the algorithm enters Case B or C (which happens only if At c holds). We achieve the above via a counting argument, based on the following proposition:

Proposition C.1. Let t2 > t1 > K. If Pt2 1 t=t1 I (At c, Bt) K, then mini [K] ζi(t2) mini [K] ζi(t1) + 1 σ2 .

In simple terms, Proposition C.1 states that if At c and Bt hold (which corresponds to Case B of Algorithm 1) at least K times within an interval [t1, t2 1], then the minimum weighted number of samples, ζi(t), over all arms i [K] after these rounds, must have increased by at least 1/ σ2.

Proof of Proposition C.1. The proposition follows from a pigeonhole argument. Let ℓ1 be the first round in the interval [t1, t2 1], where At c, Bt are satisfied and, hence, Algorithm 1 enters Case B. Let j = argmink [K] ζk(ℓ1) be the arm with minimum weighted number of samples at the beginning of round ℓ1, which was chosen in Case B of Algorithm 1. At round ℓ1, Algorithm 1 plays by construction an arm iℓ1 = arg mink [K] σ2 k,j and, hence, the weighted number of samples of arm j at round ℓ1 + 1 must satisfy

ζj(ℓ1 + 1) = ζj(ℓ1) + max k [K] 1 σ2 k,j = ζj(ℓ1) + 1 (σmin j )2 ζj(ℓ1) + 1

σ2 min i [K] ζi(t1) + 1

where the last inequality follows by the fact that ζj(ℓ1) = mini [K] ζi(ℓ1) mini [K] ζi(t1), since the weighted number of samples of any arm can only increase as time progresses.

Observe that, if at any round ℓ [ℓ1 + 1, t2 1] arm i is again a minimizer of mink [K] ζk(ℓ), Equation (19) implies that, at round t2, it has to be that mini [K] ζi(t2) ζj(ℓ1) mini [K] ζi(t1)+ 1/ σ2, in which case the proposition follows directly. Thus, we can assume that for any round ℓ [ℓ1 + 1, t2 1] only arms in [K] \ {j} are minimizers of mink [K] ζk(ℓ).

By repeatedly applying the above argument, it follows that each arm in [K] can be the minimizer of mink [K] ζk(ℓ) for at most one round ℓ [t1, t2 1]. However, since Pt2 1 t=t1 I (At c, Bt) K, by assumption, this can only imply that each arm

j [K] is the minimizer of mink [K] ζk(ℓ) for at least one round ℓ [t1, t2 1]. In this case, the weighted number of samples for each arm i [K] (including the minimizer of round t1) has been increased by at least 1/ σ2 during the interval [t1, t2 1], and thus

min i [K] ζi(t2) min i [K]

= min i [K] ζi(t1) + 1

which concludes the proof.

Lemma 5.1. The following inequality holds:

t=K+1 I (At c, Bt) 1

t=K+1 I (At c)

Proof. Let K + 1 t T be the maximum round where the events At c, Bt hold, that is,

t = max t N {t | I (At c, Bt) = 1 and K + 1 t T}.

Due to Proposition C.1, we have that the minimum weighted number of samples of any arm i [K] at time t satisfies

min i [K] ζi(t ) 1 σ2 K

t=K+1 I (At c, Bt),

since the above minimum increases by at least 1/ σ2 every K times the events At c, Bt occur.

By rearranging terms in the above inequality and using the fact that, when At c, Bt hold, we have that mini [K] ζi(t) < 1 K β (ne(t)), we obtain the following upper bound on the number of times the events At c, Bt can occur over T rounds:

t=K+1 I (At c, Bt)

t=K+1 I (At c, Bt) +1 K σ2 min i [K] ζi(t ) + 1 < σ2 β (ne(t )) + 1. (20)

Now using that ne(t ) = Pt 1 t=K+1 I (At c), the definition of β(x) = xγ/2, and that t T, we have

(20) = σ2 β

t=K+1 I (At c)

t=K+1 I (At c)

t=K+1 I (At c)

which concludes the proof.

Equipped with Lemma 5.1 and using a decomposition of the event At c, we can bound the term of the regret depending on I (Ut, At c, Bt) as follows:

t=K+1 I (Ut, At c, Bt)

t=K+1 I (At c, Bt)

t=K+1 I (At c)

t=K+1 I (Ut c, At c) + I (Ut, At c, Bt) + I (Ut, At c, Bt c, Et) + I (Ut, At c, Bt c, Et c)

t=K+1 (I (Ut c) + I (Ut, At c, Bt) + I (Ut, At c, Bt c, Et c)) + 1

t=K+1 I (Ut, At c, Bt c, Et)

where the last inequality comes from the fact that the function f(x) = xγ is subadditive for any γ (0, 1), and xγ x for any integer x.

By rearranging and grouping terms we obtain the following bound:

t=K+1 I (Ut, At c, Bt)

t=K+1 (I (Ut c) + I (Ut, At c, Bt c, Et c)) +

t=K+1 I (Ut, At c, Bt c, Et)

Thus, in order to bound the regret in the case where Ut, At c and Bt c hold, we need to bound the regret accumulated due to the terms I (Ut, At c, Bt c, Et c) and I (Ut, At c, Bt c, Et).

Regret due to Ut, At c, Bt c, Et. In rounds where the events At c and Bt c hold and, thus, the constraints of the estimated LP, C(ˆµ(t)), are not satisfied, Algorithm 1 plays any arm j [K] that violates its corresponding constraint, i.e. Nj(t) 4a log t < c j(ˆµ(t)). In addition, we recall that the event Et implies that ˆµ(t) µ ϵ, namely, all mean estimates are ϵ-close to the true mean rewards. In that case, for any arm j [K] we can construct an ϵ-approximate LP, using the worst-case ϵ-approximate LP solution introduced in Definition 5.1:

c j(µ, ϵ) = sup µ : µ µ ϵ c j(µ ).

Lemma 5.2. For every arm j [K], it holds that

t=K+1 I (Ut, At c, Bt c, Et, it = j) 4α c j(µ, ϵ) log T.

Proof. Let t be the last round such that t T where {Ut, At c, Bt c, Et} holds and the algorithm plays arm j. Then we have that:

t=K+1 I (Ut, At c, Bt c, Et, it = j) =

t=K+1 I (Ut, At c, Bt c, Et, it = j) Nj(t ) 4αc j(ˆµ(t )) log t 4αc j(µ, ϵ) log T,

where the second inequality comes from the fact that Nj(t ) 4a log t < c j(ˆµ(t )), by definition of j. The last follows from t T and Definition 5.1 combined with the fact that ˆµ(t ) µ ϵ.

By applying the above lemma for any arm j [K], we can bound the term of the regret depending on I (Ut, At c, Bt c, Et) as follows:

t=K+1 t(µ) I (Ut, At c, Bt c, Et) X

t=K+1 I (Ut, At c, Bt c, Et, it = j)

j [K] j(µ)c j(µ, ϵ) log T. (22)

Regret due to Ut, At c, Bt c, Et c. We can bound the term of the regret due to the events {Ut, At c, Bt c, Et c} as follows:

t=K+1 it(µ) I (Ut, At c, Bt c, Et c)

t=K+1 I (Ut, At c, Bt c, Et c)

t=K+1 I Ut, At c, Bt c, j [K] : | ˆµj(t) µj| > ϵ #

t=K+1 I Ut, At c, Bt c, | ˆµj(t) µj| > ϵ

where the last inequality follows by a union bound on any possible arm j [K] satisfying | ˆµj(t) µj| > ϵ.

For an easier manipulation of the bound in Equation (23), let us construct the following sets:

Λ = t [T] : Ut, At c, min i [K] ζi(t) 1

K β (ne(t))

and the more refined

Λ(τ) = t [T] : Ut, At c, min i [K] ζi(t) 1

K β (τ) , ne(t) = τ .

Note that if t Λ(τ) then ζj(t) 1

K β (ne(t)) and that |Λ(τ)| 1 for any τ, by construction. In addition, Λ τ [T ]Λ(τ). Then, let ϕτ be a stopping time such that ϕτ = t if Λ(τ) = {t} and ϕτ = T + 1 otherwise. We can write that:

t [T ] I (t Λ, | ˆµi(t) µi| > ϵ)

τ [T ] I (ϕτ T, | ˆµi(t) µi| > ϵ)

τ [T ] Pr [ϕτ T, | ˆµi(t) µi| > ϵ]

τ [T ] 2 exp ϵ2

2 1 K β(τ) ,

where in the last inequality we apply the result of Lemma 3.1.

Therefore, Equation (23) can be bounded as follows:

Equation (23) = max E

t=K+1 I (t Λ, | ˆµi(t) µi| > ϵ )

τ [T ] exp β(τ)ϵ2

τ [T ] exp τ γϵ2

where in the last inequality we used the definition of β( ) function.

Combining everything. By combining the bounds in Equations (14), (15), (18), (21), (22) and (24), we can conclude that the regret of Algorithm 1 can be upper bounded as follows:

RT (µ) K max + max 8K α 4 + max

j [K] j(µ)c j(µ, ϵ) log T

+ 2 max K X

τ [T ] exp τ γϵ2

j [K] j(µ)c j(µ, ϵ) log T + 2.

C.2. Proof of Corollary 5.1

Corollary 5.1. As the time horizon tends to infinity, the regret of Algorithm 1 satisfies

j [K] j(µ)c j(µ).

Proof. Recall that by Theorem 5.1:

RT (µ) 2K + 8K α 4 + 2 max +2 max K X

τ [T ] exp τ γϵ2

j [K] c j(µ, ϵ) log T + K

j [K] j(µ)c j(µ, ϵ) log T.

Observe that for any constant ϵ, σ > 0 and γ (0, 1) we have that lim T P

τ [T ] exp τ γϵ2

4K σ2 < . We emphasize that, while σ is a fixed (possibly unbounded) parameter of the problem instance, ϵ is only a component of the analysis. Moreover, notice that we are not allowed to directly use ϵ 0 in the above regret upper bound, since that would make the term P

τ [T ] exp τ γϵ2

4K σ2 linear in T. However, if we choose ϵ to be a decreasing function of T, then the suboptimal

terms in Theorem 5.1 are vanishing as T . In particular, choosing ϵ = (4K σ2) 1 2 log(T) γ/4 gives

τ [T ] exp τ γϵ2

τ [T ] exp τ γ

τ [ (log T )1/2 ] exp τ γ

τ [T ]\[ (log T )1/2 ] exp τ γ

(log T)1/2 + X

n T (log T )1/2

where in the above expression the first term is o(log(T) and the second term is bounded above by 1/ e 1 γ 1 for any T. Furthermore, the term

j [K] c j(µ, ϵ) log T + K

j [K] c j(µ, ϵ) log T + K

and thus when it is divided by log(T) it vanishes asymptotically for T . Finally, for the value ϵ = (4K σ2) 1 2 log(T) γ/4 selected above, we have that ϵ 0 as T , thus the leading term becomes

j [K] j(µ)c j(µ, ϵ) log T

log T = lim sup T 4α X

j [K] j(µ)c j(µ, ϵ)

j [K] j(µ)c j(µ)

by Definition 5.1 of the ϵ-approximate LP solution. Therefore, we conclude that

log T = 4α X

j [K] j(µ)c j(µ).