# nested_bandits__1d7217c3.pdf

Nested Bandits

Matthieu Martin 1 Panayotis Mertikopoulos 2 1 Thibaud Rahier 1 Houssam Zenati 1 3

In many online decision processes, the optimizing agent is called to choose between large numbers of alternatives with many inherent similarities; in turn, these similarities imply closely correlated losses that may confound standard discrete choice models and bandit algorithms. We study this question in the context of nested bandits, a class of adversarial multi-armed bandit problems where the learner seeks to minimize their regret in the presence of a large number of distinct alternatives with a hierarchy of embedded (non-combinatorial) similarities. In this setting, optimal algorithms based on the exponential weights blueprint (like Hedge, EXP3, and their variants) may incur significant regret because they tend to spend excessive amounts of time exploring irrelevant alternatives with similar, suboptimal costs. To account for this, we propose a nested exponential weights (NEW) algorithm that performs a layered exploration of the learner s set of alternatives based on a nested, step-by-step selection method. In so doing, we obtain a series of tight bounds for the learner s regret showing that online learning problems with a high degree of similarity between alternatives can be resolved efficiently, without a red bus / blue bus paradox occurring.

1. Introduction

Consider the following discrete choice problem (known as the red bus / blue bus paradox in the context of transportation economics). A commuter has a choice between taking a car or bus to work: commuting by car takes on average half an hour modulo random fluctuations, whereas commuting by bus takes an hour, again modulo random fluctuations

All authors in alphabetical order. 1Criteo AI Lab 2Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France 3Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. Correspondence to: Panayotis Mertikopoulos <panayotis.mertikopoulos@imag.fr>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

(it s a long commute). Then, under the classical multinomial logit choice model for action selection [20, 21], the commuter s odds for selecting a car over a bus would be exp( 1/2)/ exp( 1) 1.6 : 1. This indicates a very clear preference for taking a car to work and is commensurate with the fact that, on average, commuting by bus takes twice as long.

Consider now the same model but with a twist. The company operating the bus network purchases a fleet of new buses that are otherwise completely identical to the existing ones, except for their color: old buses are red, the new buses are blue. This change has absolutely no effect on the travel time of the bus; however, since the new set of alternatives presented to the commuter is {car, red bus, blue bus}, the odds of selecting a car over a bus (red or blue, it doesn t matter) now drops to exp( 1/2)/[exp( 1) + exp( 1)] 0.8 : 1. Thus, by introducing an irrelevant feature (the color of the bus), the odds of selecting the alternative with the highest utility have dropped dramatically, to the extent that commuting by car is no longer the most probable outcome in this example.

Of course, the shift in choice probabilities may not always be that dramatic, but the point of this example is that the presence of an irrelevant alternative (the blue bus) would always induce such a shift which is, of course, absurd. In fact, the red bus / blue bus paradox was originally proposed as a sharp criticism of the independence from irrelevant alternatives (IIA) axiom that underlies the multinomial logit choice model [20] and which makes it unsuitable for choice problems with inherent similarities between different alternatives. In turn, this has led to a vast corpus of literature in social choice and decision theory, with an extensive array of different axioms and models proposed to overcome the failures of the IIA assumption. For an introduction to the topic, we refer the reader to the masterful accounts of Mc Fadden [21], Ben-Akiva & Lerman [7] and Anderson et al. [2].

Perhaps surprisingly, the implications of the red bus / blue bus paradox have not been explored in the context of online learning, despite the fact that similarities between alternatives are prevalent in the field s application domains for example, in recommender systems with categorized product recommendation catalogues, in the economics of transport and product differentiation, etc. What makes this gap particularly pronounced is the fact that logit choice underlies some

Nested Bandits

of the most widely used algorithmic schemes for learning in multi-armed bandit problems namely the exponential weights algorithm for exploration and exploitation (EXP3) [4, 19, 29] as well as its variants, Hedge [5], EXP3.P [6], EXP3-IX [17], EXP4 [6] / EXP4-IX [23], etc. Thus, given the vulnerability of logit choice to irrelevant alternatives, it stands to reason that said algorithms may be suboptimal when faced with a set of alternatives with many inherent similarities.

Our contributions. Our paper examines this question in the context of repeated decision problems where a learner seeks to minimize their regret in the presence of a large number of distinct alternatives with a hierarchy of embedded (non-combinatorial) similarities. This similarity structure, which we formalize in Section 2, is defined in terms of a nested series of attributes like type or color and induces commensurate similarities to the losses of alternatives that lie in the same class (just as the red and blue buses have identical losses in the example described above).

Inspired by the nested logit choice model introduced by Mc Fadden [21] to resolve the original red bus / blue bus paradox, we develop in Section 3 a nested exponential weights (NEW) algorithm for no-regret learning in decision problems of this type. Our main result is that the regret incurred by NEW is bounded as O( neff log n T), where n is the total number of alternatives and neff is the effective number when taking similarities into account (for example, in the standard red bus / blue bus paradox, neff = 2, cf. Section 4). The gap between nested and non-nested algorithms can be quantified by the problem s price of affinity (Po Af), defined here as the ratio α = p

n/neff measuring the worstcase ratio between the regret guarantees of the NEW and EXP3 algorithms (the latter scaling as O( n log n T) in the problem at hand).

In practical applications (such as the type of recommendation problems that arise in online advertising), α can be exponential in the number of attributes, indicating that the NEW algorithm could lead to significant performance gains in this context. We verify that this is indeed the case in a range of synthetic experiments in Section 5.

Related Work. The problem of exploiting the structure of the loss model and/or any side information available to the learner is a staple of the bandit literature. More precisely, in the setting of contextual bandits, the learner is assumed to observe some context-based information and tries to learn the context to reward mapping underlying the model in order to make better predictions. Bandit algorithms of this type like EXP4 are often studied as expert models [6, 11] or attempt to model the agent s loss function with a semiparametric contextual dependency in the stochastic setting to derive optimistic action selection rules [1]; for a survey,

we refer the reader to [18] and references therein. While the nested bandit model we study assumes an additional layer of information relative to standard bandit models, there are no experts or a contextual mapping conditioning the action taken, so it is not comparable to the contextual setup.

The type of feedback we consider assumes that the learner observes the intra-class losses of their chosen alternative, similar to the semi-bandit in the study of combinatorial bandit algorithms [12, 15]. However, the similarity with combinatorial bandit models ends there: even though the categorization of alternatives gives rise to a tree structure with losses obtained at its leaves, there is no combinatorial structure defining these costs, and modeling this as a combinatorial bandit would lead to the same number of arms and ground elements, thus invalidating the concept.

Besides these major threads in the literature, [28] recently showed that the range of losses can be exploited with an additional free observation, while [13] improves the regret guarantees by using effective loss estimates. However, both works are susceptible to the advent of irrelevant alternatives and can incur significant regret when faced with such a problem. Finally, in the Lipschitz bandit setting, [14, 16] obtain order-optimal regret bounds by building a hierarchical covering model in the spirit of [10]; the correlations induced by a Lipschitz loss model cannot be compared to our model, so there is no overlap of techniques or results.

2. The general model

We begin in this section by defining our general nested choice model. Because the technical details involved can become cumbersome at times, it will help to keep in mind the running example of a music catalogue where songs are classified by, say, genre (classical music, jazz, rock,...), artist (Rachmaninov, Miles Davis, Led Zeppelin,...), and album. This is a simple but not simplistic use case which requires the full capacity of our model, so we will use it as our go-to example throughout.

2.1. Attributes, classes, and the relations between them

Let A = {ai : i = 1, . . . , n} be a set of alternatives (or atoms) indexed by i = 1, . . . , n. A similarity structure (or structure of attributes) on A is defined as a tower of nested similarity partitions (or attributes) Sℓ, ℓ= 0, . . . , L, of A with {A} =: S0 S1 SL := {{a} : a A}. As a result of this definition, each partition Sℓcaptures successively finer attributes of the elements of A (in our music catalogue example, these attributes would correspond to genre, artist, album, etc.).1 Accordingly, each constituent

1The trivial partitions S0 = {A} and SL = {{a} : a A} do not carry much information in themselves, but they are included for completeness and notational convenience later on.

Nested Bandits

set A of a partition Sℓwill be referred to as a similarity class and we assume it collects all elements of A that share the attribute defining Sℓ: for example, a similarity class for the attribute artist might consist of all Beethoven symphonies, all songs by Led Zeppelin, etc.

Collectively, a structure of attributes will be represented by the disjoint union

ℓ=0{(A, ℓ) : A Sℓ} (1)

of all class/attribute pairs of the form (A, ℓ) for A Sℓ. In a slight abuse of terminology (and when there is no danger of confusion), the pair S = (A, ℓ) will also be referred to as a class , and we will write S Sℓand a S instead of A Sℓand a A respectively. By contrast, when we need to clearly distinguish between a class and its underlying set, we will write A = elem(S) for the set of atoms contained in S and ℓ= attr(S) for the attached attribute label.

Remark 1. The reason for including the attribute label ℓin the definition of S is that a set of alternatives may appear in different partitions of A in a different context. For example, if IV is the only album by Led Zeppelin in the catalogue, the album s track list represents both the set of all songs in IV as well as the set of all Led Zeppelin songs . However, the focal attribute in each case is different artist in the former versus album in the latter and this additional information would be lost in the non-discriminating union SL ℓ=0 Sℓ(unless, of course, the partitions Sℓhappen to be mutually disjoint, in which case the distinction between union and disjoint union becomes set-theoretically superfluous).

Moving forward, if a class S Sℓcontains the class S Sk for some k > ℓ, we will say that S is a descendant of S (resp. S is an ancestor of S ), and we will write S S (resp. S S ).2 As a special case of this relation, if S S and k = ℓ+ 1, we will say that S is a child of S (resp. S is parent of S ) and we will write S S (resp. S S ). For completeness, we will also say that S

and S are siblings if they are children of the same parent, and we will write S S in this case. Finally, when we wish to focus on descendants sharing a certain attribute, we will write S ℓS as shorthand for the predicate S S and attr(S ) = ℓ .

Building on this, a similarity structure on A can also be represented graphically as a rooted directed tree an arborescence by connecting two classes S, S S with a directed edge S S whenever S S . By construction,

2More formally, we will write S S when elem(S ) elem(S) and attr(S ) > attr(S). The corresponding weak relation is defined in the standard way, i.e., allowing for the case attr(S ) = attr(S) which in turn implies that S = S.

S2 2 S1 2 S3 2

S0 3 S1 3 S2 3 S3 3 S4 3 S5 3 S6 3 S7 3

{a1} {a2} {a3} {a4} {a5} {a6} {a7} {a8}

Figure 1: A structure with L = 3 attributes on the set A = {a1, . . . , a8}; for example, the class S1 2 consists of {a3, a4}.

the root of this tree is A itself,3 and the unique directed path A S0 S1 Sℓ S from A to any class S S will be referred to as the lineage of S. For notational simplicity, we will not distinguish between S and its graphical representation, and we will use the two interchangeably; for an illustration, see Fig. 1.

2.2. The loss model

Throughout what follows, we will consider loss models in which alternatives that share a common set of attributes incur similar costs, with the degree of similarity depending on the number of shared attributes. More precisely, given a similarity class S S, we will assume that all its immediate subclasses S share the same base cost c S (determined by the parent class S) plus an idiosyncratic cost increment r S (which is specific to the child S S in question). Formally, starting with c A = 0 (for the root class A), this boils down to the recursive definition

c S = c S + r S for all S S, (2)

which, when unrolled over the lineage A S0 S1 Sℓ S of a target class S Sℓ, yields the expression

S S r S = r S1 + + r Sℓ. (3)

Thus, in particular, when S a A, the cost assigned to an individual alternative a A will be given by

ℓ=1 r Sℓ= X

S a r S for all a A. (4)

Finally, to quantify the intra-class variability of costs, we will assume throughout that the idiosyncratic cost increments within a given parent class S are bounded as

r S [0, RS] for all S S. (5)

3Stricto sensu, the root of the tree is (A, 0), but since there is no danger of confusion, the attribute label 0 will be dropped.

Nested Bandits

This terminology is justified by the fact that, under the loss model (2), the costs c S , c S to any two sibling classes S , S S (i.e., any two classes parented by S) differ by at most RS. Analogously, the costs to any two alternatives a, a A that share a set of common attributes S1, . . . , Sℓ will differ by at most PL k=ℓ+1 RSk.

Example 1. To represent the original red bus / blue bus problem as an instance of the above framework, let S1 = {{red bus, blue bus}, car} be the partition of the set A = {red bus, blue bus, car} by type ( bus or car ), and let S2 be the corresponding sub-partition by color ( red or blue for elements of the class bus ). The fact that color does not affect travel times may then be represented succinctly by taking Rcolor = 0 (representing the fact that color does not affect travel times).

Remark 2. We make no distinction here between ca and c{a}, i.e., between an alternative a of A and the (unique) singleton class of {a} SL containing it. This is done purely for reasons of notational convenience.

Remark 3. For posterity, we also note that the optimizing agent is assumed to be aware of the cost decomposition (4) after selecting an alternative a A. In the context of combinatorial bandits [12] this would correspond to the so-called semi-bandit setting.

2.3. Sequence of events

With all this in hand, we will consider a generic online decision process that unfolds over a set of alternatives A endowed with a similarity structure S =

ℓSℓas follows:

1. At each stage t = 1, 2, . . . , the learner selects an alternative at A by selecting attributes from S one-by-one.

2. Concurrently, nature sets the idiosyncratic, intra-class losses r S,t for each similarity class S S.

3. The learner incurs r S,t for each chosen class S at for a total cost of ct = P

S at r S,t, and the process repeats.

To align our presentation with standard bandit models with losses in [0, 1], we will assume throughout that P

S a RS 1 for all a A, meaning in particular that the maximal cost incurred by any alternative a A is upper bounded by 1. Other than this normalization, the sequence of idiosyncratic loss vectors rt RS, t = 1, 2, . . . , is assumed arbitrary and unknown to the learner as per the standard adversarial setting [11, 26].

To avoid deterministic strategies that could be exploited by an adversary, we will assume that the learner selects an alternative at at time t based on a mixed strategy xt (A), i.e., at xt. The regret of a policy xt, t = 1, 2, . . . , against a benchmark strategy p (A) is then defined as the cumulative difference between the player s mean cost

under p and xt, that is

t=1 [Ext[cat,t] Ep[cat,t]] =

t=1 ct, xt p

(6) where ct = (ca,t)a A RA denotes the vector of costs encountered by the learner at time t, i.e., ca,t = P

S a r S,t for all a A. This definition will be our main figure of merit in the sequel.

3. The nested exponential weights algorithm

Our goal in what follows will be to design a learning policy capable of exploiting the type of similarity structures introduced in the previous section. The main ingredients of our method are a nested attribute selection and cost estimation rule, which we describe in detail in Sections 3.1 and 3.2 respectively; the proposed nested exponential weights (NEW) algorithm is then developed and discussed in Section 3.3.

3.1. Probabilities, propensities, and nested logit choice

We begin by introducing the attribute selection scheme that forms the backbone of our proposed policy. Our guiding principle in this is the nested logit choice (NLC) rule of Mc Fadden [21] which selects an alternative a A by traversing S one attribute at a time and prescribing the corresponding conditional choice probabilities at each level of S.

To set the stage for all this, if x = (x1, . . . , xn) (A) is a mixed strategy on A we will write

for the probability of choosing S S under x, and

x S |S = x S /x S (8)

for the conditional probability of choosing a descendant S

of S assuming that S has already been selected under x.4

Then the NLC rule proceeds as follows: first, it prescribes choice probabilities x S1 for all classes S1 S1 (i.e., the coarsest ones); subsequently, once a class S1 S1 has been selected, NLC prescribes the conditional choice probabilities x S2|S1 for all children S2 of S1 and draws a class from S2 based on x S2|S1. The process then continues downwards along S until reaching the finest partition SL and selecting an atom {a} SL SL 1 S0 A.

This step-by-step selection process captures the nested part of the nested logit choice rule; the logit part refers to the way that the conditional probabilities (8) are actually prescribed given the agent s predisposition towards each alternative a A. To make this precise, suppose that the learner associates to each element a A a propensity score

4Note here that the joint probability of selecting both S and S

under x is simply x S whenever S S.

Nested Bandits

ya R indicating their tendency or propensity to select it. The associated propensity score of a similarity class Sℓ 1 Sℓ 1, ℓ= 1, . . . , L, is then defined inductively as

y Sℓ 1 = µℓlog X

Sℓ Sℓ 1 exp(y Sℓ/µℓ) (9)

where µℓ> 0 is a tunable parameter that reflects the learner s uncertainty level regarding the ℓ-th attribute Sℓ of S. In words, this means that the score of a class is the weighted softmax of the scores of its children; thus, starting with the individual alternatives of A that is, the leaves of S propensity scores are propagated backwards along S, and this is repeated one attribute at a time until reaching the root of S. Remark 4. We should also note that Eq. (9) assigns a propensity score to any similarity class S S. However, because the primitives of this assignment are the original scores assigned to each alternative a A, we will reserve the notation y = (y1, . . . , yn) RA for the profile of propensity scores (ya)a A that comprises the basis of the recursive definition (9).

With all this in hand, given a propensity score profile y = (y1, . . . , yn) RA, the nested logit choice (NLC) rule is defined via the family of conditional selection probabilities

PSℓ|Sℓ 1(y) = exp(y Sℓ/µℓ) exp(y Sℓ 1/µℓ) (NLC)

1. Sℓ Sℓand Sℓ 1 Sℓ 1 is a child / parent pair of similarity classes of S. 2. µ1 µL > 0 is a nonincreasing sequence of uncertainty parameters (indicating a higher uncertainty level for coarser attributes; we discuss this later).

In more detail, the choice of an alternative a A under (NLC) proceeds as follows: given a propensity score ya R for each a A, every similarity class SL 1 SL 1 is assigned a propensity score via the recursive softmax expression (9), and the same procedure is applied inductively up to the root A of S. Then, to select an alternative a A, the conditional logit choice rule (NLC) proceeds in a top-down manner, first by selecting a similarity class S1 S0 A, then by selecting a child S2 S1 of S1, and so on until reaching a leaf {a} SL SL 1 S0 A of S.

Equivalently, unrolling (NLC) over the lineage A S0 S1 Sℓ S of a target class S Sℓ, we obtain the expression

k=1 exp(y Sk/µk) exp(y Sk 1/µk) (10)

for the total probability of selecting class S under the propensity score profile y RA. Clearly, (NLC) and (10) are

mathematically equivalent, so we will refer to either one as the definition of the nested logit choice rule.

3.2. The nested importance weighted estimator

The second key ingredient of our method is how to estimate the costs of alternatives that were not chosen under (NLC). To that end, given a cost vector c [0, 1]A and a mixed strategy x (A) with full support, a standard way to do this is via the importance-weighted estimator [9, 18]

ˆca = 1{a = ˆa}

xa ca (IWE)

where ˆa x is the (random) element of A chosen under x.

This estimator enjoys the following important properties:

a) It is non-negative.

b) It is unbiased, i.e.,

E[ˆca] = ca for all a A. (11)

c) Its importance-weighted mean square is bounded as

a A xaˆc2 a i n (12)

This trifecta of properties plays a key role in establishing the no-regret guarantees of the vanilla exponential weights algorithm [5, 19, 29]; at the same time however, (IWE) fails to take into account any side information provided by similarities between different elements of A. This is perhaps most easily seen in the original red bus / blue bus paradox: if the commuter takes a red bus, the observed utility would be immediately translateable to the blue bus (and vice versa). However, (IWE) is treating the red and blue buses as unrelated, so ˆcblue bus is not updated under (IWE), even though cblue bus = cred bus by default.

To exploit this type of similarities, we introduce below a layered estimator that shadows the step-by-step selection process of (NLC). To define it, let x (A) be a mixed strategy on A with full support, and assume that an element ˆa A is selected progressively according to x as in the case of (NLC):5 First, the learner chooses a similarity class ˆS1 S1 with probability P( ˆS1 = S1) = x S1; subsequently, conditioned on the choice of ˆS1, a class ˆS2 ˆS1 is selected with probability P( ˆS2 = S2| ˆS1) = x S2| ˆS1, and the process

repeats until reaching a leaf ˆSL = {ˆa} of S (at which point the selection procedure terminates and returns ˆa). Then, given a loss profile r [0, + )S and a mixed strategy x (A), the nested importance weighted estimator (NIWE) is defined for all ℓ= 1, . . . , L as

5To clarify, this process adheres to the nested part of (NLC); the conditional probabilities x S |S may of course differ.

Nested Bandits

ˆr Sℓ= 1 Sℓ= ˆSℓ, . . . , S1 = ˆS1

x Sℓ|Sℓ 1 x S2|S1x S1 r Sℓ (NIWE)

where the chain of categorical random variables A ˆS0 ˆS1 ˆSL = {ˆa} is drawn according to x (A) as outlined above.6

This estimator will play a central part in our analysis, so some remarks are in order. First and foremost, the nonnested estimator (IWE) is recovered as a special case of (NIWE) when there are no similarity attributes on A (i.e., L = 1). Second, in a bona fide nested model, we should note that ˆc Sℓis ˆSℓ-measurable but not ˆSℓ 1-measurable: this property has no analogue in (IWE), and it is an intrinsic feature of the step-by-step selection process underlying (NIWE). Third, it is also important to note that (NIWE) concerns the idiosyncratic losses of each chosen class, not the base costs ca of each alternative a A. This distinction is again redundant in the non-nested case, but it leads to a distinct estimator for ca in nested environments, namely

S a ˆr S for all a A. (13)

In particular, in the red bus / blue bus paradox, this means that an observation for the class bus automatically updates both ˆcred bus and ˆcblue bus, thus overcoming one of the main drawbacks of (IWE) when facing irrelevant alternatives.

To complete the comparison with the non-nested setting, we summarize below the most important properties of the layered estimator (NIWE):

Proposition 1. Let S = L ℓ=1 Sℓbe a similarity structure on A. Then, given a mixed strategy x (A) and a vector of cost increments r RS as per (5), the estimator (NIWE) satisfies the following:

1. It is unbiased:

E[ˆr S] = r S for all S S. (14)

2. It enjoys the importance-weighted mean-square bound

E x Sˆr2 S R2 S for all S S. (15)

Accordingly, the loss estimator (13) is itself unbiased and enjoys the bound

a A xaˆc2 a i neff (16)

where neff is defined as

ℓ=1 nℓ Rℓ (17)

6The indicator in (NIWE) is assumed to take precedence over x Sk|Sk 1, i.e., ˆc Sℓ= 0 if Sk = ˆSk for some k = 1, . . . , ℓ.

with nℓ= |Sℓ| denoting the number of classes of attribute Sℓ, and

Sℓ SℓR2 Sℓ (18)

denoting the root-mean-square range of all classes in Sℓ.

Of course, Proposition 1 yields the standard properties of (IWE) as a special case when L = 1 (in which case there are no similarities to exploit between alternatives). To streamline our presentation, we prove this result in Appendix B.

3.3. The nested exponential weights algorithm

We are finally in a position to present the nested exponential weights (NEW) algorithm in detail. Building on the original exponential weights blueprint [5, 19, 29], the main steps of the NEW algorithm can be summed up as follows:

1. For each stage t = 1, 2, . . . , the learner maintains and updates a propensity score profile yt RA.

2. The learner selects an action at A based on the nested logit choice rule at P(ηtyt) where ηt 0 is the method s learning rate and P is given by (NLC).

3. The learner incurs r S,t for each class S at and constructs a model ˆct of the cost vector ct of stage t via (NIWE).

4. The learner updates their propensity score profile based on ˆct and the process repeats.

For a presentation of the algorithm in pseudocode form, see Algorithm 1; the tuning of the method s uncertainty parameters µ1 . . . µL > 0 and the learning rate ηt is discussed in the next section, where we undertake the analysis of the NEW algorithm.

4. Analysis and results

We are now in a position to state and discuss our main regret guarantees for the NEW algorithm. These are as follows:

Theorem 1. Suppose that Algorithm 1 is run with a nonincreasing learning rate ηt > 0 and uncertainty parameters µ1 µL > 0 against a sequence of cost vectors ct [0, 1]A, t = 1, 2, . . . , as per (4). Then, for all p (A), the learner enjoys the regret bound

E[Regp(T)] H ηT +1 + neff

t=1 ηt (19)

with neff given by (17) and H H(µ1, . . . , µL) defined by

Nested Bandits

Algorithm 1: Nested exponential weights (NEW)

Require: set of alternatives A; attribute structure S = L ℓ=1 Sℓ Params: uncertainty levels µ1, . . . , µL > 0; learning rate ηt 0 Input: sequence of class costs rt [0, 1]S, t = 1, 2, . . .

1: initialize y 0 RA, S0 = A # initialization 2: for t = 1, 2, . . . do # scoring phase 3: for ℓ= L 1, . . . , 0 and for all S Sℓdo 4: y S µℓ+1 log P

S S exp(y S /µℓ+1) # as per (9) 5: set ˆr S 0 # baseline guess 6: end for 7: for ℓ= 1, . . . , L do # selection phase 8: select class Sℓ Sℓ 1 # class choice

Sℓ x Sℓ|Sℓ 1 = exp(ηty Sℓ/µℓ) exp(ηty Sℓ 1/µℓ) # (NLC)

9: get r Sℓ,t # intra-class cost 10: set ˆr Sℓ ˆr Sℓ+ r Sℓ,t x Sℓ|Sℓ 1 x S1|S0 # (NIWE)

11: end for 12: set ˆca P

S a ˆr S for all a A # loss model 13: set y y ˆc # update propensities 14: end for

setting y = 0 in (9) and taking H = y A, i.e.,

(20) In particular, if Algorithm 1 is run with µ1 = = µL = p

neff/2 and ηt = p

log n/(2t), we have

E[Regp(T)] 2 p

neff log n T. (21)

Theorem 1 is our main regret guarantee for NEW so, before discussing its proof (which we carry out in detail in Appendices A C), some remarks are in order.

The first thing of note is the comparison to the corresponding bound for EXP3, namely

E[Regp(T)] 2 p

n log n T. (22)

This shows that the guarantees of NEW and EXP3 differ by a factor of7

n/neff, (23)

which, for reasons that become clear below, we call the price of affinity (Po Af).

7Depending on the source, the bound (22) may differ up to a factor of

2, compare for example [26, Corollary 4.2] and [18, Theorem 11.2]. This factor is due to the fact that (22) is usually stated for a known horizon T (which saves a factor of

2 relative to anytime algorithms). Ceteris paribus, the bound (21) can be sharpened by the same factor, but we omit the details.

Since the variabilities of the idiosyncratic losses within each attribute have been normalized to 1 (recall the relevant discussion in Section 2.3), Hölder s inequality trivially gives neff n, no matter the underlying similarity structure. Of course, if there are no similarities to exploit (L = 1), we get neff = n, in which case the two bounds coincide (α = 1).

At the other extreme, suppose we have a red bus / blue bus type of problem with, say, n1 = 2 similarity classes, n2 = 100 alternatives per class, and a negligible intra-class loss differential (R2 0). In this case, EXP3 would have to wrestle with n = n1n2 = 200 alternatives, while NEW would only need to discriminate between neff n1 = 2 alternatives, leading to an improvement by a factor of α 10 in terms of regret guarantee. Thus, even though the red bus / blue bus paradox could entangle EXP3 and cause the algorithm to accrue significant regret over time, this is no longer the case under the NEW method; we also explore this issue numerically in Section 5.

As another example, suppose that each non-terminal class in S has m children and the variability of the idiosyncratic losses likewise scales down by a factor of m per attribute. In this case, a straightforward calculation shows that neff scales as Θ(m), so the gain in efficiency would be of the order of α = p

n/neff = Θ(m(L 1)/2), i.e., polynomial in m and exponential in L. This gain in performance can become especially pronounced when there is a very large number of atlernatives organized in categories and subcategories of geometrically decreasing impact on the end cost of each alternative. We explore this issue in practical scenarios in Section 5 and Appendix D.

Finally, we should also note that the parameters of NEW have been tuned so as to facilitate the comparison with EXP3. This tuning is calibrated for the case where S is fully symmetric, i.e., all subcategories of a given attribute have the same number of children. Otherwise, in full generality, the tuning of the algorithm s uncertainty levels would boil down to a transcedental equation involving the nested term H(µ1, . . . , µL) of (19). This can be done efficiently offline via a line search, but since the result would be structuredependent, we do not undertake this analysis here.

Proof outline of Theorem 1. The detailed proof of Theorem 1 is quite lengthy, so we defer it to Appendices A C and only sketch here the main ideas.

The first basic step is to derive a suitable potential function that can be used to track the evolution of the NEW policy relative to the benchmark p (A). The main ingredient of this potential is the nested entropy function

Sk Sk x Sk log x Sk, (24)

where δk = µk µk+1 for all k = 1, . . . , L (with µL+1 = 0

Nested Bandits

by convention).8 As we show in Proposition A.1 in Appendix A, the tiers of h can be unrolled to give the nontiered recursive representation

S S h(x|S) (25)

where h(x|S) = µℓ+1 P

S S x S log(x S /x S) denotes the conditional entropy of x relative to class S Sℓ. Then, by means of this decomposition and a delicate backwards induction argument, we show in Proposition A.2 that a) the recursively defined propensity score y A of A can be expressed non-recursively as y A = arg maxx (A){ y, x h(x)}; and b) that the choice rule (NLC) can be expressed itself as

Pa(y) = y A

ya for all y RA, a A. (26)

This representation of (NLC) provides the first building block of our proof because, by Danskin s theorem [8], it allows us to rewrite Algorithm 1 in more concise form as

yt+1 = yt ˆct xt+1 = arg max x (A) { ηt+1yt+1, x h(x)} (NEW)

with ˆct given by (13) appplied to x xt. Importantly, this shows that the NEW algorithm is an instance of the wellknown follow the regularized leader (FTRL) algorithmic framework [26, 27]. Albeit interesting, this observation is not particularly helpful in itself because there is no universal, regularizer-agnostic analysis giving optimal (or near-optimal) regret rates for FTRL with bandit/partial information.9 Nonetheless, by adapting a series of techniques that are used in the analysis of FTRL algorithms, we show in Appendix C that the iterates of (NEW) satisfy the energy inequality

ˆct, xt p Et Et+1 + 1

ηt F(xt, ηtyt+1)

+ (η 1 t+1 η 1 t )[h(p) min h] (27)

where ˆct is the nested importance weighted estimator (13) for the cost vector encountered ct, and we have set

F(x, y) = h(x) + y A y, x (28)

and Et = η 1 t F(p, ηtyt).

Then, by Proposition 1, we obtain:

8In the non-nested case, (24) boils down to the standard (negative) entropy h(x) = P

a xa log xa. However, the inverse problem of deriving the correct form of h in a nested environment involves a technical leap of faith and a fair degree of trial-and-error. 9For the analysis of specific versions of FTRL with nonentropic regularizers, cf. [3, 30] and references therein.

0 200 400 600 800 Steps

Env - Red Bus/Blue Bus Paradox - Regret

EXP3 - N 2 EXP3 - N 5 EXP3 - N 10 EXP3 - N 50

NEW - N 2 NEW - N 5 NEW - N 10 NEW - N 50

Figure 2: Regret of EXP3 and NEW in the red bus / blue bus problem with different numbers of buses.

Proposition 2. The NEW algorithm enjoys the bound

E[Regp(T)] H ηT +1 +

E[F(xt, ηtyt+1)]

Proposition 2 provides the first half of the bound (19), with the precise form of H derived in Lemma C.1. The second half of (19) revolves around the term E[F(xt, ηtyt+1)] and boils down to estimating how propensity scores are backpropagated along S. In particular, the main difficulty is to bound the difference y+ A y A in the propensity score of the root node A of S when the underlying score profile y RA

is incremented to y+ = y + w for some w RA.

A first bound that can be obtained by convex analysis arguments is |y+ A y A| y, P(y) + w 2 ; however, because the increments of (NEW) are unbounded in norm, this global bound is far too lax for our puposes. A similar issue arises in the analysis of EXP3, and is circumvented by deriving a bound for the log-sum-exp function using the identity exp(x) 1 + x + x2/2 for x 0 and the fact that the estimator (IWE) is non-negative [11, 18, 26]. Extending this idea to nested environments is a very delicate affair, because each tier in S introduces an additional layer of error propagation in the increments yt+1 yt. However, by a series of inductive arguments that traverse S both forward and backward, we are able to show the bound

y+ A y A y, P(y) + 1 2µL

Sℓ Sℓ PSℓ(y)r2 Sℓ (30)

which, after taking expecations and using the bounds of Proposition 1, finally yields the pseudo-regret bound (19).

5. Numerical experiments

In this section we present a series of numerical experiments designed to test the efficiency of our method compared to EXP3. We use a synthetic environment where we simulate nested similarity partitions with trees. While NEW exploits the similarity structure by making forward/backward

Nested Bandits

0 2000 4000 6000 8000 10000

Env - Tree Structure - Regret

EXP3 - L 4, M 3 EXP3 - L 5, M 3 EXP3 - L 6, M 3

NEW - L 4, M 3 NEW - L 5, M 3 NEW - L 6, M 3

Figure 3: Regret of EXP3 and NEW in a tree environment with different values of levels L and classes per level M

0 2000 4000 6000 8000 10000

Env - Tree Structure - Regret

EXP3 - L 2, M 50 EXP3 - L 2, M 100 EXP3 - L 2, M 200

NEW - L 2, M 50 NEW - L 2, M 100 NEW - L 2, M 200

Figure 4: Regret of EXP3 and NEW in a tree environment with different values of levels L and classes per level M

passes through the associated tree with its logit choice rule (NLC), EXP3 is simply run over the leaves of the tree, i.e., A. All experiment details (as well as additional results) are presented in Appendix D. For every setting, we report the results of our experiments by plotting the average regret of each algorithm for 20 seeds of randomly drawn losses. The code to reproduce the experiments can be found at https://github.com/criteo-research/ Nested-Exponential-Weights.

Benefits in the red bus/blue bus problem. We consider here a variant of the red bus/blue bus problem with N different buses (the original paradox has N = 2). In this experiment (see illustration in Fig. 5, Appendix D.2) we allow each bus to have non-zero intrinsic losses and illustrate in Fig. 2 how both algorithms perform when N grows. We observe there that for all configurations NEW achieves better regret than EXP3. While both methods achieve sublinear regret, EXP3 requires far more steps to identify the best alternative as N grows and suffers overall from worse regret while NEW achieves similar regret and does not suffer as much from the number of irrelevant alternatives. We provide additional plots in Appendix D.2 which show that NEW performs consistently better than EXP3 when there exists a similarity structure allowing to efficiently update scores of classes that have very similar losses.

Performance in general nested structures. In this setting we generate symmetric trees and experiment with different values of number of levels L and number of child per nodes M = |Sℓ| for ℓ= 1, . . . , L. Specifically, in Fig. 3 with a fixed M, we see that NEW obtains better regret than EXP3 even when L increases. We provide variance plots for the experiments that generated the same performance on the plots in D.3 as well as additional visualisations. Finally, in Fig. 4, we can see that for a shallow tree (L = 2) NEW performs always better than EXP3, even for high values of M. Indeed, when the number of children per nodes M increases, the tree loses its factorized structure which also affects NEW due to the less "structured" tree. Thus, again, NEW performs consistently better than EXP3 when it is possible to efficiently handle classes with similar losses.

Overall, our experiments confirm that a learning algorithm based on nested logit choice can lead to significant benefits in problems with a high degree of similarity between alternatives. This leaves open the question of whether a similar approach can be applied to structures with non-nested attributes; we defer this question to future work.

6. Concluding remarks

One limitation of the current framework is that the nested estimator (13) requires knowledge of the intra-class cost increments r S for every chosen similarity class S at. This is akin to the difference between the full bandit and semibandit setting that arises in combinatorial bandits [12]. While relevant in a number of application domains (e.g., in path-planning or when layering a structured security, such as the tranches of a CDO), treating the fully unobservable case possibly using an approach in the spirit of the hierarchical contextual analysis of Sen et al. [25] is an important open question for future research.

Finally, it is also interesting to note that our analysis has been carried out in an arbitrarily changing adversarial environment. In a stochastic environment, it would be fruitful to consider other, contextual-based approaches such as Lin UCB, Kernel UCB and their variants [18]. Ideally, one would like to employ a nested variant of the universal algorithm of Zimmert & Seldin [30] that attains optimal regret guarantees in both stochastic and adversarial environments, but this question lies beyond the scope of our work.

Acknowledgements

P. Mertikopoulos is grateful for financial support by the French National Research Agency (ANR) in the framework of the Investissements d avenir program (ANR-15-IDEX-02), the Lab Ex PERSYVAL (ANR-11-LABX-0025-01), MIAI@Grenoble Alpes (ANR-19-P3IA-0003), and the bilateral ANR-NRF grant ALIAS (ANR-19-CE48-0018-01).

Nested Bandits

[1] Abbasi-yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Adv. Neural Information Processing Systems (NIPS), 2011.

[2] Anderson, S. P., de Palma, A., and Thisse, J.-F. Discrete Choice Theory of Product Differentiation. MIT Press, Cambridge, MA, 1992.

[3] Audibert, J.-Y., Bubeck, S., and Lugosi, G. Minimax policies for combinatorial prediction games. In COLT 11: Proceedings of the 24th Annual Conference on Learning Theory, 2011.

[4] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, 1995.

[5] Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235 256, 2002.

[6] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48 77, 2002.

[7] Ben-Akiva, M. and Lerman, S. R. Discrete Choice Analysis: Theory and Application to Travel Demand. MIT Press, Cambridge, 1985.

[8] Berge, C. Topological Spaces. Dover, New York, 1997.

[9] Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

[10] Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. Xarmed bandits. Journal of Machine Learning Research, 12: 1655 1695, 2011.

[11] Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.

[12] Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. Journal of Computer and System Sciences, 78:1404 1422, 2012.

[13] Cesa-Bianchi, N. and Shamir, O. Bandit regret scaling with the effective loss range. In ALT 18: Proceedings of the 29th International Conference on Algorithmic Learning Theory, 2018.

[14] Cesa-Bianchi, N., Gaillard, P., Gentile, C., and Gerchinovitz, S. Algorithmic chaining and the role of partial feedback in online nonparametric learning. In COLT 17: Proceedings of the 30th Annual Conference on Learning Theory, 2017.

[15] György, A., Linder, T., Lugosi, G., and Ottucsák, G. The online shortest path problem under partial monitoring. Journal of Machine Learning Research, 8:2369 2403, 2007.

[16] Héliou, A., Martin, M., Mertikopoulos, P., and Rahier, T. Zeroth-order non-convex learning via hierarchical dual averaging. In ICML 21: Proceedings of the 38th International Conference on Machine Learning, 2021.

[17] Kocák, T., Neu, G., Valko, M., and Munos, R. Efficient learning by implicit exploration in bandit problems with side observations. In NIPS 14: Proceedings of the 28th International Conference on Neural Information Processing Systems, 2014.

[18] Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge University Press, Cambridge, UK, 2020.

[19] Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and Computation, 108(2):212 261, 1994. [20] Luce, R. D. Individual Choice Behavior: A Theoretical Analysis. Wiley, New York, 1959. [21] Mc Fadden, D. L. Conditional logit analysis of qualitative choice behavior. In Zarembka, P. (ed.), Frontiers in Econometrics, pp. 105 142. Academic Press, New York, NY, 1974. [22] Nesterov, Y. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221 259, 2009. [23] Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In NIPS 15: Proceedings of the 29th International Conference on Neural Information Processing Systems, 2015. [24] Rockafellar, R. T. Convex Analysis. Princeton University Press, Princeton, NJ, 1970. [25] Sen, R., Rakhlin, A., Ying, L., Kidambi, R., Foster, D., Hill, D., and Dhillon, I. Top-k e Xtreme contextual bandits with arm hierarchy. In ICML 21: Proceedings of the 38th International Conference on Machine Learning, 2021. [26] Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4 (2):107 194, 2011. [27] Shalev-Shwartz, S. and Singer, Y. Convex repeated games and Fenchel duality. In NIPS 06: Proceedings of the 19th Annual Conference on Neural Information Processing Systems, pp. 1265 1272. MIT Press, 2006. [28] Thune, T. S. and Seldin, Y. Adaptation to easy data in prediction with limited advice. In Advances in Neural Information Processing Systems, volume 31, 2018. [29] Vovk, V. G. Aggregating strategies. In COLT 90: Proceedings of the 3rd Workshop on Computational Learning Theory, pp. 371 383, 1990. [30] Zimmert, J. and Seldin, Y. An optimal algorithm for stochastic and adversarial bandits. In AISTATS 19: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.

Nested Bandits

A. The nested entropy and its properties

Our aim in this appendix is to prove the basic properties of the series of (negative) entropy functions that fuel the regret analysis of the nested exponential weights (NEW) algorithm.

To begin with, given a similarity structure S on A and a sequence of uncertainty parameters µ1 µL > 0 (with µL+1 = 0 by convention), we define:

1. The conditional entropy of x (A) relative to a target class S Sℓ:

h(x|S) = µℓ+1 X

S S x S log x S

x S = µℓ+1 x S X

S S x S |S log x S |S. (A.1)

2. The nested entropy of x (A) relative to S Sℓ:

Sk k S x Sk log x Sk (A.2)

where δk = µk µk+1 for all k = 1, . . . , L.

3. The restricted entropy of x (A) relative to S Sℓ:

h|S(x) = h S(x) + χ (S)(x) =

( h S(x) if x (S), otherwise, (A.3)

where χ (S) denotes the (convex) characteristic function of (S), i.e., χ (S)(x) = 0 if x (S) and χ (S)(x) = otherwise. [Obviously, h|S(x) = h S(x) whenever x (S).] Remark 1. As per our standard conventions, we are treating S interchangeably as a subset of A or as an element of S; by analogy, to avoid notational inflation, we are also viewing (S) as a subset of (A) more precisely, a face thereof. Finally, in all cases, the functions h(x|S), h S(x) and h|S(x) are assumed to take the value + for x RA \ (A).

Remark 2. For posterity, we also note that the nested and restricted entropy functions (h S(x) and h|S(x) respectively) are both convex though not necessarily strictly convex over (A). This is a consequence of the fact that each summand x S log x S in (A.2) is convex in x and that δk = µk µk+1 0 for all k = 1, . . . , L. Of course, any two distributions x, x (A) that assign the same probabilities to elements of S but not otherwise have h S(x) = h S(x ), so h S is not strictly convex over (A) if S = A. However, since the function P

a S xa log xa is strictly convex over (S), it follows that h S and hence h|S is strictly convex over (S).

Our main goal in the sequel will be to prove the following fundamental properties of the entropy functions defined above:

Proposition A.1. For all S Sℓ, ℓ= 1, . . . , L, and for all x (A), we have:

S S h(x|S ) + µℓx S log x S. (A.4)

Consequently, for all x (S), we have:

S S h(x|S ). (A.5)

Proposition A.2. For all S S and all y RA, we have:

1. The recursively defined propensity score y S of S as given by (9) can be expressed as

y S = max x (S){ y, x h|S(x)} (A.6)

Nested Bandits

2. The conditional probability of choosing a A given that S has already been selected under (NLC) is given by

Pa|S(y) = y S

and the conditional probability vector P|S(y) = (Pa|S(y))a A solves the problem (A.6), viz.

P|S(y) = arg max x (S) { y, x h|S(x)} (A.8)

These propositions will be the linchpin of the analysis to follow, so some remarks are in order: Remark 3. Note here that the maximum in (A.6) is taken over the restricted entropy function h|S, not the nested entropy h S. This distinction will play a crucial role in the sequel; in particular, since h|S is strictly convex over (S), it implies that the arg max in (A.8) is a singleton.

Remark 4. The first part of Proposition A.2 can be rephrased more concisely (but otherwise equivalently) as

y S = h |S(y) (A.9)

where h |S(y) = max x (A){ y, x h|S(x)} (A.10)

denotes the convex conjugate of h|S. This interpretation is conceptually important because it spells out the precise functional dependence between the (primitive) propensity score profile y RA and the propensity scores y S that are propagated to higher-tier similarity classes S S via the recursive definition (9). In particular, this observation leads to the recursive rule

for all S Sℓ, ℓ= 0, 1, . . . , L 1. (A.11)

We will we use this representation freely in the sequel.

Remark 5. It is also worth noting that the propensity scores y Sℓ, Sℓ Sℓ, can also be seen as primitives for the arborescence S = ℓ k=0 Sk obtained from S by excising all (proper) descendants of Sℓ. Under this interpretation, the second part of Proposition A.2 readily gives the more general expression

PS |S(y) = y S

y S for all S S, (A.12)

where, in the right-hand side, y S is to be construed as a function of y S , defined recursively via (9) applied to the truncated arborescence S . Even though we will not need this specific result, it is instructive to keep it in mind for the sequel.

The rest of this appendix is devoted to the proofs of Propositions A.1 and A.2.

Proof of Proposition A.1. Let ℓ= attr(S), and fix some attribute label k > ℓ. We will proceed inductively by collecting all terms in (A.4) associated to the attribute Sk and then summing everything together. Indeed, we have:

S k S x S log x S = µk X

S Sk 1 x S log x S

# collect attributes

S Sk 1 x S |Sk 1x Sk 1 log(x S |Sk 1x Sk 1)

# by definition

S Sk 1 x S |Sk 1x Sk 1 log x S |Sk 1

S Sk 1 x S |Sk 1x Sk 1 log x Sk 1

Nested Bandits

with the tacit understanding that any empty sum that appears above is taken equal to zero.

Now, by the definition of the nested entropy, we readily obtain that

(A.13a) = X

Sk 1 k 1S h(x|Sk 1) (A.14a)

whereas, by noting that P

S Sk 1 x S |Sk 1 = 1 (by the definition of conditional class choice probabilities), Eq. (A.13b) becomes (A.13b) = µk X

Sk 1 k 1S x Sk 1 log x Sk 1. (A.14b)

Hence, combining Eqs. (A.13), (A.14a) and (A.14b), we get:

S k S x S log x S = X

Sk 1 k 1S h(x|Sk 1) + µk X

Sk 1 k 1S x Sk 1 log x Sk 1. (A.15)

The above expression is our basic inductive step. Indeed, summing (A.15) over all k = L, . . . , ℓ= attr(S), we obtain:

k=ℓ (µk µk+1) X

S k S x S log x S # by definition

S k S x S log x S µk+1 X

S k S x S log x S

+ (µℓ µℓ+1) x S log x S # isolate S

Sk 1 k 1S h(x|Sk 1) + µk X

Sk 1 k 1S x Sk 1 log x Sk 1 µk+1 X

S k S x S log x S

+ (µℓ µℓ+1) x S log x S # by (A.15)

S k S h(x|S ) + µℓx S log x S µL+1 X

S LS x S log x S (A.16)

with the last equality following by telescoping the terms involving µk. Now, given that µL+1 = 0 by convention, the third sum above is zero. Finally, since the conditional entropy of x relative to any childless class is zero by definition, the first sum in (A.16) can be rewritten as PL 1 k=ℓ P S k S h(x|S ) = P S S h(x|S ), and our claim follows.

Finally, (A.5) is a consequence of the fact that x S = 1 whenever x (S) i.e., whenever supp(x) S.

Proof of Proposition A.2. We begin by noting that the optimization problem (A.6) can be written more explicitly as

maximize y, x h S(x),

subject to x (A) and supp(x) S. (Opt S)

We will proceed to show that the (unique) solution of (Opt S) is given by the vector of conditional probabilities (Pa|S(y))a A. The expression (A.6) for the maximal value of (Opt S) will then be derived from Proposition A.1, and the differential representation (A.8) will follow from Legendre s identity. We make all this precise in a series of individual steps below.

Step 1: Optimality conditions for (Opt S). For all a S, the definition of the nested entropy gives

xa (x S log x S ) =

S k S (1 + log x S ) x S

S k S (1 + log x S ) 1{a S }

Nested Bandits

k=ℓ δk(1 + log x Sk)

k=ℓ δk log x Sk (A.17)

where S Sℓ Sℓ+1 SL {a} denotes the lineage of a up to S (inclusive). This implies that ah S(x) whenever xa 0, so any solution x of (Opt S) must have xa > 0 for all a S. In view of this, the first-order optimality conditions for (Opt S) become

k=ℓ δk log x Sk = λ for all a S, (A.18)

where λ is the Lagrange multiplier for the equality constraint P

a A xa = 1.10 Thus, after rearranging terms and exponentiating, we get

xδL SL xδL 1 SL 1 xδℓ Sℓ= exp(ya)

for some proportionality constant Z Z(y) > 0.

Step 2: Solving (Opt S). The next step of our proof will focus on unrolling the chain (A.19), one attribute at a time. To start, recall that δL = µL, so (A.19) becomes

x SL xδL 1/µL SL 1 xδℓ/µL Sℓ = exp(y SL/µL)

Z1/µL , (A.20)

where we used the fact that SL = a by definition. Now, since SL 1 Sℓ= S, it follows that all children of SL 1 are also desendants of S, so (A.20) applies to all siblings of SL as well. Hence, summing (A.20) over SL SL 1, we get

x SL 1 xδL 1/µL SL 1 xδℓ/µL Sℓ = exp(y SL 1/µL)

Z1/µL , (A.21)

where we used the definition (7) of x SL 1 = P

SL SL 1 x SL and the recursive definition (9) for y SL 1, i.e., the fact that exp(y SL 1/µL) = P SL SL 1 exp(y SL/µL). Therefore, noting that

µL = 1 + µL 1 µL

the product (A.21) becomes

xµL 1 SL 1 xδL 2 SL 2 xδℓ Sℓ= exp(y SL 1)

or, equivalently

x SL 1 xδL 2/µL 1 SL 2 xδℓ/µL 1 Sℓ = exp(y SL 1/µL 1)

Z1/µL 1 . (A.24)

This last equation has the same form as (A.21) applied to the chain Sℓ Sℓ+1 SL 1 instead of Sℓ Sℓ+1 SL. Thus, proceeding inductively, we conclude that

j=k 1 xδj Sj = exp(y Sk)

Z for all k = L, . . . , ℓ (A.25)

with the empty product Q

j xδj Sj taken equal to 1 by standard convention.

Now, substituting k k + 1 in (A.25), we readily get

xµk+1 Sk+1 xδk Sk

j=k 1 xδj Sj = exp(y Sk+1)

Z for all k = L 1, . . . , ℓ. (A.26)

10Since xa > 0 for all a S, the multipliers for the corresponding inequality constraints all vanish by complementary slackness.

Nested Bandits

Consequently, recalling that δk = µk µk+1 and dividing (A.25) by (A.26), we get

xµk+1 Sk+1 xµk+1 Sk = exp(y Sk+1)

exp(y Sk) , (A.27)

and hence x Sk+1

x Sk = exp(y Sk+1/µk+1)

exp(y Sk/µk+1) = PSk+1|Sk(y) (A.28)

by the definition of the conditional logit choice model (NLC). Therefore, by unrolling the chain

x S = x SL x SL 1 x SL 1

x SL 2 x Sℓ+1

x Sℓ = PSL|SL 1(y) PSL 1|SL 2(y) PSℓ+1|Sℓ(y) (A.29)

we obtain the nested expression

k=ℓ PSk+1|Sk(y) for all a S. (A.30)

Thus, with x S = 1 (by the fact that supp(x) = S), we finally conclude that

k=ℓ PSk+1|Sk(y) = Pa|S(y) for all a S. (A.31)

Step 3: The maximal value of (Opt S). To obtain the value of the maximization problem (Opt S), we will proceed to substitute (A.31) in the expression (A.4) provided by Proposition A.1 for h S(x). To that end, for all k = ℓ, . . . , L 1 and all Sk k S, the definition (A.1) of the conditional entropy gives:

h(x|Sk) = µk+1 x Sk X

Sk+1 Sk x Sk+1|Sk log x Sk+1|Sk # by definition

= µk+1 x Sk X

Sk+1 Sk x Sk+1|Sk log exp(y Sk+1/µk+1)

exp(y Sk/µk+1) # by (A.28)

Sk+1 Sk x Sk+1|Sky Sk+1 x Sky Sk # since P Sk+1 Sk x Sk+1|Sk = 1

Sk+1 Sk x Sk+1y Sk+1 x Sky Sk (A.32)

Sk k S h(x|Sk) = X

Sk+1 Sk x Sk+1y Sk+1 x Sky Sk

Sk+1 k+1S x Sk+1y Sk+1 X

Sk k S x Sky Sk. (A.33)

Thus, telescoping this last releation over k = ℓ, . . . , L and invoking Proposition A.1, we obtain:

S S h(x|S ) + µk x S log x S # by Proposition A.1

Sk k S h(x|Sk) # collect parent classes

Sk+1 k+1S x Sk+1y Sk+1 X

Sk k S x Sky Sk

# by (A.33)

= y, x x Sy S (A.34)

where, in the second line, we used the fact that the conditional entropy h(x|SL) relative to any childless class SL SL is zero by definition. Accordingly, substituting back to (Opt S) we conclude that

val (Opt S) = y, x h S(x) = x Sy S = y S, (A.35)

as claimed.

Nested Bandits

Step 4: Differential representation of conditional probabilities. To prove the second part of the proposition, recall that the restricted entropy function h|S is convex, and let

h |S(y) = max x (A){ y, x h|S(x)} (A.36)

denote its convex conjugate.11 By standard results in convex analysis [e.g., Theorem 23.5 in 24], h |S is differentiable in y and we have the Legendre identity:

x = h |S(y) y h|S(x) x arg max x (A) { y, x h|S(x )} (A.37)

Now, by (A.31), we have xa = Pa|S(y) whenever x solves (Opt S) and hence, by Fermat s rule, whenever y h|S(x) 0. Our claim then follows by noting that h |S(y) = y S and combining the first and third legs of the equivalence (A.37).

These properties of the nested entropy function (and its restricted variant) will play a key role in deriving a suitable energy function for the nested exponential weights algorithm. We make this precise in Appendix C below.

B. Auxiliary bounds and results

Throughout this appendix, we assume the following primitives:

A fixed sequence of real numbers µ1 µ2 µL > 0; all entropy-related objects will be defined relative to this sequence as per the previous section.

A score vector y RA that defines inductively the score y S of any class S S using (9), as well as the associated nested choice probability P(y) as per (NLC).

A vector of cost increments r = (r S)S S RS that defines an associated cost vector c RA as per (4), viz.

S a r S for all a A. (B.1)

Moreover, for all c, y RA, we define the nested power sum function σc,y : S\SL R which, to any S S\SL, associates the real number

a S Pa|S(y) exp( ca/µL) if attr(S) = L 1,

S S PS |S(y)σc,y(S )

µℓ+2 µℓ+1 if attr(S) = ℓ< L 1. (B.2)

The following lemma links the increments of the conjugate entropy h to the nested power sum defined above:

Lemma B.1. For all y RA, c RA, we have

h (y c) = h (y) + µ1 log(σc,y(A)). (B.3)

Lemma B.1 will be proved as a corollary of the more general result below:

Lemma B.2. Fix some y RA and c RA. Then, for all Sℓ Sℓ, ℓ< L,we have

σc,y(Sℓ) (B.4)

Proof of Lemma B.1. Simply invoke Lemma B.2 with S A.

Proof of Lemma B.2. We proceed by descending induction on ℓ= attr(S).

11Note here that h |S(y) is bounded from above by the convex conjugate h S(y) of h S(x) because the latter does not include the constraint supp(x) S.

Nested Bandits

Base step. Fix some S S with attr(S) = L 1. We then have:

# by Eq. (A.11)

# the a s are leaves

exp h |a(y)

exp h |S(y)

| {z } =σc,y(S) by definition

σc,y(S) (B.5)

with the last equality following from the definition of Pa|S via (NLC) and by the definition of σc,y(S). This concludes the start of the induction process.

Induction step. Fix some S S with attr(S) = ℓ 1, ℓ< L, and suppose that (B.4) holds at level ℓ. We then have:

µℓ # inductive hypothesis

exp h |S (y)

exp h |S(y)

| {z } =σc,y(S) by definition

σc,y(S) (B.6)

with the last equality following from the definition of PS |S and σc,y(S). This being true for all S S with attr(S) = ℓ 1, the inductive step and a fortiori our proof are complete.

The next lemma provides an upper bound for σc,y(A), which will in turn allow us to derive a bound for the increment of h .

Lemma B.3. For y RA and c [0, + )A, we have:

σc,y(A) 1 1

a A Pa(y)ca 1 2µL

a A Pa(y)c2 a

Nested Bandits

As in the case of B.1, Lemma B.3 will follow as a special case of the more general, class-based result below:

Lemma B.4. Fix some y RA and c RA +. Then, for all Sℓ Sℓ, ℓ< L,we have

σc,y(Sℓ) 1 1 µℓ+1

a Sℓ Pa|Sℓ(y)ca 1 2µL

a Sℓ Pa|Sℓ(y)c2 a

Proof of Lemma B.3. Simply invoke Lemma B.4 with S A.

Proof of Lemma B.4. We proceed again by descending induction on ℓ= attr(S).

Base step. Fix some S S with attr(S) = L 1. We then have:

σc,y(S) = X

S S PS |S(y) exp( c S

S S PS |S(y)(1 c S

c2 S 2µ2 L ) # e x 1 x + x2/2 for x 0

S S PS |S(y)c S 1 2µL

S S PS |S(y)c2 S

= 1 1 µ(L 1)+1

a S Pa|S(y)ca 1 2µL

a S Pa|S(y)c2 a

so the initialization of the induction process is complete.

Induction step. Fix some S S with attr(S) = ℓ 1, ℓ< L, and suppose that (B.8) holds at level ℓ. We then have:

σc,y(S) = X

S S PS |S(y)σc,y(S )

S S PS |S(y)

a S Pa|S (y)ca + 1 2µL

a S Pa|S (y)c2 a

µℓ # inductive hypothesis

S S PS |S(y)

a S Pa|S (y)ca + 1 2µL

a S Pa|S (y)c2 a

# (1 + x)β 1 + βx for β 1

a S Pa|S (y)PS |S(y)ca + 1 2µL

a S Pa|S (y)PS |S(y)c2 a

= 1 + 1 µ(ℓ 1)+1

a S Pa|S(y)ca + 1 2µL

a S Pa|S(y)c2 a

This being true for all S S s.t. attr(S) = ℓ 1, the induction step and the proof of our assertion are complete.

With all this in hand, we are now in a position to upper bound the increments of the conjugate nested entropy h .

Proposition B.1. For y RA and c [0, + )A, we have:

h (y c) h (y) P(y), c + 1 2µL

a A Pa(y)c2 a. (B.12)

Proof. Using Lemmas B.1 and B.3 and the concavity inequality log x 1 + x directly delivers our assertion.

Nested Bandits

Remark 6. It is useful to note that, given a cost increment vector r RS with associated aggregate costs given by c RA

P(y), c = X

a A Pa(y)ca

a A Pa(y) X

a A Pa(y) X

S S r S 1a S

a A Pa(y) 1a S

S S PS(y)r S.

We are finally in a position to prove the basic properties of the NIWE estimator, which we restate below for convenience: Proposition 1. Let S = L ℓ=1 Sℓbe a similarity structure on A. Then, given a mixed strategy x (A) and a vector of cost increments r RS as per (5), the estimator (NIWE) satisfies the following:

1. It is unbiased: E[ˆr S] = r S for all S S. (14)

2. It enjoys the importance-weighted mean-square bound

E x Sˆr2 S R2 S for all S S. (15)

Accordingly, the loss estimator (13) is itself unbiased and enjoys the bound

a A xaˆc2 a i neff (16)

where neff is defined as neff = XL

ℓ=1 nℓ Rℓ (17)

with nℓ= |Sℓ| denoting the number of classes of attribute Sℓ, and

Sℓ SℓR2 Sℓ (18)

denoting the root-mean-square range of all classes in Sℓ.

Proof. Fix some S S with attr(S) = ℓ {1, . . . , L} and lineage A S0 S1 Sℓ S. We will now prove both properties of the (NIWE) estimator.

Part 1. We begin by showing that the estimator (NIWE) is unbiased. Indeed, we have:

E[ˆr S] = E

" 1 Sℓ= ˆSℓ, . . . , S1 = ˆS1

x Sℓ|Sℓ 1 x S2|S1x S1 r Sℓ

# Rewriting (NIWE)

x S E h 1 S = ˆS i

= r S. (B.13)

Part 2. We now turn to the proof of the importance-weighted mean-square bound of the estimator (NIWE). In this case, for any S S, we have:

E x Sˆr2 S = x S E ˆr2 S = x S E

Nested Bandits

= x S r2 Sℓ x2 S E h 1 S = ˆS i = r2 Sℓ # because E 1 S = ˆS = x S

R2 S. (B.14)

We are left to derive the bound for the aggregate cost estimator (13), viz.

S a ˆr S. (B.15)

With this in mind, we can write:

a A xaˆc2 a = X

S a ˆr2 S + 2 X

S S ˆr Sˆr S

S S xaˆr2 S 1a S +2 X

S S xaˆr Sˆr S 1a S

S S ˆr2 S X

a A xa 1a S

S S ˆr Sˆr S X

a A xa 1a S

S S x Sˆr2 S + 2 X

S S x S ˆr Sˆr S . (B.16)

Now, decomposing the above sums attribute-by-attribute and taking expectations in (B.16), we get:

a A xaˆc2 a

Sℓ Sℓ x SℓE ˆr2 Sℓ + 2 X

Sℓ Sℓ Sℓ ℓ Sℓ

x Sℓ E ˆr Sℓˆr Sℓ . (B.17)

The first term in (B.17) can simply be bounded using (B.14). Indeed:

Sℓ Sℓ x SℓE ˆr2 Sℓ

Sℓ Sℓ R2 Sℓ=

ℓ=1 nℓ R2 ℓ. (B.18)

Sℓ SℓR2 Sℓfor any ℓ= 1, . . . , L.

We now turn to the second term in (B.17). Let {ϵℓ,ℓ }1 ℓ <ℓ L be any fixed sequence of positive numbers. For any ℓ, ℓ {1, . . . , L} and any Sℓ Sℓand Sℓ Sℓ , the Peter-Paul inequality yields:

2ˆr Sℓ ˆr Sℓ 1 ϵℓ,ℓ ˆr2 Sℓ + ϵℓ,ℓ ˆr2 Sℓ (B.19)

Injecting (B.19) into the second term of (B.17) yields:

Sℓ Sℓ Sℓ ℓ Sℓ

x Sℓ E ˆr Sℓˆr Sℓ

Sℓ Sℓ Sℓ ℓ Sℓ

ϵℓ,ℓ E h ˆr2 Sℓ i + ϵℓ,ℓ E ˆr2 Sℓ

Sℓ Sℓ Sℓ ℓ Sℓ

x Sℓ E h ˆr2 Sℓ i + X

1 ℓ<ℓ L ϵℓ,ℓ X

Sℓ Sℓ Sℓ ℓ Sℓ

x Sℓ E ˆr2 Sℓ

Nested Bandits

Sℓ Sℓ Sℓ ℓ Sℓ

x Sℓ E h ˆr2 Sℓ i + X

1 ℓ<ℓ L ϵℓ,ℓ X

Sℓ Sℓ E ˆr2 Sℓ X

Sℓ ℓ Sℓ x Sℓ

| {z } x Sℓ

Sℓ Sℓ x Sℓ E h ˆr2 Sℓ i + X

1 ℓ<ℓ L ϵℓ,ℓ X

Sℓ Sℓ x SℓE ˆr2 Sℓ

Sℓ Sℓ R2 Sℓ + X

1 ℓ<ℓ L ϵℓ,ℓ X

Sℓ Sℓ R2 Sℓ # by (B.14)

1 ϵℓ,ℓ nℓ R2 ℓ + X

1 ℓ<ℓ L ϵℓ,ℓ nℓ R2 ℓ. (B.20)

Injecting (B.18) and (B.20) into (B.17) ensures that:

a A xaˆc2 a i XL

ℓ=1 nℓ R2 ℓ+ X

ϵℓ,ℓ nℓ R2 ℓ + ϵℓ,ℓ nℓ R2 ℓ

holds for any sequence of positive numbers {ϵℓ,ℓ }1 ℓ <ℓ L. As a result, taking ϵℓ,ℓ = q

Rℓyields the tight bound

a A xaˆc2 a i XL

ℓ=1 nℓ R2 ℓ+ 2 X

1 ℓ<ℓ L nℓ Rℓ nℓ R2 ℓ= XL

which proves our original assertion.

C. Regret analysis

As we mentioned in the main text, the principal component of our analysis is a recursive inequality which, when telescoped over t = 1, 2, . . . , will yield the desired regret bound. To establish this template inequality , we will first require an energy function measuring the disparity between a benchmark strategy x (A) and a propensity score profile y RA. To that end, building on the notions introduced in Appendix A, let h: (A) R denote the total nested entropy function

h(x) = h A(x) =

Sk Sk x Sk log x Sk, x (A), (C.1)

h (y) = max x (A){ y, x h(x)}, y RA, (C.2)

denote the convex conjugate of h so, by Proposition A.2, we have

h (y) = y A and Pa(y) = h

ya for all y RA. (C.3)

The Fenchel coupling between x (A) and y RA is then defined as

F(x, y) = h(x) + h (y) y, x for all x (A), y RA, (C.4)

and we have the following key result:

Proposition C.1. Let S = L ℓ=0 Sℓbe a similarity structure on A with uncertainty parameters µ1 µL > 0. Then:

1. The Fenchel coupling (C.4) is positive-definite, i.e.,

F(x, y) 0 for all x (A) and all y RA, (C.5)

with equality if and only if x is given by (NLC), i.e., if and only if x = P(y).

Nested Bandits

2. For all x A, we have F(x, 0) = h(x) + h (0) = h(x) min h (C.6)

where min h minx (A) h(x ) denotes the minimum of h over (A).

Proof. Our first claim follows by setting S A in Propositions A.1 and A.2 and noting that h S = h|S when S = A: indeed, by Young s inequality, we have h(x) + h (y) y, x 0 with equality if and only if y h(x), so the equality x = P(y) follows from (A.37) applied to S A and the fact that Pa|A(y) = Pa(y). As for our second claim, simply note that h (0) = maxx (A){ 0, x h(x)} = minx (A) h(x) and set y 0 in the definition (C.4) of the Fenchel coupling.

With all this in hand, the specific energy function that we will use for our regret analysis is the rate-deflated Fenchel coupling

ηt F(p, ηtyt) (C.7)

where p (A) is the regret comparator, ηt is the algorithm s learning rate at stage t, and yt is the corrsponding propensity score estimate. In words, since the mixed strategy employed by the learner at stage t is xt = P(ηtyt), the energy Et essentially measures the disparity between xt and the target strategy p (suitably rescaled by the method s learning rate). We then have the following fundamental estimate:

Proposition C.2. For all p (A) and all t = 1, 2, . . . , we have:

Et+1 Et + ˆct, xt p + (η 1 t+1 η 1 t )[h(p) min h] + 1

ηt F(xt, ηtyt+1). (C.8)

Proof. By the definition of Et, we have

Et+1 Et = 1 ηt+1 F(p, ηt+1yt+1) 1

ηt F(p, ηtyt) = 1 ηt+1 F(p, ηt+1yt+1) 1

ηt F(p, ηtyt+1) (C.9a)

ηt F(p, ηtyt+1) 1

ηt F(p, ηtyt). (C.9b)

We now proceed to upper-bound each of the two terms (C.9a) and (C.9b) separately.

For the term (C.9a), the definition of the Fenchel coupling (C.4) readily yields:

(C.9a) = 1 ηt+1 1

h(p) + 1 ηt+1 h (ηt+1yt+1) 1

ηt h (ηtyt+1). (C.10)

Inspired by a trick of Nesterov [22], consider the function φ(η) = η 1[h (ηy) + min h]. Then, by Proposition A.2, letting x = P(ηy) and differentiating φ with respect to η gives

η y, P(ηy) 1

η2 [h (ηy) + min h]

η2 [ ηy, x h (ηy) min h]

η2 [h(x) min h] 0. (C.11)

Since ηt+1 ηt, the above shows that φ(ηt) φ(ηt+1). Accordingly, setting y yt+1 in the definition of φ yields

1 ηt+1 h (ηt+1yt+1) 1

ηt h (ηtyt+1) 1

min h (C.12)

and hence (C.9a) (η 1 t+1 η 1 t )[h(p) min h]. (C.13)

Nested Bandits

Now, after a straightforward rearrangement, the second term of (C.9) becomes

ηt [h(p) + h (ηtyt+1) ηt yt+1, p ] 1

ηt [h(p) + h (ηtyt) ηt yt, p ]

ηt [h (ηtyt+1) h (ηtyt) ηt ˆct, p ] # by (NEW)

ηt [h (ηtyt+1) h (ηtyt) ηt ˆct, xt ] + ˆct, xt p # isolate benchmark

ηt [h (ηtyt+1) ηtyt, xt + h(xt) ηt ˆct, xt ] + ˆct, xt p # by Proposition A.2

ηt F(xt, ηtyt+1) + ˆct, xt p (C.14)

Thus, combining the above with (C.13), we finally obtain

Et+1 = Et + (C.9a) + (C.9b)

Et + (η 1 t+1 η 1 t )[h(p) min h] + ˆct, xt p + 1

ηt F(xt, ηtyt+1) (C.15)

and our proof is complete.

We are now in a position to state and prove the template inequality that provides the scaffolding for our regret bounds:

Proposition 2. The NEW algorithm enjoys the bound

E[Regp(T)] H ηT +1 +

E[F(xt, ηtyt+1)]

Proof. Let Zt = ˆct vt denote the error in the learner s estimation of the t-th stage payoff vector vt. Then, by substituting in Proposition C.2 and rearranging, we readily get:

vt, p xt Et Et+1 + Zt, xt p + η 1 t+1 η 1 t [h(p) min h] + ηt F(p, ηtyt+1) (C.16)

Thus, telescoping over t = 1, 2, . . . , T, we have

Regp(T) E1 ET +1 + 1 ηT +1 1

[h(p) min h] +

t=1 Zt, xt p +

1 ηt F(xt, ηtyt+1)

t=1 Zt, xt p +

1 ηt F(xt, ηtyt+1) (C.17)

where we used the fact that a) Et 0 for all t (a consequence of the first part of Proposition C.1); and that b) E1 = η 1 1 [h(p) + h (0)] = η 1 1 [h(p) min h] (from the second part of the same proposition). Our claim then follows by taking expectations in (C.17) and noting that E[Zt | Ft] = 0 (by Proposition 1).

In view of the above, our main regret bound follows by bounding the two terms in the template inequality (C.8). The second term is by far the most difficult one to bound, and is where Appendix B comes in; the first term is easier to handle, and it can be bounded as follows:

Lemma C.1. Suppose that each class S Sℓ 1 has at most mℓchildren, ℓ= 1, . . . , L. Then, for all p (A), we have

ℓ=1 µℓlog mℓ with equality iff the tree is symmetric, (C.18)

H = µ log(n) if µ1 = µ2 = = µL = µ. (C.19)

Nested Bandits

Proof. Suppose that ya = 0 for all a A. Then, applying (9) inductively, we have:

y SL = 0 for all SL SL

y SL 1 = µL log X

SL SL 1 exp(0) µL log m L for all SL 1 SL 1

y SL 2 = µL 1 log X

SL 1 SL 2 exp(y SL 1/µL 1) µL 1 log m L 1 + µL log m L for all SL 2 SL 2

y Sℓ 1 = µℓlog X

Sℓ Sℓ 1 exp(y Sℓ/µℓ)

k=ℓ µk log mk for all Sℓ 1 Sℓ 1

and hence H = h (0) = y A PL ℓ=1 µℓlog mℓ. Eq. (C.18) then follows from Proposition C.1.

Now, if µ1 = µ2 = = µL = µ, we have

= µ log n, (C.21)

which proves Eq. (C.19) and completes our proof.

Proposition C.3. For all p (A) and all t = {1, 2, . . . }, we have:

F(xt, µtyt+1) + ηt ˆct, xt = h (ηtyt + ηtˆct) h (ηtyt). (C.22)

Proof. Let p (A) and t 1, 2, . . . , we simply write:

F(xt, ηtyt+1) = h(xt) + h (ηtyt+1) ηt yt+1, xt

= h(xt) + h (ηtyt) ηtyt, xt | {z } =F (xt,ηtyt)

+h (ηtyt+1) h (ηtyt) ηt ˆct, xt

= h (ηtyt + ηtˆct) h (yt) ηt ˆct, xt # F(xt, ηtyt) = 0

and our assertion follows.

We are finally in a position to prove our main result (which we restate below for convenience): Theorem 1. Suppose that Algorithm 1 is run with a non-increasing learning rate ηt > 0 and uncertainty parameters µ1 µL > 0 against a sequence of cost vectors ct [0, 1]A, t = 1, 2, . . . , as per (4). Then, for all p (A), the learner enjoys the regret bound

E[Regp(T)] H ηT +1 + neff

t=1 ηt (19)

with neff given by (17) and H H(µ1, . . . , µL) defined by setting y = 0 in (9) and taking H = y A, i.e.,

Nested Bandits

In particular, if Algorithm 1 is run with µ1 = = µL = p

neff/2 and ηt = p

log n/(2t), we have

E[Regp(T)] 2 p

neff log n T. (21)

Proof. Injecting Eq. (C.22) in the result of Proposition 2 and using Proposition B.1 and Eq. (16) of Proposition 1 directly yields the pseudo-regret bound (19).

Finally, if we choose µ1 = = µL = p

neff/2, Lemma C.1 gives

neff/2 log n. (C.23)

Thus, taking ηt = p

log n/(2t) and substituting in (19) along with (C.23) finally delivers

E[Regp(T)] 2 p

neff log n T, (C.24)

and our claim follows.

D. Additional Experiment Details and Discussions

In this appendix we provide additional details on the experiments as well as further discussions on the settings we presented. The code with the implementation of the algorithms as well as the code to reproduce the figures will be open-sourced and is provided along with the supplementary materials.

D.1. Experiment additional details

In the synthetic environment, at each level, the rewards are generated randomly according for each class nodes, through uniform distributions of randomly generated means and fixed bandwidth. From a level ℓto the next ℓ+ 1, the rewards range are divided by a multiplicative factor Rℓ/Rℓ+1 = 10. The implemented method of NEW uses the reward based IW. Moreover, no model selection was used in this experiment as no hyperparameter was tuned. Indeed, a decaying rate of 1

t was used for the score updates for all methods, as is common in the bandit litterature [18].

D.2. Blue Bus/ Red Bus environment

We detail in Figure 5 a graphical representation of such blue bus/red bus environment, where many colors of the bus item build irrelevant alternatives. In this setting, with few arms, we run the methods up to the horizon T = 1000. We provide in Figure 6 the average reward of the two methods NEW and EXP3 with varying number of subclasses of the bus .

car blue bus a1 a2 a0

Figure 5: Diagram of the blue Bus/Red Bus environment.

While the NEW method ends up selecting the best alternative and having the lowest regret, the EXP3 seems to pick wrong alternative in some experiments, and ends up having higher regret and requiring more iterations to converge to higher average reward. In some of our experiments over the multiple random runs, alternatives of very low sampling probability that were sampled changed the score vector too brutally in the IPS estimator which seemed to hurt the EXP3 method much more than the NEW algorithm.

D.3. Tree structures

In this appendix we show additional results and visualisations for the second setting presented in the main paper. We start with discussions on the depth parameter L and follow with the breadth parameter related to the number of child per class M = |S|.

Nested Bandits

0 200 400 600 800 Steps

Env - Red Bus/Blue Bus Paradox - Regret

EXP3 - N 2 EXP3 - N 5 EXP3 - N 10 EXP3 - N 50

NEW - N 2 NEW - N 5 NEW - N 10 NEW - N 50

0 200 400 600 800 Steps

Average Reward

Env - Red Bus/Blue Bus Paradox - Average Reward

EXP3 - N 2 EXP3 - N 5 EXP3 - N 10 EXP3 - N 50

NEW - N 2 NEW - N 5 NEW - N 10 NEW - N 50

Figure 6: Regret and Average Reward of NEW and EXP3 on the Blue Bus/ Red Bus environment.

Influence of the depth parameter L In Figure 7 we show the influence of the depth parameter with a fixed number of child per class. By making the tree deeper, we illustrate the effect of knowing the nested structure compared to running the logit choice to the whole alternative set. As shown in both the regret and average reward plots, the NEW method outperforms the EXP3 algorithm. While the NEW method also use an IPS estimator, it is less prone to variance issues than the EXP3 method. Indeed, due to the nested structure and the reward decay related to the ratio Rℓ+1/Rℓ, the NEW estimator end up not hurting the regret by still selecting "right" parent classes.

0 2000 4000 6000 8000 10000

Env - Tree Structure - Regret

EXP3 - L 4, M 3 EXP3 - L 5, M 3 EXP3 - L 6, M 3

NEW - L 4, M 3 NEW - L 5, M 3 NEW - L 6, M 3

0 2000 4000 6000 8000 10000 Steps

Average Reward

Env - Tree Structure - Average Reward

EXP3 - L 4, M 3 EXP3 - L 5, M 3 EXP3 - L 6, M 3

NEW - L 4, M 3 NEW - L 5, M 3 NEW - L 6, M 3

Figure 7: Regret and Average Reward of NEW and EXP3 on the synthetic environment with varying number of levels L.

Influence of the number of child per class (wideness) M = |S| In this setting we fix the number of levels L and vary the number of child per classes M. In Figure 8 we can see that the NEW method outperforms the EXP3 in terms of regret and average reward. Interestingly, we see that the gap between the two methods shrinks when the number of child per class augments. This is because when the size of a class increase, the NEW method also end up having less knowledge locally and end up having a large number of alternatives to choose among.

0 2000 4000 6000 8000 10000

Env - Tree Structure - Regret

EXP3 - L 2, M 50 EXP3 - L 2, M 100 EXP3 - L 2, M 200

NEW - L 2, M 50 NEW - L 2, M 100 NEW - L 2, M 200

0 2000 4000 6000 8000 10000 Steps

Average Reward

Env - Tree Structure - Average Reward

EXP3 - L 2, M 50 EXP3 - L 2, M 100 EXP3 - L 2, M 200

NEW - L 2, M 50 NEW - L 2, M 100 NEW - L 2, M 200

Figure 8: Regret and Average Reward of NEW and EXP3 on the synthetic environment with varying number of child per class M = |S|.

Nested Bandits

D.4. A visualisation of the effects of NEW

In this appendix we want to show the effects of NEW through the simple setting where we assume a nested structure with L = 4 and M = |S| = 3. We illustrate in Figure 9 the score vectors of the NEW method along the optimal path in the tree (path which nodes have the highest cumulated mean, i.e which generates the highest reward) along with the oracle means of the child nodes. We can see that the algorithm takes advantage of the nested structure and updates the scores vectors optimally with regards to the oracle means of all the nodes. The NEW algorithm therefore estimates correctly the rewards of the environment.

NEW - Level 0

Normalized score vector Rescaled reward distribution

NEW - Level 1

Normalized score vector Rescaled reward distribution

NEW - Level 2

Normalized score vector Rescaled reward distribution

NEW - Level 3

Normalized score vector Rescaled reward distribution

Figure 9: Histograms of the score vectors along the optimal path in the nested structure, with visualisation of the mean value of the node.

Inversely we see in Figure 10 that the EXP3 method has suffered from variance issue and selected a suboptimal alternative among the |S|L = 81 possible ones. The EXP3 did not take advantage of the nested structure and therefore did not learn as correctly as the NEW algorithm the reward values.

1 11 21 31 41 51 61 71 81 0.0

EXP3 on all alternatives

Normalized score vector Rescaled reward distribution

Figure 10: Histogram of the score vector of the all alternatives, with a visualisation of the mean value of all nodes.

D.5. Cases where both algorithms perform identically

In this appendix we merely show that the implementation of the NEW and EXP3 algorithm match exactly and observe the same behavior when the number of levels L is set to 1. This setting is where we have no knowledge of any nested structure, therefore both algorithms perform identically in Figure 11.

0 2000 4000 6000 8000 10000

Env - Tree Structure - Regret

EXP3 - L 1, M 5 EXP3 - L 1, M 10 EXP3 - L 1, M 50

NEW - L 1, M 5 NEW - L 1, M 10 NEW - L 1, M 50

0 2000 4000 6000 8000 10000 Steps

Average Reward

Env - Tree Structure - Average Reward

EXP3 - L 1, M 5 EXP3 - L 1, M 10 EXP3 - L 1, M 50

NEW - L 1, M 5 NEW - L 1, M 10 NEW - L 1, M 50

Figure 11: Regret and Average Reward of NEW and EXP3 on the synthetic environment where L = 1.

D.6. Variance plots for the synthetic experiments

We discuss here the variance of the regret at the final timestep T = 10000. Indeed, as shown on Figure 6 for the NEW algorithm , on Figure 7 for both algorithms EXP3 and NEW, and on Figure 8 for EXP3, some of the plots do no exhibit the

Nested Bandits

monotonicity one would have expected when increasing the number of arms through L or M, and are even overlapping on the regret plot. This can be explained on Figures 12 for the Red Bus/Blue Bus environment, and in Figures 13 and 14 respectively for depth and wideness tree experiments. Those plots show the variances (across the 20 random seeds) of the final regret for both methods at the final step-size. In Figure 13 we see that the EXP3 arms have similar mean values with large variances, which explains why they are overlapping on the plot in Figure 3. In Figure 14 when varying M we can also have a closer look on how NEW outperforms EXP3 and how the close values of NEW regrets through different M can be explained by their high variance.

EXP3 - N 2 NEW - N 2 EXP3 - N 5 NEW - N 5 EXP3 - N 10 NEW - N 10 EXP3 - N 50 NEW - N 50

Env - Red Bus/Blue Bus Paradox - Final Regret CIs at T = 1000

Figure 12: Regret distribution at the final stepsize T = 1000 for the Red Bus/Blue Bus environment.

EXP3 - L 4, M 3 NEW - L 4, M 3 EXP3 - L 5, M 3 NEW - L 5, M 3 EXP3 - L 6, M 3 NEW - L 6, M 3

Env - Tree Structure - Final Regret CIs at T = 10000

Figure 13: Regret distribution at the final stepsize T = 10000 when varying the depth parameter L.

D.7. Reproducibility

We provide code for reproducibility of our experiments and plots, in addition to a more general implementation of both the NEW algorithm and EXP3 baseline. All experiments were run on a Mac book pro laptop, with 1 processor of 6 cores @2.6GHz (6-Core Intel Core i7). The code and all experiments can be found in the attached .zip.

Nested Bandits

EXP3 - L 2, M 50 NEW - L 2, M 50 EXP3 - L 2, M 100 NEW - L 2, M 100 EXP3 - L 2, M 200 NEW - L 2, M 200

Env - Tree Structure - Final Regret CIs at T = 10000

Figure 14: Regret distribution at the final stepsize T = 10000 when varying the wideness parameter M.