# factored_bandits__9842006f.pdf

Factored Bandits

Julian Zimmert University of Copenhagen zimmert@di.ku.dk

Yevgeny Seldin University of Copenhagen seldin@di.ku.dk

We introduce the factored bandits model, which is a framework for learning with limited (bandit) feedback, where actions can be decomposed into a Cartesian product of atomic actions. Factored bandits incorporate rank-1 bandits as a special case, but signiﬁcantly relax the assumptions on the form of the reward function. We provide an anytime algorithm for stochastic factored bandits and up to constants matching upper and lower regret bounds for the problem. Furthermore, we show how a slight modiﬁcation enables the proposed algorithm to be applied to utilitybased dueling bandits. We obtain an improvement in the additive terms of the regret bound compared to state-of-the-art algorithms (the additive terms are dominating up to time horizons that are exponential in the number of arms).

1 Introduction

We introduce factored bandits, which is a bandit learning model, where actions can be decomposed into a Cartesian product of atomic actions. As an example, consider an advertising task, where the actions can be decomposed into (1) selection of an advertisement from a pool of advertisements and (2) selection of a location on a web page out of a set of locations, where it can be presented. The probability of a click is then a function of the quality of the two actions, the attractiveness of the advertisement and the visibility of the location it was placed at. In order to maximize the reward the learner has to maximize the quality of actions along each dimension of the problem. Factored bandits generalize the above example to an arbitrary number of atomic actions and arbitrary reward functions satisfying some mild assumptions.

explicit reward models

weakly constrained reward models

Combin. Bandits

relaxation Chen et al. (2016)

(Generalized) Linear Bandits

Factored Bandits Stochastic Rank-1

Utility Based Uniformly Identiﬁable

Dueling Bandits

Condorcet Winner

Figure 1: Relations between factored bandits and other bandit models.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

In a nutshell, at every round of a factored bandit game the player selects L atomic actions, a1, . . . , a L, each from a corresponding ﬁnite set A of size |A | of possible actions. The player then observes a reward, which is an arbitrary function of a1, . . . , a L satisfying some mild assumptions. For example, it can be a sum of the quality of atomic actions, a product of the qualities, or something else that does not necessarily need to have an analytical expression. The learner does not have to know the form of the reward function.

Our way of dealing with combinatorial complexity of the problem is through introduction of unique identiﬁability assumption, by which the best action along each dimension is uniquely identiﬁable. A bit more precisely, when looking at a given dimension we call the collection of actions along all other dimensions a reference set. The unique identiﬁability assumption states that in expectation the best action along a dimension outperforms any other action along the same dimension by a certain margin when both are played with the same reference set, irrespective of the composition of the reference set. This assumption is satisﬁed, for example, by the reward structure in linear and generalized linear bandits, but it is much weaker than the linearity assumption.

In Figure 1, we sketch the relations between factored bandits and other bandit models. We distinguish between bandits with explicit reward models, such as linear and generalized linear bandits, and bandits with weakly constrained reward models, including factored bandits and some relaxations of combinatorial bandits. A special case of factored bandits are rank-1 bandits [7]. In rank-1 bandits the player selects two actions and the reward is the product of their qualities. Factored bandits generalize this to an arbitrary number of actions and signiﬁcantly relax the assumption on the form of the reward function.

The relation with other bandit models is a bit more involved. There is an overlap between factored bandits and (generalized) linear bandits [1; 6], but neither is a special case of the other. When actions are represented by unit vectors, then for (generalized) linear reward functions the models coincide. However, the (generalized) linear bandits allow a continuum of actions, whereas factored bandits relax the (generalized) linearity assumption on the reward structure to uniform identiﬁability.

There is a partial overlap between factored bandits and combinatorial bandits [3]. The action set in combinatorial bandits is a subset of t0, 1ud. If the action set is unrestricted, i.e. A t0, 1ud, then combinatorial bandits can be seen as factored bandits with just two actions along each of the d dimensions. However, typically in combinatorial bandits the action set is a strict subset of t0, 1ud and one of the parameters of interest is the permitted number of non-zero elements. This setting is not covered by factored bandits. While in the classical combinatorial bandits setting the reward structure is linear, there exist relaxations of the model, e.g. Chen et al. [4].

Dueling bandits are not directly related to factored bandits and, therefore, we depict them with faded dashed blocks in Figure 1. While the action set in dueling bandits can be decomposed into a product of the basic action set with itself (one for the ﬁrst and one for the second action in the duel), the observations in dueling bandits are the identities of the winners rather than rewards. Nevertheless, we show that the proposed algorithm for factored bandits can be applied to utility-based dueling bandits.

The main contributions of the paper can be summarized as follows:

1. We introduce factored bandits and the uniform identiﬁability assumption.

2. Factored bandits with uniformly identiﬁable actions are a generalization of rank-1 bandits.

3. We provide an anytime algorithm for playing factored bandits under uniform identiﬁability assumption in stochastic environments and analyze its regret. We also provide a lower bound matching up to constants.

4. Unlike the majority of bandit models, our approach does not require explicit speciﬁcation or knowledge of the form of the reward function (as long as the uniform identiﬁability assumption is satisﬁed). For example, it can be a weighted sum of the qualities of atomic actions (as in linear bandits), a product thereof, or any other function not necessarily known to the algorithm.

5. We show that the algorithm can also be applied to utility-based dueling bandits, where the additive factor in the regret bound is reduced by a multiplicative factor of K compared to state-of-the-art (where K is the number of actions). It should be emphasized that in stateof-the-art regret bounds for utility-based dueling bandits the additive factor is dominating

for time horizons below Ωpexpp Kqq, whereas in the new result it is only dominant for time horizons up to Op Kq.

6. Our work provides a uniﬁed treatment of two distinct bandit models: rank-1 bandits and utility-based dueling bandits.

The paper is organized in the following way. In Section 2 we introduce the factored bandit model and uniform identiﬁability assumption. In Section 3 we provide algorithms for factored bandits and dueling bandits. In Section 4 we analyze the regret of our algorithm and provide matching upper and lower regret bounds. In Section 5 we compare our work empirically and theoretically with prior work. We ﬁnish with a discussion in Section 6.

2 Problem Setting

2.1 Factored bandits

We deﬁne the game in the following way. We assume that the set of actions A can be represented as a Cartesian product of atomic actions, A ÂL 1 A . We call the elements of A atomic arms. For rounds t 1, 2, ... the player chooses an action At P A and observes a reward rt drawn according to an unknown probability distribution p At (i.e., the game is stochastic ). We assume that the mean rewards µpaq Errt|At as are bounded in r 1, 1s and that the noise ηt rt µp Atq is conditionally 1-sub-Gaussian. Formally, this means that

@λ P R E eληt|Ft 1 ď exp ˆλ2

where Ft : t A1, r1, A2, r2, ..., At, rtu is the ﬁltration deﬁned by the history of the game up to and including round t. We denote a pa 1, a 2, ..., a Lq argmaxa PA µpaq.

Deﬁnition 1 (uniform identiﬁability). An atomic set Ak has a uniformly identiﬁable best arm a k if and only if

@a P Akzta ku : Δkpaq : min b PÂ

k A µpa k, bq µpa, bq ą 0. (1)

We assume that all atomic sets have uniformly identiﬁable best arms. The goal is to minimize the pseudo-regret, which is deﬁned as

t 1 µpa q µp Atq

Due to generality of the uniform identiﬁability assumption we cannot upper bound the instantaneous regret µpa q µp Atq in terms of the gaps Δ pa q. However, a sequential application of (1) provides a lower bound

µpa q µpaq µpa q µpa1, a 2, ..., a Lq µpa1, a 2, ..., a Lq µpaq

ě Δ1pa1q µpa1, a 2, ..., a Lq µpaq ě ... ě

1 Δ pa q. (2)

For the upper bound let κ be a problem dependent constant, such that µpa q µpaq ď κ řL 1 Δ pa q holds for all a. Since the mean rewards are in r 1, 1s, the condition is always satisﬁed by κ mina, 2Δ 1 pa q and by equation (2) κ is always larger than 1. The constant κ appears in the regret bounds. In the extreme case when κ mina, 2Δ 1 pa q the regret guarantees are fairly weak. However, in many speciﬁc cases mentioned in the previous section, κ is typically small or even 1. We emphasize that algorithms proposed in the paper do not require the knowledge of κ. Thus, the dependence of the regret bounds on κ is not a limitation and the algorithms automatically adapt to more favorable environments.

2.2 Dueling bandits

The set of actions in dueling bandits is factored into AˆA. However, strictly speaking the problem is not a factored bandit problem, because the observations in dueling bandits are not the rewards.1 When playing two arms, a and b, we observe the identity of the winning arm, but the regret is typically deﬁned via average relative quality of a and b with respect to a best arm in A.

The literature distinguishes between different dueling bandit settings. We focus on utility-based dueling bandits [14] and show that they satisfy the uniform identiﬁability assumption.

In utility-based dueling bandits, it is assumed that each arm has a utility upaq and that the winning probabilities are deﬁned by Pra wins against bs νpupaq upbqq for a monotonously increasing link function ν. Let wpa, bq be 1 if a wins against b and 0 if b wins against a. Let a : argmaxa PA upaq denote the best arm. Then for any arm b P A and any a P Aza , it holds that Erwpa , bqs Erwpa, bqs νpupa q upbqq νpupaq upbqq ą 0, which satisﬁes the uniform identiﬁability assumption. For the rest of the paper we consider the linear link function νpxq 1 x

2 . The regret is then deﬁned by

upa q up Atq

2 upa q up Btq

3 Algorithms

Although in theory an asymptotically optimal algorithm for any structured bandit problem was presented in [5], for factored bandits this algorithm does not only require solving an intractable semiinﬁnite linear program at every round, but it also suffers from additive constants which are exponential in the number of atomic actions L. An alternative naive approach could be an adaptation of sparring [16], where each factor runs an independent K-armed bandit algorithm and does not observe the atomic arm choices of other factors. The downside of sparring algorithms, both theoretically and practically, is that each algorithm operates under limited information and the rewards become non i.i.d. from the perspective of each individual factor.

Our Temporary Elimination Algorithm (TEA, Algorithm 1) avoids these downsides. It runs independent instances of the Temporary Elimination Module (TEM, Algorithm 3) in parallel, one per each factor of the problem. Each TEM operates on a single atomic set. The TEA is responsible for the synchronization of TEM instances. Two main ingredients ensure information efﬁciency. First, we use relative comparisons between arms instead of comparing absolute mean rewards. This cancels out the effect of non-stationary means. The second idea is to use local randomization in order to obtain unbiased estimates of the relative performance without having to actually play each atomic arm with the same reference, which would have led to prohibitive time complexity.

1 @ : TEM Ð new TEM(A )

3 for s 1, 2, . . . do

4 Ms Ð argmax | TEM . get Active Setpfptq 1q|

5 Ts Ð pt, t 1, . . . , t Ms 1q

6 for P t1, . . . , Lu in parallel do

7 TEM . schedule Nextp Tsq

8 for t P Ts do

9 rt Ð playpp TEM .Atq 1,...,Lq

10 for P t1, . . . , Lu in parallel do

11 TEM . feedbackpprt1qt1PTsq

12 t Ð t |Ts| Algorithm 1: Factored Bandit TEA

1 TEM Ð new TEM(A)

3 for s 1, 2, . . . do

4 As Ð TEM . get Active Setpfptq 1q

5 Ts Ð pt, t 1, . . . , t |As| 1q

6 TEM . schedule Nextp Tsq

7 for b P As do

8 rt Ð playp TEM .At, bq

10 TEM . feedbackpprt1qt1PTsq Algorithm 2: Dueling Bandit TEA

1In principle, it is possible to formulate a more general problem that would incorporate both factored bandits and dueling bandits. But such a deﬁnition becomes too general and hard to work with. For the sake of clarity we have avoided this path.

The TEM instances run in parallel in externally synchronized phases. Each module selects active arms in get Active Setpδq, such that the optimal arm is included with high probability. The length of a phase is chosen such that each module can play each potentially optimal arm at least once in every phase. All modules schedule all arms for the phase in schedule Next. This is done by choosing arms in a round robin fashion (random choices if not all arms can be played equally often) and ordering them randomly. All scheduled plays are executed and the modules update their statistics through the call of feedback routine. The modules use slowly increasing lower conﬁdence bounds for the gaps in order to temporarily eliminate arms that are with high probability suboptimal. In all algorithms, we use fptq : pt 1q log2pt 1q.

Dueling bandits For dueling bandits we only use a single instance of TEM. In each phase the algorithm generates two random permutations of the active set and plays the corresponding actions from the two lists against each other. (The ﬁrst permutation is generated in Line 6 and the second in Line 7 of Algorithm 2.)

The TEM tracks empirical differences between rewards of all arms ai and aj in Dij. Based on these differences, it computes lower conﬁdence bounds for all gaps. The set K contains those arms where all LCB gaps are zero. Additionally the algorithm keeps track of arms that were never removed from B. During a phase, each arm from K is played at least once, but only arms in B can be played more than once. This is necessary to keep the additive constants at M logp Kq instead of MK.

global :Ni,j, Di,j, K , B

1 Function initialize(K)

2 @ai, aj P K : Ni,j, Di,j Ð 0, 0

5 Function get Active Set(δ)

6 if DNi,j 0 then

9 for ai P K do

10 ˆΔLCBpaiq Ð maxaj ai Dj,i Nj,i c

12 logp2Kfp Nj,iqδ 1q

11 K Ð tai P K| ˆΔLCBpaiq ď 0u

12 if |K | 0 then

14 B Ð B X K

15 if |B| 0 then

17 return K

19 Function schedule Next(T )

20 for a P K do

21 t Ð random unassigned index in T

23 while not all Ats, . . . , Ats |T | 1 assigned do

24 for a P B do

25 t Ð random unassigned index in T

28 Function feedback(t Rtuts,...,ts Ms 1)

29 @ai : N i s, Ri s Ð 0, 0

30 for t ts, . . . , ts Ms 1 do

31 RAt s Ð RAt s Rt

32 N At s Ð N At s 1

33 for ai, aj P K do

34 Di,j Ð Di,j mint N s i , N s j up Ri s Nis Rj s Nj s q

35 Ni,j Ð Ni,j mint N s i , N s j u Algorithm 3: Temporary Elimination Module (TEM) Implementation

We start this section with the main theorem, which bounds the number of times the TEM pulls sub-optimal arms. Then we prove upper bounds on the regret for our main algorithms. Finally, we prove a lower bound for factored bandits that shows that our regret bound is tight up to constants.

4.1 Upper bound for the number of sub-optimal pulls by TEM

Theorem 1. For any TEM submodule TEM with an arm set of size K |A |, running in the TEA algorithm with M : max |A | and any suboptimal atomic arm a a , let Ntpaq denote the number of times TEM has played the arm a up to time t. Then there exist constants Cpaq ď M for

a a , such that

Er Ntpaqs ď 120 Δpaq2

logp2Kt log2ptqq 4 log ˆ48 logp2Kt log2ptqq

a a Cpaq ď M logp Kq 5

2K in the case of factored bandits and Cpaq ď 5

2 for dueling bandits.

Proof sketch. [The complete proof is provided in the Appendix.]

Step 1 We show that the conﬁdence intervals are constructed in such a way that the probability of all conﬁdence intervals holding at all epochs up from s1 is at least 1 maxsěs1 fptsq 1. This requires a novel concentration inequality (Lemma 3) for a sum of conditionally σs-sub-gaussian random variables, where σs can be dependent on the history. This technique might be useful for other problems as well.

Step 2 We split the number of pulls into pulls that happen in rounds where the conﬁdence intervals hold and those where they fail: Ntpaq N conf t paq N conf t paq.

We can bound the expectation of N conf t paq based on the failure probabilities given by Prconf failure at round ss ď 1 fptsq.

Step 3 We deﬁne s1 as the last round in which the conﬁdence intervals held and a was not eliminated. We can split N conf t paq N conf ts1 paq Cpaq and use the conﬁdence intervals to upper bound

N conf ts1 paq. The upper bound on ř

a Cpaq requires special handling of arms that were eliminated once and carefully separating the cases where conﬁdence intervals never fail and those where they might fail.

4.2 Regret Upper bound for Dueling Bandit TEA

A regret bound for the Factored Bandit TEA algorithm, Algorithm 1, is provided in the following theorem. Theorem 2. The pseudo-regret of Algorithm 1 at any time T is bounded by

logp2|A |t log2ptqq 4 log ˆ48 logp2|A |t log2ptqq

logp|A |q ÿ

Proof. The design of TEA allows application of Theorem 1 to each instance of TEM. Using µpa q µpaq ď κ řL 1 Δ pa q, we have that

t 1 µpa q µpatqss ď κ

Er NT pa qsΔ pa q.

Applying Theorem 1 to the expected number of pulls and bounding the sums ř

a CpaqΔpaq ď ř

a Cpaq completes the proof.

4.3 Dueling bandits

A regret bound for the Dueling Bandit TEA algorithm (DBTEA), Algorithm 2, is provided in the following theorem. Theorem 3. The pseudo-regret of Algorithm 2 for any utility-based dueling bandit problem at any time T (deﬁned in equation (3) satisﬁes Reg T ď O ř

a a logp T q

Δpaq Op Kq.

Proof. At every round, each arm in the active set is played once in position A and once in position B in playp A, Bq. Denote by N A t paq the number of plays of an arm a in the ﬁrst position, N B t paq the number of plays in the second position, and Ntpaq the total number of plays of the arm. We have

a a Er NtpaqsΔpaq ÿ

a a Er N A t paq N B t paqsΔpaq ÿ

a a 2Er N A t paqsΔpaq.

The proof is completed by applying Theorem 1 to bound Er N A t paqs.

4.4 Lower bound

We show that without additional assumptions the regret bound cannot be improved. The lower bound is based on the following construction. The mean reward of every arm is given by µpaq µpa q ř

Δ pa q. The noise is Gaussian with variance 1. In this problem, the regret can be decomposed into a sum over atomic arms of the regret induced by pulling these arms, Reg T ř

a PA Er NT pa qsΔ pa q. Assume that we only want to minimize the regret induced by a single atomic set A . Further, assume that Δkpaq for all k are given. Then the problem is reduced to a regular K-armed bandit problem. The asymptotic lower bound for K-armed bandit under 1-Gaussian noise goes back to [10]: For any consistent strategy θ, the asymptotic regret is lower bounded by lim inf T Ñ8 Regθ T logp T q ě ř

a a 2 Δpaq. Due to regret decomposition, we can apply this bound to every atomic set separately. Therefore, the asymptotic regret in the factored bandit problem is

lim inf T Ñ8 Regθ T logp Tq ě

This shows that our general upper bound is asymptotically tight up to leading constants and κ.

κ-gap We note that there is a problem-dependent gap of κ between our upper and lower bounds. Currently we believe that this gap stems from the difference between information and computational complexity of the problem. Our algorithm operates on each factor of the problem independently of other factors and is based on the optimism in the face of uncertainty principle. It is possible to construct examples in which the optimal strategy requires playing surely sub-optimal arms for the sake of information gain. For example, this kind of constructions were used by Lattimore and Szepesvári [11] to show suboptimality of optimism-based algorithms. Therefore, we believe that removing κ from the upper bound is possible, but requires a fundamentally different algorithm design. What is not clear is whether it is possible to remove κ without signiﬁcant sacriﬁce of the computational complexity.

5 Comparison to Prior Work

5.1 Stochastic rank-1 bandits

Stochastic rank-1 bandits introduced by Katariya et al. [7] are a special case of factored bandits. The authors published a reﬁned algorithm for Bernoulli rank-1 bandits using KL conﬁdence sets in Katariya et al. [8]. We compare our theoretical results with the ﬁrst paper because it matches our problem assumptions. In our experiments, we provide a comparison to both the original algorithm and the KL version.

In the stochastic rank-1 problem there are only 2 atomic sets of size K1 and K2. The matrix of expected rewards for each pair of arms is of rank 1. It means that for each u P A1 and v P A2, there exist u, v P r0, 1s such that Errpu, vqs u v. The proposed Stochastic rank-1 Elimination algorithm introduced by Katariya et al. is a typical elimination style algorithm. It requires knowledge of the time horizon and uses phases that increase exponentially in length. In each phase, all arms are played uniformly. At the end of a phase, all arms that are sub-optimal with high probability are eliminated.

Theoretical comparison It is hard to make a fair comparison of the theoretical bounds because TEA operates under much weaker assumptions. Both algorithms have a regret bound of O ř

u PA1zu 1 Δ1puq ř

v PA2zv 1 Δ2pvq logptq . The problem independent multiplicative factors

hidden under O are smaller for TEA, even without considering that rank-1 Elimination requires a doubling trick for anytime applications. However, the problem dependent factors are in favor of rank-1 Elimination, where the gaps correspond to the mean difference under uniform sampling pu uq ř

v PA2 v{K2. In factored bandits, the gaps are deﬁned as pu uq minv PA2 v, which is naturally smaller. The difference stems from different problem assumptions. Stronger assumptions of rank-1 bandits make elimination easier as the number of eliminated suboptimal arms increases. The TEA analysis holds in cases where it becomes harder to identify suboptimal arms after removal of bad arms. This may happen when highly suboptimal atomic actions in one factor provide more discriminative information on atomic actions in other factors than close to optimal atomic actions in the same factor (this follows the spirit of illustration of suboptimality of optimistic algorithms in [11]). We leave it to future work to improve the upper bound of TEA under stronger model assumptions.

In terms of memory and computational complexity, TEA is inferior to regular elimination style algorithms, because we need to keep track of relative performances of the arms. That means both computational and memory complexities are Opř

|A |2q per round in the worst case, as opposed to rank-1 Elimination that only requires O |A1| |A2| .

Empirical comparison The number of arms is set to 16 in both sets. We always ﬁx u u v v 0.2. We vary the absolute value of u v . As expected, rank1Elim KL has an advantage when the Bernoulli random variables are strongly biased towards one side. When the bias is close to 1

2, we clearly see the better constants of TEA. In the evaluation we clearly outperform rank-1 Elimination

Figure 2: Comparison of Rank1Elim, Rank1Elim KL, and TEA for K L 16. The results are averaged over 20 repetitions of the experiment.

over different parameter settings and even beat the KL optimized version if the means are not too close to zero or one. This supports that our algorithm does not only provide a more practical anytime version of elimination, but also improves on constant factors in the regret. We believe that our algorithm design can be used to improve other elimination style algorithms as well.

5.2 Dueling Bandits: Related Work

To the best of our knowledge, the proposed Dueling Bandit TEA is the ﬁrst algorithm that satisﬁes the following three criteria simultaneously for utility-based dueling bandits:

It requires no prior knowledge of the time horizon (nor uses the doubling trick or restarts).

Its pseudo-regret is bounded by Opř

There are no additive constants that dominate the regret for time horizons T ą Op Kq.

We want to stress the importance of the last point. For all state-of-the-art algorithms known to us, when the number of actions K is moderately large, the additive term is dominating for any realistic time horizon T. In particular, Ailon et al. [2] introduces three algorithms for the utility-based dueling bandit problem. The regret of Doubler scales with Oplog2ptqq. The regret of Multi SBM has an additive term of order ř

a a K Δpaq that is dominating for T ă Ωpexpp Kqq. The last algorithm, Sparring, has no theoretical analysis.

Algorithms based on the weaker Condorcet winner assumption apply to utility-based setting, but they all suffer from equally large or even larger additive terms. The RUCB algorithm introduced by Zoghi et al. [17] has an additive term in the bound that is deﬁned as 2DΔmax logp2Dq, for

Δmax maxa a Δpaq and D ą 1

2 ř ai a ř aj ai 4α mintΔpaiq2,Δpajq2u. By unwrapping these deﬁ-

nitions, we see that the RUCB regret bound has an additive term of order 2DΔmax ě ř

a a K Δpaq. This is again the dominating term for time horizons T ď Ωpexpp Kqq. The same applies to the RMED algorithm introduced by Komiyama et al. [9], which has an additive term of Op K2q. (The dependencies on the gaps are hidden behind the O-notation.) The D-TS algorithm by Wu and Liu [13] based on Thompson Sampling shows one of the best empirical performances, but its regret bound includes an additive constant of order Op K3q.

Other algorithms known to us, Interleaved Filter [16], Beat the Mean [15], and SAVAGE [12], all require knowledge of the time horizon T in advance.

Empirical comparison We have used the framework provided by Komiyama et al. [9]. We use the same utility for all sub-optimal arms. In Figure 3, the winning probability of the optimal arm over suboptimal arms is always set to 0.7, we run the experiment for different number of arms K. TEA outperforms all algorithms besides RMED variants, as long as the number of arms are sufﬁciently big. To show that there also exists a regime where the improved constants gain an advantage over RMED, we conducted a second experiment in Figure 4 (in the Appendix), where we set the winning probability to 0.952 and signiﬁcantly increase the number of arms. The evaluation shows that the additive terms are indeed non-negligible and that Dueling Bandit TEA outperforms all baseline algorithms when the number of arms is sufﬁciently large.

Figure 3: Comparison of Dueling Bandits algorithms with identical gaps of 0.4. The results are averaged over 20 repetitions of the experiment.

6 Discussion

We have presented the factored bandits model and uniform identiﬁability assumption, which requires no knowledge of the reward model. We presented an algorithm for playing stochastic factored bandits with uniformly identiﬁable actions and provided matching upper and lower bounds for the problem up to constant factors. Our algorithm and proofs might serve as a template to turn other elimination style algorithms into improved anytime algorithms.

Factored bandits with uniformly identiﬁable actions generalize rank-1 bandits. We have also provided a uniﬁed framework for the analysis of factored bandits and utility-based dueling bandits. Furthermore, we improve the additive constants in the regret bound compared to state-of-the-art algorithms for utility-based dueling bandits.

There are multiple potential directions for future research. One example mentioned in the text is the possibility of improving the regret bound when additional restrictions on the form of the reward function are introduced or improvements of the lower bound when algorithms are restricted in computational or memory complexity. Another example is the adversarial version of the problem.

2Smaller gaps show the same behavior but require more arms and more timesteps.

[1] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

[2] N. Ailon, Z. Karnin, and T. Joachims. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning, pages 856 864, 2014.

[3] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404 1422, 2012.

[4] W. Chen, Y. Wang, Y. Yuan, and Q. Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746 1778, 2016.

[5] R. Combes, S. Magureanu, and A. Proutiere. Minimal exploration in structured stochastic bandits. In Advances in Neural Information Processing Systems, pages 1761 1769, 2017.

[6] S. Filippi, O. Cappe, A. Garivier, and C. Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586 594, 2010.

[7] S. Katariya, B. Kveton, C. Szepesvári, C. Vernade, and Z. Wen. Stochastic rank-1 bandits (long version). In AISTATS, volume 54 of PMLR, pages 392 401, April 2017.

[8] S. Katariya, B. Kveton, C. Szepesvári, C. Vernade, and Z. Wen. Bernoulli rank-1 bandits for click feedback. International Joint Conference on Artiﬁcial Intelligence, 2017.

[9] J. Komiyama, J. Honda, H. Kashima, and H. Nakagawa. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on Learning Theory, pages 1141 1154, 2015.

[10] T. L. Lai and H. Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

[11] T. Lattimore and C. Szepesvári. The end of optimism? An asymptotic analysis of ﬁnite-armed linear bandits (long version). In AISTATS, volume 54 of PMLR, pages 728 737, April 2017.

[12] T. Urvoy, F. Clerot, R. Féraud, and S. Naamane. Generic exploration and k-armed voting bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 91 99, 2013.

[13] H. Wu and X. Liu. Double thompson sampling for dueling bandits. In Advances in Neural Information Processing Systems, pages 649 657, 2016.

[14] Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1201 1208. ACM, 2009.

[15] Y. Yue and T. Joachims. Beat the mean bandit. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 241 248, 2011.

[16] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538 1556, 2012.

[17] M. Zoghi, S. Whiteson, R. Munos, and M. Rijke. Relative upper conﬁdence bound for the k-armed dueling bandit problem. In International Conference on Machine Learning, pages 10 18, 2014.