# leveraging_good_representations_in_linear_contextual_bandits__cc7b9801.pdf

Leveraging Good Representations in Linear Contextual Bandits

Matteo Papini * 1 Andrea Tirinzoni * 1 Marcello Restelli 1 Alessandro Lazaric 2 Matteo Pirotta 2

The linear contextual bandit literature is mostly focused on the design of efﬁcient learning algorithms for a given representation. However, a contextual bandit problem may admit multiple linear representations, each one with different characteristics that directly impact the regret of the learning algorithm. In particular, recent works showed that there exist good representations for which constant problem-dependent regret can be achieved. In this paper, we ﬁrst provide a systematic analysis of the different deﬁnitions of good representations proposed in the literature. We then propose a novel selection algorithm able to adapt to the best representation in a set of M candidates. We show that the regret is indeed never worse than the regret obtained by running LINUCB on the best representation (up to a ln M factor). As a result, our algorithm achieves constant regret whenever a good representation is available in the set. Furthermore, we show that the algorithm may still achieve constant regret by implicitly constructing a good representation, even when none of the initial representations is good . Finally, we empirically validate our theoretical ﬁndings in a number of standard contextual bandit problems.

1. Introduction

The stochastic contextual bandit is a general framework to formalize sequential decision-making problems in which at each step the learner observes a context drawn from a ﬁxed distribution, it plays an action, and it receives a noisy reward. The goal of the learner is to maximize the reward accumulated over n rounds, and the performance is typically measured by the regret w.r.t. playing the optimal action in each context. This paradigm has found application in a large

*Equal contribution Work done while at Facebook AI Research 1Politecnico di Milano, Milan, Italy 2Facebook AI Research, Paris, France. Correspondence to: Matteo Papini <matteo.papini@polimi.it>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

range of domains, including recommendation systems, online advertising, and clinical trials (e.g., Bouneffouf & Rish, 2019). Linear contextual bandit (Lattimore & Szepesv ari, 2020) is one of the most studied instances of contextual bandit due to its efﬁciency and strong theoretical guarantees. In this setting, the reward for each context x and action a is assumed to be representable as the linear combination between d-dimensional features φ(x, a) Rd and an unknown parameter θ Rd. In this case, we refer to φ as a realizable representation. Algorithms based on the optimism-in-theface-of-uncertainty principle such as LINUCB (Chu et al., 2011) and OFUL (Abbasi-Yadkori et al., 2011), have been proved to achieve minimax regret bound O Sd n ln(n L)

and problem-dependent regret O S2d2

ln2(n L) , where is the minimum gap between the reward of the best and second-best action across contexts, and L and S are upper bounds to the ℓ2-norm of the features φ and θ , respectively.

Unfortunately, the dimension d, and the norm upper bounds L and S, are not the only characteristics of a representation to have an effect on the regret and existing bounds may fail at capturing the impact of the context-action features on the performance of the algorithm. In fact, as illustrated in Fig. 1, running LINUCB with different realizable representations with same parameters d and S may lead to signiﬁcantly different performance. Notably, there are good representations for which LINUCB achieves constant regret, i.e., not scaling with the horizon n. Recent works identiﬁed different conditions on the representation that can be exploited to achieve constant regret for LINUCB (Hao et al., 2020; Wu et al., 2020). Similar conditions have also been leveraged to prove other interesting learning properties, such as sublinear regret for greedy algorithms (Bastani et al., 2020), or regret guarantees for model selection between linear and multi-arm representations (Chatterji et al., 2020; Ghosh et al., 2020). While all these conditions, often referred to as diversity conditions, depend on how certain context-arm features span the full Rd space, there is no systematic analysis of their connections and of which ones can be leveraged to achieve constant regret in linear contextual bandits.

In this paper, we further investigate the concept of good representations in linear bandit and we provide the following contributions: 1) We review the diversity conditions available in the literature, clarify their relationships, and discuss how they are used. We then focus on our primary

Leveraging Good Representations in Linear Contextual Bandits

0 0.2 0.4 0.6 0.8 1

Pseudo Regret

φ0 φ1 φ2 φ3 φ4 φ5

Figure 1. Regret of LINUCB with different realizable representations with same dimension d and parameter bound S. The dashed blue line is LEADER, our proposed representation selection algorithm. Details in App. G.1.

goal, which is to characterize the assumptions needed to achieve constant regret for LINUCB. 2) We introduce a novel algorithm that effectively selects the best representation in a given set, thus achieving constant regret whenever at least one good representation is provided. 3) Furthermore, we show that, in certain problems, the algorithm is able to combine given representations to implicitly form a good one, thus achieving constant problem-dependent regret even when running LINUCB on any of the representations would not. 4) Finally, we empirically validate our theoretical ﬁndings on a number of contextual bandit problems.

Related work. The problem of selecting the best representation in a given set can be seen as a speciﬁc instance of the problem of model selection in bandits. In model selection, the objective is to choose the best candidate in a set of base learning algorithms. At each step, a master algorithm is responsible for selecting a base algorithm, which in turn prescribes the action to play and the reward is then provided as feedback to the base algorithms. Examples of model selection methods include adversarial masters e.g., EXP4 (Auer et al., 2002; Maillard & Munos, 2011) and CORRAL (Agarwal et al., 2017; Pacchiano et al., 2020b) and stochastic masters (Abbasi-Yadkori et al., 2020; Lee et al., 2020; Bibaut et al., 2020; Pacchiano et al., 2020a). For a broader discussion refer to App. A or (Pacchiano et al., 2020a, Sec. 2). Most of these algorithms achieve the regret of the best base algorithm up to a polynomial dependence on the number M of base algorithms (Agarwal et al., 2017). While existing model selection methods are general and can be applied to any type of base algorithms, 1 they may not be effective in problems with a speciﬁc structure.

An alternative approach is to design the master algorithm for a speciﬁc category of base algorithms. An instance of

1Most of existing methods only require prior knowledge of the regret of the optimal base algorithm or a bound on the regret of all base algorithms. CORRAL also requires the base algorithms to satisfy certain stability conditions.

this case is the representation-selection problem, where the base algorithms only differ by the representation used to estimate the reward. Foster et al. (2019) and Ghosh et al. (2020) consider a set of nested representations, where the best representation is the one with the smallest dimensionality for which the reward is realizable. Finally, Chatterji et al. (2020) focus on the problem of selecting between a linear and a multi-armed bandit representation. In this paper, we consider an alternative representation-selection problem in linear contextual bandits, where the objective is to exploit constant-regret good representations. Differently from our work, Lattimore et al. (2020) say that a linear representation is good if it has a low misspeciﬁcation (i.e., it represents the reward up to a small approximation error), while we focus on realizable representations for which LINUCB achieves constant-regret.

2. Preliminaries

We consider the stochastic contextual bandit problem (contextual problem for short) with context space X and ﬁnite action set A = [K] = {1, . . . , K}. At each round t 1, the learner observes a context xt sampled i.i.d. from a distribution ρ over X, it selects an arm at [K] and it receives a reward yt = µ(xt, at) + ηt where ηt is a σ-subgaussian noise. The learner s objective is to minimize the pseudoregret Rn = Pn t=1 µ (xt) µ(xt, at) for any n > 0, where µ (xt) := maxa [K] µ(xt, a). We deﬁne the minimum gap as = infx X:ρ(x)>0,a [K], (x,a)>0{ (x, a)} where (x, a) = µ (x) µ(x, a). A realizable dφ-dimensional linear representation is a feature map φ : X [K] Rdφ for which there exists an unknown parameter vector θ φ Rdφ such that µ(x, a) = φ(x, a), θ φ . When a realizable linear representation is available, the problem is called (stochastic) linear contextual bandit and can be solved using, among others, optimistic algorithms like LINUCB (Chu et al., 2011) or OFUL (Abbasi-Yadkori et al., 2011).

Given a realizable representation φ, at each round t, LINUCB builds an estimate θtφ of θ φ by ridge regression using the observed data. Denote by Vtφ = λIdφ + Pt 1 k=1 φ(xk, ak)φ(xk, ak)T the (λ > 0)-regularized design matrix at round t, then θtφ = V 1 tφ Pt 1 k=1 φ(xk, ak)yk. Assuming that θ φ 2 Sφ and supx,a φ(x, a) 2 Lφ,

LINUCB builds a conﬁdence ellipsoid Ctφ(δ) = θ Rdφ : θtφ θ Vtφ βtφ(δ) . As shown in (Abbasi-Yadkori et al., 2011, Thm. 1), when

βtφ(δ) := σ

2 ln det(Vtφ)1/2 det(λIdφ) 1/2

then P( t 1, θ φ Ctφ(δ)) 1 δ. At each step t, LINUCB plays the action with the highest upper-conﬁdence bound at = argmaxa [K] maxθ Ctφ(δ) φ(xt, a), θ , and it is shown to achieve a regret bounded as reported in the

Leveraging Good Representations in Linear Contextual Bandits

following proposition.

Proposition 1 (Abbasi-Yadkori et al., 2011, Thm. 3, 4). For any linear contextual bandit problem with dφ-dimensional features, supx,a φ(x, a) 2 Lφ, an unknown parameter vector θ φ 2 Sφ, with probability at least 1 δ, LINUCB suffers regret Rn = O Sφdφ n ln(n Lφ/δ) . Furthermore, if the problem has a minimum gap > 0, then the regret is

bounded as2 Rn = O S2 φd2 φ ln2(n Lφ/δ) .

In the rest of the paper, we assume w.l.o.g. that all terms λ, max = maxx,a (x, a), Sφ, σ are larger than 1 to simplify the expression of the bounds.

3. Diversity Conditions

Several assumptions, usually referred to as diversity conditions, have been proposed to deﬁne linear bandit problems with speciﬁc properties that can be leveraged to derive improved learning results. While only a few of them were actually leveraged to derive constant regret guarantees for LINUCB (others have been used to prove e.g., sub-linear regret for the greedy algorithm, or regret guarantees for model selection algorithms), they all rely on very similar conditions on how certain context-action features span the full Rdφ space. In this section, we provide a thorough review of these assumptions, their connections, and how they are used in the literature. As diversity conditions are getting more widely used in bandit literature, we believe this review may be of independent interest. Sect. 4 will then speciﬁcally focus on the notion of good representation for LINUCB.

We ﬁrst introduce additional notation. For a realizable representation φ, let φ (x) := φ(x, a x), where a x argmaxa [K] µ(x, a) is an optimal action, be the vector of optimal features for context x. In the following we make the assumption that φ (x) is unique. Also, let X (a) = {x X : µ(x, a) = µ (x)} denote the set of contexts where a is optimal. Finally, for any matrix A, we denote by λmin(A) its minimum eigenvalue. For any contextual problem with reward µ and context distribution ρ, the diversity conditions introduced in the literature are summarized in Tab. 2 together with how they were leveraged to obtain regret bounds in different settings.3

We ﬁrst notice that all conditions refer to the smallest eigenvalue of a design matrix constructed on speciﬁc contextaction features. In other words, diversity conditions require certain features to span the full Rdφ space. The

2The logarithmic bound reported in Prop. 1 is slightly different than the one in (Abbasi-Yadkori et al., 2011) since we do not assume that the optimal feature is unique. 3In some cases, we adapted conditions originally deﬁned in the disjoint-parameter setting, where features only depend on the context (i.e., φ(x)) and the unknown parameter θ a is different for each action a, to the shared-parameter setting (i.e., where features are functions of both contexts and actions) introduced in Sect. 2.

non-redundancy condition is a common technical assumption (e.g., Foster et al., 2019) and it simply deﬁnes a problem whose dimensionality cannot be reduced without losing information. Assuming the context distribution ρ is full support, BBK and CMB are structural properties of the representation that are independent from the reward. For example, BBK requires that, for each action, there must be feature vectors lying in all orthants of Rdφ. In the case of ﬁnite contexts, this implies there must be at least 2dφ contexts. WYS and HLS involve the notion of reward optimality. In particular, WYS requires that all actions are optimal for at least a context (in the continuous case, for a non-negligible set of contexts), while HLS only focuses on optimal actions.

We now review how these conditions (or variations thereof) were applied in the literature. CMB is a rather strong condition that requires the features associated with each individual action to span the whole Rdφ space. Chatterji et al. (2020) leverage a CMB-like assumption to prove regret bounds for OSOM, a model-selection algorithm that uniﬁes multi-armed and linear contextual bandits. More precisely, they consider a variation of CMB, where the context distribution induces stochastic feature vectors for each action that are independent and centered. The same condition was adopted by Ghosh et al. (2020) to study representationselection problems and derive algorithms able to adapt to the (unknown) norm of θ φ or select the smallest realizable representation in a set of nested representations. Bastani et al. (2020, Assumption 3) introduced a condition similar to BBK for the disjoint-parameter case. In their setting, they prove that a non-explorative greedy algorithm achieves O(ln(n)) problem-dependent regret in linear contextual bandits (with 2 actions).4 Hao et al. (2020, Theorem 3.9) showed that HLS representations can be leveraged to prove constant problem-dependent regret for LINUCB in the shared-parameter case. Concurrently, Wu et al. (2020) showed that, under WYS, LINUCB achieves constant expected regret in the disjoint-parameter case. A WYS-like condition was also used by Bastani et al. (2020, Assumption 4) to extend the result of sublinear regret for the greedy algorithm to more than two actions. The relationship between all these conditions is derived in the following lemma.

Lemma 1. For any contextual problem with reward µ and context distribution ρ, let φ be a realizable linear representation. The relationship between the diversity conditions in Tab. 2 is summarized in Fig. 3, where each inclusion is in a strict sense and each intersection is non-empty.

This lemma reveals non-trivial connections between the diversity conditions, better understood through the examples provided in the proof (see App. B.1). BBK is indeed

4Whether this is enough for the optimality of the greedy algorithm in the shared-parameter setting is an interesting problem, but it is beyond the scope of this paper.

Leveraging Good Representations in Linear Contextual Bandits

Name Deﬁnition Application

Nonredundant λmin 1/K P

a [K] Ex ρ φ(x, a)φ(x, a)T > 0

CMB a, λmin Ex ρ φ(x, a)φ(x, a)T > 0 Model selection

BBK a, u Rd, λmin Ex φ(x, a)φ(x, a)T1 φ(x, a)Tu 0 > 0

Logarithmic regret for greedy

HLS λmin Ex ρ φ (x)φ (x)T > 0 Constant regret for LINUCB

WYS a, λmin Ex ρ φ(x, a)φ(x, a)T1 {x X (a)} > 0 Constant regret for LINUCB

Figure 2. Diversity conditions proposed in the literature adapted to the shared-parameter setting. The names refer to the authors who ﬁrst introduced similar conditions.

Non-redundant

Figure 3. Categorization of diversity conditions.

stronger than CMB, and thus it is sufﬁcient for the model selection results by Chatterji et al. (2020). By superﬁcially examining their deﬁnitions, CMB may appear stronger than HLS, but the two properties are actually non-comparable, as there are representations that satisfy one condition but not the other. The implications of Fig. 3 on constant-regret guarantees are particularly relevant for our purposes. There are representations that satisfy BBK or CMB and are neither HLS nor WYS and thus may not enable constant regret for LINUCB. We notice that WYS is a stronger condition than HLS. Although WYS may be necessary for LINUCB to achieve constant regret in the disjoint-parameter case, HLS is sufﬁcient for the shared-parameter case we consider in this paper. For this reason, in the following section we adopt HLS to deﬁne good representations for LINUCB and provide a more complete characterization.

4. Good Representations for Constant Regret

The HLS condition was introduced by Hao et al. (2020), who provided a ﬁrst analysis of its properties. In this section, we complement those results by providing a complete proof of a constant regret bound, a proof of the fact that HLS is actually necessary for constant regret, and a novel characterization of the existence of HLS representations. In the following we deﬁne λφ,HLS := λmin Ex ρ φ (x)φ (x)T , which is strictly positive for HLS representations.

4.1. Constant Regret Bound

We begin by deriving a constant problem-dependent regret bound for LINUCB under the HLS condition.

Lemma 2. Consider a contextual bandit problem with realizable linear representation φ satisfying the HLS condition (see Tab. 2). Assume > 0, maxx,a φ(x, a) 2 L and θ φ 2 S. Then, with probability at least 1 2δ, the regret of OFUL after n 1 steps is at most

Rn 32λ 2 max S2 φσ2

1 + τφL2 φ λdφ

where max = maxx,a (x, a) is the maximum gap and

τφ max 3842d2 φL2 φS2 φσ2λ λφ,HLS 2 ln2 64d2 φL3 φσSφ

768L4 φ λ2 φ,HLS ln

512dφL4 φ δλ2 φ,HLS

We ﬁrst notice that τφ is independent from the horizon n, thus making the previous bound a constant only depending on the problem formulation (i.e., gap , norms Lφ and Sφ) and the value λφ,HLS which measures how much the representation φ satisﬁes the HLS condition. Furthermore, one can always take the minimum between the constant regret in Lem. 2 and any other valid regret bound for OFUL (e.g., O(log(n)/ ))), which may be tighter for small values of n. While Lem. 2 provides high-probability guarantees, we can easily derive a constant expected-regret bound by running LINUCB with a decreasing schedule for δ (e.g., δt 1/t3) and with a slightly different proof (see App. C and the proof sketch below).

Proof sketch (full proof in App. C). Following Hao et al. (2020), the idea is to show that the instantaneous regret rt+1 = θ , φ (xt+1) φ(xt+1, at+1) is zero for sufﬁciently large (but constant) time t. By using the standard regret analysis, we have

rt+1 2βt+1(δ) φ(xt+1, at+1) V 1 t+1 2Lβt+1(δ) p

λmin(Vt+1) .

Given the minimum-gap assumption, a sufﬁcient condition for rt+1 = 0 is that the previous upper bound is smaller than , which gives λmin(Vt+1) > 4L2β2 t+1(δ)/ 2. Since > 0, the problem-dependent regret bound in Prop. 1

Leveraging Good Representations in Linear Contextual Bandits

holds, and the number of pulls to suboptimal arms up to time t is bounded by gt(δ) = O (d ln(t/δ)/ )2 . Hence, the optimal arms are pulled linearly often and, by leveraging the HLS assumption, we are able to show that the minimum eigenvalue of the design matrix grows linearly in time as

λmin(Vt+1) λ + tλHLS 8L2 s

By relating the last two equations, we obtain an inequality of the form tλHLS o(t) > o(t). If we deﬁne τ < as the smallest (deterministic) time such that this inequality holds, we have that after τ the immediate regret is zero, thus concluding the proof. Note that, if we wanted to bound the expected regret, we could set δt 1/t3 and the above inequality would still be of the same form (although the resulting τ would be slightly different).

Comparison with existing bounds. Hao et al. (2020, Theorem 3.9) prove that LINUCB with HLS representations achieves lim supn Rn < , without characterizing the time at which the regret vanishes. Instead, our Lem. 2 provides an explicit problem-dependent constant regret bound. Wu et al. (2020, Theorem 2) consider the disjoint-parameter setting and rely on the WYS condition. While they indeed prove a constant regret result, their bound depends on the the minimum probability of observing a context (or, in the continuous case, a properly deﬁned meta-context). This reﬂects the general tendency, in previous works, to frame diversity conditions simply as a property of the context distribution ρ. On the other hand, our characterization of τ in terms of λφ,HLS (Lem. 2) allows relating the regret to the goodness of the representation φ for the problem at hand.

4.2. Removing the Minimum-Gap Assumption

Constant-regret bounds for LINUCB rely on a minimumgap assumption ( > 0). In this section we show that LINUCB can still beneﬁt from HLS representations when = 0, but a margin condition holds (e.g., Rigollet & Zeevi,

2010; Reeve et al., 2018). Intuitively, we require that the probability of observing a context x decays proportionally to its minimum gap (x) = mina (x, a). Assumption 1 (Margin condition). There exists C, α > 2 such that for all ϵ > 0: ρ {x X : (x) ϵ} Cϵα.

The following theorem provides a problem-dependent regret bound for LINUCB under this margin assumption. Theorem 1. Consider a linear contextual bandit problem satisfying the margin condition (Asm. 1). Assume maxx,a φ(x, a) 2 Lφ and θ φ 2 Sφ. Then, given a representation φ, with probability at least 1 3δ, the regret of OFUL after n 1 steps is at most

Rn O λ( max Sφσdφ)2n1/α + p

Cdφ ln2(Lφn/δ) .

When φ is HLS (λφ,HLS > 0), let τφ (λφ,HLS) α 2 α , then

Rn O maxτφ + p

Cdφ ln2(Lφn/δ) .

We ﬁrst notice that in general, LINUCB suffers e O(n1/α) regret, which can be signiﬁcantly larger than in the minimumgap case. On the other hand, with HLS representations, LINUCB achieves logarithmic regret, regardless of the value of α. The intuition is that, when the HLS condition holds, the algorithm collects sufﬁcient information about θ φ by pulling the optimal arms in rounds with large minimum gap, which occur with high probability by the margin condition. This yields at most constant regret in such rounds (ﬁrst term above), while it can be shown that the regret in steps when the minimum gap is very small is at most logarithmic (second term above).

4.3. Further Analysis of the HLS Condition

While Lem. 2 shows that HLS is sufﬁcient for achieving constant regret, the following proposition shows that it is also necessary. While this property was ﬁrst mentioned by Hao et al. (2020) as a remark in a footnote, we provide a formal proof in App. C.5. Proposition 2. For any contextual problem with ﬁnite contexts, full-support context distribution, and given a nonredundant realizable representation φ, LINUCB achieves sub-logarithmic regret if and only if φ satisﬁes the HLS condition.

As already observed in Section 4, the HLS condition can be equivalently expressed as:5

span{φ (x) | x X} = Rd,

i.e., optimal features must span the whole d-dimensional Euclidean space, where d is the dimension of φ. If we admit redundant representations, a weaker condition is sufﬁcient to achieve constant regret:

span{φ (x) | x X} = span{φ(x, a) | x X, a A},

i.e., optimal features must span the whole feature space, which may be a subspace of Rd in general. We prove that this weak HLS condition is sufﬁcient for LINUCB to achieve constant regret as Corollary 1 in App. E.2. This also shows that constant-regret guarantees are preserved by adding redundant features to an HLS representation.

Finally, we derive the following important existence result. Lemma 3. For any contextual bandit problem with optimal reward 6 µ (x) = 0 for all x X, that has either i) a

5That is assuming the context distribution is full-support. Otherwise, it is enough to replace X with the support of the context distribution supp(ρ). 6This condition is technical and it can be easily relaxed.

Leveraging Good Representations in Linear Contextual Bandits

ﬁnite context set with at least d contexts with nonzero probability, or ii) a Borel context space and a non-degenerate context distribution7, for any dimension d 1, there exists an inﬁnite number of d-dimensional realizable HLS representations.

This result crucially shows that the HLS condition is robust , since in any contextual problem, it is possible to construct an inﬁnite number of representations satisfying the HLS condition. In App. B.2, we indeed provide an oracle procedure for constructing an HLS representation. This result also supports the starting point of next section, where we assume that a learner is provided with a set of representations that may contain at least a good representation, i.e., an HLS representation.

5. Representation Selection

In this section, we study the problem of representation selection in linear bandits. We consider a linear contextual problem with reward µ and context distribution ρ. Given a set of M realizable linear representations {φi : X [K] Rdi}, the objective is to design a learning algorithm able to perform as well as the best representation, and thus achieve constant regret when a good representation is available. As usual, we assume θ i Rdi is unknown, but the algorithm is provided with a bound on the parameter and feature norms of the different representations.

5.1. The LEADER Algorithm

We introduce LEADER (Linear r Epresentation b An Dit mix ER), see Alg. 1. At each round t, LEADER builds an estimate θti of the unknown parameter θ i of each representation φi.8 These estimates are by nature off-policy, and thus all the samples (xl, al, yl)l<t can be used to solve all ridge regression problems. For each φi, deﬁne Vti = λIdi +Pt 1 l=1 φi(xl, al)φi(xl, al)T, θti and Cti(δ/M) as in Sec. 2. Since all the representations are realizable, we have that P ( i [M], θ i Cti(δ/M)) 1 δ. As a consequence, for each representation φi we can build an upperconﬁdence bound to the reward such that, x X, a A, with high probability

µ(x, a) max θ Cti(δ/M) φi(x, a), θ := Uti(x, a). (1)

Given this, LEADER uses the tightest available upperconﬁdence bound to evaluate each action and then it selects the one with the largest value, i.e.,

at argmax a [K] min i [M]{Uti(xt, a)}. (2)

7For instance, if X = Rm and the context distribution must have positive variance in all directions. 8We use the subscript i [M] instead of φi to denote quantities related to representation φi.

Let it = argmini [M]{Uti(xt, at)} be the representation associated to the pulled arm at. Interestingly, despite at being optimistic, in general it may not correspond to the optimistic action of representation φit, i.e., at / argmaxa{Ut,it(xt, a)}. If a representation provides an estimate that is good along the direction associated to a context-action pair, but possibly very uncertain on other actions, LEADER is able to leverage this key feature to reduce the overall uncertainty and achieve a tighter optimism. Space and time complexity of LEADER scales linearly in the number of representations, although the updates for each representation could be carried out in parallel.

Regret bound. For ease of presentation, we assume a non-zero minimum gap ( > 0). The analysis can be generalized to = 0 as done in Sec. 4.2. Thm. 2 establishes the regret guarantee of LEADER (Alg. 1). Theorem 2. Consider a contextual bandit problem with reward µ, context distribution ρ and > 0. Let (φi) be a set of M linearly realizable representations such that maxx,a φi(x, a) 2 Li and θ i i Si. Then, for any n 1, with probability 1 2δ, LEADER suffers a regret

Rn min i [M]

32λ 2 max S2 i σ2

+ di ln 1 + min{τi, n}L2 i λdi

where τi (λi,HLS ) 2 if φi is HLS and τi = + otherwise.

This shows that the problem-dependent regret bound of LEADER is not worse than the one of the best representation (see Prop. 1), up to a ln M factor. This means that the cost of representation selection is almost negligible. Furthermore, Thm. 2 shows that LEADER not only achieves a constant regret bound when an HLS representation is available, but this bound scales as the one of the best HLS representation. In fact, notice that the quality of an HLS representation does not depend only on known quantities such as di, Li, Si, but crucially on HLS eigenvalue λi,HLS, which is usually not known in advance, as it depends on the features of the optimal arms.

5.2. Combining Representations

In the previous section, we have shown that LEADER can perform as well as the best representation in the set. However, by inspecting the action selection rule (Eq. 2), we notice that, to evaluate the reward of an action in the current context, LEADER selects the representation with the smallest uncertainty, thus potentially using different representations for different context-action pairs. This leads to the question: can LEADER do better than the best representation in the set?

Leveraging Good Representations in Linear Contextual Bandits

Algorithm 1 LEADER Algorithm

Input: representations (φi)i [M] with values (Li, Si)i [M], regularization factor λ 1, conﬁdence level δ (0, 1). Initialize V1i = λIdi, θ1i = 0di for each i [M] for t = 1, . . . do

Observe context xt Pull action at argmaxa [K] mini [M]{Uti(xt, a)} Observe reward rt and, for each i [M], set Vt+1,i = Vti + φi(xt, at)φi(xt, at)T and θt+1,i = V 1 t+1,i Pt l=1 φi(xl, al)rl end for

We show that, in certain cases, LEADER is able to combine representations and achieve constant regret when none of the individual representations would. The intuition is that a subset of locally good representations can be combined to recover a condition similar to HLS. This property is formally stated in the following deﬁnition.

Deﬁnition 1 (Mixing HLS). Consider a linear contextual problem with reward µ and context distribution ρ, and a set of M realizable linear representations φ1, . . . , φM. Deﬁne Mi = Ex ρ h φ i (x)φ i (x)Ti and let Zi = {(x, a) X

A | φi(s, a) Im(Mi)} be the set of context-action pairs whose features belong to the column space of Mi, i.e., that lie in the span of optimal features. We say that the set (φi) satisﬁes the mixed-HLS condition if X A SM i=1 Zi.

Let λ+ i = λ+ min(Mi) be the minimum nonzero eigenvalue of Mi. Intuitively, the previous condition relies on the observation that every representation satisﬁes a restricted HLS condition on the context-action pairs (x, a) whose features φi(x, a) are spanned by optimal features φ (x). In this case, the characterizing eigenvalue is λ+ i , instead of the smallest eigenvalue λi,HLS (which may be zero). If every contextaction pair is in the restriction Zi of some representation, we have the mixed-HLS property. In particular, if representation i is HLS, λ+ i = λi,HLS and Zi = S A. So, HLS is a special case of mixed-HLS. In App. E.2, we provide simple examples of sets of representations satisfying Def. 1. Note that, strictly speaking, there is not a single mixed representation solving the whole problem. Even deﬁning one would be problematic since each representation may have a different parameter and even a different dimension. Instead, each representation specializes on a different portion of the context-action space. If together they cover the whole space, the beneﬁts of HLS are recovered, as illustrated in the following theorem.

Theorem 3. Consider a stochastic bandit problem with reward µ, context distribution ρ and > 0. Let (φi) be a set of M realizable linear representations satisfying the mixed-HLS property in Def. 1. Then, with probability at least 1 2δ, there exists a time τ < independent from n such that, for any n 1, the pseudo-regret of LEADER is

Rn min i [M]

32λ 2 max S2 i σ2

+ di ln 1 + τL2 i λdi

First, note that we are still scaling with the characteristics of the best representation in the set (i.e., di, Li and Si). However, the time τ to constant regret is a global value rather than being different for each representation. This highlights that mixed-HLS is a global property of the set of representations rather than being individual as before. In particular, whenever no representation is (globally) HLS (i.e., λi,HLS = 0 for all φi), we can show that in the worst case τ scales as (mini λ+ i ) 2. In practice, we may expect LEADER to even behave better than that since i) not all the representations may contribute actively to the mixed-HLS condition; and ii) multiple representations may cover the same region of the context-action space. In the latter case, since LEADER leverages all the representations at once, its regret would rather scale with the largest minimum nonzero eigenvalue λ+ i among all the representations covering such region. We refer to App. E.2 for a more complete discussion.

5.3. Discussion

Most of the model selection algorithms reviewed in the introduction could be readily applied to select the best representation for LINUCB. However, the generality of their objective comes with several shortcomings when instantiated in our speciﬁc problem (see App. A for a more detailed comparison). First, model selection methods achieve the performance of the best algorithm, up to a polynomial dependence on the number M of models. This already makes them a weaker choice compared to LEADER, which, by leveraging the speciﬁc structure of the problem, suffers only a logarithmic dependence on M. Second, model selection algorithms are often studied in a worst-case analysis, which reveals a high cost for adaptation. For instance, corralling algorithms (Agarwal et al., 2017; Pacchiano et al., 2020b) pay an extra n regret, which would make them unsuitable to target the constant regret of good representations. Similar costs are common to other approaches (Abbasi-Yadkori et al., 2020; Pacchiano et al., 2020a). It is unclear whether a problem-dependent analysis can be carried out and whether this could shave off such dependence. Third, these algorithms are generally designed to adapt to a speciﬁc best base algorithm. At the best of our knowledge, there is no evidence that model selection methods could combine algorithms to achieve better performance than the best candidate, a behavior that we proved for LEADER in our setting.

On the other hand, model selection algorithms effec-

Leveraging Good Representations in Linear Contextual Bandits

tively deal with non-realizable representations in certain cases (e.g., Foster et al., 2020; Abbasi-Yadkori et al., 2020; Pacchiano et al., 2020a), while LEADER is limited to the realizable case. While a complete study of the model misspeciﬁcation case is beyond the scope of this paper, in App. F, we discuss how a variation of the approach presented in (Agarwal et al., 2012b) could be paired to LEADER to discard misspeciﬁed representations and possibly recover the properties of good representations.

6. Experiments

In this section, we report experimental results on two synthetic and one dataset-based problems. For each problem, we evaluate the behavior of LEADER with LINUCB and model selection algorithms: EXP4.IX (Neu, 2015), CORRAL and EXP3.P in the stochastic version by Pacchiano et al. (2020b) and Regret Balancing with and without elimination (REGBALELIM and REGBAL) (Abbasi-Yadkori et al., 2020; Pacchiano et al., 2020a). See App. G for a detailed discussion and additional experiments. All results are averaged over 20 independent runs, with shaded areas corresponding to 2 standard deviations. We always set the parameters to λ = 1, δ = 0.01, and σ = 0.3. All the representations we consider are normalized to have θ i = 1.

Synthetic Problems. We deﬁne a randomly-generated contextual bandit problem, for which we construct sets of realizable linear representations with different properties (see App. G.1 for details). The purpose of these experiments is twofold: to show the different behavior of LINUCB with different representations, and to evaluate the ability of LEADER of selecting and mixing representations.

Varying dimension. We construct six representations of varying dimension from 2 up to 6. Of the two representations of dimension d = 6, one is HLS. Fig. 4(left) shows that in this case, LINUCB with the HLS representation outperforms any non-HLS representation, even if they have smaller dimension. This property is inherited by LEADER, which performs better than LINUCB with non-HLS representations even of much smaller dimension 2.

Mixing representations. We construct six representations of the same dimension d = 6, none of which is HLS. However, they are constructed so that together they satisfy the weaker mixed-HLS assumption (Def. 1). Fig. 4(middle left) shows that, as predicted by Thm. 3, LEADER leverages different representations in different context-action regions and it thus performs signiﬁcantly better than any LINUCB using non-HLS representations. The superiority of LEADER w.r.t. the model-selection baselines is evident in this case (Fig. 4(middle right) ), since only LEADER is able to mix representations, whereas model-selection algorithms target the best in a set of bad representations. Additional experiments in App. G conﬁrm that LEADER consistently

outperforms all model-selection algorithms.

Jester Dataset. In the last experiment, we extract multiple linear representations from the Jester dataset (Goldberg et al., 2001), which consists of joke ratings in a continuous range from 10 to 10 for a total of 100 jokes and 73421 users. For a subset of 40 jokes and 19181 users rating all these 40 jokes, we build a linear contextual problem as follows. First, we ﬁt a 32 32 neural network to predict the ratings from features extracted via a low-rank factorization of the full matrix. Then, we take the last layer of the network as our ground truth linear model and ﬁt multiple smaller networks to clone its predictions, while making sure that the resulting misspeciﬁcation is small. We thus obtain 7 representations with different dimensions among which, interestingly, we ﬁnd that 6 are HLS. Figure 4(right) reports the comparison between LEADER using all representations and LINUCB with each single representation on a log-scale. Notably, the ability of LEADER to mix representations makes it perform better than the best candidate, while transitioning to constant regret much sooner. Finally, the fact that HLS representations arise so naturally raises the question of whether this is a more general pattern in context-action features learned from data.

Last.fm dataset. In App. F we study a variant of LEADER that is able to handle misspeciﬁed representations, and we test it on the Last.fm music-recommendation dataset (Cantador et al., 2011). See App. F.4 for details.

7. Conclusion

We provided a complete characterization of good realizable representations for LINUCB, ranging from existence to a sufﬁcient and necessary condition to achieve problemdependent constant regret. We introduced LEADER, a novel algorithm that, given a set of realizable linear representations, is able to adapt to the best one and even leverage their combination to achieve constant regret under the milder mixed-HLS condition. While we have focused on LINUCB, other algorithms (e.g., Lin TS (Abeille & Lazaric, 2017)) as well as other settings (e.g., low-rank RL (Jin et al., 2020)) may also beneﬁt from HLS-like assumptions. We have mentioned an approach for eliminating misspeciﬁed representations, but a non-trivial trade-off may exist between the level of misspeciﬁcation and the goodness of the representation. A slightly imprecise but very informative representation may be preferable to most bad realizable ones. Finally, we believe that moving from selection to representation learning e.g., provided a class of features such as a neural network is an important direction both from a theoretical and practical perspective.

Leveraging Good Representations in Linear Contextual Bandits

0 0.2 0.4 0.6 0.8 1

Pseudo Regret

LEADER d = 2 d = 3 d = 4 d = 5 d = 6 (HLS) d = 6

0 0.2 0.4 0.6 0.8 1

LEADER φ0 φ1 φ2 φ3 φ4 φ5

0 0.2 0.4 0.6 0.8 1

CORRAL EXP3.P EXP4.IX LEADER REGBALELIM REGBAL

102 103 104 105 106 0

LEADER d = 16 d = 17 d = 20 d = 23 d = 24 d = 26 d = 33

Figure 4. Regret of LEADER and model-selection baselines on different linear contextual bandit problems. (left) Synthetic problem with varying dimensions. (middle left) Representation mixing. (middle right) Comparison to model selection baselines. (right) Jester dataset.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In NIPS, pp. 2312 2320, 2011.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Onlineto-conﬁdence-set conversions and application to sparse stochastic bandits. In AISTATS, volume 22 of JMLR Proceedings, pp. 1 9. JMLR.org, 2012.

Abbasi-Yadkori, Y., Pacchiano, A., and Phan, M. Regret balancing for bandit and rl model selection, 2020.

Abeille, M. and Lazaric, A. Linear thompson sampling revisited. In AISTATS, volume 54 of Proceedings of Machine Learning Research, pp. 176 184. PMLR, 2017.

Agarwal, A., Dud ık, M., Kale, S., Langford, J., and Schapire, R. Contextual bandit learning with predictable rewards. In Artiﬁcial Intelligence and Statistics, pp. 19 26. PMLR, 2012a.

Agarwal, A., Dud ık, M., Kale, S., Langford, J., and Schapire, R. E. Contextual bandit learning with predictable rewards. In AISTATS, volume 22 of JMLR Proceedings, pp. 19 26. JMLR.org, 2012b.

Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In COLT, volume 65 of Proceedings of Machine Learning Research, pp. 12 38. PMLR, 2017.

Arora, R., Marinov, T. V., and Mohri, M. Corralling stochastic bandit algorithms. Co RR, abs/2006.09255, 2020.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Bastani, H., Bayati, M., and Khosravi, K. Mostly exploration-free algorithms for contextual bandits. Management Science, 2020.

Bibaut, A. F., Chambaz, A., and van der Laan, M. J. Rateadaptive model selection over a collection of black-box

contextual bandit algorithms. Co RR, abs/2006.03632, 2020.

Bouneffouf, D. and Rish, I. A survey on practical applications of multi-armed and contextual bandits, 2019.

Cantador, I., Brusilovsky, P., and Kuﬂik, T. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, Rec Sys 2011, New York, NY, USA, 2011. ACM. URL http://ir.ii. uam.es/hetrec2011/index.html. Last.fm website, http://www.lastfm.com.

Chatterji, N. S., Muthukumar, V., and Bartlett, P. L. OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pp. 1844 1854. PMLR, 2020.

Chatzigeorgiou, I. Bounds on the lambert function and their application to the outage analysis of user cooperation. Co RR, abs/1601.04895, 2016.

Chen, F. A note on matrix versions of kantorovich-type inequality. JOURNAL OF MATHEMATICAL INEQUALITIES, 7(2):283 288, 2013.

Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextual bandits with linear payoff functions. In AISTATS, volume 15 of JMLR Proceedings, pp. 208 214. JMLR.org, 2011.

Degenne, R., M enard, P., Shang, X., and Valko, M. Gamiﬁcation of pure exploration for linear bandits. In International Conference on Machine Learning, pp. 2432 2442. PMLR, 2020.

Eaton, M. L. and Perlman, M. D. The non-singularity of generalized sample covariance matrices. The Annals of Statistics, pp. 710 717, 1973.

Foster, D. J., Krishnamurthy, A., and Luo, H. Model selection for contextual bandits. In Neur IPS, pp. 14714 14725, 2019.

Leveraging Good Representations in Linear Contextual Bandits

Foster, D. J., Gentile, C., Mohri, M., and Zimmert, J. Adapting to misspeciﬁcation in contextual bandits. In Neur IPS, 2020.

Ghosh, A., Sankararaman, A., and Ramchandran, K. Problem-complexity adaptive model selection for stochastic linear bandits. Co RR, abs/2006.02612, 2020.

Goldberg, K. Y., Roeder, T., Gupta, D., and Perkins, C. Eigentaste: A constant time collaborative ﬁltering algorithm. Inf. Retr., 4(2):133 151, 2001.

Hao, B., Lattimore, T., and Szepesv ari, C. Adaptive exploration in linear contextual bandit. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pp. 3536 3545. PMLR, 2020.

Hoorfar, A. and Hassani, M. Inequalities on the lambert w function and hyperpower function. J. Inequal. Pure and Appl. Math, 9(2):5 9, 2008.

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efﬁcient reinforcement learning with linear function approximation. In COLT, volume 125 of Proceedings of Machine Learning Research, pp. 2137 2143. PMLR, 2020.

Lattimore, T. and Szepesvari, C. The end of optimism? an asymptotic analysis of ﬁnite-armed linear bandits. In Artiﬁcial Intelligence and Statistics, pp. 728 737. PMLR, 2017.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pp. 5662 5670. PMLR, 2020.

Lee, J. N., Pacchiano, A., Muthukumar, V., Kong, W., and Brunskill, E. Online model selection for reinforcement learning with function approximation, 2020.

Maillard, O. and Munos, R. Adaptive bandits: Towards the best history-dependent strategy. In AISTATS, volume 15 of JMLR Proceedings, pp. 570 578. JMLR.org, 2011.

Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Neur IPS, pp. 3168 3176, 2015.

Pacchiano, A., Dann, C., Gentile, C., and Bartlett, P. Regret bound balancing and elimination for model selection in bandits and rl, 2020a.

Pacchiano, A., Phan, M., Abbasi-Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesv ari, C. Model selection in contextual stochastic bandit problems. In Neur IPS, 2020b.

Reeve, H. W. J., Mellor, J., and Brown, G. The k-nearest neighbour UCB algorithm for multi-armed bandits with covariates. In ALT, volume 83 of Proceedings of Machine Learning Research, pp. 725 752. PMLR, 2018.

Rigollet, P. and Zeevi, A. Nonparametric bandits with covariates. ar Xiv preprint ar Xiv:1003.1630, 2010.

Tirinzoni, A., Pirotta, M., Restelli, M., and Lazaric, A. An asymptotically optimal primal-dual incremental algorithm for contextual linear bandits. Advances in Neural Information Processing Systems, 33, 2020.

Tropp, J. A. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):389 434, 2012.

Wu, W., Yang, J., and Shen, C. Stochastic linear contextual bandits with diverse contexts. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pp. 2392 2401. PMLR, 2020.