# leveraging_good_representations_in_linear_contextual_bandits__cc7b9801.pdf Leveraging Good Representations in Linear Contextual Bandits Matteo Papini * 1 Andrea Tirinzoni * 1 Marcello Restelli 1 Alessandro Lazaric 2 Matteo Pirotta 2 The linear contextual bandit literature is mostly focused on the design of efficient learning algorithms for a given representation. However, a contextual bandit problem may admit multiple linear representations, each one with different characteristics that directly impact the regret of the learning algorithm. In particular, recent works showed that there exist good representations for which constant problem-dependent regret can be achieved. In this paper, we first provide a systematic analysis of the different definitions of good representations proposed in the literature. We then propose a novel selection algorithm able to adapt to the best representation in a set of M candidates. We show that the regret is indeed never worse than the regret obtained by running LINUCB on the best representation (up to a ln M factor). As a result, our algorithm achieves constant regret whenever a good representation is available in the set. Furthermore, we show that the algorithm may still achieve constant regret by implicitly constructing a good representation, even when none of the initial representations is good . Finally, we empirically validate our theoretical findings in a number of standard contextual bandit problems. 1. Introduction The stochastic contextual bandit is a general framework to formalize sequential decision-making problems in which at each step the learner observes a context drawn from a fixed distribution, it plays an action, and it receives a noisy reward. The goal of the learner is to maximize the reward accumulated over n rounds, and the performance is typically measured by the regret w.r.t. playing the optimal action in each context. This paradigm has found application in a large *Equal contribution Work done while at Facebook AI Research 1Politecnico di Milano, Milan, Italy 2Facebook AI Research, Paris, France. Correspondence to: Matteo Papini . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). range of domains, including recommendation systems, online advertising, and clinical trials (e.g., Bouneffouf & Rish, 2019). Linear contextual bandit (Lattimore & Szepesv ari, 2020) is one of the most studied instances of contextual bandit due to its efficiency and strong theoretical guarantees. In this setting, the reward for each context x and action a is assumed to be representable as the linear combination between d-dimensional features φ(x, a) Rd and an unknown parameter θ Rd. In this case, we refer to φ as a realizable representation. Algorithms based on the optimism-in-theface-of-uncertainty principle such as LINUCB (Chu et al., 2011) and OFUL (Abbasi-Yadkori et al., 2011), have been proved to achieve minimax regret bound O Sd n ln(n L) and problem-dependent regret O S2d2 ln2(n L) , where is the minimum gap between the reward of the best and second-best action across contexts, and L and S are upper bounds to the ℓ2-norm of the features φ and θ , respectively. Unfortunately, the dimension d, and the norm upper bounds L and S, are not the only characteristics of a representation to have an effect on the regret and existing bounds may fail at capturing the impact of the context-action features on the performance of the algorithm. In fact, as illustrated in Fig. 1, running LINUCB with different realizable representations with same parameters d and S may lead to significantly different performance. Notably, there are good representations for which LINUCB achieves constant regret, i.e., not scaling with the horizon n. Recent works identified different conditions on the representation that can be exploited to achieve constant regret for LINUCB (Hao et al., 2020; Wu et al., 2020). Similar conditions have also been leveraged to prove other interesting learning properties, such as sublinear regret for greedy algorithms (Bastani et al., 2020), or regret guarantees for model selection between linear and multi-arm representations (Chatterji et al., 2020; Ghosh et al., 2020). While all these conditions, often referred to as diversity conditions, depend on how certain context-arm features span the full Rd space, there is no systematic analysis of their connections and of which ones can be leveraged to achieve constant regret in linear contextual bandits. In this paper, we further investigate the concept of good representations in linear bandit and we provide the following contributions: 1) We review the diversity conditions available in the literature, clarify their relationships, and discuss how they are used. We then focus on our primary Leveraging Good Representations in Linear Contextual Bandits 0 0.2 0.4 0.6 0.8 1 Pseudo Regret φ0 φ1 φ2 φ3 φ4 φ5 Figure 1. Regret of LINUCB with different realizable representations with same dimension d and parameter bound S. The dashed blue line is LEADER, our proposed representation selection algorithm. Details in App. G.1. goal, which is to characterize the assumptions needed to achieve constant regret for LINUCB. 2) We introduce a novel algorithm that effectively selects the best representation in a given set, thus achieving constant regret whenever at least one good representation is provided. 3) Furthermore, we show that, in certain problems, the algorithm is able to combine given representations to implicitly form a good one, thus achieving constant problem-dependent regret even when running LINUCB on any of the representations would not. 4) Finally, we empirically validate our theoretical findings on a number of contextual bandit problems. Related work. The problem of selecting the best representation in a given set can be seen as a specific instance of the problem of model selection in bandits. In model selection, the objective is to choose the best candidate in a set of base learning algorithms. At each step, a master algorithm is responsible for selecting a base algorithm, which in turn prescribes the action to play and the reward is then provided as feedback to the base algorithms. Examples of model selection methods include adversarial masters e.g., EXP4 (Auer et al., 2002; Maillard & Munos, 2011) and CORRAL (Agarwal et al., 2017; Pacchiano et al., 2020b) and stochastic masters (Abbasi-Yadkori et al., 2020; Lee et al., 2020; Bibaut et al., 2020; Pacchiano et al., 2020a). For a broader discussion refer to App. A or (Pacchiano et al., 2020a, Sec. 2). Most of these algorithms achieve the regret of the best base algorithm up to a polynomial dependence on the number M of base algorithms (Agarwal et al., 2017). While existing model selection methods are general and can be applied to any type of base algorithms, 1 they may not be effective in problems with a specific structure. An alternative approach is to design the master algorithm for a specific category of base algorithms. An instance of 1Most of existing methods only require prior knowledge of the regret of the optimal base algorithm or a bound on the regret of all base algorithms. CORRAL also requires the base algorithms to satisfy certain stability conditions. this case is the representation-selection problem, where the base algorithms only differ by the representation used to estimate the reward. Foster et al. (2019) and Ghosh et al. (2020) consider a set of nested representations, where the best representation is the one with the smallest dimensionality for which the reward is realizable. Finally, Chatterji et al. (2020) focus on the problem of selecting between a linear and a multi-armed bandit representation. In this paper, we consider an alternative representation-selection problem in linear contextual bandits, where the objective is to exploit constant-regret good representations. Differently from our work, Lattimore et al. (2020) say that a linear representation is good if it has a low misspecification (i.e., it represents the reward up to a small approximation error), while we focus on realizable representations for which LINUCB achieves constant-regret. 2. Preliminaries We consider the stochastic contextual bandit problem (contextual problem for short) with context space X and finite action set A = [K] = {1, . . . , K}. At each round t 1, the learner observes a context xt sampled i.i.d. from a distribution ρ over X, it selects an arm at [K] and it receives a reward yt = µ(xt, at) + ηt where ηt is a σ-subgaussian noise. The learner s objective is to minimize the pseudoregret Rn = Pn t=1 µ (xt) µ(xt, at) for any n > 0, where µ (xt) := maxa [K] µ(xt, a). We define the minimum gap as = infx X:ρ(x)>0,a [K], (x,a)>0{ (x, a)} where (x, a) = µ (x) µ(x, a). A realizable dφ-dimensional linear representation is a feature map φ : X [K] Rdφ for which there exists an unknown parameter vector θ φ Rdφ such that µ(x, a) = φ(x, a), θ φ . When a realizable linear representation is available, the problem is called (stochastic) linear contextual bandit and can be solved using, among others, optimistic algorithms like LINUCB (Chu et al., 2011) or OFUL (Abbasi-Yadkori et al., 2011). Given a realizable representation φ, at each round t, LINUCB builds an estimate θtφ of θ φ by ridge regression using the observed data. Denote by Vtφ = λIdφ + Pt 1 k=1 φ(xk, ak)φ(xk, ak)T the (λ > 0)-regularized design matrix at round t, then θtφ = V 1 tφ Pt 1 k=1 φ(xk, ak)yk. Assuming that θ φ 2 Sφ and supx,a φ(x, a) 2 Lφ, LINUCB builds a confidence ellipsoid Ctφ(δ) = θ Rdφ : θtφ θ Vtφ βtφ(δ) . As shown in (Abbasi-Yadkori et al., 2011, Thm. 1), when βtφ(δ) := σ 2 ln det(Vtφ)1/2 det(λIdφ) 1/2 then P( t 1, θ φ Ctφ(δ)) 1 δ. At each step t, LINUCB plays the action with the highest upper-confidence bound at = argmaxa [K] maxθ Ctφ(δ) φ(xt, a), θ , and it is shown to achieve a regret bounded as reported in the Leveraging Good Representations in Linear Contextual Bandits following proposition. Proposition 1 (Abbasi-Yadkori et al., 2011, Thm. 3, 4). For any linear contextual bandit problem with dφ-dimensional features, supx,a φ(x, a) 2 Lφ, an unknown parameter vector θ φ 2 Sφ, with probability at least 1 δ, LINUCB suffers regret Rn = O Sφdφ n ln(n Lφ/δ) . Furthermore, if the problem has a minimum gap > 0, then the regret is bounded as2 Rn = O S2 φd2 φ ln2(n Lφ/δ) . In the rest of the paper, we assume w.l.o.g. that all terms λ, max = maxx,a (x, a), Sφ, σ are larger than 1 to simplify the expression of the bounds. 3. Diversity Conditions Several assumptions, usually referred to as diversity conditions, have been proposed to define linear bandit problems with specific properties that can be leveraged to derive improved learning results. While only a few of them were actually leveraged to derive constant regret guarantees for LINUCB (others have been used to prove e.g., sub-linear regret for the greedy algorithm, or regret guarantees for model selection algorithms), they all rely on very similar conditions on how certain context-action features span the full Rdφ space. In this section, we provide a thorough review of these assumptions, their connections, and how they are used in the literature. As diversity conditions are getting more widely used in bandit literature, we believe this review may be of independent interest. Sect. 4 will then specifically focus on the notion of good representation for LINUCB. We first introduce additional notation. For a realizable representation φ, let φ (x) := φ(x, a x), where a x argmaxa [K] µ(x, a) is an optimal action, be the vector of optimal features for context x. In the following we make the assumption that φ (x) is unique. Also, let X (a) = {x X : µ(x, a) = µ (x)} denote the set of contexts where a is optimal. Finally, for any matrix A, we denote by λmin(A) its minimum eigenvalue. For any contextual problem with reward µ and context distribution ρ, the diversity conditions introduced in the literature are summarized in Tab. 2 together with how they were leveraged to obtain regret bounds in different settings.3 We first notice that all conditions refer to the smallest eigenvalue of a design matrix constructed on specific contextaction features. In other words, diversity conditions require certain features to span the full Rdφ space. The 2The logarithmic bound reported in Prop. 1 is slightly different than the one in (Abbasi-Yadkori et al., 2011) since we do not assume that the optimal feature is unique. 3In some cases, we adapted conditions originally defined in the disjoint-parameter setting, where features only depend on the context (i.e., φ(x)) and the unknown parameter θ a is different for each action a, to the shared-parameter setting (i.e., where features are functions of both contexts and actions) introduced in Sect. 2. non-redundancy condition is a common technical assumption (e.g., Foster et al., 2019) and it simply defines a problem whose dimensionality cannot be reduced without losing information. Assuming the context distribution ρ is full support, BBK and CMB are structural properties of the representation that are independent from the reward. For example, BBK requires that, for each action, there must be feature vectors lying in all orthants of Rdφ. In the case of finite contexts, this implies there must be at least 2dφ contexts. WYS and HLS involve the notion of reward optimality. In particular, WYS requires that all actions are optimal for at least a context (in the continuous case, for a non-negligible set of contexts), while HLS only focuses on optimal actions. We now review how these conditions (or variations thereof) were applied in the literature. CMB is a rather strong condition that requires the features associated with each individual action to span the whole Rdφ space. Chatterji et al. (2020) leverage a CMB-like assumption to prove regret bounds for OSOM, a model-selection algorithm that unifies multi-armed and linear contextual bandits. More precisely, they consider a variation of CMB, where the context distribution induces stochastic feature vectors for each action that are independent and centered. The same condition was adopted by Ghosh et al. (2020) to study representationselection problems and derive algorithms able to adapt to the (unknown) norm of θ φ or select the smallest realizable representation in a set of nested representations. Bastani et al. (2020, Assumption 3) introduced a condition similar to BBK for the disjoint-parameter case. In their setting, they prove that a non-explorative greedy algorithm achieves O(ln(n)) problem-dependent regret in linear contextual bandits (with 2 actions).4 Hao et al. (2020, Theorem 3.9) showed that HLS representations can be leveraged to prove constant problem-dependent regret for LINUCB in the shared-parameter case. Concurrently, Wu et al. (2020) showed that, under WYS, LINUCB achieves constant expected regret in the disjoint-parameter case. A WYS-like condition was also used by Bastani et al. (2020, Assumption 4) to extend the result of sublinear regret for the greedy algorithm to more than two actions. The relationship between all these conditions is derived in the following lemma. Lemma 1. For any contextual problem with reward µ and context distribution ρ, let φ be a realizable linear representation. The relationship between the diversity conditions in Tab. 2 is summarized in Fig. 3, where each inclusion is in a strict sense and each intersection is non-empty. This lemma reveals non-trivial connections between the diversity conditions, better understood through the examples provided in the proof (see App. B.1). BBK is indeed 4Whether this is enough for the optimality of the greedy algorithm in the shared-parameter setting is an interesting problem, but it is beyond the scope of this paper. Leveraging Good Representations in Linear Contextual Bandits Name Definition Application Nonredundant λmin 1/K P a [K] Ex ρ φ(x, a)φ(x, a)T > 0 CMB a, λmin Ex ρ φ(x, a)φ(x, a)T > 0 Model selection BBK a, u Rd, λmin Ex φ(x, a)φ(x, a)T1 φ(x, a)Tu 0 > 0 Logarithmic regret for greedy HLS λmin Ex ρ φ (x)φ (x)T > 0 Constant regret for LINUCB WYS a, λmin Ex ρ φ(x, a)φ(x, a)T1 {x X (a)} > 0 Constant regret for LINUCB Figure 2. Diversity conditions proposed in the literature adapted to the shared-parameter setting. The names refer to the authors who first introduced similar conditions. Non-redundant Figure 3. Categorization of diversity conditions. stronger than CMB, and thus it is sufficient for the model selection results by Chatterji et al. (2020). By superficially examining their definitions, CMB may appear stronger than HLS, but the two properties are actually non-comparable, as there are representations that satisfy one condition but not the other. The implications of Fig. 3 on constant-regret guarantees are particularly relevant for our purposes. There are representations that satisfy BBK or CMB and are neither HLS nor WYS and thus may not enable constant regret for LINUCB. We notice that WYS is a stronger condition than HLS. Although WYS may be necessary for LINUCB to achieve constant regret in the disjoint-parameter case, HLS is sufficient for the shared-parameter case we consider in this paper. For this reason, in the following section we adopt HLS to define good representations for LINUCB and provide a more complete characterization. 4. Good Representations for Constant Regret The HLS condition was introduced by Hao et al. (2020), who provided a first analysis of its properties. In this section, we complement those results by providing a complete proof of a constant regret bound, a proof of the fact that HLS is actually necessary for constant regret, and a novel characterization of the existence of HLS representations. In the following we define λφ,HLS := λmin Ex ρ φ (x)φ (x)T , which is strictly positive for HLS representations. 4.1. Constant Regret Bound We begin by deriving a constant problem-dependent regret bound for LINUCB under the HLS condition. Lemma 2. Consider a contextual bandit problem with realizable linear representation φ satisfying the HLS condition (see Tab. 2). Assume > 0, maxx,a φ(x, a) 2 L and θ φ 2 S. Then, with probability at least 1 2δ, the regret of OFUL after n 1 steps is at most Rn 32λ 2 max S2 φσ2 1 + τφL2 φ λdφ where max = maxx,a (x, a) is the maximum gap and τφ max 3842d2 φL2 φS2 φσ2λ λφ,HLS 2 ln2 64d2 φL3 φσSφ 768L4 φ λ2 φ,HLS ln 512dφL4 φ δλ2 φ,HLS We first notice that τφ is independent from the horizon n, thus making the previous bound a constant only depending on the problem formulation (i.e., gap , norms Lφ and Sφ) and the value λφ,HLS which measures how much the representation φ satisfies the HLS condition. Furthermore, one can always take the minimum between the constant regret in Lem. 2 and any other valid regret bound for OFUL (e.g., O(log(n)/ ))), which may be tighter for small values of n. While Lem. 2 provides high-probability guarantees, we can easily derive a constant expected-regret bound by running LINUCB with a decreasing schedule for δ (e.g., δt 1/t3) and with a slightly different proof (see App. C and the proof sketch below). Proof sketch (full proof in App. C). Following Hao et al. (2020), the idea is to show that the instantaneous regret rt+1 = θ , φ (xt+1) φ(xt+1, at+1) is zero for sufficiently large (but constant) time t. By using the standard regret analysis, we have rt+1 2βt+1(δ) φ(xt+1, at+1) V 1 t+1 2Lβt+1(δ) p λmin(Vt+1) . Given the minimum-gap assumption, a sufficient condition for rt+1 = 0 is that the previous upper bound is smaller than , which gives λmin(Vt+1) > 4L2β2 t+1(δ)/ 2. Since > 0, the problem-dependent regret bound in Prop. 1 Leveraging Good Representations in Linear Contextual Bandits holds, and the number of pulls to suboptimal arms up to time t is bounded by gt(δ) = O (d ln(t/δ)/ )2 . Hence, the optimal arms are pulled linearly often and, by leveraging the HLS assumption, we are able to show that the minimum eigenvalue of the design matrix grows linearly in time as λmin(Vt+1) λ + tλHLS 8L2 s By relating the last two equations, we obtain an inequality of the form tλHLS o(t) > o(t). If we define τ < as the smallest (deterministic) time such that this inequality holds, we have that after τ the immediate regret is zero, thus concluding the proof. Note that, if we wanted to bound the expected regret, we could set δt 1/t3 and the above inequality would still be of the same form (although the resulting τ would be slightly different). Comparison with existing bounds. Hao et al. (2020, Theorem 3.9) prove that LINUCB with HLS representations achieves lim supn Rn < , without characterizing the time at which the regret vanishes. Instead, our Lem. 2 provides an explicit problem-dependent constant regret bound. Wu et al. (2020, Theorem 2) consider the disjoint-parameter setting and rely on the WYS condition. While they indeed prove a constant regret result, their bound depends on the the minimum probability of observing a context (or, in the continuous case, a properly defined meta-context). This reflects the general tendency, in previous works, to frame diversity conditions simply as a property of the context distribution ρ. On the other hand, our characterization of τ in terms of λφ,HLS (Lem. 2) allows relating the regret to the goodness of the representation φ for the problem at hand. 4.2. Removing the Minimum-Gap Assumption Constant-regret bounds for LINUCB rely on a minimumgap assumption ( > 0). In this section we show that LINUCB can still benefit from HLS representations when = 0, but a margin condition holds (e.g., Rigollet & Zeevi, 2010; Reeve et al., 2018). Intuitively, we require that the probability of observing a context x decays proportionally to its minimum gap (x) = mina (x, a). Assumption 1 (Margin condition). There exists C, α > 2 such that for all ϵ > 0: ρ {x X : (x) ϵ} Cϵα. The following theorem provides a problem-dependent regret bound for LINUCB under this margin assumption. Theorem 1. Consider a linear contextual bandit problem satisfying the margin condition (Asm. 1). Assume maxx,a φ(x, a) 2 Lφ and θ φ 2 Sφ. Then, given a representation φ, with probability at least 1 3δ, the regret of OFUL after n 1 steps is at most Rn O λ( max Sφσdφ)2n1/α + p Cdφ ln2(Lφn/δ) . When φ is HLS (λφ,HLS > 0), let τφ (λφ,HLS) α 2 α , then Rn O maxτφ + p Cdφ ln2(Lφn/δ) . We first notice that in general, LINUCB suffers e O(n1/α) regret, which can be significantly larger than in the minimumgap case. On the other hand, with HLS representations, LINUCB achieves logarithmic regret, regardless of the value of α. The intuition is that, when the HLS condition holds, the algorithm collects sufficient information about θ φ by pulling the optimal arms in rounds with large minimum gap, which occur with high probability by the margin condition. This yields at most constant regret in such rounds (first term above), while it can be shown that the regret in steps when the minimum gap is very small is at most logarithmic (second term above). 4.3. Further Analysis of the HLS Condition While Lem. 2 shows that HLS is sufficient for achieving constant regret, the following proposition shows that it is also necessary. While this property was first mentioned by Hao et al. (2020) as a remark in a footnote, we provide a formal proof in App. C.5. Proposition 2. For any contextual problem with finite contexts, full-support context distribution, and given a nonredundant realizable representation φ, LINUCB achieves sub-logarithmic regret if and only if φ satisfies the HLS condition. As already observed in Section 4, the HLS condition can be equivalently expressed as:5 span{φ (x) | x X} = Rd, i.e., optimal features must span the whole d-dimensional Euclidean space, where d is the dimension of φ. If we admit redundant representations, a weaker condition is sufficient to achieve constant regret: span{φ (x) | x X} = span{φ(x, a) | x X, a A}, i.e., optimal features must span the whole feature space, which may be a subspace of Rd in general. We prove that this weak HLS condition is sufficient for LINUCB to achieve constant regret as Corollary 1 in App. E.2. This also shows that constant-regret guarantees are preserved by adding redundant features to an HLS representation. Finally, we derive the following important existence result. Lemma 3. For any contextual bandit problem with optimal reward 6 µ (x) = 0 for all x X, that has either i) a 5That is assuming the context distribution is full-support. Otherwise, it is enough to replace X with the support of the context distribution supp(ρ). 6This condition is technical and it can be easily relaxed. Leveraging Good Representations in Linear Contextual Bandits finite context set with at least d contexts with nonzero probability, or ii) a Borel context space and a non-degenerate context distribution7, for any dimension d 1, there exists an infinite number of d-dimensional realizable HLS representations. This result crucially shows that the HLS condition is robust , since in any contextual problem, it is possible to construct an infinite number of representations satisfying the HLS condition. In App. B.2, we indeed provide an oracle procedure for constructing an HLS representation. This result also supports the starting point of next section, where we assume that a learner is provided with a set of representations that may contain at least a good representation, i.e., an HLS representation. 5. Representation Selection In this section, we study the problem of representation selection in linear bandits. We consider a linear contextual problem with reward µ and context distribution ρ. Given a set of M realizable linear representations {φi : X [K] Rdi}, the objective is to design a learning algorithm able to perform as well as the best representation, and thus achieve constant regret when a good representation is available. As usual, we assume θ i Rdi is unknown, but the algorithm is provided with a bound on the parameter and feature norms of the different representations. 5.1. The LEADER Algorithm We introduce LEADER (Linear r Epresentation b An Dit mix ER), see Alg. 1. At each round t, LEADER builds an estimate θti of the unknown parameter θ i of each representation φi.8 These estimates are by nature off-policy, and thus all the samples (xl, al, yl)l 0). The analysis can be generalized to = 0 as done in Sec. 4.2. Thm. 2 establishes the regret guarantee of LEADER (Alg. 1). Theorem 2. Consider a contextual bandit problem with reward µ, context distribution ρ and > 0. Let (φi) be a set of M linearly realizable representations such that maxx,a φi(x, a) 2 Li and θ i i Si. Then, for any n 1, with probability 1 2δ, LEADER suffers a regret Rn min i [M] 32λ 2 max S2 i σ2 + di ln 1 + min{τi, n}L2 i λdi where τi (λi,HLS ) 2 if φi is HLS and τi = + otherwise. This shows that the problem-dependent regret bound of LEADER is not worse than the one of the best representation (see Prop. 1), up to a ln M factor. This means that the cost of representation selection is almost negligible. Furthermore, Thm. 2 shows that LEADER not only achieves a constant regret bound when an HLS representation is available, but this bound scales as the one of the best HLS representation. In fact, notice that the quality of an HLS representation does not depend only on known quantities such as di, Li, Si, but crucially on HLS eigenvalue λi,HLS, which is usually not known in advance, as it depends on the features of the optimal arms. 5.2. Combining Representations In the previous section, we have shown that LEADER can perform as well as the best representation in the set. However, by inspecting the action selection rule (Eq. 2), we notice that, to evaluate the reward of an action in the current context, LEADER selects the representation with the smallest uncertainty, thus potentially using different representations for different context-action pairs. This leads to the question: can LEADER do better than the best representation in the set? Leveraging Good Representations in Linear Contextual Bandits Algorithm 1 LEADER Algorithm Input: representations (φi)i [M] with values (Li, Si)i [M], regularization factor λ 1, confidence level δ (0, 1). Initialize V1i = λIdi, θ1i = 0di for each i [M] for t = 1, . . . do Observe context xt Pull action at argmaxa [K] mini [M]{Uti(xt, a)} Observe reward rt and, for each i [M], set Vt+1,i = Vti + φi(xt, at)φi(xt, at)T and θt+1,i = V 1 t+1,i Pt l=1 φi(xl, al)rl end for We show that, in certain cases, LEADER is able to combine representations and achieve constant regret when none of the individual representations would. The intuition is that a subset of locally good representations can be combined to recover a condition similar to HLS. This property is formally stated in the following definition. Definition 1 (Mixing HLS). Consider a linear contextual problem with reward µ and context distribution ρ, and a set of M realizable linear representations φ1, . . . , φM. Define Mi = Ex ρ h φ i (x)φ i (x)Ti and let Zi = {(x, a) X A | φi(s, a) Im(Mi)} be the set of context-action pairs whose features belong to the column space of Mi, i.e., that lie in the span of optimal features. We say that the set (φi) satisfies the mixed-HLS condition if X A SM i=1 Zi. Let λ+ i = λ+ min(Mi) be the minimum nonzero eigenvalue of Mi. Intuitively, the previous condition relies on the observation that every representation satisfies a restricted HLS condition on the context-action pairs (x, a) whose features φi(x, a) are spanned by optimal features φ (x). In this case, the characterizing eigenvalue is λ+ i , instead of the smallest eigenvalue λi,HLS (which may be zero). If every contextaction pair is in the restriction Zi of some representation, we have the mixed-HLS property. In particular, if representation i is HLS, λ+ i = λi,HLS and Zi = S A. So, HLS is a special case of mixed-HLS. In App. E.2, we provide simple examples of sets of representations satisfying Def. 1. Note that, strictly speaking, there is not a single mixed representation solving the whole problem. Even defining one would be problematic since each representation may have a different parameter and even a different dimension. Instead, each representation specializes on a different portion of the context-action space. If together they cover the whole space, the benefits of HLS are recovered, as illustrated in the following theorem. Theorem 3. Consider a stochastic bandit problem with reward µ, context distribution ρ and > 0. Let (φi) be a set of M realizable linear representations satisfying the mixed-HLS property in Def. 1. Then, with probability at least 1 2δ, there exists a time τ < independent from n such that, for any n 1, the pseudo-regret of LEADER is Rn min i [M] 32λ 2 max S2 i σ2 + di ln 1 + τL2 i λdi First, note that we are still scaling with the characteristics of the best representation in the set (i.e., di, Li and Si). However, the time τ to constant regret is a global value rather than being different for each representation. This highlights that mixed-HLS is a global property of the set of representations rather than being individual as before. In particular, whenever no representation is (globally) HLS (i.e., λi,HLS = 0 for all φi), we can show that in the worst case τ scales as (mini λ+ i ) 2. In practice, we may expect LEADER to even behave better than that since i) not all the representations may contribute actively to the mixed-HLS condition; and ii) multiple representations may cover the same region of the context-action space. In the latter case, since LEADER leverages all the representations at once, its regret would rather scale with the largest minimum nonzero eigenvalue λ+ i among all the representations covering such region. We refer to App. E.2 for a more complete discussion. 5.3. Discussion Most of the model selection algorithms reviewed in the introduction could be readily applied to select the best representation for LINUCB. However, the generality of their objective comes with several shortcomings when instantiated in our specific problem (see App. A for a more detailed comparison). First, model selection methods achieve the performance of the best algorithm, up to a polynomial dependence on the number M of models. This already makes them a weaker choice compared to LEADER, which, by leveraging the specific structure of the problem, suffers only a logarithmic dependence on M. Second, model selection algorithms are often studied in a worst-case analysis, which reveals a high cost for adaptation. For instance, corralling algorithms (Agarwal et al., 2017; Pacchiano et al., 2020b) pay an extra n regret, which would make them unsuitable to target the constant regret of good representations. Similar costs are common to other approaches (Abbasi-Yadkori et al., 2020; Pacchiano et al., 2020a). It is unclear whether a problem-dependent analysis can be carried out and whether this could shave off such dependence. Third, these algorithms are generally designed to adapt to a specific best base algorithm. At the best of our knowledge, there is no evidence that model selection methods could combine algorithms to achieve better performance than the best candidate, a behavior that we proved for LEADER in our setting. On the other hand, model selection algorithms effec- Leveraging Good Representations in Linear Contextual Bandits tively deal with non-realizable representations in certain cases (e.g., Foster et al., 2020; Abbasi-Yadkori et al., 2020; Pacchiano et al., 2020a), while LEADER is limited to the realizable case. While a complete study of the model misspecification case is beyond the scope of this paper, in App. F, we discuss how a variation of the approach presented in (Agarwal et al., 2012b) could be paired to LEADER to discard misspecified representations and possibly recover the properties of good representations. 6. Experiments In this section, we report experimental results on two synthetic and one dataset-based problems. For each problem, we evaluate the behavior of LEADER with LINUCB and model selection algorithms: EXP4.IX (Neu, 2015), CORRAL and EXP3.P in the stochastic version by Pacchiano et al. (2020b) and Regret Balancing with and without elimination (REGBALELIM and REGBAL) (Abbasi-Yadkori et al., 2020; Pacchiano et al., 2020a). See App. G for a detailed discussion and additional experiments. All results are averaged over 20 independent runs, with shaded areas corresponding to 2 standard deviations. We always set the parameters to λ = 1, δ = 0.01, and σ = 0.3. All the representations we consider are normalized to have θ i = 1. Synthetic Problems. We define a randomly-generated contextual bandit problem, for which we construct sets of realizable linear representations with different properties (see App. G.1 for details). The purpose of these experiments is twofold: to show the different behavior of LINUCB with different representations, and to evaluate the ability of LEADER of selecting and mixing representations. Varying dimension. We construct six representations of varying dimension from 2 up to 6. Of the two representations of dimension d = 6, one is HLS. Fig. 4(left) shows that in this case, LINUCB with the HLS representation outperforms any non-HLS representation, even if they have smaller dimension. This property is inherited by LEADER, which performs better than LINUCB with non-HLS representations even of much smaller dimension 2. Mixing representations. We construct six representations of the same dimension d = 6, none of which is HLS. However, they are constructed so that together they satisfy the weaker mixed-HLS assumption (Def. 1). Fig. 4(middle left) shows that, as predicted by Thm. 3, LEADER leverages different representations in different context-action regions and it thus performs significantly better than any LINUCB using non-HLS representations. The superiority of LEADER w.r.t. the model-selection baselines is evident in this case (Fig. 4(middle right) ), since only LEADER is able to mix representations, whereas model-selection algorithms target the best in a set of bad representations. Additional experiments in App. G confirm that LEADER consistently outperforms all model-selection algorithms. Jester Dataset. In the last experiment, we extract multiple linear representations from the Jester dataset (Goldberg et al., 2001), which consists of joke ratings in a continuous range from 10 to 10 for a total of 100 jokes and 73421 users. For a subset of 40 jokes and 19181 users rating all these 40 jokes, we build a linear contextual problem as follows. First, we fit a 32 32 neural network to predict the ratings from features extracted via a low-rank factorization of the full matrix. Then, we take the last layer of the network as our ground truth linear model and fit multiple smaller networks to clone its predictions, while making sure that the resulting misspecification is small. We thus obtain 7 representations with different dimensions among which, interestingly, we find that 6 are HLS. Figure 4(right) reports the comparison between LEADER using all representations and LINUCB with each single representation on a log-scale. Notably, the ability of LEADER to mix representations makes it perform better than the best candidate, while transitioning to constant regret much sooner. Finally, the fact that HLS representations arise so naturally raises the question of whether this is a more general pattern in context-action features learned from data. Last.fm dataset. In App. F we study a variant of LEADER that is able to handle misspecified representations, and we test it on the Last.fm music-recommendation dataset (Cantador et al., 2011). See App. F.4 for details. 7. Conclusion We provided a complete characterization of good realizable representations for LINUCB, ranging from existence to a sufficient and necessary condition to achieve problemdependent constant regret. We introduced LEADER, a novel algorithm that, given a set of realizable linear representations, is able to adapt to the best one and even leverage their combination to achieve constant regret under the milder mixed-HLS condition. While we have focused on LINUCB, other algorithms (e.g., Lin TS (Abeille & Lazaric, 2017)) as well as other settings (e.g., low-rank RL (Jin et al., 2020)) may also benefit from HLS-like assumptions. We have mentioned an approach for eliminating misspecified representations, but a non-trivial trade-off may exist between the level of misspecification and the goodness of the representation. A slightly imprecise but very informative representation may be preferable to most bad realizable ones. Finally, we believe that moving from selection to representation learning e.g., provided a class of features such as a neural network is an important direction both from a theoretical and practical perspective. Leveraging Good Representations in Linear Contextual Bandits 0 0.2 0.4 0.6 0.8 1 Pseudo Regret LEADER d = 2 d = 3 d = 4 d = 5 d = 6 (HLS) d = 6 0 0.2 0.4 0.6 0.8 1 LEADER φ0 φ1 φ2 φ3 φ4 φ5 0 0.2 0.4 0.6 0.8 1 CORRAL EXP3.P EXP4.IX LEADER REGBALELIM REGBAL 102 103 104 105 106 0 LEADER d = 16 d = 17 d = 20 d = 23 d = 24 d = 26 d = 33 Figure 4. Regret of LEADER and model-selection baselines on different linear contextual bandit problems. (left) Synthetic problem with varying dimensions. (middle left) Representation mixing. (middle right) Comparison to model selection baselines. (right) Jester dataset. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In NIPS, pp. 2312 2320, 2011. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Onlineto-confidence-set conversions and application to sparse stochastic bandits. In AISTATS, volume 22 of JMLR Proceedings, pp. 1 9. JMLR.org, 2012. Abbasi-Yadkori, Y., Pacchiano, A., and Phan, M. Regret balancing for bandit and rl model selection, 2020. Abeille, M. and Lazaric, A. Linear thompson sampling revisited. In AISTATS, volume 54 of Proceedings of Machine Learning Research, pp. 176 184. PMLR, 2017. Agarwal, A., Dud ık, M., Kale, S., Langford, J., and Schapire, R. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pp. 19 26. PMLR, 2012a. Agarwal, A., Dud ık, M., Kale, S., Langford, J., and Schapire, R. E. Contextual bandit learning with predictable rewards. In AISTATS, volume 22 of JMLR Proceedings, pp. 19 26. JMLR.org, 2012b. Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. Corralling a band of bandit algorithms. In COLT, volume 65 of Proceedings of Machine Learning Research, pp. 12 38. PMLR, 2017. Arora, R., Marinov, T. V., and Mohri, M. Corralling stochastic bandit algorithms. Co RR, abs/2006.09255, 2020. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. Bastani, H., Bayati, M., and Khosravi, K. Mostly exploration-free algorithms for contextual bandits. Management Science, 2020. Bibaut, A. F., Chambaz, A., and van der Laan, M. J. Rateadaptive model selection over a collection of black-box contextual bandit algorithms. Co RR, abs/2006.03632, 2020. Bouneffouf, D. and Rish, I. A survey on practical applications of multi-armed and contextual bandits, 2019. Cantador, I., Brusilovsky, P., and Kuflik, T. 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, Rec Sys 2011, New York, NY, USA, 2011. ACM. URL http://ir.ii. uam.es/hetrec2011/index.html. Last.fm website, http://www.lastfm.com. Chatterji, N. S., Muthukumar, V., and Bartlett, P. L. OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pp. 1844 1854. PMLR, 2020. Chatzigeorgiou, I. Bounds on the lambert function and their application to the outage analysis of user cooperation. Co RR, abs/1601.04895, 2016. Chen, F. A note on matrix versions of kantorovich-type inequality. JOURNAL OF MATHEMATICAL INEQUALITIES, 7(2):283 288, 2013. Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextual bandits with linear payoff functions. In AISTATS, volume 15 of JMLR Proceedings, pp. 208 214. JMLR.org, 2011. Degenne, R., M enard, P., Shang, X., and Valko, M. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pp. 2432 2442. PMLR, 2020. Eaton, M. L. and Perlman, M. D. The non-singularity of generalized sample covariance matrices. The Annals of Statistics, pp. 710 717, 1973. Foster, D. J., Krishnamurthy, A., and Luo, H. Model selection for contextual bandits. In Neur IPS, pp. 14714 14725, 2019. Leveraging Good Representations in Linear Contextual Bandits Foster, D. J., Gentile, C., Mohri, M., and Zimmert, J. Adapting to misspecification in contextual bandits. In Neur IPS, 2020. Ghosh, A., Sankararaman, A., and Ramchandran, K. Problem-complexity adaptive model selection for stochastic linear bandits. Co RR, abs/2006.02612, 2020. Goldberg, K. Y., Roeder, T., Gupta, D., and Perkins, C. Eigentaste: A constant time collaborative filtering algorithm. Inf. Retr., 4(2):133 151, 2001. Hao, B., Lattimore, T., and Szepesv ari, C. Adaptive exploration in linear contextual bandit. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pp. 3536 3545. PMLR, 2020. Hoorfar, A. and Hassani, M. Inequalities on the lambert w function and hyperpower function. J. Inequal. Pure and Appl. Math, 9(2):5 9, 2008. Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In COLT, volume 125 of Proceedings of Machine Learning Research, pp. 2137 2143. PMLR, 2020. Lattimore, T. and Szepesvari, C. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pp. 728 737. PMLR, 2017. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020. Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pp. 5662 5670. PMLR, 2020. Lee, J. N., Pacchiano, A., Muthukumar, V., Kong, W., and Brunskill, E. Online model selection for reinforcement learning with function approximation, 2020. Maillard, O. and Munos, R. Adaptive bandits: Towards the best history-dependent strategy. In AISTATS, volume 15 of JMLR Proceedings, pp. 570 578. JMLR.org, 2011. Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In Neur IPS, pp. 3168 3176, 2015. Pacchiano, A., Dann, C., Gentile, C., and Bartlett, P. Regret bound balancing and elimination for model selection in bandits and rl, 2020a. Pacchiano, A., Phan, M., Abbasi-Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesv ari, C. Model selection in contextual stochastic bandit problems. In Neur IPS, 2020b. Reeve, H. W. J., Mellor, J., and Brown, G. The k-nearest neighbour UCB algorithm for multi-armed bandits with covariates. In ALT, volume 83 of Proceedings of Machine Learning Research, pp. 725 752. PMLR, 2018. Rigollet, P. and Zeevi, A. Nonparametric bandits with covariates. ar Xiv preprint ar Xiv:1003.1630, 2010. Tirinzoni, A., Pirotta, M., Restelli, M., and Lazaric, A. An asymptotically optimal primal-dual incremental algorithm for contextual linear bandits. Advances in Neural Information Processing Systems, 33, 2020. Tropp, J. A. User-friendly tail bounds for sums of random matrices. Found. Comput. Math., 12(4):389 434, 2012. Wu, W., Yang, J., and Shen, C. Stochastic linear contextual bandits with diverse contexts. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pp. 2392 2401. PMLR, 2020.