# regret_minimization_via_saddle_point_optimization__c36c1258.pdf Regret Minimization via Saddle Point Optimization Johannes Kirschner Department of Computer Science University of Alberta jkirschn@ualberta.ca Seyed Alireza Bakhtiari Department of Computer Science University of Alberta sbakhtia@ualberta.ca Kushagra Chandak Department of Computer Science University of Alberta kchandak@ualberta.ca Volodymyr Tkachuk Department of Computer Science University of Alberta vtkachuk@ualberta.ca Csaba Szepesvári Department of Computer Science University of Alberta szepesva@ualberta.ca A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs. In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which was shown to provide nearly tight lower and upper bounds on the worst-case expected regret in structured bandits and reinforcement learning. By reparametrizing the offset DEC with the confidence radius and solving the corresponding min-max program, we derive an anytime variant of the Estimation-To-Decisions algorithm (ANYTIME-E2D). Importantly, the algorithm optimizes the explorationexploitation trade-off online instead of via the analysis. Our formulation leads to a practical algorithm for finite model classes and linear feedback models. We further point out connections to the information ratio, decoupling coefficient and PAC-DEC, and numerically evaluate the performance of E2D on simple examples. 1 Introduction Regret minimization is a widely studied objective in bandits and reinforcement learning theory [Lattimore and Szepesvári, 2020a] that has inspired practical algorithms, for example, in noisy zero-order optimization[e.g., Srinivas et al., 2010] and deep reinforcement learning [e.g., Osband et al., 2016]. Cumulative regret measures the online performance of the algorithm by the total loss suffered due to choosing suboptimal decisions. Regret is unavoidable to a certain extent as the learner collects information to reduce uncertainty about the environment. In other words, a learner will inevitably face the exploration-exploitation trade-off where it must balance collecting rewards and collecting information. Finding the right balance is the central challenge of sequential decision-making under uncertainty. More formally, denote by Π a decision space and O an observation space. Let M be a class of models, where f = (rf, Mf) M associated with a reward function rf : Π R and observation 37th Conference on Neural Information Processing Systems (Neur IPS 2023). map Mf : Π P(O), where P(O) is the set of all probability distributions over O.1 The learner s objective is to collect as much reward as possible in n steps when facing a model f M. The learner s prior information is M and the associated reward and observation maps, but does not know the true instance f M. The learner constructs a stochastic sequence π1, . . . , πn of decisions taking values in Π and adapted to the history of observations yt Mf (πt). The policy of the learner is the sequence of probability kernels µ1:n = (µt)n t=1 that are used to take decisions. The expected regret of a policy µ1:n and model f after n N steps is Rn(µ1:n, f ) = max π Π E t=1 rf (π) rf (πt) The literature studies regret minimization for various objectives, including worst-case and instancedependent frequentist regret [Lattimore and Szepesvári, 2020a], Bayesian regret [Russo and Van Roy, 2014] and robust variants [Garcelon et al., 2020, Kirschner et al., 2020a]. For the frequentist analysis, all prior knowledge is encoded in the model class M. The worst-case regret of policy µ1:n on M is supf M Rn(µ1:n, f), and therefore the optimal minimax regret infµ supf M Rn(µ1:n, f) only depends on M and the horizon n. The Bayesian, in addition, assumes access to a prior ν P(M), which leads to the Bayesian regret Ef ν[Rn(µ1:n, f)]. Interestingly, the worst-case frequentist regret and Bayesian regret are dual in the following sense [Lattimore and Szepesvári, 2019]:2 inf µ1:n sup f M Rn(µ1:n, f) = sup ν P(M) inf µ1:n Ef ν[Rn(µ1:n, f)] (1) Unfortunately, directly solving for the minimax policy (or the worst-case prior) is intractable, except in superficially simple problems. Ths is because the optimization is over the exponentially large space of adaptive policies. However, the relationship in Eq. (1) has been directly exploited in prior works, for example, to derive non-constructive upper bounds on the worst-case regret via a Bayesian analysis [Bubeck et al., 2015]. Moreover, it can be seen as inspiration underlying optimization-based algorithms for regret minimization: The crucial step is to carefully relax the saddle point problem in a way that preserves the statistical complexity, but can be analyzed and computed more easily. This idea manifests in several closely related algorithms, including information-directed sampling [Russo and Van Roy, 2014, Kirschner and Krause, 2018], Exp By Opt [Lattimore and Szepesvári, 2020b, Lattimore and Gyorgy, 2021], and most recently, the Estimation-To-Decisions (E2D) framework [Foster et al., 2021, 2023]. These algorithms have in common that they optimize the information tradeoff directly, which in structured settings leads to large improvements compared to standard optimistic exploration approaches and Thompson sampling. On the other hand, algorithms that directly optimize the information trade-off can be computationally more demanding and, consequently, are often not the first choice of practitioners. This is partly due to the literature primarily focusing on statistical aspects, leaving computational and practical considerations underexplored. Contributions Building on the results by Foster et al. [2021], we introduce the average-constrained decision-estimation coefficient (decac ϵ ), a saddle-point objective that characterizes the frequentist worst-case regret in sequential decision-making with structured observations. Compared to the decision-estimation coefficient of [Foster et al., 2021], the decac ϵ is parametrized via the confidence radius ϵ, instead of the Lagrangian offset multiplier. This allows optimization of the information tradeoff online by the algorithm, instead of via the derived regret upper bound. Moreover, optimizing the decac ϵ leads to an anytime version of the E2D algorithm (ANYTIME-E2D) with a straightforward analysis. We also point out relations between the decac ϵ , the information ratio [Russo and Van Roy, 2016], the decoupling coefficient [Zhang, 2022] and a PAC version of the DEC [Foster et al., 2023]. We further detail how to implement the algorithm for finite model classes and linear feedback models, and demonstrate the advantage of the approach by providing improved bounds for linear bandits with sideobservations. Lastly, we report the first empirical results of the E2D algorithm on simple examples. 1.1 Related Work There is a broad literature on regret minimization in bandits [Lattimore and Szepesvári, 2020a] and reinforcement learning [Jin et al., 2018, Azar et al., 2017, Zhou et al., 2021, Du et al., 2021, 1To simplify the presentation, we ignore tedious measure-theoretic details in this paper. The reader could either fill out the missing details, or just assume that all sets, unless otherwise stated, are discrete. 2The result by Lattimore and Szepesvári [2019] was only shown for finite action, reward and observation spaces, but can likely be extended to the infinite case under suitable continuity assumptions. Zanette et al., 2020]. Arguably the most popular approaches are based on optimism, leading to the widely analysed upper confidence bound (UCB) algorithms [Lattimore and Szepesvári, 2020a], and Thompson sampling (TS) [Thompson, 1933, Russo and Van Roy, 2016]. A long line of work approaches regret minimization as a saddle point problem. Degenne et al. [2020b] showed that in the structured bandit setting, an algorithm based on solving a saddle point equation achieves asymptotically optimal regret bounds, while explicitly controlling the finite-order terms. Lattimore and Szepesvári [2020b] propose an algorithm based on exponential weights in the partial monitoring setting [Rustichini, 1999] that finds a distribution for exploration by solving a saddle-point problem. The saddle-point problem balances the trade-off between the exponential weights distribution and an information or stability term. The same approach was further refined by Lattimore and Gyorgy [2021]. In stochastic linear bandits, Kirschner et al. [2021] demonstrated that information-directed sampling can be understood as a primal-dual method solving the asymptotic lower bound, which leads to an algorithm that is both worst-case and asymptotically optimal. The saddle-point approach has been further explored in the PAC setting [e.g., Degenne et al., 2020b,a]. Our work is closely related to recent work by Foster et al. [2021, 2023]. They consider decision making with structured observations (DMSO), which generalizes the bandit and RL setting. They introduce a complexity measure, the offset decision-estimation coefficient (offset DEC), defined as a min-max game between a learner and an environment, and provide lower bounds in terms of the offset DEC. Further, they provide an algorithm, Estimation-to-Decisions (E2D) with corresponding worst-case upper bounds in terms of the offset DEC. Notably, the lower and upper bound nearly match and recover many known results in bandits and RL. More recently, Foster et al. [2023] refined the previous bounds by introducing the constrained DEC and a corresponding algorithm E2D+. There are various other results related to the DEC and the E2D algorithm. Foster et al. [2022a] show that the E2D achieves improved bounds in model-free RL when combined with optimistic estimation (as introduced by Zhang [2022]). Chen et al. [2022] introduced two new complexity measures based on the DEC that are necessary and sufficient for reward-free learning and PAC learning. They also introduced new algorithms based on the E2D algorithm for the above two settings and various other improvements. Foster et al. [2022b] have shown that the DEC is necessary and sufficient to obtain low regret for adversarial decision-making. An asymptotically instance-optimal algorithm for DMSO has been proposed by Dong and Ma [2022], extending a similar approach for the linear bandit setting [Lattimore and Szepesvari, 2017]. The decision-estimation coefficient is also related to the information ratio [Russo and Van Roy, 2014] and the decoupling coefficient [Zhang, 2022]. The information ratio has been studied under both the Bayesian [Russo and Van Roy, 2014] and the frequentist regret [Kirschner and Krause, 2018, Kirschner et al., 2020b, 2021, 2023] in various settings including bandits, reinforcement learning, and partial monitoring. The decoupling coefficient was studied for the Thompson sampling algorithm in contextual bandits [Zhang, 2022], and RL [Dann et al., 2021, Agarwal and Zhang, 2022]. We consider the sequential decision-making problem already introduced in the preface. Recall that Π is a compact decision space and O is an observation space. The model class M is a set of tuples f = (rf, Mf) containing a reward function rf : Π R and an observation distribution Mf : Π P(O). We define the gap function (π, g) = rg(π g) rg(π) , where π g = arg maxπ Π rg(π) is an optimal decision for model g, chosen arbitrarily if not unique. A randomized policy is a sequence of kernels µ1:n = (µt)n t=1 from histories ht 1 = (π1, y1, . . . , πt 1, yt 1) (Π O)t 1 to sampling distributions P(Π). The filtration generated by the history ht is Ft. The learner s decisions π1, . . . , πn are sampled from the policy πt µt and observations yt Mf (πt) are generated by an unknown true model f M. The expected regret under model f is formally defined as follows: Rn(µ1:n, f ) = E t=1 Eπt µt(ht)[ (πt, f )] For now, we do not make any assumption about the reward being observed. This provides additional flexibility to model a wide range of scenarios, including for example, duelling and ranking feedback [Yue and Joachims, 2009, Radlinski et al., 2008, Combes et al., 2015, Lattimore et al., 2018, Kirschner and Krause, 2021] (e.g. used in reinforcement learning with human feedback, RLHF) or dynamic pricing [den Boer, 2015]. The setting is more widely known as partial monitoring Rustichini [1999]. The special case where the reward is part of the observation distribution is called decision-making with structured observations [DMSO, Foster et al., 2021]. Earlier work studies the closely related structured bandit setting [Combes et al., 2017]. A variety of examples across bandit models and reinforcement learning are discussed in [Combes et al., 2017, Foster et al., 2021, 2023, Kirschner et al., 2023]. For the purpose of this paper, we focus on simple cases for which we can provide tractable implementations. Besides the finite setting where M can be enumerated, these are the following linearly parametrized feedback models. Example 2.1 (Linear Bandits, Abe and Long [1999]). The model class is identified with a subset of Rd and features ϕπ Rd for each π Π. The reward function is rf(π) = ϕ(π), f and the observation distribution is Mf(π) = N( ϕπ, f , 1). The linear bandit setting can be generalized by separating reward and feedback maps, which leads to the linear partial monitoring framework [Lin et al., 2014, Kirschner et al., 2020b]. Here we restrict our attention to the special case of linear bandits with side-observations [c.f. Kirschner et al., 2023], which, for example, generalizes the classical semi-bandit setting Mannor and Shamir [2011] Example 2.2 (Linear Bandits with Side-Observations). As in the linear bandit setting, we have M Rd, and features ϕπ Rd that define the reward functions rf(π) = ϕπ, f . Observation matrices Mπ Rmπ d for each π Π define mπ-dimensional observation distributions Mf(π) = N(Mπf, σ21mπ). In addition, we assume that ϕπϕ π M π Mπ, which is automatically satisfied if ϕ π is included in the rows of Mπ, i.e. when the reward is part of the observations. 3 Regret Minimization via Saddle-Point Optimization The goal of the learner is to choose decisions π Π that achieve a small gap (π, f ) under the true model f M. Since the true model is unknown, the learner has to collect data that provides statistical evidence to reject models g = f for which the regret (π, g) is large. To quantify the information-regret trade-off, we use a divergence D( ) defined for distributions in P(O). For a reference model f, the information (or divergence) function is defined by: If(π, g) = DKL(Mg(π) Mf(π)) , where DKL( ) is the KL divergence. Intuitively, If(π, g) is the rate at which the learner collects statistical information to reject g M when choosing π Π and data is generated under the reference model f. Note that If(π, f) = 0 for all f M and π Π. As we will see shortly, the regret-information trade-off can be written precisely as a combination of the gap function, , and the information function, If. We remark in passing that other choices such as the Hellinger distance are also possible, and the KL divergence is mostly for concreteness and practical reasons. To simplify the notation and emphasize the bilinear nature of the saddle point problem that we study, we will view , If RΠ M + as |Π| |M| matrices (by fixing a canonical ordering on Π and M). For vectors µ RΠ and ν RM, we will frequently write bilinear forms µ fν and µIfν. This also means that by convention, µ will always denote a row vector, while ν will always denote a column vector. The standard basis for RΠ and RM is (eπ)π Π and (eg)g M. 3.1 The Decision-Estimation Coefficient To motivate our approach, we recall the decision-estimation coefficient (DEC) introduced by Foster et al. [2021, 2023], before introducing the main quantity of interest, the average-constrained DEC. First, the offset decision-estimation coefficient (without localization) [Foster et al., 2021] is deco λ(f) = min µ P(Π) max g M µ eg λµIfeg The tuning parameter λ > 0 controls the weight of the information matrix relative to the gaps: Viewing the above as a two-player zero-sum game, we see that increasing λ forces the max-player to avoid models that differ significantly from f under the min-player s sampling distribution. The advantage of this formulation is that the information term µIfeg can be telescoped in the analysis, which directly leads to regret bounds in terms of the estimation error (introduced below in Eq. (8)). The disadvantage of the λ-parametrization is that the trade-off parameter is chosen by optimizing the final regret upper bound. This is inconvenient because the optimal choice requires knowledge of the horizon and a bound on maxf M deco λ(f). Moreover, any choice informed by the upper bound may be conservative, leading to sub-optimal performance. The constrained decision-estimation coefficient [Foster et al., 2023] is decc ϵ(f) = min µ P(Π) max g M µ eg s.t. µIfeg ϵ2 (2) In this formulation, the max player is restricted to choose models g that differ from f at most by ϵ2 in terms of the observed divergence under the min-player s sampling distribution. Note that because eπIfef = 0 for all eπ Π, there always exists a feasible solution. For horizon n, the radius can be set to ϵ2 βM n , where βM is a model estimation complexity parameter, thereby essentially eliminating the trade-off parameter from the algorithm. However, because of the hard constraint, strong duality of the Lagrangian saddle point problem (for fixed µ) fails, and consequently, telescoping the information gain in the analysis is no longer easily possible (or at least, with the existing analysis). To achieve sample complexity decc ϵ(f), Foster et al. [2023] propose a sophisticated scheme that combines phased exploration with a refinement procedure (E2D+). As the main quantity of interest in the current work, we now introduce the average-constrained decision-estimation coefficient, defined as follows: decac ϵ (f) = min µ P(Π) max ν P(M) µ ν s.t. µIfν ϵ2 (3) Similar to the decc ϵ, the parameterization of the decac ϵ is via the confidence radius ϵ2, making the choice of the hyperparameter straightforward in many cases (more details in Section 3.2). By convexifying the domain P(M) of the max-player, we recover strong duality of the Lagrangian (for fixed µ). Thereby, the formulation inherits the ease of choosing the ϵ-parameter from the decc ϵ, while, at the same time, admitting a telescoping argument in the analysis and a much simpler algorithm. Specifically, Sion s theorem implies three equivalent Lagrangian representations for Eq. (3): decac ϵ (f) = min µ P(Π) max ν P(M) min λ 0 µ ν λ(µIfν ϵ2) (4) = min λ 0,µ P(Π) max ν P(M) µ ν λ(µIfν ϵ2) (5) = min λ 0 max ν P(M) min µ P(Π) µ ν λ(µIfν ϵ2) (6) When fixing the outer problem, strong duality holds for the inner saddle-point problem in each line, however, the joint program in Eq. (5) is not convex-concave. An immediate consequence of relaxing the domain of the max player and Eq. (5) is that decc ϵ(f) decac ϵ (f) = min λ 0{deco λ(f) + λϵ2} (7) The decac ϵ can therefore be understood as setting the λ parameter of the deco λ optimally for the given confidence radius ϵ2. On the other hand, the cost paid for relaxing the program is that there exist model classes M where the inequality in Eq. (7) is strict, and decac ϵ does not lead to a tight characterization of the regret [Foster et al., 2023, Proposition 4.4]. The remedy is that under a stronger regularity condition and localization, the two notions are essentially equivalent [Foster et al., 2023, Proposition 4.8]. 3.2 Anytime Estimation-To-Decisions (Anytime-E2D) Estimations-To-Decisions (E2D) is an algorithmic framework that directly leverages the decisionestimation coefficient for choosing a decision in each round. The key idea is to compute a sampling distribution µt P(Π) attaining the minimal DEC for an estimate ˆft of the underlying model, and then define the policy to sample πt µt. The E2D approach, using the decac ϵ formulation, is summarized in Algorithm 1. To compute the estimate ˆft, the E2D algorithm takes an abstract estimation oracle EST as input, that, given the collected data, returns ˆft M. The final guarantee Algorithm 1: ANYTIME-E2D Input : Hypothesis class M, estimation oracle EST, sequence ϵt 0, data D0 = 1 for t = 1, 2, 3, . . . do 2 Estimate ˆft = EST(Dt 1) 3 Compute gap and information matrices, and I ˆ ft RΠ M 4 µt = arg minµ P(Π) maxν P(M){µ ν : µI ˆ ftν ϵ2 t} 5 Sample πt µt and observe yt Mf (πt) 6 Append data Dt = Dt 1 {(πt, yt)} depends on the estimation error (or estimation regret), defined as the sum over divergences of the observation distributions under the estimate ˆft and the true model f : t=1 µt I ˆ ftef Intuitively, the estimation error is well-behaved if ˆft f , since µt If ef = 0. Equation (8) is closely related to the total information gain used in the literature on information-directed sampling [Russo and Van Roy, 2014] and kernel bandits [Srinivas et al., 2010]. To bound the estimation error, Foster et al. [2021] rely on online density estimation (also, online regression or online aggregation) [Cesa-Bianchi and Lugosi, 2006, Chapter 9]. For finite M, the default approach is the exponential weights algorithm (EWA), which we provide for reference in Appendix A. When using this algorithm, the estimation error always satisfies Estn log(|M|), see [Cesa-Bianchi and Lugosi, 2006, Proposition 3.1]. While these bounds extend to continuous model classes via standard covering arguments, the resulting algorithm is often not tractable without additional assumptions. For linear feedback models (Examples 2.1 and 2.2), one can rely on the more familiar ridge regression estimator, which, we show, achieves bounded estimation regret Estn O(d log(n)). For further discussion, see Appendix A.1. With this in mind, we state our main result. Theorem 1. Let λt 0 be any sequence adapted to the filtration Ft. Then the regret of ANYTIMEE2D (Algorithm 1) with input sequence λt satisfies for all n 1: Rn ess sup t [n] ( decac ϵt,λt( ˆft) t=1 ϵ2 t + Estn where we defined decac ϵ,λ(f) = minµ P(Π) maxν P(M) µ ν λ(µIfν ϵ2). As an immediate corollary, we obtain a regret bound for Algorithm 1 where the sampling distribution µt is chosen to optimize decac ϵt for any sequence ϵt. Corollary 1. The regret of ANYTIME-E2D (Algorithm 1) with input ϵt 0 satisfies for all n 1: Rn max t [n],f M decac ϵt (f) ϵ2 t t=1 ϵ2 t + Estn Importantly, the regret of Algorithm 1 is directly controlled by the worst-case DEC, maxf M decac ϵ (f), and the estimation error Estn. It remains to set ϵ2 t (respectively λt) appropriately. For a fixed horizon n, we let ϵ2 t = Estn n . With the reasonable assumption that maxf M ϵ 2decac ϵ (f) is non-decreasing in ϵ, Corollary 1 reads Rn 2n max f M Estn/n(f) . (9) This almost matches the lower bound Rn Ω(ndecc 1/ n(F))3 [Foster et al., 2023, Theorem 2.2], up to the estimation error and the beforehand mentioned gap between decc ϵ and decac ϵ . 3Here, decc ϵ(F) = maxf co(M) minµ P(Π) maxg M {f}{µ ν : µIfeg ϵ2}. Setting deco γ decac ϵ Multi-Armed Bandits |Π|/γ 2ϵ p |Π| Linear Bandits d/4γ ϵ Lipschitz Bandits 2γ 1 d+1 2 d+1 d+2 ϵ 2 d+2 Convex Bandits O(d4/γ) O(ϵd2) Table 1: Comparison of deco γ and decac ϵ for different settings. Bounds between deco γ and decac ϵ can be converted using Eq. (7). Refined bounds for linear bandits with side-observations are in Lemma 4. To get an anytime algorithm with essentially the same scaling as in Eq. (9), we set ϵ2 t = log(|M|)/t for finite model classes, and ϵ2 t = βM t if Estt βM log(t) for βM > 0. For linear bandits, decac ϵ ϵ d (see Section 3.3), and Estn d log(n). Choosing ϵ2 t = d/t recovers the optimal regret bound Rn O(d n) [Lattimore and Szepesvári, 2020a]. Alternatively, one can also choose λt by minimizing an upper bound on maxt [n],f M decac ϵt,λt(f)/ϵ2 t . For example, in linear bandits, decac ϵt,λ d 4λ + λϵ2 t (see Table 1); hence, for ϵ2 t = d/t, we can set λt = t/4. Further discussion and refined upper bound for linear feedback models are in Section 3.3. Proof of Theorem 1. Let µ t and ν t be a saddle-point solution to the offset dec, deco λt( ˆft) = min µ P(Π) max ν P(M) µ ν λtµI ˆ ftν Note that µ t ν t λtµ t Ifν t µ t ef λtµ t Ifef 0, which implies that λtϵ2 t decac ϵt,λt. Next, t=1 E h µt ef λt(µt I ˆ ftef ϵ2 t) + λt(µt I ˆ ftef ϵ2 t) i t=1 E max g M µt ˆ fteg λt(µt I ˆ fteg ϵ2 t) + λt(µt I ˆ ftef ϵ2 t) t=1 E min µ P(Π) max ν P(M) µ ν λt(µI ˆ ftν ϵ2 t) + λt(µt I ˆ ftef ϵ2 t) So far, we only introduced the saddle point problem by maximizing over f . The last equality is by our choice of λt and µt, and noting that ν P(M) can always be realized as a Dirac. Continuing, t=1 E h decac ϵt,λt( ˆft) + λt(µt I ˆ ftef ϵ2 t) i t=1 E decac ϵt,λt( ˆft) + 1 ϵ2 t decac ϵt,λt( ˆft)µt I ˆ ftef (ii) ess sup t [n] max f M ϵ2 t decac ϵt,λt(f) n X ϵ2 t + E h µt I ˆ ftef i We first drop the negative term in (i) and use the beforehand stated fact that λtϵ2 t decac ϵt,λt( ˆft). The last step, (ii), is taking the maximum out of the sum. 3.3 Certifying Upper Bounds As shown by Corollary 1, the regret of Algorithm 1 scales directly with the decac ϵ . For analysis purposes, it is however useful to compute upper bounds on the decac ϵ to verify the scaling w.r.t. parameters of interest. Via the equivalence Eq. (7), bounds on the deco λ directly translate to the decac ϵ (see Table 1). For a detailed discussion of upper bounds in various models, we refer to Foster et al. [2021]. Below, we highlight three connections that are directly facilitated by the decac ϵ . To this end, we first introduce a variant of the decac ϵ where the gap function depends on f: decac,f ϵ (f) = min µ P(Π) max ν P(M) µ fν s.t. µIfν ϵ2 , (10) where f(π, g) = rg(π g) rf(π). We remark that for distributions ν P(M) and µ P(Π), the gap f can be decoupled, µ fν = δfν + µ fef, where we defined δf(g) = rg(π g) rf(π f). The following assumption implies that the observations for a decision π are at least as informative as observing the rewards. Assumption 1 (Reward Data Processing). The rewards and information matrices are related via the following data-processing inequality that holds for any µ P(Π): |Eπ µ[rf(π) rg(π)]| q Eπ µ[D(Mf(π) Mg(π))] The next lemma shows that under Assumption 1, decac ϵ (f) and decac,f ϵ (f) are essentially equivalent, at least for the typical worst-case bounds where maxf M decac ϵ (f) Ω(ϵ). Lemma 1. If Assumption 1 holds, then decac,f ϵ (f) ϵ decac ϵ (f) decac,f ϵ (f) + ϵ The proof is in Appendix C.1. We remark that Algorithm 1 where the sampling distribution is computed for decac,f ϵ ( ˆft) and f achieves a bound analogous to Theorem 1, as long as Assumption 1 holds. For details see Lemma 8 in Appendix C. Upper Bounds via Decoupling First, we introduce the information ratio, Ψf(µ, ν) = (µ fν)2 The definition is closely related to the Bayesian information ratio [Russo and Van Roy, 2016], where ν takes the role of a prior over M. The Thompson sampling distribution is µTS ν = P h M νheπ h. The decoupling coefficient, dc(f), [Zhang, 2022, Definition 1] is defined as the smallest number K 0, such that for all distributions ν P(M), µTS ν fν inf η 0 g,h M νgνheπ h(rg rf)2 + K g,h M νgνheπ h(rg rf)2 (11) The next lemma provides upper bounds on the decac ϵ (f) in terms of the information ratio, which is further upper-bounded by the decoupling coefficient. Lemma 2. With Ψ(f) = maxν M minµ P(Π) Ψf(µ, ν) and Assumption 1 satisfied, we have decac,f ϵ (f) ϵ p The proof follows directly using the AM-GM inequality, see Appendix C.2. By [Zhang, 2022, Lemma 2], this further implies decac,f ϵ ϵ d. An analogous result for the generalized information ratio [Lattimore and Gyorgy, 2021] that recovers rates ϵρ for ρ 1 is given in Appendix C.4. PAC to Regret Another useful way to upper bound the decac,f ϵ is via an analogous definition for the PAC setting [c.f. Eq. (10), Foster et al., 2023]: pac-decac,f ϵ (f) = min µ P(Π) max ν M δfν s.t. µIfν ϵ2 (12) Lemma 3. Under Assumption 1, decac,f ϵ (f) min p [0,1] n pac-decac,f ϵp 1/2(f) + p max o The proof is given in Appendix B.1. Lemma 3 combined with Theorem 1 leads to O(n2/3) upper bounds on the regret that are reminiscent of so-called globally observable games in linear partial monitoring [Kirschner et al., 2023]. Application to Linear Feedback Models To illustrate the techniques introduced, we compute a regret bound for Algorithm 1 for linear bandits with side-observations (Examples 2.1 and 2.2). Lemma 4. For linear bandits with side-observations and divergence If(π, g) = Mπ(g f) 2, pac-decac,f ϵ (f) min µ P(Π) max b Π ϵ ϕb V (µ) 1 ϵ where V (µ) = P π Π µπMπM π . Moreover, denoting Ω= minµ P(Π) maxb Π ϕb V (µ) 1, decac,f ϵ (f) min ϵ p Ψ(f), 2ϵ2/3Ω1/3 1/3 max The proof is given in Appendix B.2. While in the worst-case for linear bandits, there is no improvement over the standard O(d n) without further refinement or specification of the upper bounds, in the case of linear side-observations there is an improvement whenever Ω maxf M Ψ(f). To exemplify the improvement, consider a semi-bandit with a revealing action ˆπ, e.g. Mˆπ = 1d. Here, the regret bound improves to Rn min{d n, d1/3n2/3}, since then pac-decac,f ϵ (f) ϵ. The corresponding improvement in the regime n d4 might seem modest, but is relevant in high-dimensional and non-parametric models. Moreover, in (deep) reinforcement learning, high-dimensional models are commonly used and the learner obtains side information in the form of state observations. Therefore, it is plausible that the n2/3 rate is dominant even for a moderate horizon. Exploring this effect in reinforcement learning is therefore an important direction for future work. Notably, this improvement is not observed by upper confidence bound algorithms and Thompson sampling, because both approaches discard informative but suboptimal actions early on [c.f. Lattimore and Szepesvari, 2017], including the action ˆπ in the example above. E2D for a constant offset parameter λ > 0, in principle, attains the better rate, but only if one pre-commits to a fixed horizon. Lastly, we note that a similar effect was observed for information-directed sampling in sparse high-dimensional linear bandits [Hao et al., 2020]. 3.4 Computational Aspects For finite model classes, Algorithm 1 can be readily implemented. Since almost no structure is imposed on the gap and information matrices of size |Π| |M|, avoiding scaling with |Π| |M| seems hardly possible without introducing additional assumptions. Even in the finite case, solving Eq. (3) is not immediate because the corresponding Lagrangian is not convex-concave. A practical approach is to solve the inner saddle point for Eq. (5) as a function of λ. Strong duality holds for the inner problem, and one can obtain a solution efficiently by solving the corresponding linear program using standard solvers. It then remains to optimize over λ 0. This can be done, for example, via a grid search over the range [0, maxf M ϵ 2decac ϵ (f)]. In the linear setting, the above is not satisfactory because most commonly M is identified with parameters in Rd. As noted before, ridge regression can be used instead of online aggregation while preserving the optimal scaling of the estimation error (see Appendix A.1). The next lemma further shows that the saddle point problem Eq. (3) can be rewritten to only scale with the size of the decision set |Π|. Lemma 5. Consider linear bandits with side observations, M = Rd and quadratic divergence, If(π, g) = Mπ(g f) 2, and denote ϕµ = P π Π µπϕπ and V (µ) = P π Π µπM π Mπ. Then decac,f ϵ ( ˆft) = min λ 0 min µ P(Π) max b Π ϕb ϕµ, ˆft + 1 4λ ϕb 2 V (µ) 1 + λϵ2 Moreover, the objective is convex in µ P(Π). The proof is a straightforward calculation provided in Appendix C.5. Note that the saddle point expression is analogous to Eq. (5), and in fact, one can linearize the inner maximization over P(Π), such that the inner saddle point becomes convex-concave. This leads to expressions equivalent to Eqs. (4) and (6), albeit the objective is no longer linear in µ P(Π). We use Lemma 5 to employ the same strategy as before: As a function of λ 0, solve the inner problem of the expression in Lemma 5, for example, as a convex program with |Π| variables and |Π| constraints (Appendix D). Then all that remains is to solve a one-dimensional optimization problem over λ [0, maxf M ϵ 2decac ϵ (f)]. We demonstrate this approach in Appendix E to showcase the performance of E2D on simple examples. 4 Conclusion We introduced ANYTIME-E2D, an algorithm based on the estimation-to-decisions framework for sequential decision-making with structured observations. The algorithm optimizes the averageconstrained decision-making coefficient, which can be understood as a reparametrization of the corresponding offset version. The reparametrization facilitates an elegant anytime analysis and makes setting all remaining hyperparameters immediate. We demonstrate the improvement with a novel bound for linear bandits with side-observations, that is not attained by previous approaches. Lastly, we discuss how the algorithm can be implemented for finite and linear model classes. Nevertheless, much remains to be done. For example, one can expect the reference model to change very little from round to round, and therefore, it seems wasteful to solve Eq. (3) from scratch repetitively. Preferable instead would be an incremental scheme that iteratively computes updates to the sampling distribution. Acknowledgments and Disclosure of Funding Johannes Kirschner gratefully acknowledges funding from the SNSF Early Postdoc.Mobility fellowship P2EZP2_199781. Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. N. Abe and P. M. Long. Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML 99, pages 3 11, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2. A. Agarwal and T. Zhang. Model-based rl with optimistic posterior sampling: Structural conditions and sample complexity. ar Xiv preprint ar Xiv:2206.07659, 2022. M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263 272. PMLR, 2017. S. Bubeck, O. Dekel, T. Koren, and Y. Peres. Bandit convex optimization:\sqrtt regret in one dimension. In Conference on Learning Theory, pages 266 278. PMLR, 2015. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006. F. Chen, S. Mei, and Y. Bai. Unified algorithms for rl with decision-estimation coefficients: No-regret, pac, and reward-free learning. ar Xiv preprint ar Xiv:2209.11745, 2022. R. Combes, S. Magureanu, A. Proutiere, and C. Laroche. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 231 244. ACM, 2015. ISBN 978-14503-3486-0. R. Combes, S. Magureanu, and A. Proutiere. Minimal exploration in structured stochastic bandits. Advances in Neural Information Processing Systems, 30, 2017. C. Dann, M. Mohri, T. Zhang, and J. Zimmert. A provably efficient model-free posterior sampling method for episodic reinforcement learning. Advances in Neural Information Processing Systems, 34:12040 12051, 2021. R. Degenne, P. Ménard, X. Shang, and M. Valko. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pages 2432 2442. PMLR, 2020a. R. Degenne, H. Shao, and W. Koolen. Structure adaptive algorithms for stochastic bandits. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2443 2452. PMLR, 13 18 Jul 2020b. URL https://proceedings.mlr.press/v119/degenne20b.html. A. V. den Boer. Dynamic pricing and learning: Historical origins, current research, and new directions. Surveys in Operations Research and Management Science, 20(1):1 18, 2015. K. Dong and T. Ma. Asymptotic instance-optimal algorithms for interactive decision making. ar Xiv preprint ar Xiv:2206.02326, 2022. S. Du, S. Kakade, J. Lee, S. Lovett, G. Mahajan, W. Sun, and R. Wang. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826 2836. PMLR, 2021. J. C. Dunn and S. Harshbarger. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432 444, 1978. D. J. Foster, S. M. Kakade, J. Qian, and A. Rakhlin. The statistical complexity of interactive decision making. ar Xiv preprint ar Xiv:2112.13487, 2021. D. J. Foster, N. Golowich, J. Qian, A. Rakhlin, and A. Sekhari. A note on model-free reinforcement learning with the decision-estimation coefficient. ar Xiv preprint ar Xiv:2211.14250, 2022a. D. J. Foster, A. Rakhlin, A. Sekhari, and K. Sridharan. On the complexity of adversarial decision making. ar Xiv preprint ar Xiv:2206.13063, 2022b. D. J. Foster, N. Golowich, and Y. Han. Tight guarantees for interactive decision making with the decision-estimation coefficient. ar Xiv preprint ar Xiv:2301.08215, 2023. M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95 110, 1956. E. Garcelon, B. Roziere, L. Meunier, J. Tarbouriech, O. Teytaud, A. Lazaric, and M. Pirotta. Adversarial attacks on linear contextual bandits. Advances in Neural Information Processing Systems, 33:14362 14373, 2020. A. György, D. Pál, and C. Szepesvári. Online learning: Algorithms for big data. Lecture Notes, 2013. B. Hao, T. Lattimore, and M. Wang. High-dimensional sparse linear bandits. Advances in Neural Information Processing Systems, 33:10753 10763, 2020. M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning, pages 427 435. PMLR, 2013. C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018. J. Kirschner and A. Krause. Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pages 358 384. PMLR, 2018. J. Kirschner and A. Krause. Bias-robust bayesian optimization via dueling bandits. In International Conference on Machine Learning, pages 5595 5605. PMLR, 2021. J. Kirschner, I. Bogunovic, S. Jegelka, and A. Krause. Distributionally Robust Bayesian Optimization. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), August 2020a. URL http://proceedings.mlr.press/v108/kirschner20a/kirschner20a.pdf. J. Kirschner, T. Lattimore, and A. Krause. Information directed sampling for linear partial monitoring. In Conference on Learning Theory, pages 2328 2369. PMLR, 2020b. J. Kirschner, T. Lattimore, C. Vernade, and C. Szepesvári. Asymptotically optimal informationdirected sampling. In Conference on Learning Theory, pages 2777 2821. PMLR, 2021. J. Kirschner, T. Lattimore, and A. Krause. Linear partial monitoring for sequential decision making: Algorithms, regret bounds and applications. Journal of Machine Learning Research, August 2023. T. Lattimore and A. Gyorgy. Mirror descent and the information ratio. In Conference on Learning Theory, pages 2965 2992. PMLR, 2021. T. Lattimore and C. Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728 737. PMLR, 2017. T. Lattimore and C. Szepesvári. An information-theoretic approach to minimax regret in partial monitoring. In Conference on Learning Theory, pages 2111 2139. PMLR, 2019. T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020a. T. Lattimore and C. Szepesvári. Exploration by optimisation in partial monitoring. In Conference on Learning Theory, pages 2488 2515. PMLR, 2020b. T. Lattimore, B. Kveton, S. Li, and C. Szepesvári. Toprank: A practical algorithm for online stochastic ranking. In Advances in Neural Information Processing Systems, pages 3949 3958. Curran Associates, Inc., 2018. T. Lin, B. Abrahao, R. Kleinberg, J. Lui, and W. Chen. Combinatorial partial monitoring game with linear feedback and its applications. In International Conference on Machine Learning, pages 901 909, 2014. S. Mannor and O. Shamir. From bandits to experts: On the value of side-observations. Advances in Neural Information Processing Systems, 24, 2011. I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. Advances in neural information processing systems, 29, 2016. F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pages 784 791. ACM, 2008. D. Russo and B. Van Roy. Learning to optimize via information-directed sampling. Advances in Neural Information Processing Systems, 27, 2014. D. Russo and B. Van Roy. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442 2471, 2016. A. Rustichini. Minimizing regret: The general case. Games and Economic Behavior, 29(1-2): 224 243, 1999. S. Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. International Conference on Machine Learning (ICML), 2010. W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th International Conference on Machine Learning, pages 1201 1208. ACM, 2009. A. Zanette, A. Lazaric, M. Kochenderfer, and E. Brunskill. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978 10989. PMLR, 2020. T. Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834 857, 2022. D. Zhou, Q. Gu, and C. Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532 4576. PMLR, 2021. Algorithm 2: Exponential Weights Algorithm (EWA) for Density Estimation Input :Finite model class M, data Dt = {(y1, π1), . . . , (yt, πt)}, Learning rate η > 0 1 Define L(f) = Pt s=1 log Mf(ys|πs) 2 Let p(f) exp( ηL(f)) 3 For convex M: Return P 4 Else: Return f p( ). A Online Density Estimation For any f M and π Π, we denote by p( |π, f) the the density function of the observation distribution Mf(π) w.r.t. a reference measure over the observation space O. Consider a finite model class M and the KL divergence, eπIfeg = Ey Mg(π) log p(y|π, g) In this case, the estimation error can be written as follows: t=1 eπt I ˆ ftef t=1 log(p(yt|πt, f )/p(yt|πt, ˆft) p(yt|πt, ˆft) t=1 log 1 p(yt|πt, f ) The last line can be understood as the estimation regret of the estimates ˆf1, . . . , ˆfn under the logarithmic loss. A classical approach to control this term is the exponential weights algorithm (EWA) given in Algorithm 2. For the EWA algorithm, we have the following bound. Lemma 6 (EWA for Online Density Estimation). For any data stream {y1, π1, . . . , yn, πn} the predictions ˆf1, . . . ˆfn obtained via Algorithm 2 with η = 1 satisfy p(yt|πt, ˆft) t=1 log 1 p(yt|πt, g) log(|M|) (14) For a proof, see [Cesa-Bianchi and Lugosi, 2006, Proposition 3.1]. A.1 Bounding the Estimation Error of Projected Regularized Least-Squares In this section, we consider the linear model from Example 2.2. We denote by the Euclidean norm. For simplicity, the observation maps Mπ Rm d are assumed to have the same output dimension m N. The observation distribution is such that yt = Mπtf + ξt, where ξ Rm is random noise such that Et[ξ] = 0 and Et ξ 2 σ2. Here, Et[ ] = E[ |π1, y1, . . . , πt 1, yt 1, πt] is the conditional observation in round t including the decision πt chosen in round t. We will use the quadratic divergence4, eπIfeg = 1 2 Mπ(g f) 2 This choice corresponds to the Gaussian KL, but we do not require that the noise distribution is Gaussian is the following. In the linear bandit model, this choice reduces to eπIfeg = 1 2 ϕπ, g f 2. Let K Rd be a closed convex set. Our goal is to control the estimation regret for the projected regularized least-squares estimator, ˆft = arg min f K s=1 Mπsf ys 2 + f 2 V0 = Proj Vt where V0 is a positive definite matrix, Vt = Pt 1 s=1 M πs Mπs + V0 and Proj Vt( ) is the orthogonal projection w.r.t. the Vt norm. For K = Rd and V0 = η1d, this recovers the standard ridge 4We added a factor of 1 2 for convenience. regression. The projection is necessary to bound the magnitude of the squared loss, and the result will depend on an almost-surely bound on the observed diameter, max f,g K max π Π Mπ(f g) B Recall that our goal is to bound the estimation error, t=1 eπt I ˆ ftef 1 2 Mπt(f ˆft) 2 # We remark that one can get the following naive bound by applying Cauchy-Schwarz: t=1 Mπt(f ˆft) 2 t=1 Mπt 2 V 1 t f ˆft 2 Vt O(d2 log(n)2) (17) The last inequality follows from the elliptic potential lemma and standard concentration inequalities [Lattimore and Szepesvári, 2020a, Lemma 19.4 and Theorem 20.5]. However, this will lead to an additional d-factor in the regret that can be avoided, as we see next. For K = Rd, one-dimensional observations and noise bounded in the range [ B, B], one can also directly apply [Cesa-Bianchi and Lugosi, 2006, Theorem 11.7] to get Estn O( B2d log(n)), thereby improving the naive bound by a factor d log(n). This result is obtained in a more general setting, where no assumptions, other than boundedness, are placed on the observation sequence y1, . . . , yn. Here we refine and generalize this result in two directions: First, we allow for the more general feedback model in with multi-dimensional observations (Example 2.2). Second, we directly exploit the stochastic observation model to obtain a stronger result that does not require the observation noise to be bounded. Theorem 2. Consider the linear observation setting with additive noise and quadratic divergence eπIfeg = 1 2 Mπ(g f) 2, as described at the beginning of this section. Assume that maxf,g M,π Π Mπ(f g) B and E ξt 2 σ2. Then Estn (σ2 + B2)E log det Vn If in addition Mπ L and V0 = η1d, then Estn (σ2 + B2) log 1 + n L2 Remark 1. Note that by the [Lattimore and Szepesvári, 2020a, Theorem 19.4], log det Vn det V0 further be upper bounded by d log trace V0+n L2 d det(V0)1/d , which effectively results the desired bound. Proof. The proof adapts [György et al., 2013, Theorem 19.8] to multi-dimensional observations and takes advantage of the stochastic loss function by taking the expectation. First, define lt(f) = 1 2 Mπtf yt 2. Then, using that Et[yt] = Mπtf , 1 2 Mπt(f ˆft) 2 # 1 2 Mπt ˆft yt 2 1 2 Mπtf yt 2 # t=1 lt( ˆft) lt(f ) Further, by directly generalizing [György et al., 2013, Lemma 19.7], we have that lt( ˆft+1) lt( ˆft) lt( ˆft)V 1 t lt( ˆft) = (Mπt ˆft yt) M πt V 1 t Mπt(Mπt ˆft yt) (18) We now start upper bounding the estimation error, (i) f 2 + E lt(wt) lt(wt+1) # (ii) f 2 + E ξt + Mπt(f ˆft 1) Mπt V 1 t Mπt ξt + Mπt(f ˆft 1) # (iii) = f 2 + E t=1 ξt Mπt V 1 t Mπtξt # t=1 xt Mπt V 1 t Mπt xt # (iv) f 2 + E t=1 λmax(Mπt V 1 t Mπt) ξt 2 # t=1 λmax(Mπt V 1 t Mπt) xt 2 # (v) f 2 + (σ2 + B2)E t=1 λmax(Mπt V 1 t Mπt) The inequality (i) follows from [Shalev-Shwartz et al., 2012, Lemma 2.3]. For (ii) we used Eq. (18). For (iii) we used that Et[ξt] = 0. In (iv), we introduce the maximum eigenvalue λmax(A) for A Rm m and denote xt = Mπt(f ft 1). Lastly, in (v) we used that xt 2 B and Et ξt 2 σ2. We conclude the proof with basic linear algebra. Denote by λi(A) the i-th eigenvalue of a matrix M Rm m. Using the generalized matrix determinant lemma, we get det(Vt 1) = det(Vt M πt Mπt) = det(Vt) det(I M πt V 1 t Mπt) i=1 (1 λi(M πt V 1 t Mπt)) Note that λi(M πt V 1 t Mπt) (0, 1]. Next, using that log(1 x) x for all x < 1, we get that log det(Vt 1) i=1 log(1 λi(M πt V 1 t Mπt)) i=1 λi(M πt V 1 t Mπt) Rearranging the last display, and bounding the sum by its maximum element, we get λmax(M πt V 1 t Mπt) i=1 λi(M πt V 1 t Mπt) log det(Vt) The proof is concluded by combining Eqs. (19) and (20). Remark 2 (Expected Regret). The beauty of Theorem 1 is that the proof uses only in-expectation arguments. This is unlike most previous analysis, that controls the regret via controlling tail-events, and bounds on the expected regret are then derived a-posteriori from high-probability bounds. In the context of linear bandits, Theorem 2 leads to bound on the expected regret that only requires the noise variance to be bounded, whereas most previous work relies on the stronger sub-Gaussian noise assumption [e.g. Abbasi-Yadkori et al., 2011]. Remark 3 (Kernel Bandits / Bayesian Optimization). Using the standard kernel-trick , the analysis can further be extended to the non-parametric setting where M is an infinite-dimensional reproducing kernel Hilbert space (RKHS). B PAC to Regret Bounds B.1 Proof of Lemma 3 Proof. The Lagrangian for Eq. (12) is pac-decac,f ϵ (f) = min λ 0 min µ P(Π) max ν P(M) δfν λ(µIfν ϵ2). Reparametrize any µ P(Π) as µ(p) = (1 p)eπ f + pµ2. We bound decac,f ϵ by a function of pac-decac,f ϵ . Starting from Eq. (5), we have decac,f ϵ (f) = min λ 0 min µ P(Π) max ν P(M) µ fν λ(µIfν ϵ2) = min λ 0 min µ P(Π) max ν P(M) µ fν λ( µIfν ϵ2) = min λ 0 min 0 p 1 min µ2 P(Π) max ν P(M) δfν + pµ2 fef λ µIfν λϵ2 min λ 0 min 0 p 1 min µ2 P(Π) max ν P(M) δfν + pµ2 fef λpµ2Ifν λϵ2 min 0 p 1 min λ 0 min µ2 P(Π) max ν P(M) δfν λ (µ2Ifν ϵ2 p ) + p max min 0 p 1 pac-decac,f ϵ p (f) + p max . B.2 Proof of Lemma 4 Proof of Lemma 4. For the first part, note that pac-decac,f ϵ (f) = min µ P(Π) min λ 0 max ν P(M) δfν λµIfν + λϵ2 = min µ P(Π) min λ 0 max b Π max g M ϕb, g ϕπ f , f λ g f 2 V (µ) + λϵ2 (i) = min µ P(Π) min λ 0 max b Π ϕb ϕπ f , f + 1 4λ ϕb 2 V (µ) 1 + λϵ2 (ii) min µ P(Π) min λ 0 max b Π 1 4λ ϕb 2 V (µ) 1 + λϵ2 = min µ P(Π) max b Π ϵ ϕb V (µ) 1 Equation (i) follows by computing the maximizer attaining the quadratic form over M = Rd. The inequality (ii) is by definition of π f and the last inequality (iii) by the assumption that the reward is observed, respectively, ϕπϕ π M π Mπ, and the Kiefer Wolfowitz theorem. The second part of the statement follows by combining Lemmas 2 and 3. C Coefficient Relations Results and Proofs Lemma 7. Assume Assumption 1 holds, i.e. (µ(rf rg))2 µIfeg . (21) g µ(rf rg)νg µ(rf rg) 2νg µIfν . Proof. First Jensen s inequality, then Eq. (21). C.1 Proof of Lemma 1 Proof of Lemma 1. Note that decac ϵ (f) = min µ P(Π) max ν P(M) µ ν s.t. µIfν ϵ2 = min µ P(Π) max ν P(M) µ fν + X π Π νgµπ(rf(π) rg(π)) s.t. µIfν ϵ2 min µ P(Π) max ν P(M) µ fν + p µIfν s.t. µIfν ϵ2 ϵ + min µ P(Π) max ν P(M) µ fν s.t. µIfν ϵ2 ϵ + decac,f ϵ (f) s.t. µIfν ϵ2, where the first inequality is by Lemma 7. Also, by lower bounding the sum in the second inequality by p µIfν we get left inequality. C.2 Proof of Lemma 2 Proof of Lemma 2. For the first inequality, using the definition of decac ϵ (f, f) and the AM-GM inequality: decac,f ϵ (f) = min λ 0 max ν P(M) min µ P(Π) µ fν λµIfν + λϵ2 min λ>0 max ν P(M) min µ P(Π) (µ fν)2 4λµIfν + λϵ2 (22) = min λ>0 Ψ(f) 4λ + λϵ2 = ϵ p Ψ(f) . (23) Further, by Eq. (11) and Assumption 1 we have µTS ν fν q g,h M νgνheπ h(rg rf)2 q KµTS f Ifν, which gives Ψ(f) K. Plugging this into Eq. (23) gives the second inequality. C.3 Regret bound for Algorithm 1 defined for f and decac,f ϵ Lemma 8. If Assumption 1 holds, then the regret of ANYTIME-E2D (Algorithm 1) with replaced with f is bounded as follows: Rn max t [n],f M ( decac,f ϵt (f) ϵ2 t t=1 ϵ2 t + Estn Proof. The proof follows along the lines of the proof of Theorem 1. The main difference is that when introducing f, we get a term that captures the reward estimation error: t=1 E h decac ϵt ( ˆft) + λt(µt I ˆ ftef ϵ2 t) i + t=1 E h µt(r ˆ ft rf ) i (24) max t [n] max f M ϵ2 t decac ϵt (f) n X ϵ2 t + E h µt I ˆ ftef i + n Estn (25) For the last inequality, we used Cauchy-Schwarz and Assumption 1 to bound the error term, t=1 E h µt(r ˆ ft rf ) i t=1 E h (µt(r ˆ ft rf ))2 i t=1 E h µt I ˆ ftef i = C.4 Generalized Information Ratio The generalized information ratio [Lattimore and Gyorgy, 2021] for µ P(Π), ν P(M), and α > 1 is defined as Ψα,f(µ, ν) = (µ fν)α For α = 2, we get the standard information ratio introduced by Russo and Van Roy [2014] with ν as a prior over the model class M. Define Ψα(f) = maxν M minµ P(Π) Ψα,f(µ, ν). To upper bound decac ϵ , we have the following lemma. Lemma 9. For the reference model f, the ac-dec can be upper bounded as decac ϵ (f) min λ>0 λ 1 1 α α α 1 α (α 1)Ψα(f) 1 α 1 + λϵ2 (27) Proof. We start by noting that for x1, . . . , xα 0, from AM-GM we have that α(x1 x2 xα)1/α x1 + + xα. Substituting x2 = x3 = xα, we get α x α 2 x1 (α 1)x2. Writing x1 = λµIfν and x2 = α α 1 α (µ ν)α λµIf ν 1 α 1 and using the previous inequality with the decac ϵ program gives the result. The information ratio Ψα,f(µ, ν) can be thought of as the Bayesian information ratio in [Russo and Van Roy, 2014] where the expectation is taken over the distribution ν of possible environments. However, for information gain If, [Russo and Van Roy, 2014] use entropy difference in the posterior distribution of π f before and after the observation is revealed. C.5 Proof of Lemma 5 Proof of Lemma 5. decac,f ϵ ( ˆft) = min µ P(Π) min λ 0 max b Π max g M ϕb, g ϕµ, ˆft λ g ˆft 2 V (µ) + λϵ2 = min λ 0 min µ P(Π) max b Π ϕb ϕµ, ˆft + 1 4λ ϕb 2 V (µ) 1 + λϵ2 . (28) The first equality is by definition, and the second equality follows from solving the quadratic maximization over g M = Rd. To show that the problem is convex in µ, note that taking inverses of positive semi-definite matrices X, Y is a convex function, i.e. ((1 η)X + ηY ) 1 (1 η)X 1 + ηY 1. In particular, V ((1 η)µ1 + ηµ2) 1 (1 η)V (µ1) 1 + ηV (µ2) 1. With this the claim follows. D Convex Program for Fixed λ Take Eq. (28) and fix λ > 0. Then we have the following saddle-point problem: min µ P(Π) max b Π ϕb ϕµ, ˆft + 1 4λ ϕb 2 V (µ) 1 + λϵ2 = λϵ2 + min µ P(Π) max b Π ϕb ϕµ, ˆft + 1 4λ ϕb 2 V (µ) 1 Up to the constant additive term, this saddle point problem is equivalent to the following convex program min y R,µ RΠ y s.t. y ϕb ϕµ, ˆft + 1 4λ ϕb 2 V (µ) 1 b Π 1µ = 1 µπ 0 π E Experiments All experiments below were run on a semi-bandit problem with a "revealing action", as alluded to in the paragraph below Lemma 4. Specifically, we assume a semi-bandit model where M = Rd and the features are ϕπ Rd. For an instance f M, the reward function is rf = ϕπ, f for all π Π. There is one revealing (sub-optimal) action ˆπ = π f . The observation for any action π = ˆπ is Mf (π) = N( ϕπ, f , 1) (29) Define Mˆπ = [ϕπ1, . . . , ϕπ|Π|] . Then the observation for action ˆπ is Mf (ˆπ) = N(Mˆπf , 1d) (30) Thus, the information for any action π = ˆπ is If(g, π) = σ2 2 ϕπ, g f 2 (31) while the information for action ˆπ is If(g, ˆπ) = σ2 2 Mˆπ(g f) 2 = σ2 π ϕπ, g f 2 (32) For this setting Estn O(d log(n)) (see Appendix A.1). E.1 Experimental Setup Our main objective is to compare our algorithm ANYTIME-E2D to the fixed-horizon E2D algorithm by Foster et al. [2021]. ANYTIME-E2D and E2D were implemented by using the procedure described in Section 3.4. Both ANYTIME-E2D and E2D need to solve the inner convex problem in Lemma 5. To do so we use Frank-Wolfe [Frank and Wolfe, 1956, Dunn and Harshbarger, 1978, Jaggi, 2013] for 100 steps and warm-starting the optimization at the solution from the previous round, µt 1. For ANYTIME-E2D we further perform a grid search over λ [0, maxg M ϵ 2decac,f(g)] (with a discretization of 50 points) to optimize over lambda within each iteration of Frank-Wolfe. For both the E2D and ANYTIME-E2D algorithm we used the version with the gaps replaced with f, since we noticed that both algorithms performed better with f. For E2D, the scale hyperparameter λ was set using λ = q n 4 log(n) as mentioned in Foster et al. [2021, Section 6.1.1]. While for ANYTIME- E2D we set the hyper-parameter ϵ2 t = d/t. Further, we compare to standard bandit algorithms: Upper Confidence Bound (UCB) and Thompson Sampling (TS) [Lattimore and Szepesvári, 2020a]. E.2 Experiment 1 In this experiment, we aim to demonstrate the advantage of having an anytime algorithm. Specifically, we tune λ in the E2D algorithm for different horizons n = 200, 500, 1000, 2000, but run it for a fixed horizon of n = 2000. As such, we expect our algorithm ANYTIME-E2D to perform better than E2D when λ was tuned for the incorrect horizons (i.e. n = 200, 500, 1000). The feature dimension is d = 3. The number of decisions is |Π| = 10. We generated the features ϕπ for each π Π and parameter f Rd randomly at the beginning and then kept them fixed throughout the experimentation. 100 independent runs were performed for each algorithm. The results of the experiment can be seen as the left plot in Fig. 1. As expected, our algorithm ANYTIME-E2D performs better than E2D (for n = 200, 500, 1000). This indicates that the E2D algorithm is sensitive to different settings of λ, which is problematic when the horizon is not known beforehand. Whereas our ANYTIME-E2D algorithm performs well even when the horizon is not known. E.3 Experiment 2 In this experiment, we investigate the case when n < d4. As pointed out below Lemma 3, we expect improvement in this regime as the regret bound of our algorithm is Rn min{d n, d1/3n2/3}, 0 500 1000 1500 2000 step 0 200 400 600 800 1000 step TS UCB E2D-200 E2D-500 E2D-1000 E2D-2000 Anytime E2D Figure 1: Running ANYTIME-E2D, TS, UCB, and E2D optimized for different horizons n {200, 500, 1000, 2000}. Left: The result for horizon n = 2000, and the feature space dimension d = 3. Right: The result for horizon n = 1000, and the feature space dimension d = 30. while the default, fixed-horizon E2D algorithm cannot achieve these bounds simultaneously and one has to pick one of d n or d1/3n2/3 beforehand for setting the scale hyperparameter λ. It is standard that the choice of λ is made according to the d n regret bound for E2D Foster et al. [2021](which is not optimal when n << d4), especially, if the horizon is not known beforehand. Thus, we set the horizon to n = 1000 and the dimension of the feature space to d = 30, which gives us that n = 1000 << 810000 = d4. The rest of the setup and parameters are the same as in the previous experiment except for the features ϕπ and f which are again chosen randomly in the beginning and then kept fixed throughout the experiment. The results of the experiment can be seen as the right plot in Fig. 1. As expected, our algorithm ANYTIME-E2D performs better than E2D, UCB, and TS. This indicates that indeed, ANYTIMEE2D is likely setting λ appropriately to achieve the preferred d1/3n2/3 regret rate for small horizons. The poor performance of the other algorithms can be justified, since E2D is optimized based on the worse d n regret rate (for small horizons), while the UCB and TS algorithms are not known to get regret better than d n.