# online_episodic_convex_reinforcement_learning__4f84f47b.pdf

Online Episodic Convex Reinforcement Learning

Bianca Marin Moreno * 1 2 3 Khaled Eldowa * 4 5 Pierre Gaillard 1 Margaux Br eg ere 2 6 Nadia Oudjane 2 3

We study online learning in episodic finitehorizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent s policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use an online mirror descent algorithm with varying constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent s policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.

1. Introduction

Reinforcement learning (RL) studies the problem where an agent interacts with an environment over time, adhering to a probabilistic policy that maps states to actions and aiming to minimize the cumulative expected losses. The environment s dynamics are represented by a Markov decision process (MDP), assumed here to be episodic, with episodes of length N, a finite state space X, a finite action space A, and

*Equal contribution 1Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. 2EDF Lab, 7 bd Gaspard Monge, 91120 Palaiseau, France 3Fi ME (Laboratoire de Finance des March es de l Energie - Dauphine, CREST, EDF R&D) 4Universit a degli Studi di Milano, Milan, Italy 5Politecnico di Milano, Milan, Italy 6Sorbonne Universit e LPSM, Paris, France. Correspondence to: Bianca Marin Moreno <bianca.marinmoreno@inria.fr>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

a sequence of probability transition kernels p : ppnqn Pr Ns, such that for each px, aq P X ˆA, pnp |x, aq P X , the simplex over the state space. Formally, the RL problem involves finding a policy π that, under a transition kernel p, induces a state-action distribution sequence µπ,p P p XˆAq N minimizing the inner product with a loss vector ℓ: pℓnqn Pr Ns, with ℓn P RXˆA, i.e.: minπPp Aq XˆN xℓ, µπ,py. A large body of literature is devoted to solving the RL problem efficiently and with theoretical guarantees in many challenging environments (Bertsekas, 2019; Sutton & Barto, 2018).

However, numerous practical problems entail more intricate objectives, such as those encountered within the Concave Utility Reinforcement Learning (CURL) framework (Hazan et al., 2019; Zahavy et al., 2021) (also known as convex RL). The CURL problem consists in minimizing a convex function (or maximizing a concave function) on the stateaction distributions induced by an agent s policy:

minπPp Aq XˆN Fpµπ,pq. (1)

In addition to RL, other examples of machine learning problems that can be written as CURL are pure exploration (Hazan et al., 2019; Mutti et al., 2021; 2022b), where Fpµπ,pq xµπ,p, logpµπ,pqy; imitation learning (Ghasemipour et al., 2020; Lavington et al., 2022) and apprenticeship learning (Zahavy et al., 2019; Abbeel & Ng, 2004), where Fpµπ,pq Dgpµπ,p, µ q, with Dg representing a Bregman divergence induced by a function g and µ being a behavior to be imitated; certain instances of mean-field control (Bensoussan et al., 2013), where Fpµπ,pq xℓpµπ,pq, µπ,py; mean-field games with potential rewards (Lavigne & Pfeiffer, 2023); risk-averse RL (Garcıa & Fern andez, 2015; Pan et al., 2019; Greenberg et al., 2022), among others. The non-linearity of CURL alters the additive structure inherent in standard RL, invalidating the classical Bellman equations. Consequently, dynamic programming approaches become infeasible, necessitating the development of novel methodologies.

A natural extension of CURL is the online scenario, wherein a sequence of policies pπtqt Pr T s is computed over T episodes, aimed at minimizing a cumulative loss LT : řT t 1 F tpµπt,pq, where the objective F t can change arbitrarily (known as the adversarial scenario (Even-Dar et al., 2009)), and the MDP probability kernel p is unknown. Most existing approaches to CURL fail to address the challenges

Online Episodic Convex Reinforcement Learning

of the online setting (adversarial losses and unknown dynamics). The few methods that attempt to tackle this problem rely on strong assumptions about the probability transition kernel (Moreno et al., 2024), which can be overly restrictive in real-world scenarios. To overcome this, we need an approach capable of optimizing the objective function while simultaneously learning the environment, effectively balancing the exploration-exploitation dilemma.

Motivations. The CURL framework presented in this paper is motivated by a variety of application scenarios, some of which we outline below. Energy grid optimization: To balance the energy production with the consumption, an energy provider may want to control the average consumption of electrical appliances (electric vehicles, water heaters, etc) to better match a target consumption. The task involves daily control, with the target consumption varying daily due to fluctuations in energy production. To protect user privacy, the energy provider has limited access to individual trajectories, but receives the average consumption of the whole population at the end of each day. The loss is usually quadratic on the state-action distribution. This problem can be framed as our CURL formulation. See (Coffman et al., 2023; Moreno et al., 2025). Mean-field games (MFG) with potential reward: As shown by (Barakat et al., 2023), a MFG with potential reward can be framed as a CURL problem. Therefore, any sequential decision problem with a large population of anonymous agents with symmetric interests and potential rewards, such as epidemic spreading, crowd motion control, etc, can be cast as CURL.

Contribution 1. In the full-information feedback setting, where the objective function F t is fully revealed to the learner at the end of episode t, we propose the first method achieving sub-linear regret for online CURL with adversarial losses and unknown transition kernels, without relying on additional model assumptions. Our algorithm uses an Online Mirror Descent (OMD) variant incorporating welldesigned exploration bonuses into the sub-gradient of the objective function to handle the exploration-exploitation trade-off. It achieves a regret of Op ?

Tq, matching the stateof-the-art (So TA) in more restricted settings (Moreno et al., 2024), while obtaining a closed-form solution.

Contribution 2. We extend our approach to incorporate bandit feedback on the objective function. We first consider the RL case where F tpµq : xℓt, µy. Bandit feedback in this setting means that the agent only observes the loss function in the state-action pairs they visit during each episode, i.e. pℓt npxt n, at nqqn Pr Ns where pxt n, at nqn Pr Ns is the agent s trajectory. We obtain the optimal regret of Op ?

Tq in this setting. We then address for the first time the general CURL problem under more strict bandit feedback. In this setting, the learner only has access to the value of the objective function evaluated on the state-action distribution sequence

induced by the agent s policy, i.e., F tpµπt,pq. We propose two algorithms for this setting and show that they achieve sub-linear regret. One algorithm requires that the MDP is known, while the other, under the assumption that the probability transition kernel is lower bounded by a positive constant, operates in the setting where the MDP is estimated progressively from observed trajectories. We rely on gradient estimation techniques from the bandit convex optimization literature, even as the peculiar structure of our constraint set and uncertainty regarding the true transition kernel present some unique challenges.

1.1. Related Work

Offline CURL. An extensive line of work focus on the offline version of CURL (Problem (1)), where the objective function is known and fixed. The methodologies proposed by (Zhang et al., 2020; 2021; Barakat et al., 2023) rely on policy gradient techniques, requiring the estimation of F s gradient concerning the policy π, a task often complex. Taking a different approach, Zahavy et al. (2021) cast the CURL problem as a min-max game using Fenchel duality, demonstrating that conventional RL algorithms can be tailored to fit the CURL framework. Recently, Geist et al. (2022) established that CURL is a specific instance of meanfield games. Moreover, Moreno et al. (2024) undertake a convexification of Problem (1) and propose a mirror descent algorithm with a non-standard Bregman divergence. Mutti et al. (2022a; 2023) study the gap between evaluating agent performance over infinite realizations versus finite trials and question the classic CURL formulation in Eq. (1). However, they show that non-Markovian policies can be necessary to optimize the finite trials objective, which entails an increased computational burden. On the other hand, the occupancy-measure-based formulation that we study can be solved efficiently, and allows direct comparison with methods from the CURL literature such as (Moreno et al., 2024). Moreover, in application scenarios with many homogeneous agents, a mean-field approach can justify this choice, as we discuss in Sec. 1.

Online CURL. To the best of our knowledge, Greedy MDCURL from (Moreno et al., 2024) is the only regret minimization algorithm designed for online CURL. However, it only achieves sublinear regret when the system dynamics follow the form xn 1 gnpxn, an, εnq, where gn is a known deterministic function, and εn is an external noise with an unknown distribution independent of pxn, anq, which significantly limits its applicability, as we empirically show in Sec. 5. This assumption simplifies the problem, as the algorithm only needs to learn the noise distribution, which can be done independently of the policy, eliminating the need for exploration. In contrast, our approach does not assume any specific form for the dynamics, which introduces the challenge of developing a policy that minimizes

Online Episodic Convex Reinforcement Learning

Table 1. Comparisons of So TA finite-horizon tabular MDPs methods. MD stands for Mirror Descent, KL for Kullback-Leibler divergence and Γ is defined in Eq. (4). MD + p q indicates the regularization added to the MD iteration. MD on π indicates a policy optimization approach in which MD iterations are performed on policies instead of state-action distributions (occupancy-measures).

Algorithm Optimal regret in T CURL Closedform Exploration No model assumption Adversarial Losses Bandit feedback

(Jin et al., 2020) MD + KL UCRL (Moreno et al., 2024) MD + Γ None (ours) MD + Γ Bonus (Luo et al., 2021) MD on π Bonus

total loss while simultaneously enabling sufficient exploration to improve estimates of the transition kernels. The technical novelty we introduce to overcome this challenge are well-designed exploration bonuses detailed in Sec. 3.

The work of (Rosenberg & Mansour, 2019b) also studies convex performance criteria in adversarial MDPs, but under a different setup. They use fixed, known convex functions applied to linear losses, i.e., Fpxµ, ℓtyq. In contrast, our setting generalizes this by allowing the convex function F itself to be adversarial and applied directly to the occupancy measure, i.e., F tpµq.

RL approaches. Model-optimistic methods construct a set of plausible MDPs by forming confidence bounds around the empirical transition kernels, then select the policy that maximizes the expected reward in the best feasible MDP. A key example of this approach is UCRL (Upper Confidence RL) methods (Jaksch et al., 2008; Zimin & Neu, 2013; Rosenberg & Mansour, 2019b; Jin et al., 2020). While these methods offer strong theoretical guarantees, they are often difficult to implement due to the complexity of optimizing over all plausible MDPs. While we believe these approaches could be generalized to CURL, their computational complexity has led us to propose an alternative method. Valueoptimistic methods are value-based approaches that compute optimistic value functions, rather than optimistic models, using dynamic programming. An example is UCB-VI (Azar et al., 2017). However, these methods are limited to stochastic losses. Policy-optimization (PO) methods directly optimize the policy and are widely used in RL due to their faster performance and closed-form solutions. Recently, Luo et al. (2021) achieved So TA regret for PO methods with adversarial losses and bandit feedback by introducing dilated bonuses, which satisfy a dilated Bellman equation and are added to the Q-function. However, their approach cannot be applied here due to CURL s non-linearity (the expectation of the trajectory appears inside the objective function) which invalidates the Bellman s equations.

We achieve our results by computing local bonuses and adding them to the (sub-)gradient of the objective function in each OMD instance as exploration bonuses. This

is more computationally efficient than model-optimistic approaches and addresses the exploration issues in previous online CURL methods. We believe our analysis is of independent interest, as it also offers a new way to study RL approaches over occupancy measures, while providing closed-form solutions. See Table 1 for comparisons.

2. Problem Formulation

2.1. Setting

For a finite set S, |S| represents its cardinality, while S denotes the |S|-dimensional simplex. For all d P N we denote rds : t1, . . . , du. We let } }1 be the L1 norm, and for all v : pvnqn Pr Ns, such that vn P RXˆA we define }v}8,1 : sup1ďnďN }vn}1. We denote by } }1,8 its dual. Let Π : p Aq XˆN denote the set of policies. We consider an episodic MDP as introduced in Sec. 1. We assume that the initial state-action pair of an agent is sampled from a fixed distribution µ0 P XˆA at the beginning of each episode. At time step n P r Ns, the agent moves to a state xn pnp |xn 1, an 1q, and chooses an action an πnp |xnq by means of a policy πn : X Ñ A. When the agent follows a policy π : pπnqn Pr Ns for an episode in an environment described by the MDP with a transition kernel p, this induces a state-action distribution, which we denote by µπ,p : pµπ,p n qn Pr Ns, that can be calculated recursively for all pn, x, aq P r Ns ˆ X ˆ A, by

µπ,p 0 px, aq µ0px, aq

µπ,p n px, aq ÿ

px1,a1q µπ,p n 1px1, a1qpnpx|x1, a1qπnpa|xq. (2)

We define the set of all state-action distribution sequences satisfying the dynamics of the MDP as

Mp µ0 : " µ P p XˆAq Nˇˇ ÿ

a1PA µnpx1, a1q (3)

x PX,a PA pnpx1|x, aqµn 1px, aq , @x1 P X, @n P r Ns * .

For any µ P Mp µ0, there is a strategy π such that µπ,p µ. It suffices to take πnpa|xq9µnpx, aq when the normal-

Online Episodic Convex Reinforcement Learning

ization factor is non-zero, and arbitrarily defined otherwise. Let Mp, µ0 be the subset of Mp µ0 where the corresponding policies π satisfy πnpa|xq 0 for all px, aq. For any two probability transition kernels p, q, we define Γ : Mp µ0 ˆ Mq, µ0 Ñ R such that, for all µ, µ1 P Mp µ0 ˆ Mq, µ0 with policies π, π1,

Γpµ, µ1q : řN n 1 Epx,aq µnp q log πnpa|xq π1npa|xq ı . (4)

In the online extension of CURL, the objective function for episode t is denoted as F t : řN n 1 f t n, where f t n : XˆA Ñ R is convex and L-Lipschitz with respect to the } }1 norm (hence F t is LF -Lipschitz with respect to the norm } }8,1 with LF : LN). The objective function F t is unknown to the learner in the start of episode t. In this paper, we examine three types of objective function feedback: Full-information: In this case, F t is fully disclosed to the learner at the end of episode t, and is treated in Sec. 3.2. Bandit in RL: Here, F tpµq : xℓt, µy, and the learner observes the loss function only for the state-action pairs visited, i.e., pℓt npxt n, at nqqn Pr Ns, which is covered in Sec. 4.1. Bandit in CURL: In this scenario, the learner only has access to the objective function evaluated on the stateaction distribution sequence induced by the agent s policy, i.e., F tpµπt,pq, and is treated in Sec. 4.2.

The learner s goal is to compute a sequence of strategies pπtqt Pr T s, where T represents the total number of episodes, that minimizes their total loss LT : řT t 1 F tpµπt,pq. The learner s performance is evaluated by comparing it to any policy π P p Aq XˆN using the static regret:

RT pπq : řT t 1 F tpµπt,pq F tpµπ,pq. (5)

We assume the probability transition kernel p is unknown to the learner. Hence, to minimize its total loss, the learner must optimize the objective function while simultaneously learn the environment dynamics, facing an explorationexploitation dilemma. The interaction between the learner and the environment proceeds in episodes. At each episode t, the learner selects a policy πt, sends it to the agent, and observes its trajectory ot : pxt 0, at 0, . . . , xt N, at Nq. The learner uses this observation to compute an estimation of the probability transition kernel ppt 1. At the end of episode t, the learner receives one of the three feedbacks described above for the objective function F t, and then calculates the policy for the next episode, πt 1, based on πt, ppt 1, and the feedback on F t.

2.2. Preliminary Results

The results in this section are either known or extensions of existing results needed for the analysis.

Since the probability transition kernel is unknown, we propose an online mirror descent (OMD) instance that opti-

mizes over the state-action distributions induced by the estimated MDP as if it was the true model. This approach differs from the model-optimistic methods for RL discussed in Sec. 1.1 where each iteration is performed over the union of all state-action distribution sets induced by MDPs within a confidence set around the estimated model, which results in a computationally expensive optimization problem per iteration. Lemma 2.1 presents an auxiliary result concerning the quality of the state-action distribution sequence pµtqt Pr T s when µt is the solution of Eq. (6), an OMD instance on the set of state-action distributions induced by a transition kernel qt. It extends the upper bound result from (Moreno et al., 2024) for OMD with smoothly varying constraint sets to any sequence of bounded vectors pztqt Pr T s and any sequence of smootlhy varying transitions pqtqt Pr T s. Lemma 2.1. Let pqtqt Pr T s be a sequence of probability transition kernels and pztqt Pr T s a sequence of vectors in RNˆ|X|ˆ|A|, such that maxt Pr T s }zt}1,8 ď ζ. Initialize π1 npa|xq : 1{|A|. For t P r Ts, let πt : t t 1πt 1 t 1|A| 1

be a smoothed version of the policy and compute iteratively

µt 1 P arg min

µPMqt 1 µ0 τxzt, µy Γpµ, µ πt,qtq. (6)

Then, there is a τ ą 0 such that, for any sequence pνtqt Pr T s, with νt : νπ,qt for a common policy π, řT t 1xzt, µt νty ď O ζN a

VT |X| logp|A|q T logp Tq ,

where VT ě 1 max pn,x,aq

t 1 }qt np |x, aq qt 1 n p |x, aq}1.

This lemma is proved in App. C. It is known (Moreno et al., 2024) that for the divergence Γ defined in Eq. (4), Eq. (6) has a closed-form solution for the policy (see App. A.2).

Learning the model. Since the learner does not know the probability transition kernel, it must estimate p from the agents trajectories. Below we present the empirical way for estimating the transition and a well-known result (Lem. 2.2) on its quality using Hoeffding s inequality. Let N t npx, aq řt 1 s 1 1txsn x,asn au, M t npx1|x, aq řt 1 s 1 1txs n 1 x1,xs n x,as n au. The learner s estimate for the transition kernel at the end of episode t 1, to be used in episode t, is as follows

ppt n 1px1|x, aq : M t npx1|x, aq maxt1, N tnpx, aqu. (7)

Lemma 2.2 (Lem. 17 of Jaksch et al., 2008). For any 0 ă δ ă 1, with a probability of at least 1 δ,

}pnp |x, aq ppt np |x, aq}1 ď

g f f e2|X| log |X||A|NT

max t1, N t n 1px, aqu

holds simultaneously for all pt, n, x, aq P r Tsˆr NsˆX ˆA.

Online Episodic Convex Reinforcement Learning

These results suffice for analyzing CURL with fullinformation feedback (Sec. 3). For bandit feedback, more refined tools are needed. In bandit RL, we need Bernstein s inequality to bound the L1 distance (Lem. D.2). In bandit CURL, we also need a bound on the Kullback-Leibler (KL) divergence (Lem. E.3), which requires the Laplace (addone) estimator (Eq. (51)), as the KL of the empirical one can be unbounded.

3. Exploration Bonus in CURL

We now present our novel approach for online CURL with adversarial losses and unknown dynamics.

3.1. Limitations of previous approaches

The performance measure of a learner playing a sequence of strategies pπtqt Pr T s is given by the static regret defined in Eq. (5). Using the estimate of the probability transition kernel ppt computed by the learner, the static regret can be further decomposed as follows

t 1 F tpµπt,pq F tpµπt,ptq

t 1 F tpµπt,ptq F tpµπ,pq

t 1 x F tpµπt,pq, µπt,p µπt,pty looooooooooooooooooomooooooooooooooooooon

t 1 x F tpµπt,ptq, µπt,pt µπ,py looooooooooooooooooomooooooooooooooooooon

where the inequality comes from the convexity of F t. Let ξt npx, aq : }pnp |x, aq ppt np |x, aq}1. The term RMDP T , accounts for the error in estimating the MDP, and satisfies RMDP T Op ?

Tq with high probability. This is a classic result (see Neu et al., 2012). We first show that

x,a µπt,p i px, aqξt i 1px, aq. (9)

Then, using Lem. 2.2 and that N t npx, aq increases with the empirical version of the state-action distribution µπt,p n px, aq we achieve the final bound (see App. B.2). The second term, Rpolicy T , depends on the algorithm used to derive the policies. As mentioned in Sec. 1.1, model-optimistic approaches could be adapted to CURL, but they are computationally expensive. To achieve low complexity, we explore potential problems that might arise from the absence of explicit

exploration. We decompose this regret term as follows:

t 1 x F tpµπt,ptq, µπt,pt µπ,pty loooooooooooooooooooomoooooooooooooooooooon

Rpolicy/MD T

t 1 x F tpµπt,ptq, µπ,pt µπ,py looooooooooooooooooomooooooooooooooooooon

Rpolicy/MDP T

Assume the learner computes its policy sequence pπtqt Pr T s by solving Eq. (6) with qt 1 : ppt 1 and zt : F tpµπt,ptq. Hence, from Lem. 2.1, Rpolicy/MD T Op ?

Tq (Lemmas A.3 and A.4 in the Appendix demonstrate that řT t 1 }ppt 1p |x, aq pptp |x, aq}1 ď e logp Tq. By hypothesis, } F tpµπt,ptq}1,8 ď LF . Hence, we meet all the assumptions from Lem. 2.1). But the term Rpolicy/MDP T poses a challenge. It can be decomposed as RMDP T in Eq. (9). However, the state-action distribution multiplying ξt i 1px, aq would either be µπ,p i px, aq or µπ,pt 1

i px, aq, and neither is related to N t i px, aq. Consequently, we do not have the same convergence effect as RMDP T . In fact, this term can become prohibitively large. Without exploration, previous work using similar analysis (Moreno et al., 2024) only achieved optimal regret under strong model assumptions, limiting its applicability in realistic scenarios.

3.2. CURL with full-information feedback

We outline our idea to overcome previous limitations presented in Subsec. 3.1. Let bt : pbt nqn Pr Ns be a sequence of vectors, to be properly defined later, such that bt n P RXˆA. We assume that πt is the policy inducing µt computed as in Eq. (6) with qt : ppt, but instead of considering zt F tpµπt,ptq as the (sub-)gradient of MD to be used in episode t 1, we let zt : F tpµπt,ptq bt, i.e.,

µt 1 : arg min

µPM pt 1 µ0

! τx F tpµπt,ptq bt, µy Γpµ, µtq ) .

If we assume that bt is such that, for all t P r Ts and for some ζ ą 0, } F tpµπt,ptq bt}1,8 ď ζ, then by Lem. 2.1 and by adding and subtracting the bonus vector, we would have that Rpolicy T is bounded by

Tq řT t 1xbt, µπt,pt µπ,pty Rpolicy/MDP T . (10)

2|X| logp|X||A|NT{δq, and for all n P t0, r Nsu, px, aq P X ˆ A, let

bt npx, aq : Lp N nq Cδ a

max t1, N tnpx, aqu . (11)

Note that }bt n}8 ď LNCδ, ensuring that the hypothesis of Lem. 2.1 remains valid for this sequence. Decomposing

Online Episodic Convex Reinforcement Learning

Rpolicy/MDP T as we do for RMDP T in Eq. (9), and then applying Lem. 2.2, we get that for any δ P p0, 1q, with probability at least 1 δ, Rpolicy/MDP T is bounded by

n 0 p N nq ÿ

µπ,pt n px, aq a

max t1, N tnpx, aqu

t 1 xµπ,pt, bty.

By replacing Eq. (12) in Eq. (10), the additive property in the decomposition allows us to cancel out the problematic regret term Rpolicy/MDP T . As a result, we obtain that Rpolicy T ď Op ?

Tq řT t 1xbt, µπt,pty. All that remains is to analyze the new term due to the added bonus, řT t 1xbt, µπt,pty, which we do in Prop. 3.1. Proposition 3.1. Let pbtqt Pr T s be the bonus vector in Eq. (11). For any δ1 P p0, 1q, with probability 1 3δ1, řT t 1xbt, µπt,pty O LN 3|X|3{2a

With all the ingredients in place, we introduce our new method, Bonus O-MD-CURL, in Alg. 1. The main result is in Thm. 3.2 and its proof is in App. B.2. In terms of T and |A|, our result matches the optimal one in RL from (Jin et al., 2020), but we have additional factors of N and a

|X| that are due to using bonuses and to our approach to deal with adversarial convex RL. Theorem 3.2. Running Alg. 1 for online CURL with unknown transition kernel, full-information feedback, where F t : řN n 1 f t n is convex and each f t n is L-Lipschitz under } }1, ensures that, with probability at least 1 6δ for any δ P p0, 1q, the optimal choice of τ achieves, for any π P Π,

RT pπq O LN 3|X|3{2a

4. Bandit Feedback

4.1. Bandit feedback with bonus in RL

We generalize Alg. 1 to handle the RL case with bandit feedback. Our aim is not to improve the existing algorithms for bandit RL; rather, we show that our new methodology and analysis for CURL achieves comparable results to the So TA in bandit RL. In this case, an adversary selects a sequence of loss functions pℓtqt Pr T s, with ℓt : pℓt nqn Pr Ns, where ℓt n : X ˆ A Ñ r0, 1s, and the objective function is given by F tpµq : xℓt, µy řN n 1xℓt n, µny. Note that now the gradient of F t with respect to µ is always equal to ℓt due to the linearity of the objective function. Bandit feedback in this setting implies that the learner observes the loss function only for the state-action pairs visited by the agent during each episode, i.e., pℓt npxt n, at nqqn Pr Ns where pxt n, at nqn Pr Ns is the agent s trajectory.

Algorithm 1 Bonus O-MD-CURL (Full-information)

1: Input: number of episodes T, initial policy π1 P Π, initial state-action distribution µ0 and state-action distribution sequence µ1 µ1 µπ1,p1 with pp1 np |x, aq 1{|X|, learning rate τ ą 0. 2: Init.: @pn, x, a, x1q, N 1 npx, aq M 1 npx1|x, aq 0 3: for t 1, . . . , T do 4: agent starts at pxt 0, at 0q µ0p q 5: for n 1, . . . , N do 6: Env. draws new state xt n pnp |xt n 1, at n 1q 7: Update counts

N t 1 n 1pxt n 1, at n 1q N t n 1pxt n 1, at n 1q 1

M t 1 n 1pxt n|xt n 1, at n 1q M t n 1pxt n|xt n 1, at n 1q 1

8: Agent chooses an action at n πt np |xt nq 9: end for 10: Compute bonus sequence as in Eq. (11) 11: Observe objective function F t

12: Compute µπt,pt as in Eq. (2) 13: Update transition estimate as in Eq. (7) 14: Compute the πt 1 associated to the solution of Eq. 6 with zt : F tpµπt,ptq bt and qt 1 ppt 1

15: Compute πt 1 (Lem. 2.1), and µt 1 : µ πt 1,pt 1

16: end for

We define Alg. 2 in App. D, a version of Bonus O-MDCURL where for each OMD update we take zt : pℓt bt, with pℓt an importance-weighted estimator of ℓt defined in Eq. (40) and bt the bonus vector defined in Eq. (11). Thm. 4.1 states that Alg. 2 achieves the regret bound of Op ?

Tq known to be the optimal for RL with bandit feedback (Jin et al., 2020). For the proof and for an overview of approaches for bandit RL see App. D.

Theorem 4.1. Playing Alg. 2 for RL with adversarial losses pℓtqt Pr T s, unknown transition kernel, and bandit feedback, obtains with high probability for any policy π P Π,

RT pπq O N 3|X|3{2a

|A|T N 3{2|X|5{4|A| ?

4.2. CURL with bandit feedback

Returning back to the CURL framework, we now assume that F t : XˆA Ñ r0, Ns can be any convex, L-Lipschitz function with respect to } }1. In contrast to Sec. 3, we assume here that after executing a policy πt we observe F tpµπt,pq instead of F tpµπt,pq. We will consider both the case when the MDP is known in advance and when it needs (as in previous sections) to be estimated progressively from observed trajectories.

Main challenges. This problem can be broadly categorized as a bandit convex optimization (BCO) problem. This places us in a more challenging domain compared to the

Online Episodic Convex Reinforcement Learning

bandit feedback setting in the standard RL problem, where the gradient of the loss function is identical for any point in p XˆAq N and is easier to estimate. Moreover, as a BCO problem, the present setting still exhibits distinctive challenges. One being the peculiar nature of our decision set Mp µ0 and how it impedes the efficacy of some standard gradient estimation techniques as we explain below. Another issue arises when the MDP is not known as that induces uncertainty over the true set of permissible occupancy measures. This incomplete knowledge of the decision set is atypical in the BCO literature and introduces multiple sources of bias for any adopted method.

4.2.1. ENTROPIC REGULARIZATION METHOD

Our first approach is to extend our MD-based algorithm from Sec. 3, supposing still that the MDP is not known. Since the algorithm required knowledge of the gradient F tpµπt,pq, we propose to estimate it by querying F t

at a random perturbation of µπt,p, a standard approach in the convex bandit literature popularized by Flaxman et al. (2005). This method yields T 3{4 regret under convex and Lipschitz conditions, and is incapable of doing better (Hu et al., 2016). Although more advanced algorithms and analyses achieve ?

T regret (Hazan & Li, 2016; Bubeck et al., 2021; Fokkema et al., 2024), they are arguably less practical, more complicated, and have worse dimension dependence. For d P Z , we denote by Bd

and Sd the unit ball and sphere respectively in Rd, and by 1d P Rd the vector with all entries equal to one. Let k: S Ñ R be a convex function, where S Ď Rd is a convex set satisfying Bd Ď S. Fix some δ P p0, 1q. The approach of Flaxman et al. (2005) relies on the observation that p1 δqd

δ Eu PSdrkpp1 δqx δuqus kpxq. Hence, p1 δqd

δ kpp1 δqx δuqu (for some u uniformly sampled from Sd) can be used as a one-point stochastic surrogate for the gradient. Applying this idea to our problem presents several challenges. Mainly, Mp µ0 has an empty interior in RN|X||A|. This can be addressed, assuming for the moment that the kernel p is known, by defining a bijection Λp : p Mp µ0q Ñ Mp µ0, where p Mp µ0q Ď RN|X|p|A| 1q is a representation of the constraint set in a lower-dimensional space where it is possible for its interior to be non-empty, see App. E.1 for more details. Next, we need to specify a (hyper)sphere that is contained in p Mp µ0q , which would allow us to use the aforementioned spherical estimation technique while remaining inside the feasible set of occupancy measures. To guarantee the existence of such an object, we rely on the following assumption (discussed further below).

Assumption 4.2. There exists a value ε ą 0 such that pnpx1|x, aq ě ε for all x, x1 P X 2, a P A, and n P r Ns.

Under this assumption, we show in App. E.2.1 that for κ : ε{p|A| 1 a

|A| 1q, it holds that κ1N|X|p|A| 1q

κBN|X|p|A| 1q Ď p Mp µ0q . For any v P BN|X|p|A| 1q, define ζv,p : Λppκ1N|X|p|A| 1q κvq. Motivated by the preceding discussion, we use (a simple transformation of)

δκ N|X|p|A| 1q F t p1 δqµt δζut,p ut

as a surrogate for F tpµtq, where ut is sampled uniformly from SN|X|p|A| 1q. What remains is to address the issue that the true kernel p is unknown. Similarly to the full information case, we compute an estimate ppt at each round to be used in place of the true kernel, and we employ bonuses to explore. One difference is that we rely on a slightly altered transition kernel estimator (see App. E.2.2) to ensure that ppt

too satisfies the condition of Asm. 4.2. Another discrepancy to be accounted for in the analysis is that although we compute πt relying on ppt (in particular, πt is the policy induced by p1 δqµt δζut,pt P Mpt µ0), we observe F tpµπt,pq, the evaluation of πt in the true environment. This induces an extra source of bias in the gradient estimator. We summarize our approach in Alg. 3 in App. E.2.3, and prove the following result in App. E.2.5:

Theorem 4.3. Under Asm. 4.2, Alg. 3 with a suitable tuning of τ, δ, and pαtqt Pr T s satisfies for any policy π P Π that

E r RT pπqs O a

Lp L 1q{ε|X| 5{4|A| 5{4N 3T 3{4 .

The main shortcoming of this method is its reliance on the restrictive Asm. 4.2, which also affects the regret guarantee through its dependence on ε. This assumption is not necessary to guarantee that p Mp µ0q has a non-empty interior; it suffices instead to assume that every state is reachable at every step, as we do later. Enforcing Asm. 4.2 only serves as a simple way to enable the construction of a sampling sphere with a certain radius. One can construct a different sampling sphere (or ellipsoid) without this assumption; nevertheless, the magnitude of the gradient estimator (which is featured in the current regret bound) would still scale with the reciprocal of the radius of that sphere, the permissible values for which depend on the structure of the MDP and can be arbitrarily small. It seems then that the current approach leads to an inevitable degradation of the bound subject to the structure of the MDP.

4.2.2. SELF-CONCORDANT REGULARIZATION METHOD

Fortunately, we can adopt a more principled approach via the use of self-concordant regularization, which is a common technique in bandit convex (and linear) optimization (see, e.g., Abernethy et al., 2008; Saha & Tewari, 2011; Hazan & Levy, 2014), and has been used for online learning in MDPs in different (linear) settings (Lee et al., 2020; Cohen et al., 2021; Van der Hoeven et al., 2023). We show in App. E.3 that p Mp µ0q is a convex polytope specified as the intersection of N|X||A| half-spaces. We define ψlb : p Mp µ0q Ñ R as the standard logarithmic barrier for

Online Episodic Convex Reinforcement Learning

p Mp µ0q (see Nemirovski, 2004, Cor. 3.1.1) which is a ϑself-concordant barrier (see Nemirovski, 2004, Def. 3.1.1) for p Mp µ0q with ϑ N|X||A|. The second approach we adopt here is to run mirror descent directly on p Mp µ0q

as the decision set and take ψlb as the regularizer in place of the entropic regularizer that induces Γ. Let ξ belong to the interior of p Mp µ0q , which we assume is not empty. Property I in (Nemirovski, 2004, Sec. 2.2) implies that ξ p 2ψlbpξqq 1{2BNSp A 1q Ď p Mp µ0q . Hence, we can construct an ellipsoid entirely contained in p Mp µ0q around any point in intp Mp µ0q . Let ξt be the output of mirror descent at round t and Ut : p 2ψlbpξtqq 1{2. We can then use the following as a surrogate for the gradient of F t Λp at ξt (see also Saha & Tewari, 2011):

δ N|X|p|A| 1q F t Λp ξt δUtut U 1 t ut

with ut again sampled uniformly from SN|X|p|A| 1q. The eigenvalues of Ut correspond to the lengths of the semi-axes of the ellipsoid used at round t, which could be arbitrarily small and lead again to the gradient surrogate having large magnitude. However, thanks to the relationship between ξt

and Ut, a local norm analysis of mirror descent (see, e.g., Lem. 6.16 in Orabona, 2019) absolves the regret of any dependence on the properties of Ut. Unfortunately, due to technical barriers, this log-barrier-based approach is not readily extendable to the setting where the decision set can change over time (in particular, it is not clear whether an analogue for Lem. 2.1 can hold in this case). Hence, we restrict its application only to the case when the MDP is known, see Alg. 4 in App. E.3. We state next a regret bound for this algorithm (proved in App. E.3.2), which requires the following less restrictive assumption in place of Asm. 4.2.

Assumption 4.4. For every state x P X and step n P r Ns, there exists a policy π such that ř

a PA µπ,p n px, aq ą 0.

Note that this can be imposed without loss of generality since the MDP is known; defining Xn Ď X as the subset of states reachable at step n, one can represent occupancy measures as sequences of distributions in p XnˆAqn Pr Ns.

Theorem 4.5. Under Asm. 4.4, Alg. 4 with a suitable tuning of τ and δ satisfies for any policy π P Π that

E r RT pπqs O ?

LN 7{4 p|X||A|Tq 3{4 .

Though holding only for the known MDP case, this bound maintains the T 3{4 rate of Thm. 4.3 while eliminating its reliance on Asm. 4.2 and its undesirable dependence on the MDP s structure. We leave extending this result to unknown MDPs and designing practical approaches enjoying the optimal ?

T rate for future work.

Figure 1. [left] Initial agent distribution; [middle] The three targets from multi-objectives; [right] The constrained MDP (reward in yellow, constraints in blue).

5. Experiments

We evaluate Bonus O-MD-CURL on the multi-objective and constrained MDP tasks from (Geist et al., 2022), which use fixed objective functions and fixed probability kernels across time steps. Adversarial and bandit MDPs are harder to implement due to challenges in finding optimal stationary policies, and there is a a lack of experimental validation in the literature. We focus on evaluating how well the additive bonus helps the algorithm to learn the environment. We also compare it to Greedy MD-CURL from (Moreno et al., 2024). The state space is an 11 ˆ 11 four-room grid world, with a single door connecting adjacent rooms. The agent can choose to stay still or move right, left, up, or down, as long as there are no walls blocking the path: xn 1 xn an εn. The external noise εn is a perturbation that can move the agent to a neighboring state with some probability. The initial distribution is a Dirac delta at the upper left corner of the grid, as in Fig. 1 [left]. We take N 40, τ 0.01, and 5 repetitions per experiment.1

Multi-objectives: The goal is to concentrate the distribution on three targets by the final step N, as in Fig. 1 [middle]. The objective function is defined as fnpµπ,p n q : ř3 k 1p1 xµπ,p n , ekyq2, where ek P R|X| is a vector with a 1 at the target state and 0 elsewhere. Constrained MDPs: The goal is to concentrate the state distribution on the yellow target in Fig. 1 [right] while avoiding the constraint states in blue. The objective function is defined as fnpµπ,p n q : xr, µπ,p n y pxµπ,p n , cyq2, where r, c P R|X|ˆ|A| . Here, r and c are zero everywhere except at the target and constraint states respectively.

To compute the regret, we compare against the oracle optimal policy, which can be closely approximated when the dynamics are known. For the Multi-objective task, Fig. 2 displays the state distribution at the final time step after 50 iterations for Bonus O-MD-CURL [up, left], and Greedy MD-CURL [up,right], and plot the log-loss [down,left] and regret [down,right] after 1000 iterations. We see that Bonus O-MD-CURL reaches the targets much faster than Greedy MD-CURL. As for the Constrained MDP task, Fig. 3 dis-

1The code to reproduce the empirical results are available at: https://github.com/biancammoreno/Convex_RL

Online Episodic Convex Reinforcement Learning

100 101 102 103

Greedy MD-CURL Bonus O-MD-CURL

0 200 400 600 800 1,000

Greedy MD-CURL Bonus O-MD-CURL

Figure 2. Multi-objective: distribution at N 40 after 50 iters. for Bonus O-MD-CURL [up,left], Greedy MD-CURL [up,right]; log-loss [down,left] and regret [down,right] for 103 iters.

100 101 102 103

Greedy MD-CURL Bonus O-MD-CURL

0 200 400 600 800 1,000 0

Greedy MD-CURL Bonus O-MD-CURL

Figure 3. Constrained MDP after 103 iters.: sum distributions over all time steps n P r40s at [up,left]; distribution at the last time step N 40 for Bonus O-MD-CURL [up,center], and Greedy MDCURL [up,right]; the log-loss [down, left] and regret [down,right].

plays the log-sum of all state distributions for all time steps n P r40s at iteration 1000 for Bonus O-MD-CURL [up,left]; the state distribution at the last time step n 40 after 1000 iterations for Bonus O-MD-CURL [up,center], and Greedy MD-CURL [up,right]; and the log-loss [down,left] and regret [down,right]. In this case, Greedy MD-CURL fails to reach the target state even after 1000 iterations, while Bonus O-MD-CURL successfully reaches the target state avoiding constrained states to minimize cost thanks to the additive bonuses. These examples empirically demonstrate the value of the additive bonus in tasks requiring exploration.

Impact Statement

This work is of a theoretical nature, we do not foresee any notable societal consequences.

Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty First International Conference on Machine Learning, pp. 1, 2004.

Abernethy, J., Hazan, E., and Rakhlin, A. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, volume 3, pp. 263 274, 2008.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 263 272, 2017.

Barakat, A., Fatkhullin, I., and He, N. Reinforcement learning with general utilities: simpler variance reduction and large state-action space. In Proceedings of the 40th International Conference on Machine Learning, pp. 1753 1800, 2023.

Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett., 31(3):167 175, 2003.

Bensoussan, A., Yam, P., and Frehse, J. Mean Field Games and Mean Field Type Control Theory. Springer, 2013.

Bertsekas, D. Reinforcement learning and optimal control, volume 1. Athena Scientific, 2019.

Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4): 231 357, 2015.

Bubeck, S., Eldan, R., and Lee, Y. T. Kernel-based methods for bandit convex optimization. Journal of the ACM (JACM), 68(4):1 35, 2021.

Cammardella, N., Buˇsi c, A., and Meyn, S. P. Kullback leibler-quadratic optimal control. SIAM Journal on Control and Optimization, 61(5):3234 3258, 2023.

Canonne, C. L., Sun, Z., and Suresh, A. T. Concentration bounds for discrete distribution estimation in kl divergence, 2023.

Coffman, A., Buˇsi c, A., and Barooah, P. A unified framework for coordination of thermostatically controlled loads. Automatica, 152:111002, 2023.

Cohen, A., Kaplan, H., Koren, T., and Mansour, Y. Online Markov decision processes with aggregate bandit feedback. In Conference on Learning Theory, pp. 1301 1329. PMLR, 2021.

Cover, T. and Thomas, J. Elements of Information Theory. Wiley, 2012.

Online Episodic Convex Reinforcement Learning

Even-Dar, E., Kakade, S. M., and Mansour, Y. Online Markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009. ISSN 0364765X, 15265471.

Flaxman, A. D., Kalai, A. T., and Mc Mahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 385 394, 2005.

Fokkema, H., Van der Hoeven, D., Lattimore, T., and Mayo, J. J. Online newton method for bandit convex optimisation extended abstract, 2024.

Garcıa, J. and Fern andez, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015.

Geist, M., P erolat, J., Lauri ere, M., Elie, R., Perrin, S., Bachem, O., Munos, R., and Pietquin, O. Concave utility reinforcement learning: The mean-field game viewpoint. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pp. 489 497. International Foundation for Autonomous Agents and Multiagent Systems, 2022.

Ghasemipour, S. K. S., Zemel, R., and Gu, S. A divergence minimization perspective on imitation learning methods. In Conference on robot learning, pp. 1259 1277, 2020.

Greenberg, I., Chow, Y., Ghavamzadeh, M., and Mannor, S. Efficient risk-averse reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pp. 32639 32652, 2022.

Hazan, E. and Levy, K. Bandit convex optimization: Towards tight bounds. In Advances in Neural Information Processing Systems, volume 27, 2014.

Hazan, E. and Li, Y. An optimal algorithm for bandit convex optimization. ar Xiv preprint ar Xiv:1603.04350, 2016.

Hazan, E., Kakade, S., Singh, K., and Van Soest, A. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681 2691, 2019.

Hazan, E. et al. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3-4):157 325, 2016.

Hu, X., Prashanth, L., Gy orgy, A., and Szepesvari, C. (bandit) convex optimization with biased noisy gradient oracles. In Artificial Intelligence and Statistics, pp. 819 828, 2016.

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res., 11:1563 1600, 2008.

J ez equel, R., Ostrovskii, D. M., and Gaillard, P. Efficient and near-optimal online portfolio selection. ar Xiv preprint ar Xiv:2209.13932, 2022.

Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In Proceedings of the 37th International Conference on Machine Learning, pp. 4860 4869, 2020.

Lavigne, P. and Pfeiffer, L. Generalized conditional gradient and learning in potential mean field games. Applied Mathematics & Optimization, 88(3):89, 2023.

Lavington, J. W., Vaswani, S., and Schmidt, M. Improved policy optimization for online imitation learning. In Proceedings of The 1st Conference on Lifelong Learning Agents, pp. 1146 1173, 2022.

Lee, C.-W., Luo, H., Wei, C.-Y., and Zhang, M. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. In Advances in Neural Information Processing Systems, pp. 15522 15533, 2020.

Luo, H., Wei, C.-Y., and Lee, C.-W. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. In Advances in Neural Information Processing Systems, volume 34, pp. 22931 22942, 2021.

Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009.

Moreno, B. M., Br eg ere, M., Gaillard, P., and Oudjane, N. Efficient model-based concave utility reinforcement learning through greedy mirror descent. In International Conference on Artificial Intelligence and Statistics, pp. 2206 2214, 2024.

Moreno, B. M., Br eg ere, M., Gaillard, P., and Oudjane, N. (Online) convex optimization for demand-side management: Application to thermostatically controlled loads. Journal of Optimization Theory and Applications, 205(3): 43, 2025.

Mourtada, J. and Ga ıffas, S. An improper estimator with optimal excess risk in misspecified density estimation and logistic regression. Journal of Machine Learning Research, 23:1 49, 2022.

Mutti, M., Pratissoli, L., and Restelli, M. Task-agnostic exploration via policy gradient of a non-parametric state entropy estimate. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):9028 9036, 2021.

Online Episodic Convex Reinforcement Learning

Mutti, M., De Santi, R., De Bartolomeis, P., and Restelli, M. Challenging common assumptions in convex reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4489 4502, 2022a.

Mutti, M., De Santi, R., and Restelli, M. The importance of non-Markovianity in maximum state entropy exploration. In Proceedings of the 39th International Conference on Machine Learning, pp. 16223 16239, 2022b.

Mutti, M., Santi, R. D., Bartolomeis, P. D., and Restelli, M. Convex reinforcement learning in finite trials. Journal of Machine Learning Research, 24(250):1 42, 2023.

Nemirovski, A. Interior point polynomial time methods in convex programming. Lecture notes, 42(16):3215 3224, 2004.

Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems, 28, 2015.

Neu, G., Gyorgy, A., and Szepesvari, C. The adversarial stochastic shortest path problem with unknown transition probabilities. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, pp. 805 813, 2012.

Orabona, F. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Pan, X., Seita, D., Gao, Y., and Canny, J. Risk averse robust adversarial reinforcement learning. In International Conference on Robotics and Automation (ICRA), pp. 8522 8528, 2019.

Rosenberg, A. and Mansour, Y. Online stochastic shortest path with bandit feedback and unknown transition function. Advances in Neural Information Processing Systems, 32, 2019a.

Rosenberg, A. and Mansour, Y. Online convex optimization in adversarial Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning, pp. 5478 5486, 2019b.

Saha, A. and Tewari, A. Improved regret guarantees for online smooth convex optimization with bandit feedback. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 636 642, 2011.

Shalev-Shwartz, S. Online learning and online convex optimization. Found. Trends Mach. Learn., pp. 107 194, 2012.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. A Bradford Book, 2018.

van der Hoeven, D., Zierahn, L., Lancewicki, T., Rosenberg, A., and Cesa-Bianchi, N. A unified analysis of nonstochastic delayed feedback for combinatorial semibandits, linear bandits, and MDPs. In Proceedings of Thirty Sixth Conference on Learning Theory, pp. 1285 1321, 2023.

Wei, C.-Y. and Luo, H. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pp. 1263 1291, 2018.

Wolsey, L. and Nemhauser, G. Integer and Combinatorial Optimization. Wiley Series in Discrete Mathematics and Optimization. Wiley, 1999.

Zahavy, T., Cohen, A., Kaplan, H., and Mansour, Y. Apprenticeship learning via Frank-Wolfe. In AAAI Conference on Artificial Intelligence, 2019.

Zahavy, T., O' Donoghue, B., Desjardins, G., and Singh, S. Reward is enough for convex MDPs. In Advances in Neural Information Processing Systems, pp. 25746 25759, 2021.

Zhang, J., Koppel, A., Bedi, A. S., Szepesvari, C., and Wang, M. Variational policy gradient method for reinforcement learning with general utilities. In Advances in Neural Information Processing Systems, pp. 4572 4583, 2020.

Zhang, J., Ni, C., Yu, z., Szepesvari, C., and Wang, M. On the convergence and sample efficiency of variancereduced policy gradient method. In Advances in Neural Information Processing Systems, pp. 2228 2240, 2021.

Zimin, A. and Neu, G. Online learning in episodic Markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems, pp. 1583 1591, 2013.

Online Episodic Convex Reinforcement Learning

A. Auxiliary Results

A.1. Auxiliary lemmas

Lemma A.1. For 0 ă δ ă 1,

x,a µπt,p i px, aq}pi 1p |x, aq ppt i 1p |x, aq}1

ď 3|X|N 2 d

2|A|T log ˆ|X||A|NT

with probability at least 1 2δ.

Proof. Let ξt npx, aq : }pnp |x, aq ppt np |x, aq}1. We denote by ot : pxt n, at nqn Pr Ns the trajectory of the agent at episode t when playing policy πt. Let pµπt,p n px, aq : 1tpxtn,atnq px,aqu be the empirical state-action distribution computed from the agent s trajectory. We consider the following decomposition:

x,a µπt,p i px, aqξt i 1px, aq

x,a pµπt,p i px, aqξt i 1px, aq loooooooooooooooooooooomoooooooooooooooooooooon

µπt,p n pµπt,p i px, aq ξt i 1px, aq looooooooooooooooooooooooooooomooooooooooooooooooooooooooooon

Term p1q analysis. We start by analysing the first term. Using Lem. 2.2, we have that for δ P p0, 1q, with probability 1 δ,

x,a pµπt,p i px, aqξt i 1px, aq ď

2|X| log ˆ|X||A|NT

µπt,p i px, aq a

maxt1, N t i px, aqu .

Using Lem. 19 from (Jaksch et al., 2008), we have that for all i P r Ns and px, aq P X ˆ A,

pµπt,p i px, aq a

maxt1, N t i px, aqu ď p ?

N T i px, aq.

Therefore, using Jensen s inequality and that ř

px,aq N T i px, aq T for all i P r Ns, we have that

pµπt,p i px, aq a

maxt1, N t i px, aqu ď 3

N T i px, aq

Substituting this inequality into the upper bound for term p1q yields

2|X| log ˆ|X||A|NT

2|A|T log ˆ|X||A|NT

Online Episodic Convex Reinforcement Learning

Term p2q analysis. We now analyse the second term. Let Ft : σpo1, . . . , ot 1q be the filtration generated by the trajectories of the agent from the first episode, up to the end of episode t 1. Note that ξt n 1px, aq is Ft measurable, as it only depends on observations up to episode t 1. Therefore,

Erξt n 1px, aqpµπt,p n px, aq|Ft ns ξt n 1px, aq Erpµπt,p n px, aq|Ft ns ξt n 1px, aqµπt,p n px, aq.

For all n P r Ns, let M 0 n 0 and for all t P r Ts,

µπs,p n px, aq pµπs,p n px, aq ξs n 1px, aq.

From the observation above, p M t nqt Pr T s is a martingale sequence with respect to the filtration Ft. Furthermore, as by definition |ξt n 1px, aq| ď 2,

|M t n M t 1 n | ď ÿ

µπt,p n px, aq pµπt,p n px, aq ξt n 1px, aq ˇˇˇˇ ď 2|X|.

Therefore, by Azuma-Hoeffding, we have that for any ε ą 0,

P M T n ě ε ď exp ˆ ε2

Applying the union bound on all n P r Ns, we then have that for any δ P p0, 1q, with probability at least 1 δ,

M T n ď 2|X|

holds simultaneously for all n P r Ns.

Substituting this inequality into term p2q and summing over n P r Ns and i P rn 1s, we obtain, with probability at least 1 δ, that

i 0 M T i ď 2|X|N 2 c

Final step. Combining the upper bounds for term p1q from Eq. (14) and term p2q from Eq. (15), we obtain, with probability at least 1 2δ, that

x,a µπt,p i px, aqξt i 1px, aq ď 3|X|N 2 d

2|A|T log ˆ|X||A|NT

concluding the proof.

Lemma A.2. For any 0 ă δ ă 1,

n 0 p N nq ÿ

µπt,p n px, aq a

maxt1, N tnpx, aqu ď 3N 2a

|X||A|T |X|N 2 c

holds with probability at least 1 δ.

Proof. Recall that we denote by pxt n, at nqn Pt0,r Nsu the trajectory of the agent during episode t, when playing policy πt, and that we define by pµπt,ppx, aq : 1tpxtn,atnq px,aqu as the empirical state-action distribution computed from the trajectory of

Online Episodic Convex Reinforcement Learning

the agent. We consider the following decomposition:

n 0 p N nq ÿ

µπt,p n px, aq a

maxt1, N tnpx, aqu

n 0 p N nq ÿ

pµπt,p n px, aq a

maxt1, N tnpx, aqu looooooooooooooooooooooooomooooooooooooooooooooooooon

n 0 p N nq ÿ

µπt,p n pµπt,p n px, aq

maxt1, N tnpx, aqu looooooooooooooooooooooooomooooooooooooooooooooooooon

Term p1q analysis. Using the same decomposition of term p1q of Lem. A.1 in Eq. (13) we have that

n 0 p N nq ÿ

pµπt,p n px, aq a

maxt1, N tnpx, aqu ď 3N 2a

|X||A|T. (17)

Term p2q analysis. The analysis of term p2q follows a similar approach to the analysis of term p2q in Lem. A.1, with the key difference being that, instead of carrying the term related to the difference between the true probability transition and the estimated one, we now have the term 1{ a

max t1, N tnpx, aqu.

Let Ft : σpo1, . . . , ot 1q be the filtration generated by the trajectories of the agent from the first episode, up to the end of episode t 1. Note that 1{ a

max t1, N tnpx, aqu is Ft measurable, as it only depends from observations of time step n up to episode t 1. Therefore,

max t1, N tnpx, aqupµπt,p n px, aq|Ft ns 1{ a

max t1, N tnpx, aquµπt,p n px, aq.

For all n P r Ns, let M 0 n 0 and for all t P r Ts,

s 1 p N nq ÿ

µπs,p n px, aq pµπs,p n px, aq 1{ a

max t1, N tnpx, aqu.

From the observation above, p M t nqt Pr T s is a martingale sequence with respect to the filtration Ft. Furthermore, as by definition |1{ a

max t1, N tnpx, aqu| ď 1,

|M t n M t 1 n | ď p N nq ÿ

µπt,p n px, aq pµπt,p n px, aq ξt npx, aq ˇˇˇˇ ď p N nq|X|.

Therefore, by Azuma-Hoeffding, we have that for any ε ą 0,

P M T n ě ε ď exp ˆ ε2

2|X|2p N nq2T

Applying the union bound on all n P r Ns, we then have that for any δ P p0, 1q, with probability at least 1 δ,

M T n ď |X|N

Summing over n P r Ns, we have that with probability at least 1 δ,

n 0 M T n ď |X|N 2 c

Online Episodic Convex Reinforcement Learning

Joining terms p1q and p2q. To conclude, we replace the final upper bounds of the terms p1q and p2q of Eq. (17) and (18) respectively in the decomposition of Eq. (16), and we obtain that, for any δ P p0, 1q, with probability at least 1 δ,

n 0 p N nq ÿ

µπt,p n px, aq a

maxt1, N tnpx, aqu ď 3N 2a

|X||A|T |X|N 2 c

concluding the proof.

Lemma A.3. For all n P r Ns, px, a, x1q P X ˆ A ˆ X, and t P r Ts, let ppt n 1px1|x, aq be defined as in Eq. (7). Hence,

}ppt 1 n 1p |x, aq ppt n 1p |x, aq}1 ď 1txtn x,atn au maxt1, N t 1 n px, aqu.

Proof. From the definition of the estimator ppt, we have that

ppt 1 n 1px1|x, aq 1 maxt1, N t 1 n px, aqu

N t npx, aqppt n 1px1|x, aq 1txt n 1 x1,xtn x,atn au .

|ppt 1 n 1px1|x, aq ppt n 1px1|x, aq| 1 maxt1, N t 1 n px, aqu

ˇˇˇ1txt n 1 x1,xt n x,at n au ppt n 1px1|x, aq N t 1 n px, aq N t npx, aq ˇˇˇ

1 maxt1, N t 1 n px, aqu

ˇˇˇ1txt n 1 x1,xtn x,atn au ppt n 1px1|x, aq1txtn x,atn au ˇˇˇ.

Summing over x1 P X we then have that

}ppt 1 n 1p |x, aq ppt n 1p |x, aq}1 ď 1txtn x,atn au maxt1, N t 1 n px, aqu,

concluding the proof.

Lemma A.4. For pn, x, aq P r NsˆX ˆA, let pqtqt Pr T s be a sequence of probability transition kernels with qt : pqt nqn Pr Ns such that

}qt 1 n p |x, aq qt np |x, aq}1 ď c1txt n 1 x,at n 1 au max t1, N t 1 n 1px, aqu

for some constant c ą 0. Then, Tÿ

t 1 }qt 1 n p |x, aq qt np |x, aq}1 ď ec logp Tq .

Proof. We have that

t 1 }qt 1 n p |x, aq qt np |x, aq}1 ď c

1txt n 1 x,at n 1 au max t1, N t 1 n 1px, aqu c

NT 1 n 1 px,aq ÿ

t ď c logpe Tq T ě2 ď ec logp Tq .

Lemma A.5. Let pqtqt Pr T s be a sequence of probability transition kernels, i.e., qt : pqt nqn Pr Ns such that for any state-action pair px, aq and any step n P r Ns, řT t 1 }qt 1 n p |x, aq qt np |x, aq}1 ď c logp Tq for some constant c ą 0. Then, for any sequence of policies pπtqt Pr T s, Tÿ

t 1 }µπt,qt 1 µπt,qt}8,1 ď c|X||A|N logp Tq .

While for a fixed policy π, Tÿ

t 1 }µπ,qt 1 µπ,qt}8,1 ď c|X|N logp Tq .

Online Episodic Convex Reinforcement Learning

Proof. Using Lem. B.1 we obtain that

t 1 }µπt,qt 1 µπt,qt}8,1 ď

t 1 sup n Pr Ns

i px, aq}qt 1 i 1p |x, aq qt i 1p |x, aq}1

x,a µπt,qt n px, aq}qt 1 n 1p |x, aq qt n 1p |x, aq}1

t 1 }qt 1 n 1p |x, aq qt n 1p |x, aq}1 ď c|X||A|N logp Tq .

While for a fixed policy π,

t 1 }µπ,qt 1 µπ,qt}8,1 ď

x,a µπ,qt n px, aq}qt 1 n 1p |x, aq qt n 1p |x, aq}1

x,a πnpa|xq}qt 1 n 1p |x, aq qt n 1p |x, aq}1

x,a πnpa|xq logp Tq c|X|N logp Tq .

Lemma A.6. Consider a sequence of policies pπtqt Pr T s, and define a smoothed version of each policy πt for all t P r Ts as πt : p1 αtqπt αt

|A|, where αt P p0, 1q. Let p and q be two probability transition kernels, denoted as p : ppnqn Pr Ns and q : pqnqn Pr Ns, respectively. Therefore, for all t P r Ts,

}µπt,p µ πt,q}8,1 ď

x,a µπt,p i px, aq}pi 1p |x, aq qi 1p |x, aq}1 2Nαt.

Proof. See Lem. D.4 from (Moreno et al., 2024).

A.2. Building a closed-form solution for each OMD iteration

In this subsection we argue that the MD optimization problem solved at each iteration in Lem. 2.1 has a closed-form solution. Define the convex function Gtpµq : τxzt, µy Γpµ, µtq, for τ ą 0.

Optimizing a convex objective function over policies is equivalent to optimizing it over state-action distributions in Mp µ0. Therefore, the optimization problem solved in Lem. 2.1 over the state-action distributions induced by qt 1 is equivalent to minimizing the same function over the space of policies:

min µPMqt 1 µ0 Gtpµq looooooomooooooon

piq: state-action problem

min πPp Aq XˆN Gtpµπ,qt 1q looooooooooooomooooooooooooon

piiq: policy problem

In Thm. 4.1 of (Moreno et al., 2024), it is shown that for each episode t P r Ts, an optimal policy for the problem

min πPp Aq XˆN Gtpµπ,qt 1q : τxzt, µy Γpµ, µtq, (20)

defined in Eq. (19), denoted by πt 1, can be computed using an auxiliary sequence of functions p Qt nqn Pr Ns, where Qt n : X ˆ A Ñ R. The sequence starts with Qt Npx, aq zt Npx, aq, and for n P t N, . . . , 1u, the following recursion is

Online Episodic Convex Reinforcement Learning

πt 1 n 1pa|xq πt n 1pa|xq exp τ Qt n 1px, aq

a1 πt n 1pa1|xq exp τ Qt n 1px, a1q ,

Qt npx, aq zt npx, aq ÿ

x1 qt 1 n 1px1|x, aq ÿ

a1 πt 1 n 1pa1|x1q 1

τ log πt 1 n 1pa1|x1q πt n 1pa1|x1q

Qt n 1px1, a1q ȷ .

The core idea of the proof is to show that, due to the specific divergence used (defined in Eq. (4)), Eq. (20) can be solved using dynamic programming. For further details, the reader is referred to Appendix B of (Moreno et al., 2024). A similar result was also obtained by (Cammardella et al., 2023), though they approached the optimization problem using Lagrangian multipliers instead of dynamic programming.

Problem piq of Eq. (19) is convex, and the theoretical analysis are given in Lem. 2.1. Thanks to the equivalence between problems piq and piiq in Eq. (19), we can use the analysis of problem piq to provide theoretical guarantees for the closed-form solution policy of problem piiq.

B. Missing Results and Proofs

Lemma B.1. For any strategy π P p Aq XˆN, for any two probability kernels p ppnqn Pr Ns and q pqnqn Pr Ns such that pn, qn : X ˆ A ˆ X Ñ r0, 1s, and n P r Ns,

}µπ,p n µπ,q n }1 ď

x,a µπ,p i px, aq}pi 1p |x, aq qi 1p |x, aq}1.

Proof. From the definition of a state-action distribution sequence induced by a policy π in a probability kernel p in Eq. (2), we have that for all px, aq P X ˆ A and n P r Ns,

µπ,p n px, aq ÿ

x1,a1 µπ,p n 1px1, a1qpnpx|x1, a1qπnpa|xq.

}µπ,p n µπ,q n }1 ÿ

ˇˇµπ,p n px, aq µπ,q n px, aq ˇˇ

ˇˇµπ,p n 1px1, a1qpnpx|x1, a1q µπ,q n 1px1, a1qqnpx|x1, a1q ˇˇπnpa|xq

ˇˇµπ,p n 1px1, a1qpnpx|x1, a1q µπ,q n 1px1, a1qqnpx|x1, a1q ˇˇ

ˇˇµπ,p n 1px1, a1qpnpx|x1, a1q µπ,p n 1px1, a1qqnpx|x1, a1q

µπ,p n 1px1, a1qqnpx|x1, a1q µπ,q n 1px1, a1qqnpx|x1, a1q ˇˇ

x1,a1 µπ,p n 1px1, a1q}pnp |x1, a1q qnp |x1, a1q}1 ÿ

ˇˇµπ,p n 1px1, a1q µπ,q n 1px1, a1q ˇˇ

x1,a1 µπ,p n 1px1, a1q}pnp |x1, a1q qnp |x1, a1q}1 }µπ,p n 1 µπ,q n 1}1.

Since for n 0, }µπ,p 0 µπ,q 0 }1 0, by induction we get that

}µπ,p n µπ,q n }1 ď

x1,a1 µπ,p i px1, a1q}pi 1p |x1, a1q qi 1p |x1, a1q}1.

Online Episodic Convex Reinforcement Learning

B.1. Proof of Prop. 3.1

Proof. In the analysis, we explicitly write the term n 0 separately from the other n P r Ns. We begin with the following decomposition:

t 1 xbt, µπt,pty

t 1 xbt 0, µ0y

t 1 xbt, µπt,pt µπt,py loooooooooooomoooooooooooon

t 1 xbt, µπt,py

t 1 xbt 0, µ0y loooooooooooooooomoooooooooooooooon

Term p1q analysis. Using Holder s inequality, we have that

n 1 }bt n}8}µπt,pt n µπt,p n }1.

From the definition of the bonus sequence, we have that for all n P r Ns, }bt n}8 ď Lp N nq Cδ. Hence,

x,a µπt,p i px, aq}pi 1p |x, aq ppt i 1p |x, aq}1

ď LCδ|X|N 3 3

2|A|T log ˆ|X||A|NT

where the first inequality comes from Lem. B.1, and the second inequality is achieved for any δ P p0, 1q, with probability at least 1 2δ, using Lem. A.1.

Term p2q analysis. Using the definition of the bonus sequence in equation (11), and recalling that the initial state-action distribution µ0 is always the same, we have that, for any δ P p0, 1q, with probability at least 1 δ,

n 0 p N nq ÿ

µπt,p n px, aq a

maxt1, N tnpx, aqu

ď LCδN 2 3 a

|X||A|T |X|

where the inequality comes from Lem. A.2.

Joining the upper bounds in term p1q and p2q. Putting both upper bounds together we get that for any δ P p0, 1q, with probability at least 1 3δ, and from the definition of Cδ,

t 1 xbt, µπt,pty

t 1 xbt 0, µ0y ď LCδ|X|N 3 3

2|A|T log ˆ|X||A|NT

|X||A|T |X|

O ˆ LN 3|X|3{2a

|A|T log ˆ|X||A|NT

B.2. Proof of Thm. 3.2 (Main result)

For proving the main result we join together all the pieces we presented in the main paper and the appendix.

Online Episodic Convex Reinforcement Learning

Proof. We start by decomposing the regret and using the convexity of the objective function obtaining that

t 1 F tpµπt,pq F tpµπt,ptq

t 1 F tpµπt,ptq F tpµπ,pq

t 1 x F tpµπt,pq, µπt,p µπt,pty looooooooooooooooooomooooooooooooooooooon

t 1 x F tpµπt,ptq, µπt,pt µπ,py looooooooooooooooooomooooooooooooooooooon

We analyse each term separately:

Analysis of RMDP T . We begin by analyzing the term RMDP T , which represents the cost incurred due to not knowing the true probability kernel. First, we apply Hoeffding s inequality, and the fact that f t n is L-Lipschitz with respect to the norm } }1. Following, we apply Lem. B.1, obtaining that

n 1 } f t npµπt,pq}8}µπt,p n µπt,pt n }1

x,a µπt,p i px, aq}pi 1p |x, aq ppt i 1p |x, aq}1.

We can now apply Lem. A.1 to obtain that for any δ P p0, 1q, with probability at least 1 2δ,

RMDP T ď L3|X|N 2 d

2|A|T log ˆ|X||A|NT

Analysis of Rpolicy T . To analyse Rpolicy T we further decompose it as

t 1 x F tpµπt,ptq bt, µπt,pt µπ,pty looooooooooooooooooooooomooooooooooooooooooooooon

Rpolicy/MD T

t 1 xbt, µπt,pt µπ,pty

t 1 x F tpµπt,ptq, µπ,pt µπ,py loooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooon

Rpolicy/bonus T

where recall that bt : pbt nqn Pr Ns is the bonus vector defined in Eq. (11).

Analysis of Rpolicy/MD T . We begin by addressing the term that accounts for the regret incurred by using online Mirror Descent with changing constraint sets.

From Lemmas A.3 and A.4, we know that the probability sequence ppptqt Pr T s satisfies the condition that řT t 1 }ppt 1 n ppt n}1 ď c logp Tq for c e. Additionally, at each time step t, since F t is LF -Lipschitz with respect to the norm } }8,1, we have } F tpµq}1,8 ď LF LN for any state-action distribution µ. From the definition of the bonus vector, we also have that }bt}1,8 ď LN 2Cδ. Consequently, } F tpµq bt}1,8 ď 2LN 2Cδ. Therefore, as we compute µt 1 by solving

µt 1 : arg min

µPM pt 1 µ0

! τx F tpµπt,ptq bt, µy Γpµ, µtq ) ,

by applying Lem. 2.1 with νt : µπ,pt, ζ 2LN 2Cδ, and the sequence of probability transition kernels ppptqt Pr T s, we

obtain that for the optimal parameter τ b

b ζ2T , where

b : N ˆ logp Tq e|X||A| 4 logp|A|q 2Ne|X| logp Tq2 logp|A|q ,

Online Episodic Convex Reinforcement Learning

and recalling that Cδ : b

2|X| log |X||A|NT{δq,

Rpolicy/MD T ď 2ζ ?

b T ζNe|X| logp Tq

2|X| log ˆ|X||A|TN

b T N logp Tqe|X|

O ˆ LN 2|X|

T log ˆ|X||A|NT

N|A| logp Tq N a

logp|A|q logp Tq .

Analysis of Rpolicy/bonus T . We start by analysing the second term of the sum in Rpolicy/bonus T . For any δ P p0, 1q, with probability at least 1 δ, we have that

t 1 x F tpµπt,ptq, µπ,pt µπ,py ď

n 1 } f t npµπ,pq}8}µπ,p n µπ,pt n }1

i px, aq}pi 1p |x, aq ppt i 1p |x, aq}1

i px, aq Cδ a

maxt1, N t i px, aqu

n 0 p N nq ÿ

x,a µπ,pt n px, aq Cδ a

maxt1, N tnpx, aqu

t 1 xbt, µπ,pty

t 1 xbt 0, µ0y

where the first inequality follows from Holder s inequality, the second from the fact that f t n is L-Lipschitz with respect to the norm } }1 and Lem. B.1, the third from the concentration bound in Lem. 2.2 where we define

2|X| log |X||A|NT{δq, and the last equality comes from the definition of the bonus vector in Eq. (11).

Replacing it at the Rpolicy/bonus T term we have that

Rpolicy/bonus T

t 1 xbt, µπt,pt µπ,pty

t 1 x F tpµπt,ptq, µπ,pt µπ,py

t 1 xbt, µπt,pt µπ,pty

t 1 xbt, µπ,pty

t 1 xbt 0, µ0y

t 1 xbt, µπt,pty

t 1 xbt 0, µ0y.

Lastly, we apply Prop. 3.1 to achieve that, for any δ P p0, 1q, with probability at least 1 4δ,

Rpolicy/bonus T ď

t 1 xbt, µπt,pty

t 1 xbt 0, µ0y O ˆ LN 3|X|3{2a

|A|T log ˆ|X||A|NT

Final upper bound on Rpolicy T . Joining the upper bounds on Rpolicy/MD T and Rpolicy/bonus T from Eq.s (22) and (23) respectively, we achieve that for any δ P p0, 1q, with probability at least 1 4δ, ignoring logarithmic terms,

Rpolicy T ď O LN 3|X|3{2a

|A|T . (24)

Online Episodic Convex Reinforcement Learning

Joining the upper bounds on RMDP T and on Rpolicy T . Note that the terms in the upper bound on Rpolicy T from Eq. (24) dominate those in the upper bound on RMDP T from Eq. (21). Therefore, when combining both terms to complete the upper bound on the regret, we obtain that, with probability 1 6δ,

RT pπq ď O LN 3|X|3{2a

concluding the proof.

C. Proof of Lem. 2.1: Online Mirror Descent with Varying Constraint Sets

Before stating the proof we recall a few results from Bregman divergences and in particular the divergence Γ defined in Eq. (4) that are used throughout the proof.

To simplify notation, for any probability measure η P E, where E is any finite space, we define the neg-entropy function, using the convention that 0 logp0q 0, as ϕpηq : ř

x PE ηpxq log ηpxq. For any µ : pµnqn Pr Ns P p XˆAq N, we define ρnpxq : ř

a PA µnpx, aq for all n P r Ns and x P X, representing the marginal distribution over the state space. The function inducing the divergence Γ, defined in Eq. (4), is given by

n 1 ϕpρnq. (25)

By definition of a Bregman divergence, for any two probability transition kernels p, q, for all µ P Mp µ0 and µ1 P Mq, µ0 , where Mq, µ0 is the subset of Mq µ0 where the corresponding policies π satisfy πnpa|xq 0, we then have that

Γpµ, µ1q : ψpµq ψpµ1q x ψpµ1q, µ µ1y. (26)

Additionally, for any probability transition kernel p, the function ψ is 1-strongly convex with respect to } }8,1 within Mp µ0 (see Thm. 4.1 from (Moreno et al., 2024)). Consequently, a consequence from a known property of Bregman divergences (Shalev-Shwartz, 2012) is that, for any µ P Mp µ0 and µ1 P Mp, µ0 ,

Γpµ, µ1q ě 1

2}µ µ1}2 8,1. (27)

Lemma. Let pqtqt Pr T s be a sequence of probability transition kernels, and pztqt Pr T s a sequence of vectors in RNˆ|X|ˆ|A|, such that }zt}1,8 ď ζ for all t P r Ts. Initialize π1 npa|xq : 1{|A| as the uniform policy. For every t P r Ts, let πt : p1 αtqπt αt|A| 1 be a smoothed version of the policy with αt : 1{pt 1q and µt : µ πt,qt. For each t P r Ts, compute iteratively µt 1 P arg min

µPMqt 1 µ0 τxzt, µy Γpµ, µtq. (28)

Hence, there is a τ ą 0 such that, for any sequence pνtqt Pr T s, with νt : νπ,qt for a common policy π,

řT t 1xzt, µt νty ď O ζN a

VT |X| logp|A|q T logp Tq ,

where VT ě 1 maxpn,x,aq řT 1 t 1 }qt np |x, aq qt 1 n p |x, aq}1.

Proof. Throughout this proof, for all t P r Ts we denote by πt the policy inducing µt, meaning that µt : µπt,qt and µt : µ πt,qt. We assume here that maxpn,x,aq řT 1 t 1 }qt np |x, aq qt 1 n p |x, aq}1 ď c logp Tq for c a constant, as this is the case for all the transition estimators we use to obtain the main results of the article.

As Mqt 1 µ0 is a convex set (only linear constraints), the optimality conditions and the definition of a Bregman divergence in Eq. (26) imply that for all νt 1 P Mqt 1 µ0 ,

xτzt ψpµt 1q ψp µtq, νt 1 µt 1y ě 0.

Online Episodic Convex Reinforcement Learning

Re-arranging the terms and using the three points inequality for Bregman divergences (Bubeck, 2015) we get that,

τxzt, µt 1 νt 1y ď x ψpµt 1q ψp µtq, νt 1 µt 1y Γpνt 1, µtq Γpνt 1, µt 1q Γpµt 1, µtq.

Therefore, by adding and subtracting τxzt, µt νty on the left-hand side,

τxzt, µt 1 νt 1y τxzt, µt νty τxzt, µt νty ď Γpνt 1, µtq Γpνt 1, µt 1q Γpµt 1, µtq

ñ τxzt, µt νty ď τxzt, νt 1 νty τxzt, µt µt 1y Γpνt 1, µtq Γpνt 1, µt 1q Γpµt 1, µtq.

Then, by summing over t P r Ts, we obtain that

t 1 xzt, µt νty ď 1

τxzt, µt µt 1y Γpµt 1, µtq

loooooooooooooooooooooooomoooooooooooooooooooooooon

Γpνt 1, µtq Γpνt 1, µt 1q

loooooooooooooooooooooomoooooooooooooooooooooon

t 1 xzt, νt 1 νty loooooooooomoooooooooon

The term A arises due to our lack of knowledge of zt at the beginning of episode t for all episodes (adversarial losses hypothesis). To address this, we employ Young s inequality and the strong convexity of Γ. For the term B, in the classic Online Mirror Descent proof (Shalev-Shwartz, 2012), where the set of constraints is fixed, the sum of the differences between the Bregman divergences telescopes (as would be the case with a fixed ν). However, because we are dealing with time-varying constraint sets, this telescoping effect does not occur in our situation. We will now proceed to derive an upper bound for each term, starting with term C that is straightforward.

Step 0: upper bound on C. Applying Holder s inequality, Lem. A.5 with a fixed policy π, and the hypothesis that }zt}1,8 ď ζ,

t 1 xzt, νt 1 νty ď

t 1 }zt}1,8}νt 1 νt}8,1 ď ζc|X|N logp Tq. (30)

Step 1: upper bound on B. We now analyse the second term of the sum in Eq. (29). To make the Bregman divergence terms telescope we add and subtract Γpνt, µtq Γpνt, µtq, obtaining

t 1 Γpνt 1, µtq Γpνt 1, µt 1q

t 1 Γpνt 1, µtq Γpνt, µtq loooooooooooooooomoooooooooooooooon

t 1 Γpνt, µtq Γpνt, µtq loooooooooooooomoooooooooooooon

t 1 Γpνt, µtq Γpνt 1, µt 1q looooooooooooooooomooooooooooooooooon

We analyze each term. Using the definition of a Bregman divergence induced by ψ in Eq. (26) we get that

t 1 ψpνt 1q ψp µtq x ψp µtq, νt 1 µty ψpνtq ψp µtq x ψp µtq, νt µty

t 1 ψpνt 1q ψpνtq

t 1 x ψp µtq, νt νt 1y

t 1 } ψp µtq}1,8}νt νt 1}8,1,

Online Episodic Convex Reinforcement Learning

where in the last inequality we use the telescoping nature of the first term and applied H older s inequality to the second term. Recall that for v : pvnqn Pr Ns such that vn P RXˆA, we defined }v}8,1 : supn Pr Ns }vn}1. We now also define

}ω}1,8 : supvt|xω, vy|, }v}8,1 ď 1u řN n 1 supx,a |ωnpx, aq| as the respective dual norm.

With our choice of Bregman divergence, and given that

πt : p1 αtqπt αt 1 |A|,

for each n P r Ns, px, aq P X ˆ A, | ψp µtqpn, x, aq| | logp πt npa|xqq| ď logp|A|{αtq.

From the Lemma hypothesis, there is a common policy π such that for all t P r Ts, νt : νπ,qt. Hence, from the result above,

piq ď ψpν1q

t 1 N log ˆ|A|

}νπ,qt νπ,qt 1}8,1

ď ψpν1q N log ˆ |A| mint Pr T s αt

c|X|N logp Tq,

where the last inequality comes from Lem. A.5 with a fixed π.

As for the second term, using our definition of Γ, we obtain that

n,x,a νπ,qt n px, aq log ˆπnpa|xq

n,x,a νπ,qt n px, aq log ˆπnpa|xq

n,x,a νπ,qt n px, aq log ˆπt npa|xq πtnpa|xq

n,x,a νπ,qt n px, aq log ˆ πt npa|xq p1 αtqπtnpa|xq αt{|A|

t 1 p logp1 αtqq ď 2N

where the last inequality is valid if 0 ď αt ď 0.5.

The third term telescopes, hence, since ΓpνT 1, µT 1q ď 0 because a Bregman divergence is always non-negative,

piiiq ď Γpν1, µ1q.

Before adding back the three terms, note that, for π1 npa|xq 1{|A|, we have Γpν1, µ1q ψpν1q ψpµ1q. Furthermore, ψpµ1q ď N logp|A|q. Therefore, Γpν1, µ1q ψpν1q ď N logp|A|q. (32)

Summing over our bounds and using the Inequality (32), we get that B is upper bounded as

Γpνt 1, µtq Γpνt 1, µt 1q ď 1

piq piiq piiiq

τ logp|A|q N 2c|X|

τ log ˆ |A| mint Pr T s αt

Step 2: Upper bound on A. It remains to upper bound term A from Eq. (29),

t 1 τxzt, µt µt 1y Γpµt 1, µtq ȷ , (34)

Online Episodic Convex Reinforcement Learning

representing what we pay for not knowing the loss function in advance. For that we use Young s inequality (Beck & Teboulle, 2003): for any σ ą 0 to be optimized later, and for each episode t P r Ts,

τxzt, µt µt 1y Γpµt 1, µtq ď τ 2}zt}2 1,8 2σ σ

2 }µt µt 1}2 8,1 Γpµt 1, µtq. (35)

From the definition of Γ in Eq. (4), we have that

Γpµt 1, µtq

px,aq µt 1 n px, aq log ˆπt 1 n pa|xq πtpa|xq

Γpµt 1, µ πt,qt 1q.

From the strong convexity of ψ, as µt 1 P Mqt 1 µ0 and µ πt,qt 1 P Mqt 1, µ0 , we then have from Eq. (27) that

Γpµt 1, µtq Γpµt 1, µ πt,qt 1q ě 1

2}µt 1 µ πt,qt 1}2 8,1. (36)

Using the fact that for any vectors a, b, c P Rd and for any norm } }, the inequality }a b}2 ď 2p}a c}2 }b c}2q holds, we then have by Eq. (36)

1 4}µt µt 1}2 8,1 Γpµt 1, µtq ď 1

4}µt µt 1}2 8,1 1

2}µ πt,qt 1 µt 1}2 8,1

}µt µ πt,qt 1}2 8,1 }µ πt,qt 1 µt 1}2 8,1 1

2}µ πt,qt 1 µt 1}2 8,1

2}µt µ πt,qt 1}2 8,1.

For any n P r Ns, we have }µt n µ πt,qt 1 n }1 ď 2. Using this result along with Lem. A.6 for p ppt and q ppt 1, we derive the first inequality below. To obtain the second inequality, we apply Lem. A.5 with the sequence of policies pπtqt Pr T s.

t 1 }µt µ πt,qt 1}2 8,1 ď 2

t 1 sup n Pr Ns

x,a µt ipx, aq}qt i 1p |x, aq qt 1 i 1p |x, aq}1 4N

ď 2c|X||A|N logp Tq 4N

Therefore, summing Eq. (35) over t P r Ts with σ 1{2, and plugging the inequality above, yields

t 1 τxzt, µt µt 1y Γpµt 1, µtq ď τ 2 Tÿ

t 1 }zt}2 1,8 c|X||A|N logp Tq 2N

Using that }zt}1,8 ď ζ and dividing by τ entails:

ˆ c|X||A| logp Tq 2

Conclusion. Finally, by replacing the final bounds of Eqs. (33), (39), and (30), we obtain

t 1 xzt, µt νty ď A B C

ˆ c|X||A| logp Tq 2

τ log ˆ |A| mint Pr T s αt

t 1 αt ζNc|X| logp Tq.

Online Episodic Convex Reinforcement Learning

In particular, for αt 1{pt 1q,

t 1 xzt, µt νty ď τTζ2 1

τ N logp Tq c|X||A| 4 logp|A|q 2Nc|X| logp Tq2 logp|A|q ȷ

loooooooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooooooon

ζNc|X| logp Tq.

Optimising over τ b

t 1 xzt, µt νty ď 2ζ ?

b T ζNc|X| logp Tq O ζ a

c N|X||A|T logp Tq ζN a

c|X| logp|A|q T logp Tq ζNc|X| logp Tq ,

concluding the proof.

D. Bandit Feedback with Bonus in RL

Notations. Throughout App. D, we define the trajectory observed by the learner in episode t as ot : pxt n, at n, ℓt npxt n, at nqqn Pr Ns. Let Ft denote the σ-algebra generated by the observations up to episode t, i.e., Ft : σpo1, . . . , ot 1q. We use Et to represent the conditional expectation with respect to the observations up to episode t, i.e., Etr s : Er |Fts.

Overview of existing approaches. To adapt Alg. 1 to the bandit case, we need to estimate the loss function for each MD update. A classic choice using importance sampling is:

pℓt npx, aq ℓt npx, aq

µπt,p n px, aq 1txtn x,atn au.

This update is unbiased, as Er1txtn x,atn aus Er Er1txtn x,atn au|Ftss µπt,p n px, aq. However, since we do not know the true transition probability p, we cannot use this estimate directly. In (Rosenberg & Mansour, 2019a), they use µπt,pt

with UC-O-REPS and achieve a regret of Op T 3{4q.

Consider the following confidence set, that is further detailed in Eq. (41),

Ωt : tq | |qnpx1|x, aq ppt npx1|x, aq| ď εt npx1|x, aq, @px, a, x1q P X ˆ A ˆ X, n P r Nsu.

In (Jin et al., 2020), the authors incorporate a parameter γ for implicit exploration, an idea from multi-armed bandits (Neu, 2015), and use the following estimate:

pℓt npx, aq ℓt npx, aq µtnpx, aq γ 1txtn x,atn au, (40)

where µt npx, aq : maxq PΩt µπt,q. Although this is a biased estimate (µπt,p n px, aq ď µnpx, aq), Ωt is constructed to ensure that the bias introduced is reasonably small. They also argue that µ can be computed efficiently through dynamic programming. They demonstrate that running UC-O-REPS from (Rosenberg & Mansour, 2019b) with this estimate achieves Op ?

Tq regret, improving upon previous results.

In Alg. 2 we detail our method for solving the RL problem with adversarial losses, unknown probability transitions and bandit feedback. We proceed to the regret analysis.

D.1. Auxiliary lemmas

Lemma D.1 (Lem. A.2 of (Luo et al., 2021), adapted from Lem. 1 of (Neu, 2015)). Let pzt npx, aqqt Pr T s be a sequence of functions Ft-measurable, such that zt npx, aq P r0, Rs for each px, aq P X ˆ A, and n P r Ns. Let Zt npx, aq P r0, Rs be a random variable such that Etr Zt npx, aqs zt npx, aq. Then with probability 1 δ,

ˆ1txtn x,atn au Zt npx, aq µtnpx, aq γ µπt,p n px, aqzt npx, aq µtnpx, aq

Online Episodic Convex Reinforcement Learning

Algorithm 2 MD-CURL with Additive Bonus for Bandit feedback RL

1: Input: number of episodes T, initial policy π1 P Π, initial state-action distribution µ0, initial state-action distribution sequence µ1 µ1 µπ1,p1 with pp1 np |x, aq 1{|X| for all pn, x, aq, learning rate τ ą 0, exploration parameter γ τ (tuned in the proof of Thm. 4.1), sequence of parameters pαtqt Pr T s with αt 1{pt 1q. 2: Init.: @pn, x, a, x1q, N 1 npx, aq M 1 npx1|x, aq 0 3: for t 1, . . . , T do 4: Agent starts at pxt 0, at 0q µ0p q 5: for n 1, . . . , N do 6: Env. draws new state xt n pnp |xt n 1, at n 1q 7: Update counts

N t 1 n 1pxt n 1, at n 1q Ð N t n 1pxt n 1, at n 1q 1

M t 1 n 1pxt n|xt n 1, at n 1q Ð M t n 1pxt n|xt n 1, at n 1q 1

8: Agent chooses an action at n πt np |xt nq 9: Observe local loss ℓt npxt n, at nq 10: end for 11: Update transition estimate for all pn, x, a, x1q: ppt 1 n px1|x, aq : M t 1 n px1|x,aq maxt1,N tnpx,aqu 12: Compute bonus sequence for all pn, x, aq: bt npx, aq : p N nq Cδ ?

max t1,N t 1 n px,aqu 13: Compute optimistic state-action distribution for all pn, x, aq: µt npx, aq : maxq PΩt µπt,q, where Ωt is defined as in Eq. (41)

14: Compute loss estimate for all pn, x, aq: pℓt npx, aq ℓt npx,aq µtnpx,aq γ 1txtn x,atn au 15: Compute policy πt 1 n px, aq by solving

µt 1 P arg min

µPM pt 1 µ0

τxpℓt bt, µy Γpµ, µtq ( ,

which has a closed-form solution for πt 1 (see Sec. A.2) 16: Compute πt 1, the smooth version of πt 1:

πt 1 p1 αtqπt 1 αt{|A|

and the associated state-action distribution µt 1 : µ πt 1,pt 1

17: end for

Online Episodic Convex Reinforcement Learning

We define the confidence interval used in Alg. 2 as

Ωt : tq ˇˇ|qnpx1|x, aq ppt npx1|x, aq| ď εt npx1|x, aq, for all px, a, x1q P X ˆ A ˆ X, n P r Nsu, (41)

εt npx1|x, aq : 2

g f f e pptnpx1|x, aq log T N|X||A|

maxt1, N t n 1px, aqu 14 log T N|X||A|

3 maxt1, N t n 1px, aqu.

We present two results regarding this confidence set. The first result, based on the empirical Bernstein inequality, shows that the true probability kernel p belongs to this confidence set with high probability. The second is a key lemma from (Jin et al., 2020), which explains how the confidence set shrinks over time. For the proofs, the reader is referred to the original references.

Lemma D.2 (Empirical Bernstein inequality, Thm. 4 (Maurer & Pontil, 2009)/Lem. 2 (Jin et al., 2020)). With probability at least 1 4δ, we have that p P Ωt for all t P r Ts.

Lemma D.3 (Lem. 4 (Jin et al., 2020)). With probability at least 1 6δ, for any collection of transition functions ppx,tqx PX such that px,t P Ωt for all x, we have

x,a |µπt,px,t n px, aq µπt,p n px, aq| O ˆ N 2|X|

AT log TN|X||A|

D.2. Proof of Thm. 4.1

Proof. We start by decomposing the static regret with respect to any policy π P p Aq XˆN as follows

t 1 xℓt, µπt,p µπt,pty loooooooooooomoooooooooooon

t 1 xℓt bt, µπt,pt µπ,pty loooooooooooooooomoooooooooooooooon

2,Rpolicy/MD T

t 1 xℓt, µπ,pt µπ,py

t 1 xbt, µπ,pty loooooooooooooooooooooomoooooooooooooooooooooon

3,Rpolicy/MD T

t 1 xbt, µπt,pty looooooomooooooon

4, Bonus term

D.2.1. TERM 1: RMDP T

The analysis of this term is already provided in App. B.2. Here, we can further leverage the fact that the objective function is linear and that, by definition, ℓt n P r0, 1s. Therefore, with probability at least 1 2δ, we have:

RMDP T ď 3|X|N 2 d

2|A|T log ˆ|X||A|NT

D.2.2. TERM 2: RPOLICY/MD T

In practice, the learner plays using the estimated loss function minus the bonus. Hence, Rpolicy/MD T accounts for both the bias introduced by the loss estimation and the standard mirror descent regret bound. We start with the following decomposition:

Rpolicy/MD T

t 1 xℓt bt, µπt,pt µπ,pty

t 1 xℓt pℓt, µπt,pt µπ,pty loooooooooooooooomoooooooooooooooon

t 1 xpℓt bt, µπt,pt µπ,pty loooooooooooooooomoooooooooooooooon

Online Episodic Convex Reinforcement Learning

Mirror Descent term in Rpolicy/MD T . We begin by analyzing the error term from applying Mirror Descent with varying constraint sets, which is similar to that of Lem. 2.1 with zt pℓt bt, and ppptqt Pr T s as the probability kernel sequence defining the varying constraint sets. However, special attention is needed for the sup norm of the subgradient term since it now involves an estimate of the loss function. Additionally, the optimal learning rate τ now also depends on both the exploration parameter γ and the analysis of the bias terms, we provide a detailed explanation of how the entire analysis is affected below.

In proving Lem. 2.1 in App. C, the regret term for Mirror Descent is split into three terms: term A in Eq. (34), term B in Eq. (31), and term C in Eq. (30). The analysis of term B remains unchanged since this term is independent of the chosen loss function. We focus on what changes for terms A and C.

As in the proof of Lem. 2.1, we use again the notation µt : µπt,pt for all t P r Ts. From Eq. (34) term A is defined as

τxpℓt bt, µt µt 1y Γpµt 1, µtq .

For a fixed t, from Young s inequality we have that

n 1 τxpℓt n bt n, µt n µt 1 n y ď

n 1 τ 2 }pℓt n bt n}2 8 2σ σ

2 }µt n µt 1 n }2 1.

Following the analysis of term A in App. C, in special Eqs. (36), (37), and (38), we obtain that for σ 1{2,

n 1 τ}pℓt n bt n}2 8 e N|X||A| logp Tq

t 1 αt. (44)

From Eq. (30), with the notation νt : νπ,pt, term C is defined as C 1

τ řT t 1 τxpℓt bt, νt 1 νty. From Young s inequality with σ 1{2 we obtain that

τ 2}pℓt bt}2 1,8 2σ 1

t 1 }νt 1 νt}2 8,1

n 1 τ}pℓt n bt n}2 8 e N|X| logp Tq

Bounding the sup norm of the estimated loss function. Recall from the definition of the bonus function in Eq. (11), with L 1, that }bt n}8 ď NCδ : b for all n P r Ns and t P r Ts. As }pℓt n bt n}2 8 ď }pℓt n}2 8 }bt n}2 8, we can focus on the term involving the sup norm of the estimated loss function.

We apply Lem. D.1 with Zt npx, aq 1txtn x,atn auℓt npx,aq2

µtnpx,aq γ and zt npx, aq µπt,p n px,aqℓt npx,aq2

µtnpx,aq γ . Note that Zt npx, aq, zt npx, aq ď

1 γ , that zt npx, aq is Ft-measurable, and that Etr Zt npx, aqs zt npx, aq. Therefore, with probability 1 δ,

n 1 }pℓt n}2 8 ď τ

x,a pℓt npx, aq2

1txt n x,at n auℓt npx, aq µtnpx, aq γ 1txt n x,at n auℓt npx, aq µtnpx, aq γ

1txtn x,atn au Zt npx, aq µtnpx, aq γ

µπt,p n px, aq µtnpx, aq µπt,p n px, aqℓt npx, aq2

µtnpx, aq γ τ 1

γ N 2γ log ˆN

Online Episodic Convex Reinforcement Learning

Lem. D.2 states that the true probability transition p P Ωt with probability 1 4δ for all t P r Ts. Hence, with probability 1 4δ, µt npx, aq ě µπt,p n px, aq. Consequently, by replacing it in the previous inequality, we obtain that with probability 1 5δ,

n 1 }pℓt n}2 8 ď τ

x,a ℓt npx, aq2 Nτ

ď τTN|X||A| Nτ

where for the last inequality we use that ℓt n P r0, 1s.

Therefore, by replacing it in the upper bound of the terms A and C in Eqs. (44) and (45) respectively, we obtain that

A C ď τ4b TN|X||A| Nτ

3e N|X||A| logp Tq

The upper bound on term B from Eq. (33) in App. C remains the same:

τ logp|A|q e N 2|X|

τ log ˆ |A| mint Pr T s αt

Thus, setting αt 1{pt 1q, the final upper bound on the Mirror Descent term is, with high probability, given by

t 1 xpℓt bt, µπt,pt µπ,pty ď A B C

ď τ4b TN|X||A| Nτ

6e N|X||A| logp Tq

τ logp|A|q e N 2|X|

τ logp|A|Tq logp Tq.

Before tuning the optimal parameter τ, we must first analyze the bias terms.

Bias terms. We now proceed to analyze the bias terms. Our approach is similar to the one used in (Jin et al., 2020), with a key difference: they utilize confidence sets in their Mirror Descent iterations, whereas we perform iterations over the set induced by ppt. We start by dividing the bias term in two:

t 1 xℓt pℓt, µπt,pt µπ,pty

t 1 xℓt pℓt, µπt,pty loooooooooomoooooooooon

t 1 xpℓt ℓt, µπ,pty loooooooooomoooooooooon

Bias 1. Since µπt,pt is Ft-measurable, we have that Etrxℓt pℓt, µπt,ptys x Etrℓt pℓts, µπt,pty. For any couple px, aq, and for any time step n P r Ns,

Etrℓt npx, aq pℓt npx, aqs ℓt npx, aq ℓt npx, aqµπt,p n px, aq µtnpx, aq γ ℓt npx, aq ˆ µt npx, aq γ µπt,p n px, aq µtnpx, aq γ

Etr Bias 1s

x,a µπt,pt n px, aqℓt npx, aq ˆ µt npx, aq γ µπt,p n px, aq µtnpx, aq γ

From Lem. D.2, and from the definition of µ, we have that with high probability, µt npx, aq ě µπt,pt n px, aq, therefore,

Etr Bias 1s ď

x,a | µt npx, aq γ µπt,p n px, aq| ď

x,a | µt npx, aq µπt,p n px, aq| γ|X||A|NT.

Online Episodic Convex Reinforcement Learning

Note that µt npx, aq maxp PΩt µπt,p n px, aq πt npa|xq maxp PΩt ρπt,p n pxq, where ρπt,p n pxq : ř

a PA µπt,p n px, aq. Therefore, for each x P X, there is a px,t P Ωt such that µt npx, aq µπt,px,t n px, aq.

From Lem. D.3 we obtain that

x,a |µπt,px,t n px, aq µπt,p n px, aq| O ˆ N 2|X|

|A|T log |X||A|NT

Etr Bias 1s O ˆ N 2|X|

|A|T log |X||A|NT

As we have that Bias 1 Etr Bias 1s řT t 1x Etrpℓts pℓt, µπt,pty, all that remains is to treat the second term of the sum. With high probability, µt npx, aq ě µπt,pt n px, aq, therefore

x,a pℓt npx, aqµπt,pt n px, aq ď

x,a ℓt npx, aq1txtn x,atn au ď N.

Thus, Azuma s inequality gives us that

t 1 x Etrpℓts pℓt, µπt,pty ď N

which is of a smaller order than the terms previously appearing in the bias bound. Hence,

Bias 1 O ˆ N 2|X|

|A|T log |X||A|NT

Bias 2. The result follows directly from Lem. 14 of (Jin et al., 2020) using µπ,pt instead of µπ,p:

t 1 xpℓt ℓt, µπ,pty O ˆN logp |X||A|N

Optimizing the learning and exploration parameters τ and γ. By joinning the Mirror Descent term from Eq. (46) along with the bounds on Bias 1 and Bias 2 terms, and setting γ τ, we obtain that

Rpolicy/MD T O ˆ τb N|X||A|T N

|X||A| logp Tq logp|A|q N|X| logp|A|Tq logp Tq ı

|A|T log |X||A|NT

τ|X||A|NT N logp |X||A|N

Let φ1 : pb 1q N|X||A|, and

φ2 : N log N

|X||A| logp Tq logp|A|q N|X| logp|A|Tq logp Tq log ˆ|X||A|N

φ2{φ1T, recalling that b : NCδ, and that Cδ : a

2|X| logp|X||A|NT{δq, we obtain that, with high probability,

Rpolicy/MD T 2 a

φ1φ2T O N 3{2|X|5{4|A| ?

T N 2|X|5{4a

|A|T . (47)

D.2.3. TERM 3: RPOLICY/MDP T

The upper bound for this term directly follows from the analysis of adding the bonus term to compensate for insufficient exploration, as discussed in Subsec. 3.2 of the main paper, and is detailed in App. B.2. Thus, we have that with high probability, Rpolicy/MDP T ď 0.

Online Episodic Convex Reinforcement Learning

D.2.4. TERM 4: BONUS TERM

The analysis of this term follows directly from Prop. 3.1 from the main paper: for any δ P p0, 1q, with probability at least 1 3δ, we have that Tÿ

t 1 xbt, µπt,pty O N 3|X|3{2a

|A|T . (48)

D.3. Final bound

Replacing the upper bound on all four terms from Eq.s (43), (47), (48), and that Rpolicy/MDP T ď 0 into the regret decomposition in Eq. (42), we obtain that playing Alg. 2 for the RL problem with bandit feedback on adversarial loss functions has, with high probability, a static regret of order

RT pπq O N 3|X|3{2a

|A|T N 3{2|X|5{4|A| ?

E. Curl with Bandit Feedback

Notation Let S and A be two positive integers. For convenience and brevity, we suppose in what follows that X r Ss and A r As. Accordingly, we will often use S and A in place of |X| and |A| respectively. Let 9A P t A, A 1u. For a vector ξ P RNS 9A and pn, x, aq P r Ns ˆ r Ss ˆ r 9As, we use ξnpx, aq as a shorthand for ξppn 1q S 9A px 1q 9A aq. Similarly, let :A P t A, A 1u. Then, for an NS 9A ˆ NS :A matrix M, we use Mpn, x, a, n1, x1, a1q to denote the item in row pn 1q S 9A px 1q 9A a and column pn1 1q S :A px1 1q :A a1 of M. For any d P Z , let 1d P Rd be the vector with all entries equal to one and Id the d ˆ d identity matrix.

E.1. An alternative representation for the decision sets

In the following, we will fix an arbitrary transition kernel p : ppnqn Pr Ns. We recall the notation that for ζ P Mp µ0, ρζ npxq : ř

a PA ζnpx, aq for pn, xq P r Ns ˆ X, which satisfies ρζ npxq ř

x1,a1PXˆA ζn 1px1, a1qpnpx|x1, a1q for n ě 2. At the first step, we define ρp 1pxq : ř

x1,a1PXˆA µ0px1, a1qp1px|x1, a1q, which satisfies ρp 1pxq ρζ 1pxq for every ζ P Mp µ0 and x P X since the initial state distribution is the same for all occupancy measures in Mp µ0.

We describe here the mapping alluded to in Sec. 4.2.1 of Mp µ0 to a lower-dimensional space where it could have a non-empty interior. This is analogous to how one can define a bijective map between the simplex d and the set tx P Rd 1 : 1 d 1x ď 1 and xi ě 0 @i P rd 1su, which is the intersection of the positive orthant of Rd 1 with the L1 unit ball, see (J ez equel et al., 2022, Section 2). This can be done since any coordinate xi of a vector x P d can be recovered from the rest of the coordinates: xi 1 ř i i xi. In our case, denoting by a the last action in A (i.e., a A, recalling that A r As), we will represent the occupancy measures as vectors in RNSp A 1q by omitting all coordinates that correspond to this action. We can afford to do so, since for any µ P Mp µ0, we have that µnpx, a q ρµ npxq ř

a a µnpx, aq where ρµ n is recoverable from µn 1 and given in the first step by the initial state distribution ρp 1pxq, which does not depend on µ. In the following, we use this idea to define the sought mapping.

Define the A ˆ p A 1q matrix

G : IA 1 1 A 1

and let H be the NSA ˆ NSp A 1q matrix obtained via taking the direct sum of NS copies of G: H : ÀNS i 1 G.2 Define wp,1 P RNSA such that wp,1 n px, aq : ρp 1pxq Itn 1, a a u .

Next, for every 2 ď m ď N, we define W p,m as the NSA ˆ NSA matrix where

W p,mpn, x, a, n1, x1, a1q : Itn m, n1 m 1, a a upmpx|x1, a1q .

2For an n ˆ m matrix M and an n1 ˆ m1 matrix M 1, M À M 1 is the pn n1q ˆ pm m1q block matrix M 0 0 M 1

Online Episodic Convex Reinforcement Learning

Then, we define the NSA ˆ NSp A 1q matrix

Bp : p INSA W p,Nq . . . p INSA W p,3qp INSA W p,2q H

and the vector βp : p INSA W p,Nq . . . p INSA W p,3qp INSA W p,2qwp,1 .

Finally, define the function Ξp : RNSp A 1q Ñ RNSA where

Ξppξq : Bpξ βp

for ξ P RNSp A 1q.

To explain the semantics of Ξp, let µ P Mp µ0 and µ P RNSp A 1q be such that µnpx, aq : µnpx, aq for all pn, x, aq P r Ns ˆ X ˆ Aza . It then holds that Ξpp µq µ. To see this, note that H µ expands µ setting p H µqnpx, a q ř

a a µnpx, aq. To fully recover µnpx, a q, what remains is to add ρµ npxq. This is achieved at n 1 by adding wp,1 to H µ since wp,1 n px, a q ρp 1pxq ρµ 1pxq and wp,1 n px, aq 0 for a a . Next, at n 2, the matrix W p,2 extracts the values ρµ 2pxq when operated on H µ wp,1 such that µ2px, a q is recovered at coordinate p2, x, a q of p INSA W p,2qp H µ wp,1q. Iterating this procedure until step N allows us to fully recover µ from µ. While for a generic ξ P RNSp A 1q, Ξppξq is the unique vector in RNSA satisfying pΞppξqqnpx, aq ξnpx, aq for all n, x, and a a ; pΞppξqq1px, a q ρp 1pxq ř

a a pΞppξqq1px, aq for all x; and pΞppξqqnpx, a q ř

x1,a1pΞppξqqn 1px1, a1qpnpx|x1, a1q ř

a a pΞppξqqnpx, aq for all x and n ě 2.

Note that Bp has full column rank since for any ξ P RNSp A 1q, Bpξ is only an expansion of ξ; hence, we can define its left pseudo-inverse p Bpq : pp Bpq Bpq 1p Bpq , which satisfies p Bpq Bp INSp A 1q. On the other hand, the matrix Bpp Bpq projects vectors in RNSA onto the column space of Bp, which is given by

" µ P RNSA : ÿ

a µnpx, aq ÿ

x1,a1 µn 1px1, a1qpnpx|x1, a1q @x P X, 2 ď n ď N and ÿ

a µ1px, aq 0 @x P X * . (49)

It is easy to verify that for any µ, µ1 P Mp µ0, µ µ1 lies in the column space of Bp (recall that ř

a µ1px, aq ř

a µ1 1px, aq ρp 1pxq). Moreover, βp P Mp µ0 as it corresponds to a policy π where πnpa |xq 1 for all n and x. Therefore, for any µ P Mp µ0, µ βp belongs to the column space of Bp, and we consequently have that

Ξp p Bpq pµ βpq Bpp Bpq pµ βpq βp µ .

Hence, by the definition of Ξp, p Bpq pµ βpq coincides with µ on all coordinates pn, x, aq P r Ns ˆ X ˆ Azta u (since the map Ξp only expands the input vector adding the coordinates corresponding to action a ), and is then the only point in RNSp A 1q that Ξp maps to µ. In light of this, we define

p Mp µ0q : tξ P RNSp A 1q | Ξppξq P Mp µ0u ,

the pre-image of Mp µ0 under Ξp. Accordingly, we define Λp : p Mp µ0q Ñ Mp µ0 as the restriction of Ξp to p Mp µ0q ; that is,

Λp : Ξp|p Mp µ0q .

This then is a bijective function, with Λ 1 p pµq p Bpq pµ βpq.

Still, p Mp µ0q is not guaranteed to have a non-empty interior. Suppose that some state x is not reachable at a certain step n ; that is, for every state x and action a, pn px |x, aq 0 if n ě 2, or just that ρp 1px q 0 if n 1. Then, for any µ P Mp µ0, µn px , aq 0 for every action a. This implies that for every ξ P p Mp µ0q , ξn px , aq 0 for all a a

(since these coordinates are preserved under Λp), and hence, p Mp µ0q has an empty interior. To remedy this, we rely on Asm. 4.4, which is equivalent to imposing that for every state x, ρp 1pxq ą 0 and there exists for every step n a state-action pair px1, a1q such that pn 1px|x1, a1q ą 0. We show next that this condition is sufficient for p Mp µ0q to have a non-empty interior. We first present an alternative characterization of p Mp µ0q .

Lemma E.1. It holds that p Mp µ0q ξ P RNSp A 1q : Bpξ ě βp( .

Online Episodic Convex Reinforcement Learning

Hence, p Mp µ0q is a polyhedral set formed by NSA constraints; namely that for n, x, a P r Ns ˆ X ˆ A, Bppn, x, a, , , q ξ βp npx, aq ě 0.

Proof. Any ξ P p Mp µ0q clearly satisfies Bpξ ě βp since Bpξ βp Λppξq P Mp µ0, whose coordinates are nonnegative. Conversely, assume that Bpξ ě βp for some ξ P RNSp A 1q, and let µ : Ξppξq Bpξ βp. Showing that ξ P p Mp µ0q is equivalent, by definition, to showing that Ξppξq P Mp µ0. Since βp P Mp µ0 and Bpξ belongs to the column space of Bp specified in (49), it holds that ÿ

a µnpx, aq ÿ

x1,a1 µn 1px1, a1qpnpx|x1, a1q

for every n ě 2, and that ř

a µ1px, aq ř

a βp 1px, aq ρβp 1 pxq ρp 1pxq. Then, to show that µ P Mp µ0, it remains to show that µ P p XˆAq N. By assumption, µ only has non-negative coordinates; therefore, we only have to show that ř

x,a µnpx, aq 1 at every n. This easily done via induction: ř

x,a µ1px, aq ř

x ρp 1pxq 1, and for n ě 2,

x,a µnpx, aq ÿ

x1,a1 µn 1px1, a1qpnpx|x1, a1q ÿ

x1,a1 µn 1px1, a1q .

Lemma E.2. p Mp µ0q has a non-empty interior if and only if Asm. 4.4 holds.

Proof. Necessity is immediate as argued before. We prove sufficiency utilizing an argument from the proof of Proposition 2.3 in (Wolsey & Nemhauser, 1999). For every step-state-action triple pn, x, aq, it is easy to verify that Asm. 4.4 implies the existence of some µ P Mp µ0 such that µnpx, aq ą 0. Taking a convex combination with full support of one such occupancy measure for every pn, x, aq results, via the convexity of Mp µ0, in an occupancy measure µ P Mp µ0 whose entries are all strictly positive. Hence, ξ : Λ 1 p pµ q is an interior point of the polyhedral set p Mp µ0q as it satisfies with strict inequality all the constraints defining it.

E.2. Entropic Regularization Approach

E.2.1. FITTING A EUCLIDEAN BALL IN THE CONSTRAINT SET

For the following, fix ε P p0, 1{Sq. From Sec. 4.2.1, recall the definition κ : ε{p A 1 ?

A 1q. We now show that κ1NSp A 1q κv P p Mp µ0q for any v P BNSp A 1q assuming the transition kernel p : ppnqn Pr Ns satisfies the condition of Asm. 4.2; that is, pnpx1|x, aq ě ε for all pn, x, x1, aq P r Ns ˆ X 2 ˆ A. Take ζv,p : Ξppκ1NSp A 1q κvq. Note that showing that κ1NSp A 1q κv P p Mp µ0q is equivalent to showing that ζv,p P Mp µ0. In the following, we proceed with the latter.

Note that via Lem. E.1, it suffices to show that ζv,p is non-negative. We use induction in the following to show more particularly that ζv,p P p XˆAq N. By the definition of ζv,p n , we have that for pn, xq P r Ns ˆ X,

ζv,p n px, aq ε A 1 ?

A 1p1 vnpx, aqq @a P Aza and ζv,p n px, a q ρζv,p n pxq ÿ

a a ζv,p n px, aq , (50)

where ρζv,p n pxq ř

a1,x1PAˆX ζv,p n 1px1, a1qpnpx|x1, a1q for n ě 2 and ρζv,p 1 pxq ř

a1,x1PAˆX µ0px1, a1qp1px|x1, a1q ρp 1pxq. For a a , clearly ζv,p n px, aq ě 0 as vnpx, aq ě 1. Note that at any step n and state x, the Cauchy-Schwarz inequality and the fact that v P BNSp A 1q yield that

a a vnpx, aq ď ?

a a |vnpx, aq|2 ď ?

a a ζv,p n px, aq ÿ

A 1p1 vnpx, aqq ď ε .

Online Episodic Convex Reinforcement Learning

On the other hand, Asm. 4.2 implies that ρζv,p 1 pxq ě ε for every x. Hence, (50) gives that ζv,p 1 px, a q ě 0 at every x. Moreover, (50) also implies that ř

x,a ζv,p 1 px, aq ř

x ρp 1pxq 1, yielding that ζv,p 1 P XˆA. For n ě 2, assuming that ζv,p n 1 P XˆA, Asm. 4.2 implies again that ρζv,p n pxq ě ε for every x. We then get via (50) that ζv,p n px, a q ě 0 and that ζv,p n P XˆA since ř

x,a ζv,p n px, aq ř

x ρζv,p n pxq 1, which holds again via (50) and the assumption that ζv,p n 1 P XˆA. Induction then establishes that ζv,p P p XˆAq N as sought. As mentioned above, this implies via Lem. E.1 that ζv,p P Mp µ0, or equivalently, that κ1NSp A 1q κv P p Mp µ0q and ζv,p Λppκ1NSp A 1q κvq.

E.2.2. ESTIMATING THE TRANSITION KERNEL

In this section, we define and analyze an alternative transition kernel estimator to the one given in Eq. (7). What we seek in this new estimator is (1) that it estimates well the true transition kernel, with a guarantee similar to that of Lem. 2.2; (2) that it drifts across rounds in a controlled manner, satisfying the bound of Lem. A.3 up to a constant; and (3) that, at the same time, it satisfies the condition of Asm. 4.2 almost surely, supposing, naturally, that it is satisfied by the true kernel.

To recall the notation, for each round t P r Ts, ot denotes a random trajectory obtained by executing the policy πt in the environment; that is, ot : pxt 1, at 1, . . . , xt N, at Nq where at n πtp |xt nq and xt n pnp |xt n 1, at n 1q.3 We also recall the definitions

N t npx, aq :

s 1 1txsn x,asn au and M t npx1|x, aq :

s 1 1txs n 1 x1,xsn x,asn au .

Fix n, x, a P r Ns ˆ X ˆ A. As an intermediate step, we compute at the beginning of each round t the Laplace (add-one) estimator for pt np |x, aq; that is, for x1 P X,

pt npx1|x, aq : M t n 1px1|x, aq 1 N t n 1px, aq S . (51)

To obtain a guarantee on the accuracy of this estimator, we firstly describe a slightly different setting. Let p 9xs nq T s 1 be an i.i.d. sequence of states such that 9xs n pnp |x, aq. Then, for k P r Ts, we define the Laplace estimator

9pk npx1|x, aq : 1 řk s 1 1t 9xsn x1u

Notice that in our setting, the distribution pt np |x, aq is equivalent to 9p Nt n 1px,aq n p |x, aq, keeping in mind that the number of samples N t n 1px, aq is random and dependent on the observed samples. Let DKLpp } qq denote the KL-divergence between distributions (probability mass functions) p and q. We derive the following result concerning the divergence between pt and p using known properties of the Laplace estimator and a union bound argument.

Lemma E.3. For fixed t, n, x, a P r Ts ˆ r Ns ˆ X ˆ A, it holds with probability at least 1 δ that

DKL pnp |x, aq pt np |x, aq ď 161S 6 ?

S log5{2 ST

4δ 310 maxt1, N t n 1px, aqu .

Proof. For a fixed k P r Ts, Thm. 2 in (Canonne et al., 2023) and Prop. 1 in (Mourtada & Ga ıffas, 2022) imply that

P ˆ DKL pnp |x, aq 9pk np |x, aq ą 161S 6 ?

Via a union bound, we obtain that4

P ˆ DKL pnp |x, aq pt np |x, aq ą 161S 6 ?

4δ 310 maxt1, N t n 1px, aqu

3Recall that pxt 0, at 0q µ0p , q. 4Note that if N t n 1px, aq 0, then pt np |x, aq is the uniform distribution and DKL pnp |x, aq pt np |x, aq ď log S; hence, the bound trivially holds.

Online Episodic Convex Reinforcement Learning

ď P ˆ Dk P r Ts: DKL pnp |x, aq 9pk np |x, aq ą 161S 6 ?

The lemma then follows after rescaling δ.

Note that the distribution pt np |x, aq does not necessarily satisfy the conditions of Asm. 4.2 uniformly. Next, we define for a given ε P r0, 1{Ss the set

ε X : tx P Rd : 1 dx 1 and xi ě ε @i P rdsu Ď X ,

which is the set of state distribution assigning probability at least ε to every state. We then define ppt np |x, aq as the information projection of pt np |x, aq onto ε X ; that is,

ppt np |x, aq : arg min q P ε X DKL q pt np |x, aq , (52)

which exists and is unique since ε X is compact and DKL pt np |x, aq is continuous and strictly convex where it is finite (note that pt np |x, aq never assigns zero probability to any state; hence, DKL q pt np |x, aq is finite for any q P ε X ). If pt np |x, aq is not already in ε X , this projection can only bring us closer to pnp |x, aq in the KL-divergence sense as the following inequality (Cover & Thomas, 2012, Thm. 11.6.1) states:

DKLppnp |x, aq } ppt np |x, aqq ď DKLppnp |x, aq } pt np |x, aqq DKLpppt np |x, aq } pt np |x, aqq . (53)

With this fact in mind, we can arrive at the following result, a parallel of Lem. 2.2.

Lemma E.4. With probability at least 1 δ, it holds for all t, n, x, a P r Ts ˆ r Ns ˆ X ˆ A simultaneously that

}pnp |x, aq ppt np |x, aq}1 ď

S log5{2 S2ANT 2

4δ 620 maxt1, N t n 1px, aqu .

Proof. The statement is a consequence of (53), Lem. E.3, and Pinsker s inequality; followed by an application of a union bound over all rounds, steps, and state-action pairs.

What remains now is to show that there exists a constant c ą 0 such that

}ppt 1 n 1p |x, aq ppt n 1p |x, aq}1 ď c 1txtn x,as t au maxt1, N t 1 n px, aqu .

This can be easily shown to hold for pt, i.e., before the projection step, as states the following lemma.

Lemma E.5. For all n P r N 1s, px, a, x1q P X ˆ A ˆ X, and t P r Ts; pt n 1px1|x, aq as defined in (51) satisfies

} pt 1 n 1p |x, aq pt n 1p |x, aq}1 ď 21txtn x,atn au

N t 1 n px, aq S .

Proof. The derivation follows along the same lines as the proof of Lem. A.3. We have that

pt 1 n 1px1|x, aq 1txt n 1 x1,xt n x,at n au M t npx1|x, aq 1

N t 1 n px, aq S

1txt n 1 x1,xtn x,atn au N t 1 n px, aq S N t npx, aq S N t 1 n px, aq S pt n 1px1|x, aq .

pt 1 n 1px1|x, aq pt n 1px1|x, aq 1txt n 1 x1,xtn x,atn au N t 1 n px, aq S pt n 1px1|x, aq N t npx, aq N t 1 n px, aq N t 1 n px, aq S

1txtn x,atn au N t 1 n px, aq S

1txt n 1 x1u pt n 1px1|x, aq .

Online Episodic Convex Reinforcement Learning

Finally, we conclude that

} pt 1 n 1p |x, aq pt n 1p |x, aq}1 2 ÿ

x1: pt 1 n 1px1|x,aqě pt n 1px1|x,aq

pt 1 n 1px1|x, aq pt n 1px1|x, aq

21txtn x,atn au

N t 1 n px, aq S

1 pt n 1pxt n 1|x, aq ď 21txtn x,atn au

N t 1 n px, aq S .

To derive a similar bound for the projected estimator ppt, we firstly derive a more explicit characterization of the information projection onto ε X . For a fixed ε P p0, 1{Sq, define the function gε : R ˆ X Ñ R as

gεpr; pq : ÿ

x PX maxtrε, ppxqu .

Lemma E.6. For any given p P X , the map r ÞÑ gεpr; pq is εS-Lipschitz and has a unique fixed point. Moreover, denoting this fixed point by r , it holds that r P r1, maxx ppxq{εq, and that gεpr; pq ą r for r ă r and gεpr; pq ă r for r ą r .

Proof. We firstly note that gεp ; pq can be easily verified to be convex. For any r P R and any subgradient h of gεp ; pq at r, it holds that |h| ď εS. Hence, the convexity of gεp ; pq implies that |gεpr; pq gεpr1; pq| ď εS|r r1| for any r, r1 P R, or that gεp ; pq is εS-Lipschitz. This implies, since εS ă 1 by assumption, that gεp ; pq is a contraction mapping; hence, via Banach s fixed point theorem, it admits a unique fixed point r P R. For r ă 1, it holds that gεpr; pq ě ř

x ppxq 1 ą r. While for r ě maxx ppxq{ε, gεpr; pq rεS ă r. Therefore, r P r1, maxx ppxq{εq. Moreover, for any r ă r (r ą r ), it must hold that gεpr; pq ą r (gεpr; pq ă r); as otherwise, the intermediate value theorem, applied to gεpr; pq r, would imply the existence of another fixed point, a contradiction.

Next, we define rε : X Ñ R as the function that maps a distribution p P X to the fixed point of gεp ; pq. This function is well-defined as implied by Lem. E.6. We now show that the solution of the information projection problem onto ε X can be expressed in terms of the function rε. For p P X , we define pε P ε X as

pεpxq : maxtrεppqε, ppxqu ř x1PX maxtrεppqε, ppx1qu maxtε, ppxq{rεppqu .

Lemma E.7. For p P X , it holds that pε arg minq P ε X DKLpq } pq.

Proof. We assume without loss of generality that ppxq ą 0 for all x P X; as otherwise, we can cast the problem into a lower dimensional one considering only the elements x P X for which ppxq ą 0. Since the constraint set is compact and the objective is continuous and strictly convex, this minimization problem admits a unique optimal solution. We start by rewriting the problem as

min q PRS ÿ

x PX qpxq log qpxq

subject to ε qpxq ď 0 @x P X ÿ

x PX qpxq 1 0

Define the Lagrangian

Lpq, u, vq : ÿ

x PX qpxq log qpxq

x PX upxqpε qpxqq v

x PX qpxq 1

for v P R and u P RS ě0. We have that

BL Bqpxqpq, u, vq log qpxq

ppxq 1 upxq v .

Online Episodic Convex Reinforcement Learning

We now show that we can satisfy the KKT conditions by choosing a solution pair q and u , v where

q pxq : pεpxq maxtε, ppxq{rεppqu , u pxq : log maxtε, ppxq{rεppqu

ppxq{rεppq , and v : 1 log rεppq .

Firstly, q indeed belongs to ε X by the definition of pε, and u is non-negative. Moreover, whenever q pxq ą ε, we get that u pxq 0; hence, complementary slackness holds. Finally,

BL Bqpxqpq , u , v q log maxtε, ppxq{rεppqu

ppxq 1 log maxtε, ppxq{rεppqu

ppxq{rεppq 1 log rεppq 0 .

Therefore, we conclude that pε is the optimal solution.

Computing pε, or the information projection of p onto ε X , can be performed efficiently. In particular, the following characterization implies that rεppq can be computed exactly in a finite number of steps by iterating over the set of states.

Lemma E.8. Let X p : tx P X : gεpppxq{ε; pq ă ppxq{εu and X p : Xz X p . Then,

x PX ppxq 1 ε|X p | .

Proof. As stated in the proof of Lem. E.6, for r ě maxx PX ppxq{ε, gεpr; pq rεS ă r; hence X p is non-empty as it at least includes arg maxx PX ppxq. Moreover, from the same lemma, we have that rεppq ă minx PX p ppxq{ε and rεppq ě maxx PX p ppxq{ε (if X p is non-empty). Therefore,

rεppq gεprεppq; pq ÿ

x PX maxtrεppqε, ppxqu rεppqε|X p | ÿ

x PX ppxq .

The previous lemma also implies that rεppq ď p1 εSq 1. Returning back to our original objective, we show next that }pε qε}1 is no larger than a constant multiple of }p q}1 for any two distributions p and q. Towards that end, we first show that rε is Lipschitz continuous.

Lemma E.9. For ε ď 1 2S , the function rε is 1-Lipschitz with respect to the } }1 norm; that is,

|rεppq rεpqq| ď }p q}1

for any p, q P X .

Proof. Note that, for any fixed r P R, gεpr; q is convex; and that for any p P X and subgradient k of gεpr; q at p, it holds that k is non-negative and satisfies }k}8 ď 1. Hence, for any p, q P X ,

|gεpr; pq gεpr; qq| ď ÿ

x: ppxqąqpxq pppxq qpxqq ÿ

x: qpxqąppxq pqpxq ppxqq 1

2}p q}1 . (54)

Then, we obtain that

|rεppq rεpqq| |gεprεppq; pq gεprεpqq; qq|

ď |gεprεppq; pq gεprεpqq; pq| |gεprεpqq; pq gεprεpqq; qq|

ď εS|rεppq rεpqq| 1

2}p q}1 ď 1

2|rεppq rεpqq| 1

where the second inequality follows from (54) and Lem. E.6, and the last inequality holds since ε ď 1 2S . The lemma then follows after rearranging the last result.

Lemma E.10. Assuming ε ď 1 2S , it holds for any p, q P X that }pε qε}1 ď 5

Online Episodic Convex Reinforcement Learning

Proof. We have that

qεpxq maxtrεpqqε, qpxqu

maxtrεpqqε, qpxqu maxtrεppqε, ppxqu maxtrεppqε, ppxqu

maxtrεpqqε, qpxqu maxtrεppqε, ppxqu

rεpqq rεppq

rεpqqpεpxq .

qεpxq pεpxq 1 rεpqq

maxtrεpqqε, qpxqu maxtrεppqε, ppxqu rεppq rεpqq pεpxq .

Using Lem. E.9 and the fact that

| maxtrεpqqε, qpxqu maxtrεppqε, ppxqu| ď max ε|rεpqq rεppq|, |qpxq ppxq| (

ď ε|rεpqq rεppq| |qpxq ppxq| ,

we obtain that

x |qεpxq pεpxq|

ε|rεpqq rεppq| |qpxq ppxq| pεpxq|rεppq rεpqq|

where the last step uses that εS ď 1{2 and rεpqq ě 1.

Finally, we arrive at the sought result, a parallel of Lem. A.3.

Lemma E.11. For all n P r N 1s, px, a, x1q P X ˆ A ˆ X, and t P r Ts; ppt n 1px1|x, aq as defined in (52) with ε ď 1 2S satisfies

}ppt 1 n 1p |x, aq ppt n 1p |x, aq}1 ď 51txtn x,atn au

N t 1 n px, aq S .

Proof. This is a direct consequence of the definition in (52) and Lems. E.5, E.7 and E.10.

E.2.3. THE ALGORITHM

For δ P p0, 1q, define

S log5{2 S2ANT 2

4δ 620 , (55)

which is the leading factor in the confidence bound of Lem. E.4. For the purpose of exploration, much like the full information case, we will utilize at each round t a bonus reward vector bt P RNSA to be subtracted from the estimated gradient, where

bt npx, aq : Lp N nq C1 1{T a

maxt1, N tnpx, aqu (56)

for pt, n, x, aq P r Ts ˆ pt0u Ťr Nsq ˆ X ˆ A.

Finally, with all its components detailed, we present Alg. 3, our first approach for CURL with bandit feedback. As mentioned in Sec. 4.2.1, the main changes compared to Alg. 1 are the use of spherical estimation to obtain a surrogate for the gradient and the use of a suitably altered transition kernel estimator.

Online Episodic Convex Reinforcement Learning

Algorithm 3 Bonus O-MD-CURL (bandit feedback)

input: learning rate τ ą 0, perturbation rate δ P p0, 1s, sequence of exploration parameters pαtqt Pr T s P p0, 1q T

initialization: pp1 npx1|x, aq Ð 1{S @pn, x, x1, aq, µ1 P arg minµPM p1 µ0 ψpµq

for t 1, . . . , T do

draw ut P SNSp A 1q uniformly at random

ζt Ð ζut,pt Λptpκ1NSp A 1q κutq

pµt Ð p1 δqµt δζt

πt npa|xq Ð pµtpx, aq{ ř

a PA pµtpx, aq

execute πt and observe F tpµπt,pq and a sampled trajectory ot : pxt 1, at 1, . . . , xt N, at Nq

δκ NSp A 1q F tpµπt,pqut

construct gt P RNSA as gt npx, aq Ð gt npx, aq for a a and gt npx, a q Ð 0

construct bonus vector bt as in (56)

πt npa|xq Ð p1 αtqµtpx, aq{ ř

a PA µtpx, aq αt{A

construct the new estimated kernel ppt 1 via (51) and (52)

set µt 1 P arg minµPM pt 1 µ0 τ x gt bt, µy Γpµ, µ πt,ptq

E.2.4. AUXILIARY LEMMAS

Lemma E.12. For 0 ă δ ă 1 and ppt as defined in (52), it holds that

x,a µπt,p i px, aq}pi 1p |x, aq ppt i 1p |x, aq}1 ď 3C1 δN 2?

SAT 2SN 2 c

with probability at least 1 2δ.

Proof. This lemma can be proved in the same manner as its full information version Lem. A.1 with only two small changes; we use the bound of Lem. E.4 instead of Lem. B.1 and we modify the definition of the filtration to be Ft : σpu1, o1, . . . , ut 1, ot 1, utq.

Lemma E.13. For any 0 ă δ ă 1,

n 0 p N nq ÿ

µπt,p n px, aq a

maxt1, N tnpx, aqu ď 3N 2?

holds with probability at least 1 δ.

Proof. The proof is the same as for Lem. A.2 (the version proved in the full information case), except that, again, the filtration used in the proof would be defined as Ft : σpu1, o1, . . . , ut 1, ot 1, utq.

Proposition E.14. Let bt and ppt be as defined in (56) and (52) respectively. Then, for any δ P p0, 1q, with probability at least 1 3δ,

@ µπt,pt, bt D

@ µ0, bt 0 D ď LC1 1{T N 3 ˆ 3C1 δ ?

LC1 1{T N 2 ˆ 3 ?

Proof. The proof is the same as that of Prop. 3.1 except that we would rely on Lems. E.12 and E.13 in place of Lems. A.1 and A.2, and use the definition of bt in (56) instead of (11).

Online Episodic Convex Reinforcement Learning

Lemma E.15. Let X be a random variable taking values in R, z1, z2 ě 0 be two constants, and δ1 P p0, 1q. If X ď z2 uniformly and Pp X ą z1q ď δ1, then Er Xs ď z1 δ1z2 .

Proof. Simply, Er Xs Er It X ď z1u Xs Er It X ą z1u Xs ď z1 z2Ppx ą z1q ď z1 δ1z2.

E.2.5. REGRET ANALYSIS

The following theorem, a restatement of Thm. 4.3, provides a regret bound for Alg. 3. Recall that we have adopted in this section the shorthand notation S |X| and A |A|.

Theorem E.16. Under Asm. 4.2, Alg. 3 with a suitable tuning of τ, δ, and pαtqt Pr T s satisfies for any policy π P p Aq XˆN

E r RT pπqs À

ε S5{4A5{4N 3T 3{4 L 1

ε S2A5{2N 4?

where À signifies that the inequality holds up to factors logarithmic in T, N, S, and A.

Proof. Fixing π P p Aq XˆN, we have that

E r RT pπqs E

F tpµπt,pq F tpµπ,pq

F tpµπt,pq F tpµπt,ptq

looooooooooooooooomooooooooooooooooon

F tpµπt,ptq F tpµπ,ptq

looooooooooooooooomooooooooooooooooon

F tpµπ,ptq F tpµπ,pq

loooooooooooooooomoooooooooooooooon

It holds with probability at least 1 2

µπt,p µπt,pt 1 L

µπt,p n µπt,pt n 1

x,a µπt,p i px, aq pi 1p |x, aq ppt i 1p |x, aq 1

SATC1 1{T 2LSN 2a

2T logp NTq ,

where the first inequality uses the Lipschitz continuity of F t, the second inequality follows from Lem. B.1, and the last inequality follows from Lem. E.12. Hence, since L řT t 1 µπt,p µπt,pt 1 ď 2NLT, it holds via Lem. E.15 (with δ1 2

SATC1 1{T 2LSN 2a

2T logp NTq 4LN . (57)

For the third sum, we use again the Lipschitz continuity of F t, Lem. B.1, and Lem. E.4 to get that with probability at least 1 1

µπ,pt µπ,p 1 ď L

µπ,pt n µπ,p n 1

i px, aq pi 1p |x, aq ppt i 1p |x, aq 1

i px, aq C1 1{T a

maxt1, N t i px, aqu

n 0 p N nq ÿ

x,a µπ,pt n px, aq C1 1{T a

maxt1, N tnpx, aqu

Online Episodic Convex Reinforcement Learning

@ µπ,pt, bt D

@ µ0, bt 0 D

@ µπt,pt, bt D

@ µ0, bt 0 D

@ µπ,pt µπt,pt, bt D .

Via Prop. E.14, it holds with probability 1 3

@ µπt,pt, bt D

@ µ0, bt 0 D ď LC1 1{T N 3 3C1 1{T ?

2T logp NTq

LC1 1{T N 2 3 ?

2T logp NTq .

Hence, chaining these last two results and using a union bound, we get via Lem. E.15 (with δ1 4

@ µπ,pt µπt,pt, bt D ff

ď LC1 1{T N 3 3C1 1{T ?

2T logp NTq

LC1 1{T N 2 3 ?

2T logp NTq 8LNp1 C1 1{T Nq , (58)

where we have used that

@ µπ,pt µπt,pt, bt D ď L

µπ,pt µπ,p 1

t 1 }bt}8}µπ,pt µπt,pt}1 ď 2LNTp1 C1 1{T Nq .

Define p F t : p XˆAq N Ñ R as

p F tpµq Ev PBNSp A 1q F t p1 δqµ δζv,pt ı .

As ppt satisfies the condition of Asm. 4.2 by design, ζv,pt P Mpt µ0 Ă p XˆAq N as argued in App. E.2.1; thus, p F t is well-defined. Similarly, since ut P SNSp A 1q Ă BNSp A 1q and ζt ζut,pt, it holds that ζt P Mpt µ0. Via the convexity of Mpt µ0, the fact that µt P Mpt µ0, and the definition of pµt; it holds that pµt P Mpt µ0. This yields that pµt µπt,pt, recalling the definition of πt in Alg. 3. Using the Lipschitz smoothness of F t, we have that

F tpµπt,ptq p F tpµtq F tppµtq p F tpµtq F tpp1 δqµt δζtq Ev PBNSp A 1q F t p1 δqµt δζv,pt ı

ď δLEv PBNSp A 1q}ζt ζv,pt}1 ď 2δLN

p F tpµπ,ptq F tpµπ,ptq Ev PBNSp A 1q F tpp1 δqµπ,pt δζv,ptq ı F tpµπ,ptq

ď δLEv PBNSp A 1q}ζv,pt µπ,pt}1 ď 2δLN .

F tpµπt,ptq p F tpµtq p F tpµtq p F tpµπ,ptq p F tpµπ,ptq F tpµπ,ptq

@ p F tpµtq, µt µπ,pt D 4δLNT

@ p F tpµtq bt, µt µπ,pt D 4δLNT

@ bt, µπt,pt µπ,pt D

@ bt, µt µπt,pt D .

Online Episodic Convex Reinforcement Learning

The last term is easily bounded as follows:

@ bt, µt µπt,pt D

@ bt, µt pµt D δ

@ bt, µt ζt D ď δ

t 1 }bt}8}µt ζt}1 ď 2δC1 1{T LN 2T .

We then conclude that

@ bt, µπ,pt µπt,pt D ff

@ p F tpµtq bt, µt µπ,pt D 4δLNT 2δC1 1{T LN 2T . (59)

Then, combining (57), (58), and (59) yields that

E r RT pπqs ď E

@ p F tpµtq bt, µt µπ,pt D 2δLp2 C1 1{T Nq NT

3LC1 1{T N 3 3C1 1{T ?

2T logp NTq 4LNp3 2C1 1{T Nq . (60)

Define Ft, p Ft : Mpt µ0 Ñ R as Ftpξq : F t Λptpξq and p Ftpξq : p F t Λptpξq . Then, recalling that κ : ε A 1 ?A 1,

p Ft Λ 1 pt pµtq p F tpµtq Ev PBNSp A 1q F t p1 δqµt δζv,pt ı

Ev PBNSp A 1q Ft Λ 1 pt p1 δqµt δζv,pt ı

Ev PBNSp A 1q Ft B pt p1 δqµt δζv,pt β pt ı

Ev PBNSp A 1q Ft p1 δq B pt µt β pt δ B pt ζv,pt β pt ı

Ev PBNSp A 1q Ft p1 δqΛ 1 pt pµtq δΛ 1 pt pζv,ptq ı

Ev PBNSp A 1q Ftpp1 δqΛ 1 pt pµtq δκ1NSp A 1q δκvq ı ,

where the fourth equality follows form the fact that Λ 1 pt pµq B pt µ β pt , and the last equality follows since

ζv,pt Λptpκ1NSp A 1q κvq. Lem. 1 in (Flaxman et al., 2005) and the chain rule imply that

p Ft Λ 1 pt pµtq 1 δ

δκ NSp A 1q Eu PSNSp A 1q Ftpp1 δqΛ 1 pt pµtq δκ1NSp A 1q δκuqu ı

δκ NSp A 1q Eu PSNSp A 1q Ft p1 δqΛ 1 pt pµtq δΛ 1 pt ζu,pt u ı

δκ NSp A 1q Eu PSNSp A 1q F tpp1 δqµt δζu,ptqu ı

δκ NSp A 1q Eut PSNSp A 1q F tpp1 δqµt δζtqut ,

where the last equality uses that ζt ζut,pt and that both µt and ppt are independent with respect to ut. And since p Ft Λ 1 pt pµtq B pt p F tpµtq, we obtain that

B pt B pt p F tpµtq 1 δ

δκ NSp A 1q Eut PSNSp A 1q F tpp1 δqµt δζtq B pt utı

Eut PSNSp A 1q B pt pgt , (61)

δκ NSp A 1q F tpp1 δqµt δζtqut 1 δ

δκ NSp A 1q F tppµtqut .

Online Episodic Convex Reinforcement Learning

The vector pgt differs from gt (which is employed in Alg. 3) in that it is defined using F tppµtq instead of F tpµπt,pq. For round t P r Ts, let Ft : σpu1, o1, . . . , ut, otq denote the σ-algebra generated by the random events up to the end of round t; and let Etr s : Er | Ft 1s with F0 being the trivial σ-algebra. We then have that

@ p F tpµtq, µt µπ,pt D E

@ B pt B pt p F tpµtq, µt µπ,pt D

@ Et B pt pgt , µt µπ,pt D

@ B pt pgt, µt µπ,pt D ,

where the first equality holds via the fact that B pt B pt pµt µπ,ptq µt µπ,pt since µt µπ,pt belongs to the column space of B pt (see App. E.1), the second equality uses (61) and the fact that conditioned on Ft 1, the only source of randomness in B pt pgt is ut, which is sampled independently in each round; and the last equality uses the tower rule, linearity of expectation, and the fact that µt µπ,pt is measurable with respect to Ft 1. Since µt µπ,pt B pt Λ 1 pt pµtq Λ 1 pt µπ,pt , we have that

ppgtq T B pt µt µπ,pt ppgtq T B pt B pt Λ 1 pt pµtq Λ 1 pt µπ,pt ppgtq T Λ 1 pt pµtq Λ 1 pt µπ,pt

since B pt B pt INSp A 1q, see App. E.1. Therefore,

@ p F tpµtq, µt µπ,pt D E

@ pgt, Λ 1 pt pµtq Λ 1 pt µπ,pt D

@ gt, Λ 1 pt pµtq Λ 1 pt µπ,pt D E

@ pgt gt, Λ 1 pt pµtq Λ 1 pt µπ,pt D

@ gt, µt µπ,pt D E

@ pgt gt, Λ 1 pt pµtq Λ 1 pt µπ,pt D ,

where the last equality follows from the definition of gt (see Alg. 3) and the fact that µt and µπ,pt are expansions of Λ 1 pt pµtq

and Λ 1 pt µπ,pt respectively, augmented with the entries corresponding to action a . Focusing on the second sum, we have that

@ pgt gt, Λ 1 pt pµtq Λ 1 pt µπ,pt D ď

t 1 }pgt gt}8 Λ 1 pt pµtq Λ 1 pt µπ,pt 1

t 1 }pgt gt}8 µt µπ,pt 1

t 1 }pgt gt}8

δκ N 2Sp A 1q

t 1 }ut}8|F tppµtq F tpµπt,pq|

εδ N 2SA2 Tÿ

t 1 |F tppµtq F tpµπt,pq|

εδ N 2SA2 Tÿ

t 1 |F tpµπt,ptq F tpµπt,pq|

εδ LN 2SA2 Tÿ

µπt,pt µπt,p 1 ,

Online Episodic Convex Reinforcement Learning

where the fourth inequality uses that κ ě ε 2A and that ut P SNSp A 1q, and the last inequality uses the Lipschitz continuity of F t. As shown in (57), we have that

µπt,pt µπt,p 1 ď 3N 2?

SATC1 1{T 2SN 2a

2T logp NTq 4N .

@ p F tpµtq, µt µπ,pt D ď E

@ gt, µt µπ,pt D 4

εδ LN 3SA2 3N ?

SATC1 1{T 2SN a

2T logp NTq 4 .

Combining this result with (60) yields that

E r RT pπqs ď E

@ gt bt, µt µπ,pt D 4

εδ LN 3SA2 3N ?

SATC1 1{T 2SN a

2T logp NTq 4

2δLp2 C1 1{T Nq NT 3LC1 1{T N 3 3C1 1{T ?

2T logp NTq 4LNp3 2C1 1{T Nq

@ gt bt, µt µπ,pt D δ 2Lp2 C1 1{T Nq NT looooooooooomooooooooooon

4LNp3 2C1 1{T Nq

δ 7 εLC1 1{T N 3SA2 3N ?

ATC1 1{T 2N a

2ST logp NTq 4

loooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooon

where the last inequality uses that C1 1{T ě ?

S. Note that

n 1 }gt n}8

εδ NSp A 1qp A 1 ?

A 1q F tpµπt,pq

n 1 }ut n}8

εδ N 2SA2 N ÿ

n 1 }ut n}8

n 1 }utn}28 ď 2

εδ N 5{2SA2

x,a |utnpx, aq|2 ď 2

εδ N 5{2SA2 ,

where the second inequality uses Cauchy-Schwarz and the last inequality uses that ut P SNSp A 1q. Moreover, we have that }bt}1,8 řN n 1 }bt n}8 ď řN n 1 Lp N nq C1 1{T ď LN 2C1 1{T . Hence, using that C1 1{T ě ?

} gt bt}1,8 ď 2

εδ N 5{2SA2 LN 2C1 1{T ď 2

εδ p L 1q C1 1{T ?

Via Lems. A.4 and E.11,5 we can invoke Lem. 2.1 with c 5e, ζ 2 εδp L 1q C1 1{T ?

SA2N 5{2, and αt 1{pt 1q to get that (from the proof of Lem. 2.1)

@ gt bt, µt µπ,pt D ď τ 2

εδ p L 1q C1 1{T ?

SA2N 5{2 2 T 20e2SN logp ATq2p N Aq

εδ p L 1q C1 1{T S3{2A2N 7{2 logp Tq .

5To invoke Lem. E.11, we assume without loss of generality that the constant ε specified in Asm. 4.2 satisfies ε ď 1 2S .

Online Episodic Convex Reinforcement Learning

Tuning τ optimally yields that

@ gt bt, µt µπ,pt D ď 4

εδ p L 1q C1 1{T SA2N 5{2a

20e2Np N Aq T logp ATq2

εδ p L 1q C1 1{T S3{2A2N 7{2 logp Tq

ε p L 1q C1 1{T SA2N 3 logp ATq a

loooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooon

Hence, plugging back into (62) yields that

E r RT pπqs ď δΞ1 1

δ pΞ2 Ξ3q 4LNp3 2C1 1{T Nq .

Setting δ : min ! 1, b

) , we get that

E r RT pπqs ď max ! 2 a

Ξ1pΞ2 Ξ3q, 2pΞ2 Ξ3q ) 4LNp3 2C1 1{T Nq .

Consequently, the theorem follows after using the definition of C1 1{T from Eq. (55) and ignoring log factors.

E.3. Self-Concordant Regularization Approach

We have used the set p Mp µ0q , the preimage of Mp µ0 under the map Ξp (or Λp), to represent in RNSp A 1q the set of valid occupancy measures. A more concise characterization, given by Lem. E.1, is that

p Mp µ0q ξ P RNSp A 1q : Bpξ ě βp( ;

in other words, p Mp µ0q is a convex polytope formed by the constraints Bppn, x, a, , , q ξ βp npx, aq ě 0 for n, x, a P r Ns ˆ X ˆ A. Moreover, Lem. E.2 asserts that int p Mp µ0q , the interior of p Mp µ0q , is not empty under Asm. 4.4.

We consider then the function ψlb : int p Mp µ0q Ñ R defined as

n,x,a log Bpn, x, a, , , q ξ βnpx, aq .

As mentioned in Sec. 4.2.2, Corollary 3.1.1 in (Nemirovski, 2004) yields that ψlb is a ϑ-self-concordant barrier (see Definition 3.1.1 in Nemirovski, 2004) for p Mp µ0q with ϑ N S A. The approach we analyze here is to perform OMD directly on the set p Mp µ0q with ψlb as the regularizer.

For ξ P int p Mp µ0q and y P RNSp A 1q, define the local norm }y}ξ : a

y 2ψlbpξqy. This is indeed a norm since the fact that p Mp µ0q is bounded implies via Property II in (Nemirovski, 2004, Section 2.2) that the Hessian of ψlb is non-singular everywhere. Its dual norm is denoted as }y}ξ, : a

y p 2ψlbpξqq 1y. The Dikin ellipsoid of radius r at ξ P int p Mp µ0q

is given by Erpξq : ty P RNSp A 1q : }y ξ}ξ ď ru ξ rp 2ψlbpξqq 1{2BNSp A 1q .

Via Property I in (Nemirovski, 2004, Section 2.2), E1pξq Ď p Mp µ0q for any ξ P int p Mp µ0q .

For ξ, y P int p Mp µ0q , we denote by Dψlbpy, ξq : ψlbpyq ψlbpξq xy ξ, ψlbpξqy the Bregman divergence between y and ξ with respect to ψlb. From the proof of Thm. E.16, we recall the definition Ft : F t Λp. As alluded to above, our OMD updates will take the form ξt 1 Ð arg min ξPp Mp µ0q τ @ gt, ξ D Dψlbpξ, ξtq ,

where gt will be chosen as a surrogate for Ftpξtq. Differently from the proof of Thm. E.16, we redefine the smoothed approximation p Ft : p Mp µ0q Ñ R such that

p Ftpξq : Ev PBNSp A 1q Ft p1 δqξ δ ξt p 2ψlbpξtqq 1{2v .

Online Episodic Convex Reinforcement Learning

Figure 4. This figure provides a graphical comparison between the sampling approach used in Alg. 3, represented on the left, and that used in Alg. 4, represented on the right. The simplified domain here is tx P r0, 1s2 : }x}1 ď 1u. Both approaches are illustrated at three points: a, b, and c. In the first approach, with some δ P p0, 1q and δ : 1 δ, we sample from a circle of radius δ{p2 ?

2q centered at a convex combination between the point of interest and o : 1{p2 ?

2q . In the second approach, we consider the barrier logp1 x1 x2q ř

i 1,2 logpxiq and sample from the Dikin ellipsoid (of a certain common radius) induced by this function at each point.

This is well-defined since we are evaluating Ft on a convex combination of the argument ξ and a point inside the ellipsoid E1pξtq, which is a subset of p Mp µ0q as cited before. Via Corollary 6.8 in (Hazan et al., 2016) and the chain rule, we have that

p Ftpξq p1 δq

δ NSp A 1q Eu PSNSp A 1q Ft p1 δqξ δ ξt p 2ψlbpξtqq 1{2u p 2ψlbpξtqq 1{2u . (63)

Hence, with ut sampled uniformly from SNSp A 1q, we pick (as mentioned in Sec. 4.2.2)

δ NSp A 1q Ft ξt δp 2ψlbpξtqq 1{2ut p 2ψlbpξtqq 1{2ut (64)

such that Eut gt p Ftpξtq, see also (Saha & Tewari, 2011) for a similar estimator in another BCO setting. We summarize this approach in Alg. 4, and provide in Fig. 4 a graphical comparison with the sampling approach of Alg. 3 on a simple decision set. Before proving the regret bound of Thm. 4.5, we collect a few standard properties and auxiliary results concerning self-concordant barriers and their use as regularizers.

E.3.1. AUXILIARY LEMMAS

For x, y P int p Mp µ0q , it holds via Property I in (Nemirovski, 2004, Section 2.2) that

p1 }y x}xq2 2ψlbpxq ď 2ψlbpyq ď 1 p1 }y x}xq2 2ψlbpxq (65)

whenever }y x}x ă 1. We state the following auxiliary lemma, which will be used to assert the proximity between ξt and ξt 1 for our algorithm. Establishing this stability is a crucial step in the local norm analysis. Lemma E.17. Let x P int p Mp µ0q and ℓP RNSp A 1q be such that }ℓ}x, ď 1 16, and define

y : arg min ξPint p Mp µ0q xℓ, ξy Dψlbpξ, xq .

Then, y P E1{2pxq.

Online Episodic Convex Reinforcement Learning

Algorithm 4 Bandit O-MD-CURL with logarithmic barrier regularization

input: domain p Mp µ0q with non-empty interior, learning rate τ ą 0, exploration parameter δ P p0, 1s

initialization: ξ1 Ð arg minξPint p Mp µ0q ψlbpξq

for t 1, . . . , T do

draw ut P SNSp A 1q uniformly at random

pξt Ð ξt δp 2ψlbpξtqq 1{2ut

pµt Ð Λpppξtq

πt npa|xq Ð pµtpx, aq{ ř

a PA pµtpx, aq

output πt and observe F tppµtq

δ NSp A 1q F tppµtqp 2ψlbpξtqq 1{2ut

ξt 1 Ð arg minξPint p Mp µ0q τ xgt, ξy Dψlbpξ, ξtq

Proof. For ξ P int p Mp µ0q , let

gpξq : xℓ, ξy Dψlbpξ, xq xℓ, ξy ψlbpξq ψlbpxq xξ x, ψlbpxqy .

Note that g is a self-concordant function on int p Mp µ0q (Item (ii) in Nemirovski, 2004, Proposition 2.1.1), whose Hessian (hence, local norms and Dikin ellipsoids) coincides with that of ψlb everywhere. Moreover, g is below bounded thanks to p Mp µ0q being a bounded set, which implies that g attains its minimum on int p Mp µ0q (Property VI in Nemirovski, 2004, Section 2.2). This minimum is also unique via strict convexity. Hence, y is well-defined.

The rest of the proof is similar to the proof of Lem. 13 in (Wei & Luo, 2018) and Lem. 9 in (Van der Hoeven et al., 2023). Thanks to the strict convexity of g, to show that y P E1{2pxq it suffices to show that for any ξ on the boundary of E1{2pxq, gpxq ď gpξq; this is because x P E1{2pxq and y arg minξPp Mp µ0q gpξq. For any such ξ on the boundary of E1{2pxq, Taylor s theorem implies that there exists some z on the line segment between x and ξ such that

gpξq gpxq xξ x, gpxqy 1

2pξ xq 2gpzqpξ xq

2pξ xq 2ψlbpzqpξ xq

ě xξ x, ℓy 1

8pξ xq 2ψlbpxqpξ xq

ě }ξ x}x}ℓ}x, 1

where the second equality holds since 2g 2ψlb and gpxq ℓ ψlbpxq ψlbpxq ℓ, the first inequality holds via (65) and the fact that z P E1{2pxq, the second inequality holds via the definition of a dual norm, the last equality holds since ξ is on the boundary of E1{2pxq, and the last inequality holds via the assumption that }ℓ}x, ď 1 16.

For x P int p Mp µ0q , the Minkowski function of p Mp µ0q with the pole at x is defined as (Nemirovski, 2004, Section 3.2)

πxpyq : inf t ą 0: x t 1py xq P p Mp µ0q (

for y P p Mp µ0q . The following lemma readily follows from the properties of the Minkowski function. It is used in the analysis to handle the bias term of the standard OMD regret guarantee, which is slightly more involved in this case considering that the comparator need not belong to the interior of p Mp µ0q , where ψlb is defined (and finite).

Online Episodic Convex Reinforcement Learning

Lemma E.18. Let x P int p Mp µ0q , y P p Mp µ0q , δ P p0, 1q, and z : p1 δqy δx. Then,

ψlbpzq ď ψlbpxq NSA log δ 1 .

Further, let 9x : arg minx Pint p Mp µ0q ψlbpxq and 9z : p1 δqy δ 9x. Then,

Dψlbp 9z, 9xq ď NSA log δ 1 .

Proof. Since ψlb is an NSA-self-concordant barrier for Mp µ0, Property II in (Nemirovski, 2004, Section 3.2) implies that

ψlbpzq ď ψlbpxq NSA log 1 1 πxpzq .

On the other hand,

x p1 δq 1pz xq x p1 δq 1pp1 δqy δx xq x y x y P p Mp µ0q ,

implying that πxpzq ď 1 δ. Hence, ψlbpzq ď ψlbpxq NSA log δ 1.

Next, we note that the optimality of 9x implies that

Dψlbp 9z, 9xq ψlbp 9zq ψlbp 9xq x 9z 9x, ψlbp 9xqy ď ψlbp 9zq ψlbp 9xq ,

which concludes the proof when combined with the first part.

E.3.2. REGRET ANALYSIS

We are now ready to prove the regret bound of Thm. 4.5, which is stated more explicitly in the following theorem.

Theorem E.19. Under Asm. 4.4, Alg. 4 with τ δ 16

log T N3SAT and δ min "b

17 4L N 3{4S3{4A3{4plog T q1{4

T 1{4 , 1 * satisfies for

any policy π P Π that

E r RT pπqs ď max ! 4 ?

17LN 7{4 p SATq3{4 plog Tq1{4, 34 a

N 5S3A3T log T ) 2LN .

Proof. Firstly, we assert that the iterates ξt are well defined; similar to what was argued in the proof of Lem. E.17, the functions ψlbp q and τ xgt, y Dψlbp , ξtq are self-concordant on int p Mp µ0q (Item (ii) in Nemirovski, 2004, Proposition 2.1.1) and bounded from below thanks to p Mp µ0q being a bounded set, implying via Property VI in (Nemirovski, 2004, Section 2.2) that each of these functions attains its minimum on int p Mp µ0q , which is also unique via strict convexity. Also note that indeed pµt P Mp µ0 since pξt P p Mp µ0q as we argued before presenting the algorithm.

Let µ P arg minµPMp µ0 řT t 1 F tpµq and RT : E řT t 1 F tpµπt,pq F tpµ q , which satisfies RT maxπPΠ E r RT pπqs.

Define ξ : p1 9δqΛ 1 p pµ q 9δξ1, where 9δ P p0, 1q is a constant to be specified later. To start with, we have that

F tpµπt,pq F tpµ q E

F tppµtq F tpµ q E

Ftppξtq Ft Λ 1 p pµ q .

Next, we derive that

Ftppξtq p Ftpξtq

Ft ξt δp 2ψlbpξtqq 1{2ut Ev PBNSp A 1q Ft ξt δp 2ψlbpξtqq 1{2v ı

ď LEv PBNSp A 1q Λp ξt δp 2ψlbpξtqq 1{2ut Λp ξt δp 2ψlbpξtqq 1{2v 1 δLEv PBNSp A 1q Λp p 2ψlbpξtqq 1{2ut Λp p 2ψlbpξtqq 1{2v 1 δLEv PBNSp A 1q Λp ξt p 2ψlbpξtqq 1{2ut Λp ξt p 2ψlbpξtqq 1{2v 1

Online Episodic Convex Reinforcement Learning

ď δLEv PBNSp A 1q Λp ξt p 2ψlbpξtqq 1{2ut 1 Λp ξt p 2ψlbpξtqq 1{2v 1

where the first inequality uses the Lipschitz smoothness of F t and the fact that Ft F t Λp; the second and third equalities use the fact that Λp is an affine function; and the last inequality holds since both ξt p 2ψlbpξtqq 1{2ut, ξt p 2ψlbpξtqq 1{2v P E1pξtq Ă p Mp µ0q , and that for any ξ P p Mp µ0q , Λppξq P Mp µ0 and therefore satisfies }Λppξq}1 ď N. We similarly derive that

p Ftpξ q Ft Λ 1 p pµ q

Ev PBNSp A 1q Ft p1 δqξ δ ξt p 2ψlbpξtqq 1{2v Ft Λ 1 p pµ q

Ev PBNSp A 1q Ft p1 δqp1 9δqΛ 1 p pµ q p1 δq 9δξ1 δ ξt p 2ψlbpξtqq 1{2v

Ft Λ 1 p pµ q

ď LEv PBNSp A 1q p1 δqp1 9δqµ p1 δq 9δΛppξ1q δΛp ξt p 2ψlbpξtqq 1{2v µ 1

ď LEv PBNSp A 1q |δ 9δ δ 9δ|}µ }1 p1 δq 9δ Λppξ1q 1 δ Λp ξt p 2ψlbpξtqq 1{2v 1

ď LEv PBNSp A 1q pδ 9δq}µ }1 9δ Λppξ1q 1 δ Λp ξt p 2ψlbpξtqq 1{2v 1

ď 2δLN 2 9δLN .

Hence, using also the convexity of p Ft, we obtain that

p Ftpξtq p Ftpξ q 4δLNT 2 9δLNT

@ p Ftpξtq, ξt ξ D 4δLNT 2 9δLNT . (66)

In this proof, let Ft : σpu1, . . . , utq denote the σ-algebra generated by u1, . . . , ut; and let Etr s : Er | Ft 1s with F0 being the trivial σ-algebra. We then have that

p Ftpξtq Eut gt Et gt ,

where the first equality follows from (63) and the the second equality holds since conditioned on Ft 1, ut is the only source of randomness in gt and is sampled identically and independently in every round. Using that ξt ξ is measurable with respect to Ft 1, we then obtain that

@ Etgt, ξt ξ D 4δLNT 2 9δLNT E

@ gt, ξt ξ D 4δLNT 2 9δLNT .

Via the definition of ξt and the fact that ξ P int p Mp µ0q , Lem. 6.16 in (Orabona, 2019) implies that

@ gt, ξt ξ D ď Dψlbpξ , ξ1q

τ 2}gt}2 ζt, ,

where ζt lies on the line segment between ξt and ξt 1. We firstly observe that

}gt}2 ξt, ˆp1 δq

δ NSp A 1q F tppµtq 2 putq p 2ψlbpξtqq1{2 2ψlbpξtq 1p 2ψlbpξtqq1{2ut

δ NSp A 1q F tppµtq 2 ď 1

δ2 N 4S2A2 ,

where we have used that F tppµtq ď N. Hence, if

τ ď δ 16N 2SA , (67)

Online Episodic Convex Reinforcement Learning

then τ}gt}ξt, ď 1{16. Consequently, Lem. E.17 (with x ξt, y ξt 1, and ℓ τgt) would assert that ξt 1 P E1{2pξtq; and hence, ζt P E1{2pξtq. It would then hold via (65) that

}gt}2 ζt, ď 1 p1 }ζt ξt}ξtq2 }gt}2 ξt, ď 4}gt}2 ξt, ,

On the other hand, via Lem. E.18 and the definitions of ξ1 and ξ , we have that

Dψlbpξ , ξ1q ď NSA log 9δ 1 .

Hence, conditioned on (67), we obtain the following regret bound

RT ď NSA log 9δ 1

δ2 N 4S2A2T 4δLNT 2 9δLNT . (68)

log T N 3SAT , and δ min "c

17 4L N 3{4S3{4A3{4plog Tq1{4

T 1{4 , 1 * ;

we obtain that

RT ď NSA log T

δ2 N 4S2A2T 4δLNT 2LN

N 5S3A3T log T 4δLNT 2LN

ď max ! 4 ?

17LN 7{4 p SATq3{4 plog Tq1{4, 34 a

N 5S3A3T log T ) 2LN . (69)

If T ě NSA logp Tq, then our choice of τ indeed satisfies (67):

log T N 3SAT ď δ 16N 2SA .

Otherwise, we can fall back to the trivial regret bound

RT ď NT ď N 2SA logp Tq ,

which is dominated by the bound in (69); hence, the theorem follows.