# multireward_best_policy_identification__b74dfea4.pdf Multi-Reward Best Policy Identification Alessio Russo Ericsson AB Stockholm, Sweden Filippo Vannella Ericsson Research Stockholm, Sweden Rewards are a critical aspect of formulating Reinforcement Learning (RL) problems; often, one may be interested in testing multiple reward functions, or the problem may naturally involve multiple rewards. In this study, we investigate the Multi-Reward Best Policy Identification (MR-BPI) problem, where the goal is to determine the best policy for all rewards in a given set R with minimal sample complexity and a prescribed confidence level. We derive a fundamental instancespecific lower bound on the sample complexity required by any Probably Correct (PC) algorithm in this setting. This bound guides the design of an optimal exploration policy attaining minimal sample complexity. However, this lower bound involves solving a hard non-convex optimization problem. We address this challenge by devising a convex approximation, enabling the design of sample-efficient algorithms. We propose MR-Na S, a PC algorithm with competitive performance on hard-exploration tabular environments. Extending this approach to Deep RL (DRL), we also introduce DBMR-BPI, an efficient algorithm for model-free exploration in multi-reward settings. 1 Introduction Reinforcement Learning (RL) provides a powerful framework for sequential decision-making, which has achieved a number of breakthroughs in a wide range of applications, including mastering video games [1, 2], accelerating algorithm design [3], optimizing fusion control systems [4, 5], and enhancing autonomous robotic control [6, 7, 8]. Rewards are a critical aspect of formulating an RL problem. Yet, designing a reward function can require numerous iterations of reward engineering [9]. Furthermore, in many practical scenarios, the reward signal is not necessarily a scalar, and the agent may seek to optimize performance across a set of reward functions [10, 11] (e.g., goal reaching applications in robotics [12], recommender systems [13], intent-based radio network optimization tasks [14, 15], etc.). In this work, we investigate how to efficiently identify optimal policies across multiple rewards in a given set R, a setting we refer to as Multi-Reward Best Policy Identification (MR-BPI). This setting is a variant of traditional Best Policy Identification (BPI) with fixed confidence [16, 17, 18, 19, 20], where the goal is to identify optimal policies using the least number of samples and with a prescribed confidence level. Exploration algorithms for BPI typically aim to find the optimal policy for a single reward function, which is known and fixed in advance. However, as previously mentioned, the agent may face uncertainty about which reward function is the most appropriate for a given task, or may seek to optimize performance across a set of rewards. This problem remains relatively unexplored in the current RL literature. Closely related settings are reward-free RL [21, 22, 23], unsupervised RL [24, 25, 26, 27, 28, 29], and multi-objective RL [30, 31]. In reward-free RL, the goal of the agent is to estimate the optimal policy for any possible Code repository: https://github.com/rssalessio/Multi-Reward-Best-Policy-Identification 38th Conference on Neural Information Processing Systems (Neur IPS 2024). reward, using methods such as maximum entropy exploration [32, 33]. This reward-free phase is then followed by a downstream task adaptation in unsupervised RL. Lastly, in multi-objective RL, the agent optimizes over multiple (potentially conflicting) rewards, with the goal of identifying a Pareto frontier of optimal policies. While reward-free RL is the closest to our setting, exploration for any possible reward is unlikely to occur in practical problems. On the other hand, unsupervised RL and multi-objective RL do not directly address the multi-reward pure exploration problem a setting we deem to be of practical use in a wide range of applications. For instance, in intent-based radio network optimization, rewards are designed to reflect network operator intents, such as maximizing throughput or coverage in a given network area. MR-BPI enables the agent to identify policies that perform well across different reward functions, enhancing robustness and adaptability to varying conditions and goals. Our contributions are as follows. For MR-BPI with a finite set of user-specified rewards, we establish an asymptotic instance-specific sample complexity lower bound for Markov Decision Processes (MDPs) with a unique optimal policy, which generalizes the lower bound in the single-reward BPI setting [17]. This bound provides insights into the fundamental difficulty of the problem and guides the design of an optimal exploration policy. As in previous works [16, 17, 19, 34], determining such an exploration policy requires to solve a non-convex optimization problem. To address this issue, we propose a convex formulation of the lower bound based on a finite set of rewards. We also propose extensions for two cases: (1) for random exogenous reward signals provided by the environment (app. B.4), and (2) for a set of continuous rewards, which can be addressed by considering a carefully chosen finite set of rewards (app. B.3). We then devise MR-Na S (Multi-Reward Navigate and Stop), a PC algorithm with asymptotic sample complexity guarantees, adaptable to the aforementioned extensions. We evaluate the performance of MR-Na S on different hard-exploration tabular environments, comparing to RF-UCRL [22] (a rewardfree exploration method), ID3AL [33] (a maximum entropy exploration approach) and MR-PSRL, a multi-reward adaptation of PSRL [35]. Results demonstrate the efficiency of MR-Na S in identifying optimal policies across various rewards and in generalizing to unseen rewards when the reward set is sufficiently diverse. Lastly, we further extend our multi-reward exploration approach to Deep RL (DRL) by devising DBMR-BPI (Deep Bootstrapped Multi-Reward BPI), a model-free exploration algorithm for continuous-state MDPs. We assess its performance, along with RND [26] (Random Network Distillation), Disagreement [27], and APT [29], on hard-exploration problems from the Deep Mind BSuite [36] and on link adaptation (a radio network control problem [37]). 2 Problem setting In this section, we introduce the notation used in the manuscript and briefly introduce the Best Policy Identification (BPI) framework. Lastly, we describe the MR-BPI setting. 2.1 Markov Decision Processes and the Set of Rewards Markov Decision Processes (MDPs). We consider discounted MDPs of the type M = (S, A, P, γ), where S is the finite state space (of size S = |S|), A is the finite action space (of size A = |A|), P : S A (S) is the transition function, which maps state-action pairs to a distribution over states and γ (0, 1) is the discount factor. We indicate by by Mr = (S, A, P, r, γ) the MDP when considering a reward r : S A [0, 1]. For a given reward r, we define the discounted value of a Markov policy π : S (A) to be V π r (s) = Eπ[P t 0 γtr(st, at)|s0 = s], where st+1 P( |st, at) and at π( |st). We also let V r (s) = maxπ V π r (s) be the optimal value in s and, for MDPs with a unique optimal policy, we indicate it by π r(s) = arg maxπ V π r (s). Similarly, we also define the action-value function Qπ r (s, a) = r(s, a) + γEs P ( |s,a)[V π r (s )]. For an MDP Mr, we indicate by Π (Mr) = {π Πd : V r V π r = 0} the set of optimal policies in Mr and Πd is the set of deterministic policies. Set of rewards. We denote the set of rewards by R, and assume that it is known to the agent beforehand. In subsequent sections, for the sake of simplicity, we consider this set to be discrete and finite, while we present results for continuous sets of rewards in app. B.4. For finite MDPs, rewards are represented as vectors within [0, 1]SA, with each coordinate i corresponding to the reward for the i-th state-action pair (thus the reward in each state-action pair is bounded in [0, 1]). We also denote the canonical basis of rewards as Rcanonical = {e1, . . . , e SA} in RSA (with ei being the i-th vector of the canonical basis). 2.2 Single-Reward Best Policy Identification with Fixed Confidence Best Policy Identification consists of finding an optimal policy in an MDP using the least number of samples with a given confidence level for a specific reward function [16, 17], and the proof techniques are inspired by the Best Arm Identification literature in the bandit setting [20]. Optimal exploration strategy. The main idea is to derive an instance-specific lower bound on the expected number of samples needed to identify an optimal policy. Interestingly, the lower bound is often formulated as an optimization problem, whose solution ω allows us to derive an optimal exploration algorithm. In other words, the solution ω provides the best proportion of static draws. That is, a policy following ω matches the lower bound and is therefore optimal [20, 17]. Change of measure. The recipe to derive instance-specific sample complexity lower bounds is based on a change of measure argument. In general, the reasoning adopted is to think in an adversarial way: Does there exist an MDP similar to the one considered for which the optimal policy is different? The analysis of such a problem delves into the required minimum amount of evidence needed to discern between different hypotheses. The evidence is quantified by the log-likelihood ratio of the observations under the true model and a confusing model. This confusing model is usually the one that is statistically closest to the true model while admitting a different optimal policy. This argument allows us to derive an instance-dependent quantity T , also known as the characteristic time, that lower bounds the sample complexity of an algorithm. Change of measure arguments have a long history [38, 39, 40, 41], and have been applied to derive lower bounds for regret minimization [42, 43, 44], best-arm identification [20, 45, 46, 47] and change detection problems [48, 49, 50]. 2.3 Multi-Reward Best Policy Identification with Fixed Confidence The MR-BPI setting follows closely the single-reward BPI framework [17]. In MR-BPI, the goal is to learn the best policy for all possible rewards r R as quickly as possible. This objective can be formalized in the δ-Probably-Correct (δ-PC) framework [51, 22], where the learner interacts with the MDP until she can output an optimal policy with confidence 1 δ. Contrarily to the classical BPI framework, that solely considers a single reward, here we are interested in algorithms that identify a best policy for all rewards in R. Algorithm. Formally, an algorithm consists of (i) an exploration strategy (a.k.a. sampling rule), (ii) a stopping rule τ and (iii) an estimated optimal policy ˆπτ,r, for any reward r R. At each time step t, the algorithm observes a state st, it selects an action at, and then observes the next state st+1 P( |st, at). We denote by P (resp. E) the probability law (resp. expectation) of the process (Zt)t, where Zt = (st, at). We indicate by Ft = σ({Z0, . . . , Zt}) the σ-algebra generated by the random observations made under the algorithm up to time t in the exploration phase. For such algorithm, we define τ to be a stopping rule w.r.t. the filtration (Ft)t 1. This stopping rule τ simply states when to stop the exploration phase. At the stopping time τ the algorithm outputs an estimate ˆπτ,r of the best policy for any given reward r R using the collected data. We focus on δ-PC algorithms, according to the following definition, and we impose a few mild assumptions. Definition 2.1 (δ-PC Algorithm). We say that an algorithm is δ-PC if, for any MDP M, it outputs (in finite time) an optimal policy for any reward r R with probability at-least 1 δ, i.e., P[τ < , ( r R, ˆπτ,r Π (Mr))] 1 δ. Assumption 2.1. (I) M M, where M is the set of communicating MDPs [52] with a unique optimal policy for every r R ; (II) M is aperiodic under a policy that assigns a positive probability to all actions in every state; (III) for all r R there exists a model P such that P( |s, a) P ( |s, a) for all (s, a), and Π (Mr) Π (M r) = , where M r = (S, A, P , r, γ) and for any two measures (µ, λ), µ λ denotes absolute continuity of µ w.r.t. λ. Assumptions (I)-(II) are standard in the BPI literature [51, 53]. Assumption (III) is made to avoid degenerate cases in which there are no confusing model for rewards r R. For example, this occurs when the reward is identical in all state-action pairs (therefore all actions are optimal). 3 Best Policy Identification with a Set of Rewards In this section, we investigate the MR-BPI problem, i.e., how to optimally explore an MDP to derive the best policy for any reward in R. In the first subsection, we derive a sample complexity lower bound for any δ-PC algorithm. Later, in the second subsection, we present MR-Na S, a δ-PC algorithm inspired by Na S [17] with asymptotic optimality guarantees. 3.1 Sample Complexity Lower Bound In the next theorem, we state an instance-specific lower bound on the sample complexity of any δ-PC algorithm. This bound provides a fundamental limit on the performance of any δ-PC algorithm and guides the design of optimal exploration methods for MR-BPI. To state the bound, we indicate by ω RSA the allocation (or exploration policy), and by ω(s, a) the corresponding frequency of visit of a state-action pair (s, a); we introduce the set of confusing models Altr(M) for Mr as the set of MDPs that are statistically similar to the original model Mr, but admit a different set of optimal policies under r. Specifically, we let Altr(M) := {M : M M , Π (Mr) Π (M r) = }, where M M if for all (s, a) we have P( |s, a) P ( |s, a). Theorem 3.1. Let δ (0, 1/2). For any δ-PC algorithm we have lim infδ 0 E[τ] log(1/δ) T (M), where T (M) 1 = sup ω Ω(M) inf r R inf M Altr(M) s,a ω(s, a)KLM|M (s, a), (1) with KLM|M (s, a) := KL(P( |s, a), P ( |s, a)) and Ω(M) := {ω (S A) : P a ω(s, a) = P s ,a P(s|s , a )ω(s , a ), s S}. Discussion. The proof of the theorem is presented in app. B.1. The result, obtained by classical change-of-measure arguments [20], extends the corresponding lower bound in the case of a single reward [17]. The quantity T (M) is referred to as the characteristic rate and is a non-convex optimization problem. It can be interpreted as a zero-sum game between an agent choosing a distribution (or allocation) ω, and an opponent choosing a confusing model M [45] (in the multi-reward setting, the opponent needs also to consider the set of reward when choosing the confusing model, which makes the problem harder to solve). The distribution ω is a stationary distribution over states and actions that takes into account the transition dynamics, and, interestingly, the optimal solution ωopt to the optimization problem in eq. (1) also defines the optimal exploration strategy (as πexp(a|s) = ωopt(s, a)/ P b ωopt(s, b)). In principle, with access to an optimization oracle that solves (1), we could apply classical Track-and Stop (Ta S) [20] algorithms and achieve asymptotically optimal sample complexity by tracking the optimal allocation. Unfortunately, even in classical BPI, computing the characteristic rate T (M) is in general a non-convex problem [16]. Instead, as in previous works [16, 17, 53], we focus on deriving a convex upper bound U (M) of T (M), which we call relaxed characteristic rate, that guarantees that we are still identifying the optimal policy at the cost of an increased sample complexity. We focus on the case of a finite set of user-specified rewards. The extension to externally provided reward signals and the consideration of a continuous set of rewards are detailed in the appendix (app. B.3 and app. B.4, respectively). 3.2 Relaxed Characteristic Rate To define the relaxed characteristic rate, we state a few definitions of problem-dependent quantities for MDPs. We let r(s, a) := V r (s) Q r(s, a) be the sub-optimality gap in (s, a), Varr(s, a) := Es P ( |s,a)[(V r (s ) Eˆs P ( |s,a)[V r (ˆs)]] be the variance of the optimal value function in the next state, and MDr(s, a) := V r Es P ( |s,a)[V r (s )] be the maximum deviation starting from (s, a). Additionally, for a given reward r R we also let r := mins,a =π r (s) r(s, a), Varr := maxs,a A(s;Mr) Varr(s, a) and MDr = maxs,a A(s;Mr) MDr(s, a) be the minimum gap, maximum variance and deviation across sub-optimal state-action pairs, respectively. We define the relaxed rate in terms of an upper bound on T(ω; M) that holds for all possible allocations ω (S A), where T(ω; M) 1 := inf r R inf M Alt(Mr) s,a ω(s, a)KLM|M (s, a). (2) Theorem 3.2. For any allocation ω (S A) we have that T(ω; M) U(ω; M), where U(ω; M) := max r R max s,a =π r (s) 2γ2MDr(s, a)2 r(s, a)2ω(s, a) + max s Hr 2rω(s , π r(s )), (3) where Hr = min n 139(1+γ)2 (1 γ)3 , max 16γ2Varr(1+γ)2 (1 γ)2 , 6γ4/3MD4/3 r (1+γ)4/3 (1 γ)4/3 o . We define the optimal value U (M) = U(ω ; M), where ω = arg inf ω Ω(M) U(ω; M) is the optimal allocation w.r.t. U (M). Discussion. In the above theorem (the proof can be in found in app. B.2), the various instancespecific quantities Varr(s, a), r(s, a), Hr and r are computed with respect to M. The result extends the corresponding upper bound in [17, 53]. We make some observations to clarify the key aspects of this result. First, U(ω; M) is convex in ω for a finite set of rewards. Hence U(ω; M) can be computed efficiently by using convex optimization methods (we report the runtime of our simulation in app. E.1). For a continuous set of rewards there are some challenges to guarantee convexity, due to the fact that the minimum gap can be non-convex and discontinuous. However, we find that, in practice, it is sufficient to optimize over a set of rewards whose convex hull includes the original set of rewards (e.g., by using the canonical basis of rewards; refer also to app. B.4 for more details). Second, using the allocation ω guarantees that we identify the optimal policy, at the cost of an increased over-exploration [53], since U (M) T (M). The cost of this over-exploration is limited, since by plugging ω(s, a) = 1/(SA) one can see that U (M) = O SA (1 γ)3 minr R 2r ; note that the dependency in (1 γ) 1 can be improved whenever the maximum deviation is smaller than (1 γ) 1. Lastly, observe that the dependency 2 rω(s , π r(s )) in the second term in eq. (3) can be improved to r(s, a)2ω(s , π r(s )) as in [53]. 3.3 MR-Na S Algorithm In this section we devise MR-Na S, an adaptation of MDP-Na S [17] to MR-BPI for a finite set of rewards. The algorithm consists of (1) an exploration policy (a.k.a. sampling rule) and (2) a stopping rule. We detail these components in the remainder of this section (and in app. C) and present the pseudo-code of MR-Na S in Alg. 1. Sampling rule. The main idea, as in [16, 17, 53, 20], is that the exploratory policy should follow the allocation ω = arg infω Ω(M) U(ω; M), as this will yield a sample-efficient exploration policy. However, as the model M is unknown, we cannot directly compute ω . Alternatively, we employ the certainty-equivalence principle, and use the current estimate of the model Mt in place of the true model M to compute the allocation ω t = arg infω Ω(Mt) U(ω; Mt), where U(ω; Mt) = max r R max s,a Oc t,r(s) 2γ2MDt,r(s, a)2 t,r(s, a)2ω(s, a) + Ht,r 2 t,r mins ,a Ot,r(s) ω(s , a ). (4) In the above expression, for any t 1, the estimate Mt is completely defined by the empirical transition function Pt: Pt(s |s, a) = Nt(s, a, s ) Nt(s, a) if Nt(s, a) > 0, and Pt(s |s, a) = 1 S otherwise, (5) where Nt(s, a) is the number of visits for each state-action pair up to time t (sim. Nt(s, a, s ))). Letting V t,r denote the optimal value in Mt,r := (Mt, r) (sim. Q t,r) then Ot,r(s) = {a : Q t,r(s, a) = V t,r(s)} is the set of optimal actions for reward r in state s (and Oc t,r(s) is the complement), MDt,r(s, a) = V t,r(s) Es Pt( |s,a)[V t,r(s )] is the maximum deviation starting from (s, a) Algorithm 1 MR-Na S (Multiple Rewards Navigate and Stop) Require: Confidence δ; exploration terms (α, β); reward vectors R. 1: Initialize counter t 1, Nt(s, a) 0 for all (s, a) S A. 2: Set exploration term ϵt = 1/Nt(st)α and observe s1. 3: while t U(Nt/t; Mt) 1 < β(Nt, δ) do 4: Compute ω t = infω Ω(Mt) U(ω; Mt) and let ω t = (1/t) Pt n=1 ω n. 5: Set πt(a|st) = (1 ϵt)π t (a|st) + ϵtπf,t(a|st), where π t (a|s) = ω t (s, a)/ P a ω t (s, a ). 6: Play at πt( |st) and observe st+1 P( |st, at). 7: Update Nt(st, at), Mt and set t t + 1. 8: end while Output: an optimal policy ˆπ r,τ for a given reward r. for Mt,r, Vart,r(s, a) = Es Pt( |s,a)[(V t,r(s ) Eˆs Pt( |s,a)[V t,r(ˆs)])2] is the corresponding nextstate variance of the optimal value for Mt,r, and t,r(s, a) = V t,r(s) Q t,r(s, a). These quantities can also be used similarly to compute Ht,r (i.e., Hr computed w.r.t. Mt,r). At the same time, the sampling strategy must ensure that the estimated model Mt converges to M. To that aim, we mix the computed allocation with a forcing policy πf, ensuring that each state-action pair is sampled infinitely often. We use the forcing policy πf,t( |s) = softmax ( βt(s)Nt(s, )) with βt(s) = β log(Nt(s)) maxa |Nt(s,a) minb Nt(s,b)|, β [0, 1] and (softmax(x))i = exi/ P j exj for a vector x. This choice encourages to select under-sampled actions for β > 0, while for β = 0 we obtain a uniform forcing policy πf,t(a|s) = 1/A. In [16], the authors use an exploration factor ϵt = t 1/(m+1), where m is an problem-specific constant measuring to the "connectedness" of the MDP. As estimating this constant is typically hard for most MDPs, we instead use ϵt = 1/ max(1, Nt(st))α, with Nt(s) = P a Nt(s, a) and α (0, 1], which only depends on the number of states visits. This change requires a modification of the proofs in [17], since now ϵt depends also on the number of visits of st. Finally, to ensure convergence of ω t ω , we simply require that α + β 1. Stopping rule. The stopping rule (line 3) is defined through the relaxed generalized likelihood ratio term t U(Nt/t; Mt) and a threshold function β(Nt, δ). The term Nt/t simply refers to the rates of visit of each state-action pair, while Mt is the current estimate of the MDP (based on (5)). The threshold function is selected to guarantee that the algorithm is δ-PC (the proof is in app. C.4). Proposition 3.1. Define the threshold β(Nt, δ) = log(1/δ) + (S 1) P s,a log e h 1 + Nt(s,a) Then, the stopping rule τ := inf{t 1 : t U(Nt/t; Mt) 1 β(Nt, δ)} is δ-PC, i.e., P[τ < , ( r R : ˆπτ,r / Π (Mr))] δ. (6) Putting everything together. Following our discussion, MR-Na S uses the estimate Mt to compute the allocation ω t = arg infω Ω(Mt) U(ω; Mt) (line 4). To stabilize the exploration, we use an averaged allocation to obtain π t = (1/t) Pt n=1 ω n (similarly to the C-navigation rule in [20, 17]). Lastly, we mix π t with a forcing policy πf,t (line 5). The samples observed from the MDP are then used to update the number of state action visits and the estimate of Mt. Finally, the stopping rule (line 3) determines when the algorithms has reached the desired confidence level. Sample complexity guarantees. Lastly, we establish asymptotic optimality guarantees on the sample complexity of MR-Na S (see proofs in app. C.2.3 and app. C.3.5, respectively). Theorem 3.3. MR-Na S with the stopping rule in prop. 3.1 satisfies: (i) it stops almost surely and P lim supδ 0 τ log(1/δ) U (M) = 1; (ii) E[τ] < and lim supδ 0 E[τ] log(1/δ) U (M). 3.4 Numerical Results on Tabular MDPs In this section we show results on different hard-exploration tabular environments: Riverswim [54], Forked Riverswim [53], Double Chain [22] and NArms [54] (an adaptation of Six Arms to N arms). We compare MR-Na S against RF-UCRL [22] (a reward-free exploration method), ID3AL [33] (a maximum entropy exploration approach) and MR-PSRL, a multi-reward adaptation of PSRL [35] (posterior sampling for RL) that is outlined in app. D.1.2. A full description of the environments 0 20000 40000 Steps Estimation error E[et] 0 20000 40000 Steps Forked Riverswim 0 20000 40000 Steps Double Chain 0 20000 40000 Steps MR-Na S RF-UCRL ID3AL MR-PSRL Figure 1: Average estimation error over t = 1, . . . , T steps (note that r is a random variable). Shaded area indicates 95% confidence intervals over 100 seeds. can be found in app. D.1. As the baselines are not specific to our setting, we consider different sets of experiments. In this section, we show results for MR-Na S on the more challenging setting with continuous sets of rewards, while in app. D.1, we also investigate the problem of identifying best policies with a finite set of rewards. More precisely, we consider the problem of correctly estimating the optimal policies in a convex set of rewards [0, 1]SA as well as the optimal policies of the canonical basis, with discount factor γ = 0.9. We consider R = Rcanonical Rrnd, where Rrnd = {R1, . . . , R30} consists of 30 reward vectors uniformly sampled from [0, 1]SA, different for each seed. For MR-Na S, we use the insight from the analysis of a continuous set of rewards (see app. B.4.2) which shows that exploring according to the canonical basis Rcanonical is sufficient to identify the optimal policies in a continuous set. We run each algorithm for T = 50 103 steps, and we report the fraction of misidentified policies et,r at time step t and for reward r. Fig. 1 shows the results for this setting over 50 seeds. We see how depending on the environment, certain algorithms may be more efficient. Nonetheless, MR-Na S is able to reach low estimation error relatively quickly on all of the environments. We refer the reader to app. D.1 for further details. 4 Exploration in Deep Reinforcement Learning with a Set of Rewards Extending MR-Na S to Deep-RL is not straightforward. Firstly MR-Na S is a model-based method, secondly there is no closed-form solution to arg infω ΩU (ω; M). To circumvent these issues, we adapt the DBMF-BPI algorithm from [53], where the authors propose DBMF-BPI, a deep-RL exploration approach for single-reward BPI. The DBMF-BPI algorithm. DBMF-BPI [53] is an algorithm for model-free exploration. In [53] they suggest that a model-free solution can be achieved by (i) using the generative solution arg infω (S A) U (ω; M), which can be computed in closed form (see app. B.2.4), and (ii) by estimating the parametric uncertainty of the MDP-specific quantities (sub-optimality gaps, etc.). In other words, the generative solution may be close enough to the model-based one. Nonetheless, this solution may be biased because the true sub-optimality gaps are unknown, and therefore we need account for the parametric uncertainty in the estimates. DBMF-BPI estimates the parametric uncertainty using an ensemble of Q-networks. This ensemble of size B, when trained on different data, can be used to mimic the bootstrap procedure [55]: the samples (xb)b [B] from this ensemble are used to build an empirical CDF ˆP(X x) = 1 B PB b=1 1{X xb} around some quantity of interest X (in our case the quantity of interest is the Q-function). A randomized prior value mechanism [56] addresses the restricted support of ˆP by adjusting the sampling of (xb)b. This empirical CDF is then used to sample the Q-values at each time step, which are then used to compute the generative solution ωt. Algorithm 2 DBMR-BPI(Deep Bootstrapped Multiple-Rewards BPI) - Exploration phase 1: Initialize replay buffer D, ensemble network parameter θ. 2: for t = 0, 1, 2, . . . , do 3: Choose reward r with probability 1 2 t,r and compute allocation ωt using DBMF-BPI with reward r. 4: Sample at ωt(st, ) and observe st+1 P( |st, at). 5: Add transition (st, at, r1,t, . . . , r R,t, st+1) to the buffer D, with ri,t = ri(st, at), i = 1, . . . , |R|. 6: Train the ensemble θ according to DBMF-BPI and update estimates t,r for each reward r R. 7: end for Multi-reward adaptation. We propose DBMR-BPI, an extension of DBMF-BPI to the multi-reward setting with a finite set of rewards. DBMR-BPI at each time step observes a vector of rewards, and drives its exploration according to these observed values. More precisely, the extension mainly consists of the following modifications: 1. The Q-values are learned for all rewards. In the finite action setting, this implies that there are A|R| values to estimate in every state. Alternatively, one can choose to feed the networks with a one-hot encoding of the chosen reward (or use embeddings), but we experienced this change to be less effective. 2. We explore according to the difficulty of the reward function. Specifically, we explore a reward r with probability 1/ 2 t,r, as this term is typically the leading factor in the sample complexity. For the sake of stability, at the beginning of each episode (or every 1/(1 γ) steps) we sample a reward r R to explore, and apply DBMF-BPI on this reward. The main steps of the algorithm are outlined in Alg. 2, and all the details can be found in app. E.1. 4.1 Numerical Results We evaluate the performance of DBMR-BPI and compare it with state-of-the-art method in unsupervised RL [24] (note that this is the first algorithm for multi-reward exploration in Deep-RL). More precisely, we compare our algorithm to RND [26] (Random Network Distillation), Disagreement[27] and APT [29]. The exploration procedure of these methods is guided by an intrinsic reward function that is learnt during exploration (e.g., by quantifying the parametric uncertainty of the model, or some prediction uncertainty; these methods are detailed in app. E.2). As in [57, 53], we evaluate on different challenging environments from the Deep Mind behavior suite [36]: the Cartpole swing-up problem and a stochastic version of Deep Sea [36, 53] with varying level of difficulties. We also evaluate the methods on the link adaptation problem, a radio network control task where the agent needs to choose a coding and modulation scheme in wireless networks [37], and report the results in app. D.3.1. Lastly, we also evaluate the exploration quality of DBMR-BPI both on a known set of pre-defined rewards and on unseen rewards. The Cartpole swing-up [36]. In the classical version of this environment, the goal is to balance a pole initially pointing downward, and the agent incurs a cost c whenever the cart moves. In the multi-reward variant of Cartpole, the goal is to stabilize the pole vertically around a specific position x0. The agent earns a positive reward only if the pole s angle, its angular velocity, and the cart s position, respectively, satisfy the conditions (1) cos(θ) > k/20, (2) | θ| θ0, and (3) |x x0| 1 k/20, where θ0 is a fixed constant and k is an integer representing the difficulty. We considered 5 values of x0 linearly spaced in [ 1.5, 1.5], so that |R| = 5. To assess DBMR-BPI s capacity to generalize on unseen rewards, we uniformly sample 5 additional values of x0 in the same interval that are not used during training, and we denote them by Rrnd. In tab. 1 we present results for two levels of difficulty k {3, 5} and different training steps T. The table presents statistics for the random variable Xi = 1 T PT t=1 1ri(st,at)>0, where i is the i-th reward in the set. This metric measures the average amount of information collected by the agent useful to learn how to stabilize the pole upright. Avg R({Xi}) (sim. the others metrics) indicates the arithmetic mean of {Xi}i over the rewards. We average results across 30 seeds. In general, we see that DBMR-BPI outperforms all of the unsupervised learning baselines. Interestingly, even though DBMR-BPI uses the base set of rewards R to explore, we see better performance when collecting rewards from Rrnd (which are not used by DBMR-BPI during exploration). The reason Base rewards R Random rewards Rrnd E[Avg R({Xi})] E[std R({Xi})] E[min R({Xi})] E[Avg Rrnd({Xi})] E[std Rrnd({Xi})] E[min Rrnd({Xi})] k = 3 DBMR-BPI 21.3(20.6, 21.9) 6.1(5.5, 6.6) 12.8(12.0, 13.6) 23.1(22.2, 24.2) .1(3.6, 4.6) 17.1(16.0, 18.2) T = 150000 RND 1.3(1.0, 1.6) 0.3(0.2, 0.3) 0.9(0.7, 1.2) 1.4(1.1, 1.7) 0.2(0.1, 0.2) 1.1(0.9, 1.4) APT 0.3(0.3, 0.4) 0.1(0.1, 0.2) 0.2(0.1, 0.2) 0.3(0.3, 0.4) 0.1(0.1, 0.1) 0.2(0.2, 0.2) Disagreement 0.9(0.7, 1.2) 0.3(0.2, 0.6) 0.5(0.4, 0.6) 1.0(0.7, 1.6) 0.3(0.1, 0.5) 0.6(0.5, 0.7) k = 5 DBMR-BPI 20.2(19.6, 20.6) 6.4(6.0, 6.9) 11.4(10.7, 12.0) 22.3(21.5, 23.1) 4.5(4.0, 4.9) 15.5(14.4, 16.7) T = 200000 RND 1.2(0.9, 1.5) 0.2(0.2, 0.3) 0.8(0.6, 1.1) 1.1(0.9, 1.4) 0.2(0.1, 0.2) 0.9(0.7, 1.1) APT 0.3(0.3, 0.4) 0.1(0.1, 0.1) 0.2(0.1, 0.2) 0.3(0.3, 0.4) 0.1(0.1, 0.1) 0.2(0.2, 0.3) Disagreement 0.7(0.5, 1.0) 0.3(0.2, 0.4) 0.4(0.3, 0.6) 0.8(0.5, 1.0) 0.2(0.1, 0.3) 0.5(0.4, 0.7) Table 1: Cartpole swing-up. Statistics for the random variable Xi = 1 T PT t=1 1ri(st,at)>0 (with ri R or ri Rrnd). Values are multiplied by 100 and rounded to the first digit. In bold statistically significant results (in parentheses we indicate the 95% confidence interval over 30 seeds). may be that some rewards in R, (e.g., those with |x0| = 1.5), are harder to learn, while it is unlikely for the random rewards to have such values of x0. This result seems to suggest that data collected by DBMR-BPI could also be used to optimize similar, but unseen, rewards. We refer the reader to app. D.2.1 for more results and details on the environment. The Deep Sea [36]. In the Deep Sea environment the agent moves in a riverswim-like environment, and the actions are randomized for each different seed. The agent starts in the top-left corner of a grid of size N N, and can only move diagonally, towards the bottom of the grid. In the classical setup, the agent observes a N 2-dimensional one-hot encoding of its position, and the goal is to reach the bottom-right corner. We adapt it to a multi-reward setting by considering N different rewards R = {r1, . . . , r N}, so that ri assigns a positive reward whenever the agent reaches column i in the last row of the grid. As in [53], we include a small probability of the agent taking the wrong action. To assess the capacity of DBMR-BPI to collect unseen rewards, for each seed, we sampled 20 reward functions Rrnd = (r i)20 i=1, each of dimension N(N + 1)/2 (where the i-th value corresponds to a possible position in the grid), from a Dirichlet distribution with parameter α = (αi)i, with αi = 1/(10N) for all i = 1, . . . , N(N + 1)/2, and computed the reward for these additional reward functions. The reader can find more details regarding the environment in app. D.2.2. Base rewards R Random rewards Rrnd E[GMR({Xi})] E[std R({Xi})] E[PN i=1 1Xi=0] E[Avg Rrnd({Xi})] E[std Rrnd({Xi})] N = 20 DBMR-BPI 4.6(4.6, 4.6) 1.8(1.7, 1.8) 0.0(0.0, 0.0) 8.9(8.5, 9.2) 3.6(3.2, 4.0) T = 50000 RND 4.6(4.2, 4.7) 1.7(1.6, 1.8) 0.0(0.0, 0.1) 8.7(8.4, 9.1) 3.5(3.1, 4.0) APT 1.4(0.8, 1.9) 5.3(4.8, 5.8) 0.9(0.6, 1.3) 8.7(8.2, 9.2) 6.6(5.9, 7.4) Disagreement 0.0(0.0, 0.0) 6.2(5.9, 6.5) 4.0(3.4, 4.5) 8.9(8.3, 9.5) 6.4(5.8, 7.1) N = 30 DBMR-BPI 3.1(2.9, 3.2) 0.7(0.7, 0.7) 0.0(0.0, 0.1) 6.0(5.8, 6.3) 3.0(2.5, 3.6) T = 100000 RND 1.8(1.3, 2.2) 1.9(1.9, 2.0) 0.3(0.2, 0.5) 6.1(5.9, 6.4) 3.0(2.4, 3.5) APT 0.0(0.0, 0.0) 3.9(3.6, 4.2) 3.3(2.8, 3.8) 5.9(5.6, 6.2) 4.3(3.8, 4.7) Disagreement 0.0(0.0, 0.0) 7.0(6.6, 7.3) 11.5(10.8, 12.2) 6.2(5.6, 6.7) 6.1(5.6, 6.6) Table 2: Deep Sea. Statistics for the random variable Xi = 1 T PT t=1 ri(st, at) (with ri R or ri Rrnd). Values are multiplied by 100 (except the third metric for the base rewards) and rounded to the first digit. In bold statistically significant results (in parentheses we indicate the 95% confidence interval over 30 seeds). In tab. 2 we show some statistics for the random variable Xi = 1 T PT t=1 ri(st, at), that quantifies the average reward collected for the i-th reward in a specific instance. In the table GMR({Xi}) indicates the geometric mean over the rewards (the arithmetic mean would be a constant for these base rewards in this setting, and the GM is more appropriate [58]), and PN i=1 1Xi=0 indicates the number of cells in the last row of the grid that have not yet been visited at time step T. When considering the set of rewards R used by DBMR-BPI , we see that APT and Disagremeent do not match the performance of DBMR-BPI and RND, focusing most of the time on a few rewards. On the other hand, while DBMR-BPI and RND show similar performance for N = 20, the difference increases for N = 30 on the base set of rewards. When considering the total reward collected from the set Rrnd, we see that DBMR-BPI achieves performance similar to RND, which confirms the capability of DBMR-BPI to collect reward from unseen reward functions (refer to app. D.2.2 for further details). 5 Discussion and Conclusions Exploration in RL and Best Policy Identification. There exists a variety of exploration methods in RL [59] and DRL (see [60, 53] and references therein). Our setting is a generalization of classical single-reward BPI with fixed confidence [16, 17, 61, 62, 18, 19]. This line of work typically leverages instance-specific lower bounds on the sample complexity (or approximations of it) to devise efficient exploration strategies, but do not directly address the MR-BPI problem. While we primarily draw inspiration from [17, 53], we enhance their work by introducing a multi-reward generalization applicable to both the tabular case and DRL, along with some technical improvements (see app. C and app. E.1). Lastly, while not directly related, in [63, 64, 65] the authors study the identification of the Pareto optimal frontier in a multi-objective bandit model, a framework yet to be explored in MDPs. Multi-Reward. There are several works considering a setting where the agent deals with multiple diverse reward signals [10, 66, 11, 31]. Shelton [10] investigates the case where an agent observes a vector of rewards from multiple sources at each time step. The author proposes an algorithm attaining a Nash equilibrium on the average return of each source. In [66] they investigate a multi-reward setting with continuous states and devise an algorithm yielding optimal policies for any convex combination of reward functions. Arguably, the closest work to ours is Dann et al. [11], where they investigate multi-reward tabular MDPs, aiming at optimizing a regret measure that combines the various rewards components through an operator and provide regret bounds for action elimination algorithms. Despite the similarities, these works focus on the classical regret minimization problem and do not address the pure exploration case. A related framework is multi-objective RL [30, 67, 31], a technique with various applications in areas such as molecular design [68] and pricing strategy optimization [69]. In this framework, the agent seeks to optimize multiple rewards simultaneously. To the best of our knowledge there are no work investigating the exploration problem in Multi-Objective RL, a setting that could potentially benefit from more efficient exploration strategies, such as the ones that we propose. Reward-Free Exploration. A relevant body of literature considers Reward-Free Exploration (RFE) [21, 22, 70, 71, 23, 72], where the agent learns behaviours without a specific reward. For example, Jin et al. [21] study reward-free RL in the tabular episodic case to identify ε-optimal policies for any possible reward. They provide a lower bound on the length of exploration phase scaling as Ω(H2S2A/ε2) and devise an algorithm nearly matching it of order O(H5S2A/ε2). This bound was improved in [22] and [70], where the authors propose the RF-UCRL and RF-Express algorithms for episodic reward-free RL. On a similar line, the authors of [23] focus on an RFE setting where the minimal sub-optimality gap is a known quantity, and provide an algorithm with reduced sample complexity. Another relevant RFE framework is maximum entropy exploration [32, 33, 24, 73], which studies the problem of state entropy maximization. This type of exploration is particularly useful for offline RL [74] since it is well know that the coverage of the state space characterizes the sample complexity of the problem [75, 76]. Lastly, the reward-free exploration phase can be followed by a downstream task adaptation, as in unsupervised RL [24, 25, 26, 27, 28, 29]. While these settings are highly relevant and provide valuable insights for other RL domains (such as facilitating representation learning in a self-supervised manner [77, 78, 79]), some RFE techniques can be impractical in realistic scenarios (e.g., exploring for any reward) or do not address the MR-BPI problem directly. Furthermore, unlike our setup, most of the works in RFE focus on the minimax setting only. Conclusions. In this work, we investigated the MR-BPI problem, focusing on efficiently determining the best policies for a diverse set of rewards R with a specified confidence level. We established a fundamental sample complexity result for MR-BPI and proposed two efficient methods for multi-reward exploration: MR-Na S for tabular MDPs and DBMR-BPI for DRL. Numerical results on challenging exploration environments suggests that both algorithms exhibit strong exploration capabilities and generalization properties. We believe that our study benefits various areas of RL by enabling more principled exploration in multi-reward scenarios. This includes applications in multi-objective RL, which naturally lends itself to optimization over multiple objectives, and offline RL, where it is crucial to collect a dataset that performs well across different rewards. Acknowledgments This research was conducted while the authors were employed at Ericsson in Stockholm, Sweden. We would like to express our gratitude to Burak Demirel, Pablo Soldati, Yu Wang and Peter Björkén for their support during the preparation of this work. We would also like to thank Maxime Bouton and Martin Isaksson for their critical feedback that enhanced the quality of this paper. On a personal note, Alessio Russo wishes to personally thank Letizia Orsini for her assistance in reviewing the manuscript. I also dedicate this work to the memory of my beloved cat, Caterina, whose companionship and gentle spirit brought immeasurable joy to my life and was a comforting presence throughout my academic journey. Her memory will forever be a source of comfort and inspiration. [1] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 2017. [2] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 2020. [3] Daniel J Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature, 2023. [4] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 2022. [5] Jaemin Seo, Sang Kyeun Kim, Azarakhsh Jalalvand, Rory Conlin, Andrew Rothstein, Joseph Abbate, Keith Erickson, Josiah Wai, Ricardo Shousha, and Egemen Kolemen. Avoiding fusion plasma tearing instability with deep reinforcement learning. Nature, 2024. [6] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob Mc Grew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik s cube with a robot hand. ar Xiv preprint ar Xiv:1910.07113, 2019. [7] Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun, and Davide Scaramuzza. Champion-level drone racing using deep reinforcement learning. Nature, 2023. [8] Sarmad Ahmad Abbasi, Awais Ahmed, Seungmin Noh, Nader Latifi Gharamaleki, Seonhyoung Kim, AM Masum Bulbul Chowdhury, Jin-young Kim, Salvador Pané, Bradley J Nelson, and Hongsoo Choi. Autonomous 3d positional control of a magnetic microrobot using reinforcement learning. Nature Machine Intelligence, 2024. [9] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [10] Christian Shelton. Balancing multiple sources of reward in reinforcement learning. Advances in Neural Information Processing Systems, 13, 2000. [11] Christoph Dann, Yishay Mansour, and Mehryar Mohri. Reinforcement learning can be more efficient with multiple rewards. In International Conference on Machine Learning, 2023. [12] Sha Luo, Hamidreza Kasaei, and Lambert Schomaker. Accelerating reinforcement learning for reaching using continuous curriculum learning. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020. [13] Anisio Lacerda. Multi-objective ranked bandits for recommender systems. Neurocomputing, 246:12 24, 2017. [14] Cleverson V Nahum, Victor Hugo Lopes, Ryan M Dreifuerst, Pedro Batista, Ilan Correa, Kleber V Cardoso, Aldebaro Klautau, and Robert W Heath. Intent-aware radio resource scheduling in a ran slicing scenario using reinforcement learning. IEEE Transactions on Wireless Communications, 2023. [15] Chamitha De Alwis, Pardeep Kumar, Quoc-Viet Pham, Kapal Dev, Anshuman Kalla, Madhusanka Liyanage, and Won-Joo Hwang. Towards 6g: Key technological directions. ICT Express, 9(4):525 533, 2023. [16] Aymen Al Marjani and Alexandre Proutiere. Adaptive sampling for best policy identification in markov decision processes. In International Conference on Machine Learning, pages 7459 7468. PMLR, 2021. [17] Aymen Al Marjani, Aurélien Garivier, and Alexandre Proutiere. Navigating to the best policy in markov decision processes. Advances in Neural Information Processing Systems, 34:25852 25864, 2021. [18] Jérôme Taupin, Yassir Jedra, and Alexandre Proutiere. Best policy identification in discounted linear mdps. In Sixteenth European Workshop on Reinforcement Learning, 2023. [19] Alessio Russo and Alexandre Proutiere. On the sample complexity of representation learning in multi-task bandits with global and local structure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9658 9667, 2023. [20] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998 1027. PMLR, 2016. [21] Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020. [22] Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, 2021. [23] Jingfeng Wu, Vladimir Braverman, and Lin Yang. Gap-dependent unsupervised exploration for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 4109 4131. PMLR, 2022. [24] Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning benchmark. In Deep RL Workshop Neur IPS 2021, 2021. [25] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778 2787. PMLR, 2017. [26] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In International Conference on Learning Representations, 2018. [27] Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self-supervised exploration via disagreement. In International conference on machine learning, pages 5062 5071. PMLR, 2019. [28] Younggyo Seo, Lili Chen, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. State entropy maximization with random encoders for efficient exploration. In International Conference on Machine Learning, pages 9443 9454. PMLR, 2021. [29] Hao Liu and Pieter Abbeel. Behavior from the void: Unsupervised active pre-training. Advances in Neural Information Processing Systems, 34:18459 18473, 2021. [30] Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67 113, 2013. [31] Conor F Hayes, Roxana R adulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 2022. [32] Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pages 2681 2691. PMLR, 2019. [33] Mirco Mutti and Marcello Restelli. An intrinsically-motivated approach for learning highly exploring and fast mixing policies. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020. [34] Jérôme Taupin, Yassir Jedra, and Alexandre Proutiere. Best policy identification in discounted linear mdps. In Sixteenth European Workshop on Reinforcement Learning, 2023. [35] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems (Neur IPS), 26, 2013. [36] Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina Mc Kinney, Tor Lattimore, Csaba Szepesvári, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado van Hasselt. Behaviour suite for reinforcement learning. In International Conference on Learning Representations, 2020. [37] Erik Dahlman, Stefan Parkvall, and Johan Skold. 5G NR: The next generation wireless access technology. Academic Press, 2020. [38] Abraham Wald. Sequential analysis. John Wiley, 1947. [39] Gary Lorden. Procedures for reacting to a change in distribution. The annals of mathematical statistics, pages 1897 1908, 1971. [40] Tze Leung Lai. Asymptotic optimality of invariant sequential probability ratio tests. The Annals of Statistics, pages 318 333, 1981. [41] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985. [42] Richard Combes and Alexandre Proutiere. Unimodal bandits: Regret lower bounds and optimal algorithms. In International Conference on Machine Learning, pages 521 529. PMLR, 2014. [43] Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. Mathematics of Operations Research, 44(2):377 399, 2019. [44] Jungseul Ok, Alexandre Proutiere, and Damianos Tranos. Exploration in structured reinforcement learning. Advances in Neural Information Processing Systems (Neur IPS), 31, 2018. [45] Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamification of pure exploration for linear bandits, 2020. [46] Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems, 32, 2019. [47] Alessio Russo and Filippo Vannella. Fair best arm identification with fixed confidence. ar Xiv preprint ar Xiv:2408.17313, 2024. [48] Tze Leung Lai. Information bounds and quick detection of parameter changes in stochastic systems. IEEE Transactions on Information theory, 44(7):2917 2929, 1998. [49] Alexander Tartakovsky, Igor Nikiforov, and Michele Basseville. Sequential analysis: Hypothesis testing and changepoint detection. CRC press, 2014. [50] Alessio Russo and Alexandre Proutiere. Balancing detectability and performance of attacks on the control channel of markov decision processes. In 2022 American Control Conference (ACC), pages 2843 2850. IEEE, 2022. [51] Aymen Al Marjani, Aurélien Garivier, and Alexandre Proutiere. Navigating to the best policy in markov decision processes. In Advances in Neural Information Processing Systems (Neur IPS), 2021. [52] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. [53] Alessio Russo and Alexandre Proutiere. Model-free active exploration in reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. [54] Alexander L Strehl and Michael L Littman. An empirical evaluation of interval estimation for markov decision processes. In 16th IEEE International Conference on Tools with Artificial Intelligence, pages 128 135. IEEE, 2004. [55] Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, pages 569 593. Springer, 1992. [56] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems (Neur IPS), 31, 2018. [57] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems (Neur IPS), 29, 2016. [58] Kenric P Nelson. Assessing probabilistic inference by comparing the generalized mean of the model and source probabilities. Entropy, 19(6):286, 2017. [59] Susan Amin, Maziar Gomrokchi, Harsh Satija, Herke van Hoof, and Doina Precup. A survey of exploration methods in reinforcement learning. ar Xiv e-prints, pages ar Xiv 2109, 2021. [60] Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Michal Valko, and Pierre Ménard. Model-free posterior sampling via learning rate randomization. Advances in Neural Information Processing Systems, 36, 2023. [61] Andrea Tirinzoni, Aymen Al Marjani, and Emilie Kaufmann. Near instance-optimal pac reinforcement learning for deterministic mdps. Advances in Neural Information Processing Systems (Neur IPS), 35:8785 8798, 2022. [62] Andrew J Wagenmaker, Max Simchowitz, and Kevin Jamieson. Beyond no regret: Instancedependent pac reinforcement learning. In Conference on Learning Theory, pages 358 418. PMLR, 2022. [63] Peter Auer, Chao-Kai Chiang, Ronald Ortner, and Madalina Drugan. Pareto front identification from stochastic bandit feedback. In Artificial intelligence and statistics, pages 939 947. PMLR, 2016. [64] Cyrille Kone, Emilie Kaufmann, and Laura Richert. Adaptive algorithms for relaxed pareto set identification. Advances in Neural Information Processing Systems, 36:35190 35201, 2023. [65] Zixin Zhong, Wang Chi Cheung, and Vincent Tan. Achieving the pareto frontier of regret minimization and best arm identification in multi-armed bandits. Transactions on Machine Learning Research, 2023. [66] Daniel J Lizotte, Michael H Bowling, and Susan A Murphy. Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis. In ICML, 2010. [67] Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. A generalized algorithm for multiobjective reinforcement learning and policy adaptation. Advances in neural information processing systems, 32, 2019. [68] Julien Horwood and Emmanuel Noutahi. Molecular design in synthetically accessible chemical space via deep reinforcement learning. ACS omega, 5(51):32984 32994, 2020. [69] Elena Krasheninnikova, Javier García, Roberto Maestre, and Fernando Fernández. Reinforcement learning for pricing strategy optimization in the insurance industry. Engineering applications of artificial intelligence, 80:8 19, 2019. [70] Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning. PMLR, 2021. [71] Ruosong Wang, Simon S. Du, Lin F. Yang, and Ruslan Salakhutdinov. On reward-free reinforcement learning with linear function approximation. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. [72] Andrew Wagenmaker and Kevin Jamieson. Instance-dependent policy learning for linear MDPs via online experiment design. In Advances in Neural Information Processing Systems (Neur IPS), 2022. [73] Daniil Tiapkin, Denis Belomestny, Daniele Calandriello, Eric Moulines, Remi Munos, Alexey Naumov, Pierre Perrault, Yunhao Tang, Michal Valko, and Pierre Menard. Fast rates for maximum entropy exploration. In International Conference on Machine Learning, 2023. [74] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. [75] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71:89 129, 2008. [76] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042 1051. PMLR, 2019. [77] Diana Borsa, Andre Barreto, John Quan, Daniel J Mankowitz, Hado van Hasselt, Remi Munos, David Silver, and Tom Schaul. Universal successor features approximators. In International Conference on Learning Representations, 2018. [78] Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards. Advances in Neural Information Processing Systems, 34:13 23, 2021. [79] Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? In The Eleventh International Conference on Learning Representations, 2022. [80] Yassir Jedra and Alexandre Proutiere. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33:10007 10017, 2020. [81] Aurélien Garivier and Emilie Kaufmann. Nonasymptotic sequential tests for overlapping hypotheses applied to near-optimal arm identification in bandit models. Sequential Analysis, 40(1):61 96, 2021. [82] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17(1):1 42, 2016. [83] Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 2000. [84] Gersende Fort, Eric Moulines, and Pierre Priouret. Convergence of adaptive and interacting markov chain monte carlo algorithms. 2011. [85] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017. [86] Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. Advances in Neural Information Processing Systems, 33:1253 1263, 2020. [87] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017. [88] Guido Van Rossum and Fred L Drake Jr. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam, 1995. [89] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 2020. [90] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, Ilhan Polat, Yu Feng, Eric W. Moore, Jake Vander Plas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and Sci Py 1.0 Contributors. Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261 272, 2020. [91] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016. [92] Michael Waskom, Olga Botvinnik, Drew O Kane, Paul Hobson, Saulius Lukauskas, David C Gemperline, Tom Augspurger, Yaroslav Halchenko, John B. Cole, Jordi Warmenhoven, Julian de Ruiter, Cameron Pye, Stephan Hoyer, Jake Vanderplas, Santi Villalba, Gero Kunter, Eric Quintero, Pete Bachant, Marcel Martin, Kyle Meyer, Alistair Miles, Yoav Ram, Tal Yarkoni, Mike Lee Williams, Constantine Evans, Clark Fitzgerald, Brian, Chris Fonnesbeck, Antony Lee, and Adel Qalieh. mwaskom/seaborn: v0.8.1, September 2017. [93] Wes Mc Kinney et al. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, volume 445, pages 51 56. Austin, TX, 2010. [94] John D Hunter. Matplotlib: A 2d graphics environment. Computing in science & engineering, 2007. [95] CLARABEL. CLARABEL Optimizer, 2024. [96] Alexander Domahidi, Eric Chu, and Stephen Boyd. Ecos: An socp solver for embedded systems. In 2013 European control conference (ECC), pages 3071 3076. IEEE, 2013. [97] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. Cython: The best of both worlds. Computing in Science & Engineering, 2011. [98] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (Neur IPS), 32, 2019. [99] Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5):834 846, 1983. [100] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. [101] Burak Demiral, Euhanna Ghadimi, Yu Wang, and Pablo Soldati. Pareto-optimal solutions in rans: How multi-objective reinforcement learning can implement high-level intents? 2024. Manuscript in preparation. [102] Ian Osband, Benjamin Van Roy, Daniel J Russo, and Zheng Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20:1 62, 2019. 1 Introduction 1 2 Problem setting 2 2.1 Markov Decision Processes and the Set of Rewards . . . . . . . . . . . . . . . . . 2 2.2 Single-Reward Best Policy Identification with Fixed Confidence . . . . . . . . . . 3 2.3 Multi-Reward Best Policy Identification with Fixed Confidence . . . . . . . . . . . 3 3 Best Policy Identification with a Set of Rewards 4 3.1 Sample Complexity Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Relaxed Characteristic Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 MR-Na S Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.4 Numerical Results on Tabular MDPs . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Exploration in Deep Reinforcement Learning with a Set of Rewards 7 4.1 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5 Discussion and Conclusions 10 A Limitations and Broader Impact 20 B Sample Complexity Bounds 21 B.1 Lower Bound Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B.2 Relaxed Characteristic Rate Proof . . . . . . . . . . . . . . . . . . . . . . . . . . 22 B.2.1 Step 1: simplifying the set of confusing models . . . . . . . . . . . . . . . 23 B.2.2 Step 2: relating the KL divergence to the sub-optimality gaps . . . . . . . . 23 B.2.3 Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 B.2.4 A Closed Form Solution for the Generative Case . . . . . . . . . . . . . . 26 B.3 Extension to Random Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.3.1 Lower bound with random rewards . . . . . . . . . . . . . . . . . . . . . 29 B.3.2 Relaxed characteristic rate with random rewards . . . . . . . . . . . . . . 30 B.4 Extension to a Continuous Set of Rewards . . . . . . . . . . . . . . . . . . . . . . 31 B.4.1 Non-convexity of the minimum sub-optimality gap . . . . . . . . . . . . . 32 B.4.2 A finite set of rewards can be sufficient . . . . . . . . . . . . . . . . . . . 33 B.4.3 Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 C MR-Na S Algorithm 37 C.1 Forcing policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 C.2 Almost Sure Sample Complexity Upper Bound of MR-Na S . . . . . . . . . . . . . 38 C.2.1 Forced exploration property . . . . . . . . . . . . . . . . . . . . . . . . . 39 C.2.2 Convergence of the state-action visitation frequency . . . . . . . . . . . . 39 C.2.3 Sample complexity upper bound . . . . . . . . . . . . . . . . . . . . . . . 39 C.3 Expected Sample Complexity Upper Bound of MR-Na S . . . . . . . . . . . . . . . 40 C.3.1 Geometric ergodicity of the sampling rule . . . . . . . . . . . . . . . . . . 41 C.3.2 Forced exploration with high probability . . . . . . . . . . . . . . . . . . . 41 C.3.3 Concentration of the empirical MDP . . . . . . . . . . . . . . . . . . . . . 42 C.3.4 Concentration of the state-action visitation frequency . . . . . . . . . . . . 43 C.3.5 Expected sample complexity of MR-Na S . . . . . . . . . . . . . . . . . . . 46 C.3.6 Alternative forced exploration with high probability . . . . . . . . . . . . . 47 C.4 Stopping Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 C.5 Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 D Numerical Results 51 D.1 Numerical Results for the tabular case . . . . . . . . . . . . . . . . . . . . . . . . 51 D.1.1 Environments Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 D.1.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 D.2 Numerical Results: Cartpole swing-up and Deep Sea problems . . . . . . . . . . . 55 D.2.1 Cartpole swing-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 D.2.2 Deep Sea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 D.3 Radio network control problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 D.3.1 Downlink link adaptation problem . . . . . . . . . . . . . . . . . . . . . . 66 E Deep-RL Algorithms Details 71 E.1 DBMR-BPI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 E.2 Other algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 F Neur IPS Paper Checklist 74 A Limitations and Broader Impact Limitations. We discuss some limitations of the current work and highlight major directions of improvement. Tabular MDP. The current theoretical analysis only applies to tabular MDPs, and may not necessarily extend to continuous domains in a similar way. Tracking of the allocation. In MR-Na S we employ a probabilistic policy, which may lead to worse tracking performance compared to deterministic approaches (see e.g., the classical tracking procedures for best arm identification in multi-armed bandits [20]). These approaches are, however, harder to use in an MDP setting, since it becomes harder to guarantee correct tracking and convergence of the allocation ω t to ω . Lazy updates. MR-Na S computes an allocation ω t = arg infω Ω(Mt) U(ωt; Mt) at each time step, which could be computationally prohibitive for large-dimensional problems. A possible way to address this issue is to use "lazyness" [80], i.e., to compute ω t only at time steps t T , where T N prescribes the allocation update schedule (see [80]). Scaling of the rewards for DBMR-BPI. The method that we introduce for exploration in deep-RL, DBMR-BPI, explores according to a set of rewards. These rewards, if not correctly scaled, may impact the learning performance since the method relies on estimating the sub-optimality gaps. Number of rewards for DBMR-BPI. The usage of DBMR-BPI could be limited by the number of rewards if it is too high, as the output size of the Q-network in DBMF-BPI is A |R|. However, our experiments with an ensemble size of 20 and A |R| 90 did not encounter any issues. Furthermore, the training can be efficiently done as shown in the code, without looping over the rewards. Nevertheless, exploring the scalability to a larger number of rewards remains an interesting area for future research. Finite action space for DBMR-BPI. In this paper, we analyze DBMR-BPI within a finite action space. Although continuous action spaces are equally important, extending the algorithm to such spaces requires additional work. It is currently unclear how to adapt the approach presented in [53] to continuous action spaces. Assumption on the uniqueness of the optimal policy. In the paper we assume that there is a unique optimal policy. This assumption is common in the BPI literature [18, 16, 17, 53, 20]. Addressing MDPs with multiple optimal policies or identifying ε-optimal policies necessitates the use of more complex overlapping hypothesis tests [81]. This approach is already complex when applied to multi-armed bandits (i.e., MDPs with a single state), and extending it to full MDPs presents an even greater challenge. In that sense, we can provide some insights on the main problems and potential solutions: In presence of multiple optimal policies, one may be interested in ensuring that at the stopping time we have Π (Mτ,r) Π (Mr) for all rewards r R. This guarantees that we identify a subset of policies that are optimal. Unfortunately, ensuring this condition causes an asymmetry between the definition of the set of confusing models, and the event M Alt(Mt) in proof of the stopping rule, that cannot be easily reconciled. A positive result is that if the set of confusing models can be decomposed as a union, or intersection, over the optimal policies in M, then the decomposition proofs presented in apps. B.2 and B.4.3, can be easily extended to the case of multiple optimal policies by considering the sub-optimal actions a / Or(s) in the decomposition, where Or(s) = {a : Q Mr(s, a) = V Mr(s)}. Broader impact. This paper primarily focuses on foundational research in reinforcement learning, particularly the exploration problem. Although we do not directly address societal impacts, we recognize their importance. The methods proposed here improve the sample efficiency of RL algorithms and could be applied in various contexts with societal implications. For instance, RL is used in decision-making systems in healthcare, finance, and autonomous vehicles, where biases or errors could have significant consequences. Therefore, while the immediate societal impact of our work may not be evident, we urge future researchers and practitioners to carefully consider the ethical implications and potential negative impacts in their specific applications. B Sample Complexity Bounds In this appendix we first prove thm. 3.1. In the subsequent subsection we study how to obtain the relaxed characteristic rate presented in thm. 3.2. After that, we present extensions of these results to the case of random rewards (app. B.3) and continuous rewards (app. B.4). Finally, in the last subsection, we present some useful lemmas. B.1 Lower Bound Proof The proof leverage the classical change of measure argument [41, 20] and proceeds similarly to [16]. For simplicity, we consider the generative setting (i.e., we consider allocations ω (S A)), while the extension to the forward model follows similarly to [17]. Proof of thm. 3.1. The proof begins by analysing the likelihood ratio of the observations, and compare against the worst confusing model. Then, by considering the δ-PC event and a data processing inequality yields the lower-bound. Log-likelihood ratio. Consider the MDP M = (S, A, P, γ) and an alternative model M = (S, A, P , γ) such that P P . The log-likelihood ratio between M and M of a sequence of observations (z0, z1, . . . ), where zi = (si, ai), is defined as n=1 log P(sn|zn 1) P (sn|zn 1), To simplify the analysis, let n=1 1{Zn 1=(s,a)} log P(sn|s, a) P (sn|s, a) n=1 log P(yn|s, a) P (yn|s, a) where yn denotes the next state after the pair (s, a) has been visited for the n-th time (s, a) and Nt(s, a) is the number of times the state-action pair (s, a) has been visited by the algorithm up to time t. Then we can write Lt = P s,a Ls,a t . Taking the expectation over the measure induced by M, we have EM[Lt] = EM n=1 log P(sn|zn 1) P (sn|zn 1) where EM[ ] is the expectation under the probability measure induced by a model M (sim. PM[ ]). Then, at the stopping time τ we have s,a EM [Ls,a τ ] , n=1 log P(yn|s, a) P (yn|s, a) t=0 1{Nτ (s,a) t} log P(yt|s, a) P (yt|s, a) Since Xs,a n := log P (yn|s,a) P (yn|s,a) is independent of Fn 1, and since {Nτ(s, a) n 1}c = {Nτ(s, a) n} Fn 1 we have that t=0 PM[Nτ(s, a) t]KLM|M (s, a), s,a EM[Nτ(s, a)]KLM|M (s, a), where KLM|M (s, a) = KL(P(s, a), P (s, a)). δ-PC argument. Consider a δ-PC algorithm, and define the event E = { r R, ˆπτ,r Π (Mr)} and observe that Ec = {( r R, ˆπτ,r / Π (Mr))}. Let Alt(M) = {M M| r : Π (M r) Π (Mr) = }, which can be rewritten as Alt(M) = r RAltr(M), where Altr(M) = {M M|Π (M r) Π (Mr) = }. Before proceeding by applying the transportation lemma in [82], we first note that for any δ-PC algorithm we have PM[Ec] δ. Denote by Π the set of all Markovian policies. Then, under M we have PM [Ec] = PM [ r : ˆπτ,r / Π (Mr)], max r PM [ˆπτ,r / Π (Mr )], PM [ˆπτ,r / Π (Mr)], = PM [ˆπτ,r Π \ Π (Mr)], PM [ˆπτ,r Π (M r)], min r PM [ˆπτ,r Π (M r)], PM [ r, ˆπτ,r Π (M r)], = 1 PM [ r, ˆπτ,r / Π (M r)] 1 δ. Hence, in view of the data processing inequality in [82], for any M Alt(M) we can lower bound the expected log-likelihood at the stopping time τ as s,a EM[Nτ(s, a)]KLM|M (s, a) kl(PM[E], PM [E]) kl(δ, 1 δ). In view of the transportation lemma in [82], for any M Alt(M) we can lower bound the expected log-likelihood at the stopping time τ as s,a EM[Nτ(s, a)]KLM|M (s, a) kl(PM[E], PM [E]) kl(1 δ, δ). As the inequality above holds for all M Alt(M), and by optimizing over the set of confusing model, we have inf M Alt(M) s,a EM[Nτ(s, a)]KLM|M (s, a) = min r R inf M Altr(M) s,a EM[Nτ(s, a)]KLM|M (s, a) kl(1 δ, δ). Sample complexity. By letting ω(s, a) = EM[Nτ(s, a)]/EM[τ], and optimizing over all possible allocations ω Ω, we have EM[τ] max ω (S A) min r R inf M Altr(M) s,a ω(s, a)KLM|M (s, a) kl(1 δ, δ). Finally, taking the limit as δ 0 yields the result. The extension from the generative model to the forward model can be done as in [17, Proposition 2]. B.2 Relaxed Characteristic Rate Proof Since the characteristic time T (M) is a non-convex optimization problem [16], we focus on deriving a convex upper bound of T (M) that is convex for a finite set of rewards (we discuss the continuous case in the next section). This property guarantees that we are still identifying the optimal policy at the cost of an increased over-exploration. In particular, we find a convex upper bound of T(ω; M) that holds for all possible allocations ω (S A). Recall the definition of T(ω; M) T(ω; M) 1 := inf r R inf M Alt(Mr) s,a ω(s, a)KLM|M (s, a) To find the upper bound, we proceed in 2 steps: (1) we simplify the set of confusing models; (2) we relate the KL-divergences in T(ω; M) to the sub-optimality gaps of the MDP, and formulate the problem of computing the relaxed characteristic rate as a convex problem. B.2.1 Step 1: simplifying the set of confusing models We use the following lemma from [16, Lemma 2] to decompose the set of confusing models. Lemma B.1 (Decomposition lemma.). We have that the set of confusing models for r R satisfies Altπ r r (M) = s a =π r (s) Altπ r sar(Mr). (7) where Altsar(M) := {M Altπ r (M) : Qπ M r(s, a) > V π M r(s)} is the set of confusing models for (s, a, r, π). B.2.2 Step 2: relating the KL divergence to the sub-optimality gaps Using lem. B.1, we can convexify the characteristic rate and related the KL-divergence terms with the sub-optimality gaps r(s, a). The goal is to obtain an optimization problem where only the sub-optimality gaps depend on the reward function. We have the following result. Proof of thm. 3.2. The proof follows similar steps to [16].We begin by decomposing the set of confusing models. After that, we relate the sub-optimality gaps to the inner product between the transitions and the value functions. Using this relation, we can lower bound the KL terms in T(ω; M). Finally, through an optimization step, we establish the upper bound. In the proof, we indicate by V π Mr the value of a policy π in MDP Mr, and similarly Qπ Mr. Decomposition. Fix r, s, a = π r(s). As in [16], we combine the condition Qπ r M r(s, a) > V π r M r(s) with the fact that Q Mr(s, a) + r(s, a) = V Mr(s) to obtain that r(s, a) < V Mr(s) Q Mr(s, a) + Qπ r M r(s, a) V π r M r(s). This is similar to Eq. (5) in [16]. Next, let P(s, a) = PM (s, a) PM(s, a), where PM(s, a) (sim. PM (s, a)) is a vector of dimension S representing the distribution of the next state given (s, a). Further define the vector difference between the value in Mr and M r: V π r = h V π r M r(s1) V Mr(s1) . . . V π r M r(s S) V Mr(s S) i . Then, letting 1(s) = es be the unit vector with 1 in position s, we find r(s, a) < γ(PM r(s, a) V π r M r PM(s, a) V Mr) 1(s) V π r , = γ P(s, a) V Mr + (γPM (s, a) 1(s)) V π r . (8) Hence, the condition Qπ r M r(s, a) > V π r M r(s) is equivalent to the inequality in (8). This inequality only involves the pairs {(s, a), s (s , π r(s ))}, and therefore the infimum over Altsa(Mr) of P s,a ω(s, a)KLM|M (s, a) is a model M satisfying KLM|M (s, a) = 0 over all the other stateaction pairs. Hence, using lem. B.1, and the reasoning above, we can write T(ω; M) 1 = min r R min s,a =π r (s) inf M Alt π r sar(M) h ω(s, a)KLM|M (s, a) s ω(s , π r(s ))KLM|M (s , π r(s )) i . Sub-optimality gaps. We now relate the KL-terms to the sub-optimality gaps, similarly as in [53, 16]. For any state s and action a = π r(s), we write each of the terms (8) as a fraction of r(s, a) using {αi}2 i=1, which are real values satisfying P2 i=1 αi > 1: ( α1 r(s, a) = γ| P(s, a) V π r Mr|, α2 r(s, a) = (γPM (s, a) 1(s)) V π r . For the first term, we obtain (α1 r(s, a))2 = γ2| P(s, a) (V π r Mr Es P ( |s,a)[V π r Mr(s )])|2, γ2 P(s, a) 2 1MDr(s, a)2, 2γ2KLM|M (s, a)MDr(s, a)2, where we applied Pinsker inequality. Therefore, we can write KLM|M (s, a) α2 1 r(s, a)2 2γ2MDr(s, a)2 | {z } =:B1,r(s,a) For the second term in (8) we have γPM (s, a) 1(s) 1 = |γPM (s|s, a) 1| + γ(1 PM (s|s, a)) 1 + γ and |α2 r(s, a)| (1 + γ) V π . Following the same steps as in lem. B.3, we derive that Kr(s, a, α2) Kr, (9) where Kr := maxs KLM|M (s, π r(s)), Kr(s, a, α2) = max(Kr,1(s, a, α2), Kr,2(s, a, α2)) and1 Kr,1(s, a, α2) := min (α2 r(s, a))2(1 γ)2 16γ2Varr(1 + γ)2 , |α2 r(s, a)|4/3(1 γ)4/3 6γ4/3MD4/3 r (1 + γ)4/3 Kr,2(s, a, α2) := min (α2 r(s, a))2(1 γ)3 139(1 + γ)2 , |α2 r(s, a)|(1 γ)5/2 17γ(1 + γ) , (|α2 r(s, a)|(1 γ))4/3 14(1 + γ)4/3MD4/3 r Optimization. Putting all the inequalities together, we find T(ω; M) 1 min r R min s,a =π r (s) inf α:P i αi>1 α2 1ω(s, a)Br,1(s, a) + min s ω(s , π r(s ))Kr(s, a, α2). As in [16], the above inequalities are satisfied also when α belongs to the simplex, and in particular we also have α2 i α4/3 i |αi|, for i {1, 2}. Therefore we can write T(ω; M) 1 min r R min s,a =π r (s) inf α Σ2 α2 1ω(s, a)Br,1(s, a) + min s α2 2ω(s , π r(s )) max(Br,2(s, a), Br,3(s, a)), Br,2(s, a) := min r(s, a)2(1 γ)2 16γ2Varr(1 + γ)2 , r(s, a)4/3(1 γ)4/3 6γ4/3MD4/3 r (1 + γ)4/3 Br,3(s, a) := min r(s, a)2(1 γ)3 139(1 + γ)2 , r(s, a)(1 γ)5/2 17γ(1 + γ) , r(s, a)4/3(1 γ)4/3 14MD4/3 r (1 + γ)4/3 Optimizing for α, we get T(ω; M) 1 min r R min s,a =π r (s) 1 ω(s, a)Br,1(s, a) + 1 maxs ω(s , π r(s )) max(Br,2, Br,3)(s, a) Then, using that r r(s, a) we have T(ω; M) max r R max s,a =π r (s) Hr(s, a) ω(s, a) + min(Hr,1, Hr,2) mins ω(s , π r(s )), 1Here we have different constants compared to [16]. In that work, we believe the authors may have miscomputed the constants by using the logarithm in base 10 instead of the natural logarithm. Hr(s, a) = 2γ2MD(s, a)2 16γ2Varr(1 + γ)2 2r(1 γ)2 , 6γ4/3MD4/3 r (1 + γ)4/3 4/3 r (1 γ)4/3 139(1 + γ)2 2r(1 γ)3 , 17γ(1 + γ) r(1 γ)5/2 , 14MD4/3 r (1 + γ)4/3 4/3 r (1 γ)4/3 Finally, as r < 1 (see [16]) and MDmax 1/(1 γ), we have that Hr,2 = 139(1+γ)2 2r(1 γ)3 . Henceforth, we can derive the following simplified expression for the upper bound T(ω; M) max r R max s,a =π r (s) 2γ2MDr(s, a)2 r(s, a)2ω(s, a) + Hr 2r mins ω(s , π r(s )), with Hr = min 139(1+γ)2 (1 γ)3 , max 16γ2Varr(1+γ)2 (1 γ)2 , 6γ4/3MD4/3 r (1+γ)4/3 Remark B.1. Note that one can improve the dependency in 2 r in the second term of U(ω; M) by considering r(s, a)2 instead, as done in [53] (we did not for the sake of readability). B.2.3 Technical lemmas In this subsection, we present and prove some lemmas useful for the proofs of previous results. Lemma B.2 (Lemma 4 [16]). Consider two transition functions P, P and let P(s, a) = P( |s, a) P( |s, a). Then, r R, we have for the optimal value function V r RS we have | P(s, a) V r |2 8Varr(s, a)KL(P( |s, a), P ( |s, a))+4 2MDr(s, a)2KL(P( |s, a), P ( |s, a))3/2. min | P(s, a) V r |2 16Varr(s, a) , | P(s, a) V r |4/3 27/3MDr(s, a)4/3 KL(P( |s, a), P ( |s, a)). Proof. The first inequality follows from Lemma 4 in [16], and by noting that the variance/span terms are both convex functions in the reward for a fixed state-action pair. The second inequality comes from a simple application of the inequality a + b 2 max(a, b) to the first inequality. Lemma B.3. Consider two MDPs M, M with transitions P, P and a reward vector r R. Let V π r = V π M r V Mr, where π Π (Mr) and V Mr is the optimal value in (M, r). Then max(K1, K2) Kr, (14) where Kr := maxs,a Or(s) KLM|M (s, a), with Or(s) = {a : Q Mr(s, a) = V Mr(s)} and V π r 2 (1 γ)2 16γ2Varr , V π r 4/3 (1 γ)4/3 6γ4/3MD4/3 r V π r 2 (1 γ)3 139 , V π r (1 γ)5/2 17γ , V π r 4/3 (1 γ)4/3 Proof. First, note that V Mr = V π Mr, and hence, V π r = V π M r V π Mr. Since the reward is the same, we have that V π r = γ(P )π V π r + γ P πV π Mr, where P π = (P )π P π. Hence, V π r γ 1 γ P πV π Mr . Following the same logic as in lem. B.2, and upper-bounding over all states and optimal actions we obtain V π r 2 4γ2 (1 γ)2 (2Varr Kr + 2MD2 r K3/2 r ), Kr := max s,a Or(s) KLM|M (s, a), Varr := max s,a Or(s) Varr(s, a), MDr := max s,a Or(s) MD(s, a). V π r 2 (1 γ)2 16γ2Varr , V π r 4/3 (1 γ)4/3 27/3γ4/3MD4/3 r and note that 27/3 < 6. Secondly, we can also write V π r = [(I γ(P )π) 1 (I γP π) 1]r. Using Lem. 5 in [16], we derive 32 log(2)2Kr (1 γ)3 + 8 log(2)γKr (1 γ)5/2 + 25/4K3/4 r MDr (1 γ) . Using the fact that a + b + c 3 max(a, b, c), we find the following lower bound on Kr V π r 2 (1 γ)3 288 log(2)2 , V π r (1 γ)5/2 24γ log(2) , V π r 4/3 (1 γ)4/3 34/3 25/3MD4/3 r and note that 288 log(2)2 < 139 (natural logarithm), 24 log(2) < 17 (natural logarithm) and 34/3 25/3MD4/3 r < 14. Combining the two inequalities K1 Kr and K2 Kr, we conclude that max(K1, K2) Kr. B.2.4 A Closed Form Solution for the Generative Case As a corollary, we find an upper bound on U(ω; M) that admits a closed form solution in the generative case (ω Ω(S A)). Although this solution is not optimal (i.e., it does not attain the optima of U(ω; M) under the constraint ω Ω(M)), we can consider it as a computational-efficient approximation of the optimal solution. Corollary B.1. Consider ξ (S A R). We have that an approximate solution to arg infω (S A) U(ω; M) is given by ( Hr(s, a) a = π r(s), q s S,r B,a =π r (s) Hr(s, a)/(S |R|) otherwise, (17) with ω(s, a) = P r R ξ(s, a, r), Hr(s, a) = 2γ2MDr(s, a)2/ r(s, a)2, H = maxr R Hr/ 2 r. Proof. We first derive an upper bound of U(ω; M) and then show the it can be solved in closed form. Upper bound on U(ω; M). Introduce the variable ξ(s, a, r) such that ω(s, a) = P r R ξ(s, a, r), for all (s, a) S A. By rewriting U(ω; M) as a function of ξ, we have U(ξ; M) = max r R max s,a =π r (s) 2γ2MDr(s, a)2 r ξ(s, a, r ) + Hr 2r P r mins ξ(s , π r(s ), r ), max r R max s,a =π r (s) 2γ2MDr(s, a)2 r(s, a)2ξ(s, a, r) + Hr 2r mins ξ(s , π r(s ), r), max r R max s,a =π r (s) 2γ2MDr(s, a)2 r(s, a)2ξ(s, a, r) + H mins ξ(s , π r(s ), r), = max r R max s,a =π r (s) Hr(s, a) ξ(s, a, r) + H mins ,r ξ(s , π r (s ), r ), where we used that 1/(x + y) 1/x for x, y > 0 in the first inequality and we bounded the second term using H = maxr R Hr/ 2 r in the second inequality. Finally, to lighten the notation, we let Hr(s, a) = 2γ2MDr(s, a)2/ r(s, a)2. Closed-form solution. We now show that the problem at the last line of the previous equation can be solved in closed form. For the sake of notation, in the proof we denote by Or(s) = {a : Q Mr(s, a) = V Mr(s)} the set of optimal actions in state s for the MDP Mr. To distinguish ξ when the action is optimal, or not, we write ξsra if a Or(s) and ξ(s, a, r) otherwise. After introducing the auxiliary variables t, p 0 the optimization problem can be equivalently rewritten as min ξ,t,p t + p s.t. t Hr(s, a) ξ(s, a, r) s, r, a / Or(s), ξsra s, r, a Or(s), X s,a,r ξ(s, a, r) = 1, ξ(s, a, r) > 0 s, a, r. The Lagrangian associated to (18) is then defined as L = t + p + X s,r,a Or(s) θsra s,r,a/ Or(s) µsra ξ(s, a, r) p s,a,r ξ(s, a, r) 1 s,a,r νsraξ(s, a, r), where θ, µ, ν are the Lagrange multipliers. Through the KKT conditions one derives νsra = 0 for every (s, a, r), and s,r,a/ Or(s) µsra = 1, s,r,a Or(s) θsra = 1, L ξ(s, a, r) = 0 λ µsra Hr(s, a) ξ(s, a, r)2 = 0, s, r, a / Or(s), L ξsra = 0 λ θsra H ξ2sra = 0, s, r, a Or(s). We further find that P s,a,r ξ(s, a, r) = 1, t = Hr(s, a)/ξ(s, a, r) for every s, r, a / Or(s) and p = H ξsra for every s, r, a Or(s). Hence it follows that all the ξsra have the same value, that we denote by ξ0. From ξ(s, a, r) = Hr(s, a)/t, to find t, observe that Hr(s, a) = tξ(s, a, r) P s,r,a/ Or(s) Hr(s, a) = t(1 ξsum), where ξsum = P s,r,a Or(s) ξ0 = ξ0|G|, with G = {(s, r, a) : s S, r R, a Or(s)}. Thus ξ(s, a, r)λ µsrat = 0 t = (1 ξsum)λ. Use the fact that P s,r,a/ Or(s) Hr(s, a) = t(1 ξsum) to find s,r,a/ Or(s) Hr(s, a, r) = t2. From L ξsra = 0 find λ = θsra H /ξ2 sra and ξsraλ = θsrap p = ξ0|G|λ. Thus λ = p/ξ0 = H /(ξ2 0|G|) and s,r,a/ Or(s) Hr(s, a) = t2. s,r,a/ Or(s) Hr(s, a) Finally, to determine t, note that s,r,a ξ(s, a, r) = 1 X s,r,a/ Or(s) t + |G|ξ0 = 1 s,r,a/ Or(s) Hr(s, a) + s s,a =π (s) Hε(s, a). Therefore ξ(s, a, r) Hr(s, a) for all s S, r R, a / Or(s) and ξ(s, a, r) q s,r,a/ Or(s) Hr(s, a)/|G| otherwise. The result in ω follows automatically from the definition ω(s, a) = P r R ξ(s, a, r). B.3 Extension to Random Rewards In this appendix, we consider an extension of MR-BPI in which the observed rewards are random. We begin by motivating the setting, and comparing it to the setting used in the main part of the paper. Then, we introduce the model, and its assumptions. Finally, we extend the lower bound to this setting (app. B.3.1) and derive the corresponding convex upper bound (app. B.3.2). Setting motivation. So far, we have discussed the setting where the rewards are (endogenous signals) specified by the user, rather than being provided by the environment (exogenous signals). This distinction is crucial because it influences how we model and interpret the reward signals. Endogenous rewards are typically deterministic, as they directly reflect the user s specific objectives (e.g., a fixed reward for achieving a particular goal, ensuring a consistent reinforcement signal). Conversely, rewards provided by the environment (exogenous) are typically variable and uncertain. These rewards may be modeled as random quantities with distribution conditioned on observed state-action pair, as described in the next paragraph. Model. We shortly describe the MR-BPI model with random rewards, which uses a slightly different notation (w.r.t. sec. 2) to accommodate the random reward setting. In this section, we assume that the number of rewards is finite. Let Q be a set of distributions over real numbers conditioned on the state-action space. In the following, we use i {1, . . . , |Q|} as an index for the elements of Q, so that qi is the i-th reward distribution. The related problem specific-quantities and definitions are hence indexed by i, (instead of r as in sec. 2). We also denote by rqi(s, a) the expected reward for state s and action a (w.r.t. to the corresponding reward measure qi). Similarly to the deterministic-reward setting, we assume that for all rewards i {1, . . . , |Q|} the optimal policy π i is unique. Agent interaction. At each round t 1, the agent in state st selects an action at, and observes a set of rewards samples (ri,t(st, at))i |Q|, where ri,t(st, at) qi( |st, at) and the rewards conditioned on the state-action are independent. It then transition to a new state st+1 P( |st, at). and the interaction repeats. In this section, we denote by P the probability law (resp. expectation) of the process (ζt)t, where ζt = (st, at, r1,t, . . . , r|Q|,t). We also denote by Ft = σ({ζ0, . . . , ζt}) the σ-algebra generated by the random observations made up to time t. For such algorithm, we define τ to be a stopping rule w.r.t. the filtration (Ft)t 1. This stopping rule τ simply states when to stop the exploration phase. At the stopping time τ the algorithm outputs an estimate ˆπτ,r of a best policy for any reward r R using the estimated MDP Mτ. We focus on δ-PC algorithms, according to the following definition, and we impose a few mild assumptions. Similarly to the deterministic reward setting, the goal is to devise a δ-PC algorithm with minimal sample complexity, according to the following definition Definition B.1 (δ-PC Algorithm). We say that an algorithm is δ-PC if, for any MDP M M, P[τ < , ( i {1, . . . , |Q|}, ˆπτ,i = π i )] 1 δ. B.3.1 Lower bound with random rewards Theorem B.1. Let δ (0, 1/2). For any δ-PC algorithm and under assumption (3), we have lim infδ 0 E[τ] log(1/δ) T Q(M), where T Q(M) 1 = sup ω Ω(M) inf M Alt(M) s,a ω(s, a)KLQ M|M (s, a), (19) with KLQ M,M = KLP |P (s, a) + P|Q| i=1 KLqi|q i(s, a) where KLP |P (s, a) = KL(P( |s, a), P ( |s, a)) and KLqi|q i(s, a) = KL(qi( |s, a), q i( |s, a)). Proof. The proof proceeds similarly to the one of the lower bound with deterministic rewards app. B.1. The main difference is that the log-likelihood ratio Lt also includes the reward observations sampled from distributions in Q. Log-likelihood ratio. Let Q Q, and consider the MDP Mq = (S, A, P, q, γ) and an alternative model M q = (S, A, P , q , γ) such that P P and q q . The log-likelihood ratio between M and M of a sequence of observations (z0, z1, . . . ), where zi = (si, ai, ri,1, ..., ri,s), is defined as log P(sn|zn 1) P (sn|zn 1) + X i |Q| log qi(ri,n|zn 1) q i(ri,n|zn 1) To simplify the analysis, let n=1 1{sn=s,an=a)} log P(sn|s, a) P (sn|s, a) + i=1 log qi(ri,n|s, a) q i(ri,n|s, a) log P(yn|s, a) P (yn|s, a) + i=1 log qi(xi,n|s, a) q i(xi,n|s, a) where yn and xi,n denotes the next state and ith reward component collected after the pair (s, a) has been visited for the n-th time (s, a) and Nt(s, a) is be the number of times the state-action pair (s, a) has been visited by the algorithm up to time t. Then we can write Lt = P s,a Ls,a t . Hence, at the stopping time τ we have s,a EM [Ls,a τ ] , log P(yn|s, a) P (yn|s, a) + i=1 log qi(xi,n|s, a) q i(xi,n|s, a) t=0 1{Nτ (s,a) t} log P(yt|s, a) P (yt|s, a) + i=1 log qi(xi,n|s, a) q i(xi,n|s, a) Since W s,a n := log P (yn|s,a) P (yn|s,a) +P|Q| i=1 log qi(xi,n|s,a) q i(xi,n|s,a) is independent of Fn 1, and since {Nτ(s, a) n 1}c = {Nτ(s, a) n} Fn 1 we have that t=0 PM[Nτ(s, a) t]KLQ M|M (s, a), s,a EM[Nτ(s, a)]KLQ M|M (s, a), where KLQ M,M = KL(P( |s, a)|P ( |s, a)) + P|Q| i=1 KL(qi( |s, a), q i( |s, a)). δ-PC argument. Consider a δ-PC algorithm Alg, and define the event E = i {1,...|Q|}Ei, where Ei = {ˆπi,τ = π i }. Note that, since we are considering a δ-PC algorithm, we have Let then M Alt(M). Let Alt(M) = {M : i 1, ..., |Q|, Π (Mqi) Π (M qi) = }, and note that we can write Alt(M) = i {1,...,|Q|}Alti(Mqi), where Alti(Mqi) = {M : Π (Mqi) Π (M qi) = }. Furthermore, we have that PM (E) 1 δ. In view of the transportation lemma in [82], we can lower bound the previous quantity at the stopping time τ as X s,a EM[Nτ(s, a)]KLQ M|M (s, a) kl(PM[E], PM [E]) kl(δ, 1 δ). As the inequality above holds for all M Alt(M), by optimizing over the set of confusing model, we have inf M Alt(M) s,a EM[Nτ(s, a)]KLQ M|M (s, a) kl(δ, 1 δ). Sample complexity. By taking the minimum over the set of rewards, letting ω(s, a) = EM[Nτ(s, a)]/EM[τ], and optimizing over all possible allocations ω Ω, we have EM[τ] sup ω (S A) inf M Alt(M) s,a ω(s, a)KLQ M|M (s, a) kl(δ, 1 δ). Finally, taking the limit as δ 0 yields the result. B.3.2 Relaxed characteristic rate with random rewards Theorem B.2. For any allocation ω (S A) we have that TQ(ω; M) UQ(ω; M), where UQ(ω; M) := max i,s,a =π i (s) 2(γ2MDi(s, a)2 + 1) i(s, a)2ω(s, a) + Hi + 2/(1 γ)2 2 i mins ω(s , π i (s ), (20) and Hi = min 139 2 i (1 γ)3 , max 16Vari 2 i (1 γ)2 , 6MD4/3 i 4/3 i (1 γ)4/3 Proof. The proof follows similarly to the one of thm. 3.2 but it uses a different decomposition of the analyic form ([16, Thm. 2]). We sketch its main steps below. Decomposition. We first state a decomposition result on the set of confusing parameters similar to the one presented in lem. B.1 for deterministic rewards. Lemma B.4 (Decomposition lemma random rewards.). We have that the set of confusing models, for all i = 1, . . . , |Q|, satisfies Alti(Mqi) s,a =π i Altπ i sai(Mqi). where Altπ sai(Mqi) := {M Alti(Mqi) : Qπ M i(s, a) > V π M i(s)} is the set of confusing models for (s, a, i, π). Remark B.2. Note that T Q can be equivalently rewritten using thm. B.2 as (T Q) 1 = sup ω Ω(M) min i,s,a =π i (s) inf M Altsai(Mqi) s ,a ω(s , a ) KLP |P (s , a ) + X j KLqj|q j(s , a ) Lower bound analytic form. We now rewrite the lower bound optimization problem analytically in a similar form to (5) in [16]. For all s, a, i let dri(s, a) = rqi(s, a) rq i(s, a) and d P(s, a) = PM (s, a) PM(s, a). Further define the vector difference between the value in Mi and M i: d V π i = h V π i M i (s1) V Mi(s1) . . . V π i M i (s S) V Mi(s S) i . Finally, we let 1(s) = es be the unit vector with 1 in position s. Fix i, s, a = π i (s). Then, we find that M Altπ i s,a,i(Mi) if and only if dri(s, a) + i(s, a) < γd P(s, a) V Mi + (γPM i(s, a) 1(s)) I γP π i 1 (r π i rπ i ) + (γPM i(s, a) 1(s)) (I γP π i ) 1 (I γPπ i ) 1 1 rπ i ). Hence, the condition Qπ M qi(s, a) > V π M qi(s) is equivalent to the above inequality. This inequality only involves the {(s, a, i), s (s , π i (s ))}, and therefore the infimum over Altsai(Mi) of P s,a ω(s, a)KLQ M|M (s, a) is a model M satisfying KLQ M|M (s, a) = 0 over all the other stateaction pairs. Hence, using lem. B.1, and the reasoning above, we have TQ(ω; M) 1 = min i,s,a =π i (s) inf M Altπ sai(M) ω(s, a)KLi M|M (s, a) s ω(s , π i (s )) KLP |P (s , π i (s ))) + KLqi|q i(s , π i (s ))) Sub-optimality gaps. We now relate the KL-terms to the sub-optimality gaps, similarly as in [53, 16]. By following a similar approach to [16, Thm. 2] and the proof of the relaxed characteristic rate in app. B.2 one can show that, for any α Σ4, we have KLqi|q i(s, a) Bi,1(α1) := α2 1 2 i (s,a) KLP,P (s, a) Bi,2(α2) := α2 2 i(s, a)2 2γ2MDi(s, a)2 maxs KLqi|q i(s , π i (s )) Bi,3(α3) := α2 3 ( i(1 γ))2 maxs KLP |P (s , π i (s ) Bi,4(α4) := min [α4 i(1 γ)]2 16Vari , [α4 i(1 γ)]4/3 27/3MD4/3 i maxs KLP |P (s , π i (s )) Bi,5(α4) := min α2 4 2 i (1 γ)3 288 log(2)2 , α4 2 i (1 γ)5/2 24 log(2) , α4/3 4 4/3 i (1 γ)4/3 25/334/3MDi Optimization. Putting all the inequalities together, we get TQ(ω; M) 1 min i,s,a =π i (s) inf α Σ4 ω(s, a)(Bi,1(α1) + Bi,2(α2)) + min s ω(s, π i (s))(Bi,3 + max(Bi,4(α4), Bi,5(α4))). Then, proceeding as in thm. 3.2, and optimizing for α R4 0 such that P i αi = 1 yields the result. B.4 Extension to a Continuous Set of Rewards In this section we consider the extension to a continuous set of rewards 2. We briefly motivate the setting, and then explain what are the challenges. Setting motivation. So far, we have discussed the case where the set of rewards R is finite. However, it is also useful to compare with classical reward-free techniques, which do not rely on 2In this context, by a continuous set, we are not necessarily referring to a connected set in Rn, but rather to a set with a non-countable number of elements. a pre-specified reward signal. Therefore, we extend our discussion to include a continuous set of rewards. This extension presents several challenges. First, it may not be feasible to assume that the optimal policy is unique for each reward. Second, the minimum gap can be non-convex and discontinuous, as we will discuss in the following section. However, we find that, in practice, it is sufficient to consider a finite set of rewards. This observation provides a basis for the implementation of a practical algorithm. First, we address the challenge of optimizing the inverse of the sub-optimality gap over a continuous set of rewards. Later, we propose a practical solution by considering a finite set of rewards. B.4.1 Non-convexity of the minimum sub-optimality gap The relaxed characteristic rate is mainly characterized by the minimum sub-optimality gap r = mins,a Ocr(s) r, where Or(s) = {a : Q Mr(s, a) = V Mr(s)} is the set of optimal actions in s for the MDP Mr. Unfortunately, even with a convex set of rewards, the minimum sub-optimality gap may be non-convex and possibly discontinuous. The discontinuity arises from the fact that in a certain state(s) all actions are optimal for specific rewards. We illustrate this intuition with an example. Consider the MDP depicted in fig. 2 with states S = {s0, s1} and actions A = {a0, a1}. For this MDP, we have P(s0|{s0, s1}, a0) = 1, P(s1|s0, a1) = 0.3, P(s0|s0, a1) = 0.7, P(s1|s1, a1) = 0.3 and P(s0|s1, a1) = 0.7. We use the following reward function, which parametrized by θ [0, 1], in vector form (s1,a1) 1 θ = θe2 + (1 θ)e4. Next, we derive the sub-optimality gaps θ(s, a) = V θ (s) Q θ(s, a) as a function of θ. Optimal Value function V θ and policy π θ. For a discount factor γ = 0.5 we obtain that V θ (s0) = max 1 2V θ (s0) | {z } action a0 20V θ (s0) + 3 20V θ (s1) | {z } action a1 V θ (s1) = max 1 2V θ (s0) | {z } action a0 20V θ (s0) + 3 20V θ (s1) | {z } action a1 One can immediately see that a1 is optimal in s0 for all θ, while in s1 we need to distinguish between the two cases. If a0 is optimal in s1, then we find V θ (s0) = 40 23θ, V θ (s1) = 20 Alternatively, if a1 is optimal in s1, then V θ (s0) = 7 10, V θ (s1) = 3 It is possible then to see that for θ < 23/26 the optimal policy π θ = [a1 a1] is to choose action a1 in both states, while for θ > 23/26 the optimal policy π θ = [a1 a0] changes so that action a0 is now optimal in state s1. For θ = 23/26 both actions {a0, a1} are optimal in state s1. Optimal Action-value function Q θ. We distinguish two cases. For θ 23/26 we have Q θ(s0, a) = 20 23θ a = a0 40 23θ a = a1 , Q θ(s1, a) = 20 23θ a = a0 1 6 23θ a = a1 . (a1, 0.3, θ) (a0, 1, 0), (a1, 0.7, 1 θ) (a0, 1, 0), (a1, 0.7, θ) (a1, 0.3, 1 θ) θ = mins mina/ Oθ(s) θ(s, a) Figure 2: On the left: an MDP with two states and two actions; each tuple (a, p, r) consists of (1) the action a that triggers the transition, (2) the transition probability p and (3) the reward r. On the right: the minimum sub-optimality gap θ as a function of θ [0, 1] (note the discontinuity in θ = 23/26). While for θ < 23/26 we have Q θ(s0, a) = 7 20 a = a0 7 5θ + 3 10 a = a1 , Q θ(s1, a) = 7 20 a = a0 3 10 a = a1 . Minimum sub-optimality gap. Hence, we conclude that the minimum sub-optimality gap is θ = min s min a/ Oθ(s) θ(s, a) = 20 1 2 θ < 23 26, 20 26 θ = 23 26, 26 23θ 1 23 26 < θ 1. Note that the minimum gap is non-convex on θ and it has a discontinuity at θ = 23/26, as shown in fig. 2. B.4.2 A finite set of rewards can be sufficient In this subsection, we derive a convex upper bounds on T (M) when R is a continuous set of rewards. The idea is to find a finite set of rewards R so that we can still identify the optimal policies for all rewards in R. The proof follows similar step to the one in the discrete reward setting in app. B.2 but relies on the following assumption. Convex hull (CH) assumption. We assume that the convex hull3 of R contains R. This last assumption is easy to satisfy, since one can just consider the canonical basis Rcanonical in R . Using this property, we derive a novel decomposition that leads to an alternative upper bound on T(ω; M). Theorem B.3. Consider a continuous set of rewards R satisfying the non-degeneracy assumption, and a finite set of rewards R . Assume that R satisfies assumption (CH) with respect to R. Then, for all ω Ω(S A) we have that T(ω; M) min r R min s,a 8γ2MDr(s, a)2 2r(1 γ)2ω(s, a). Proof. Using the decomposition in lem. B.5 in conjunction with the results in prop. B.1 and lem. B.6 leads to the desired result. Using also that for any ω RSA >0 (1 γ)2 max s,a KLM|M (s, a)ω(s, a)MDe(s, a)2 (1 γ)2 max s,a KLM|M (s, a)ω(s, a) max s ,a MDe(s , a )2 ω(s , a ) . 3Alternatively one may consider the conic hull of R , yielding a similar result T(ω; M) 1 min r R inf M Altr(M) s,a ω(s, a)KLM|M (s, a) min r R min e R min s inf M Alt π r s ,π r (s ),r,e(M) s,a ω(s, a)KLM|M (s, a) min r R min e R min s,a ω(s, a) 2 e(1 γ)2 8γ2MDe(s, a)2 , = min e R min s,a ω(s, a) 2 e(1 γ)2 8γ2MDe(s, a)2 . While the previous bound in the worst case scenario scales as (1 γ) 4, one can use techniques as in the proof of thm. 3.2 to find a scaling of (1 γ) 3 (we refer the reader to the proof of prop. B.1 and note that we could use lem. B.3 to find an alternative scaling). B.4.3 Technical lemmas Confusing set of models decomposition: continuous set of rewards. From the example in fig. 2 we know that an MDP may have a non-convex sub-optimality gap over a continuous set of rewards. Instead of considering the entire set of rewards, an alternative could be to just deal with a finite set of rewards R R. We obtain a decomposition that depends on R as long as every reward in r R can also be written as a convex combination (or conic combination) of the rewards R (which can be guaranteed simply by including Rcanonical to R ). Lemma B.5 (Decomposition lemma for a continuous set of rewards.). Consider a finite set of rewards R such that R Hull(R ) and each r R admits at-least a confusing model with respect to Mr (as in assumption (III)). Then, we have that the set of confusing models for any r R satisfies Altr(M) e R s,a =π r (s) Altπ r sare(M), (21) where Altπ sare(M) := {M : Qπ M e(s, a) > V π M e(s)}. Proof. Consider a reward r R. From the fact that R is contained in a convex hull of R we can always find θ [0, 1]|R | such that r = P e R θee, where θe is the non-negative coefficient for the corresponding reward e (satisfying P We start by showing that Altr(M) {M | e R , s S, a = π r(s) : Qπ M e(s, a) > V π M e(s)} with π = π r. By contradiction, assume there exists M Alt(Mr) so that e R , s S, a = π(s) the inequality Qπ M e(s, a) V π M e(s) holds. Then, since Qπ Me(s, π(s)) = V π Me(s) the inequality Qπ M e(s, a) V π M e(s) holds for every e, s, a. By linearity, the inequality holds also if we multiply by x 0 both sides. Choosing x = θe, and summing over e, we obtain that X e R θe Qπ M e(s, a) X e θe V π M e(s). Letting js,a {1, . . . , SA} be the index corresponding to the state-action pair (s, a) and (P )π be the transition function associated to policy π, i.e., (P )π(s , a |s, a) = π(a |s )P (s |s, a), we obtain that 4 e θi Qπ M e(s, a) = X e θee js,a(I γ(P )π) 1e = e js,a(I γ(P )π) 1r = Qπ Mr(s, a). 4We indicate by ei the i-th element of the canonical basis. Therefore, we conclude that Qπ M r(s, a) V π M r(s) is valid for all state-action pairs. Let now ˆπ be an optimal policy in M r. Then, for all states we have Qπ M r(s, ˆπ(s)) V π M r(s). Hence, as in [16, Lemma 2], by repeated applications of the Bellman operator Γˆπ with respect to the MDP M r, one can conclude that V M r(s) V π M r(s) in all states. However, this implies that π is optimal in M r, which contradicts that M Alt(Mr). Remark B.3. While the previous lemma is stated for a convex hull of the vectors in R , it can be equivalently stated for a conic hull of R . This latter version can be used if, for example, one considers Rcanonical (note that the conic hull of the canonical basis contains all rewards in [0, 1]SA). Upper bound on the minimum sub-optimality gap. In the following proposition, we derive a bound on the minimum sub-optimality gap. Proposition B.1. Consider two rewards r, e and two MDPs M, M such that Π (Mr) Π (M e) = . Then e 2γ 1 γ P(s, a) V Me , (22) where P(s, a) = PM(s, a) PM (s, a), and PM(s, a) (sim. PM (s, a)) is a vector of size S representing the distribution of the next state given (s, a). In particular we also have (1 γ)2 max s,a KLM|M (s, a)MDe(s, a)2. Proof. The fact that Π (Mr) Π (M e) = implies that in all states we have 0 V M e(s) V π M e(s), for all π Π (Mr). Let s0 be a state such that in Me there is a sub-optimal action. Henceforth, we have that e V Me(s0) Q Me(s0, a) for some a. Hence e V Me(s0) Q Me(s0, a) + V M e(s0) V π M e(s0), V Me V π M e + Q M e Q Me . Now we bound the two terms separately. 1. For the first term, let π Π (Me), and since Q Me(s, π (s)) Q Me(s, π(s)) write [V π M e V Me](s) = Qπ M e(s, π(s)) Q Me(s, π (s)), Qπ M e(s, π(s)) Q Me(s, π(s)), Qπ M e Q Me . Therefore V Me V π M e Qπ M e Q Me = Q Me Qπ M e . Hence, since [Q Me Qπ M e](s, a) = γ h P(s, a) V Me + PM (s, a) (V Me V π M e) i , we obtain Q Me Qπ M e γ 1 γ P(s, a) V Me , leading to V Me V π M e γ 1 γ P(s, a) V Me . 2. For the second term we have [Q Me Q M e](s, a) = γ h P(s, a) V Me + PM (s, a) (V Me V M e) i , hence, as for the first term, we have Q M e Q Me γ 1 γ P(s, a) V Me . Therefore we have that e 2γ 1 γ P(s, a) V Me . Finally, an application of Pinsker inequality leads to (1 γ)2 max s,a | P(s, a) V π Me|2 8 γ2 (1 γ)2 max s,a KLM|M (s, a)MDe(s, a)2. Lemma B.6. Consider two MDPs M, M and two rewards r, e. If for all policies π Π (Mr) there exists a state-action pair (s, a) such that Qπ M e(s, a) > V π M e(s), then Π (Mr) Π (M e) = . Proof. The proof simply follows by the fact that for all policies π Π (Mr) there exists a state s where the policy is improvable. Therefore, none of the policies in Π (Mr) is optimal in M e. C MR-Na S Algorithm In this section we describe the MR-Na S algorithm in detail. We prove its δ-PC property and its guarantees on the sample complexity. Notation. In addition to the notation introduction in sec. 2, in tab. 3 we provide a summary of the notation used in this section. Table 3: Table of Notation Symbol Description Z State-action space, Z = S A. Pπ Transition induced by π, that is Pπ((s , a )|(s, a)) = P(s |s, a)π(a |s ). Pu Transition induced by uniform policy. Pt Transition induced by the policy πt of MR-Na S. Mt Estimated MDP at time step t. ωt Stationary distribution induced by the policy πt. ω t Stationary distribution induced by the policy π t . ω t Averaged stationary distribution ω t := (1/t) Pt n=1 ω n. ωu Stationary distribution induced by the uniform policy. ω Optimal allocation ω := arg infω Ω(M) U(ω; M) w.r.t. U(ω; M). π (a|s) Oracle exploration policy, defined as π (a|s) := ω (a|s)/ P b ω (b|s). ˆπt(a|s) Exploration policy at time step t, defined as ˆπ t (a|s) := ω t (a|s)/ P b ω t (b|s). µt Defined as µt := 1/Nt(s)α+β. m0 Maximum number of steps to move between any pair of states (s, s ). m Maximum number of steps to move between any pair (s, a), (s , a ). Equal to m = m0 + 1. ηk Minimal probability of reaching z from z in n k transitions using a uniform random policy, defined as ηk := min {P n u (z |z)|z, z Z, n {1, . . . , k}, P n u (z |z) > 0}. η Defined as η := η1ηm. Bρ(P) or (Bρ(M)) Ball of radius ρ around a transition function P, defined as {P : maxs,a P( |s, a) P ( |s, a) 1 ρ}. Wt Rank-one matrix whose rows are equal to ωt. r0 Ergodicity constant such that P r(z |z) > 0 for all r r0. σu Constant, defined as σu := min(z ,z) Z2 P r0(z |z) u /ωu(z ). σ(ϵ, π) Defined as σ(ϵ, π) := (1 ϵ)A mins,a π(α|s)τ + ϵ α t σu. θ(ϵ, π) θ(ϵ, π) := 1 σ(ϵ, π) C(ϵ, π) C(ϵ, π) := 2/θ(ϵ, π) ρ(ϵ, π) ρ(ϵ, π) := θ(ϵ, π)1/r0 L(ϵ, π) L(ϵ, π) := (1 ρ(ϵ, π)) 1C(ϵ, π) σt, θt, Ctρt, Lt Respectively defined as σt := σ(ϵt, πt), θt := θ(ϵt, πt), etc. E Convergence of the MDP estimate and state-action visitation frequency event E := ( (s, a) Z, limt Nt(s, a)/t = ω (s, a), limt Mt = M). Ecnt(λ) Count event, defined as Ecnt(λ) := n t 1, (s, a), Nt(s, a) 1 m k=1 ηm (km)α+β log |S||A| E1 p,T (ξ) Concentration of the empirical MDP event E1 p,T (ξ) = T t=hp(T )(Mt Bρ(ξ)(M)). E2 T (ξ) Concentration of the state-action visitation frequency event E2 T (ξ) := T t=T 1 p | Nt t ω | Kξξ . Remark C.1 (Extension to the setting of continuous/stochastic rewards.). In this setting we consider the case with finite rewards. For the other two settings described in app. B (continuous and stochastic rewards), MR-Na S can be directly adapted by using the relaxed characteristic rate specified in app. B.3.2 for stochastic rewards and in app. B.4.2 for continuous rewards. The primary modification involves the computation of ω , where the relaxed characteristic rate appropriate for the respective setting should be applied. Furthermor, for stochastic rewards, the algorithm also needs to account for an additional exploration threshold in the stopping rule similarly to [17, Prop. 5]. Remark C.2 (Novelties with respect to previous works.). Our work is mostly inspired by [17]. With respect to previous works, we introduce a novel forcing policy that does not use an exploration factor ϵt that depends MDP-specific quantities, such as the "connectedness" of the MDP, that we denote by m. In [17] they use ϵt = t 1 2(m+1) , while we use ϵt = 1/ max(1, Nt(st))α. This change requires a slight modification of the proofs in [17], since now ϵt depends also on the number of visits of st. Additionally, we improve on the high probability forced exploration [17, Lemma 11], where the authors have a scaling of Nt(s, a) tη2 m m2 log2(1+SA/λ) 1/4 while we obtain a scaling Nt(s, a) ηm mα+β H t m ,α+β log(SA/λ), where Hx,y is the x-th generalized harmonic number of order y and ηm is the minimal probability of reaching a state-action pair in m steps under a uniform random policy. C.1 Forcing policy We begin by providing some results on the forcing policy, that serve as a basis to obtain sample complexity results. Consider the forcing policy πf,t( |s) = softmax ( βt(s)Nt(s, )), which is inspired by the classical Boltzmann distribution. Intuitively this forcing policy will choose actions that have been undersampled. We find the following lower bound on the probability of picking an action a at time t 1. Lemma C.1. Consider the forcing policy πf,t( |s) = softmax ( βt(s)Nt(s, )) with βt(s) = β log(Nt(s)) maxa |Nt(s,a) minb Nt(s,b)| and β [0, 1]. Then, for all t 1 we have πf,t(a|s) 1 ANt(s)β . (23) Proof. Note that proving (23) is equivalent to showing that ANt(s)βe βt(s)Nt(s,a) X b e βt(s)Nt(s,b). We actually show the following stronger result: ANt(s)βe βt(s)Nt(s,a) Ae βt(s) minb Nt(s,b) X b e βt(s)Nt(s,b). Therefore, taking the logarithm on both hand-sides, we obtain β log(Nt(s)) βt(s)(Nt(s, a) min b Nt(s)) = β log(Nt(s))(Nt(s, a) minb Nt(s) maxa |Nt(s, a) minb Nt(s, b)| . Thus, we obtain the condition 1 Nt(s, a) minb Nt(s) maxa |Nt(s, a) minb Nt(s, b)|, which is always true for all a A, proving the lemma. C.2 Almost Sure Sample Complexity Upper Bound of MR-Na S We now describe the forced exploration property of MR-Na S. We conclude by showing that Nt(s, a)/t ω (s, a) a.s., and the almost sure sample complexity bound of MR-Na S. C.2.1 Forced exploration property The following proposition is a modification to [17, Lem. 4] which combines a result from [83]. Proposition C.1 (Forced Exploration). The sampling rule of MR-Na S with α (0, 1], β [0, 1] and α + β 1 satisfies P ( (s, a) S A, limt Nt(s, a) = ) = 1. Proof. Recall that that ϵt = 1/Nt(st)α and πf,t( |s) = softmax ( βt(s)Nt(s, )) with βt(s) = β log(Nt(s)) maxa |Nt(s,a) minb Nt(s,b)|. Therefore, using lem. C.1 and the fact that α + β 1 we have that n=1 P(at = a|st = s, Nt(s) = n) n=1 εt 1 nβA, Thus if a state s is visited infinitely often, by the Borel-Cantelli lemma we have that the action a is also chosen infinitely often. Using [83, Lem. 4] we conclude that in a communicating MDP we have that P ( (s, a) Z, limt Nt(s, a) = ) = 1. C.2.2 Convergence of the state-action visitation frequency Next, using the previous results we establish the convergence of Nt(s, a)/t. For the sake of compactness, we denote by Z = S A the set of state-action pairs, and by Pπ((s , a )|(s, a)) = P(s |s, a)π(a |s ) the transition induced by π. Similarly, we denote by Pu the transition induced by a uniform policy, and by Pt the transition induced by the policy πt of MR-Na S. We also indicate by ω = arg infω Ω(M) U(ω; M) the optimal allocation. Proposition C.2 (Almost sure convergence of the allocation). Using MR-Na S guarantees that P ( (s, a) Z, limt Nt(s, a)/t = ω (s, a)) = 1, where ω = arg infω Ω(M) U(ω; M). Proof. To show the result it suffices to ensure that MR-Na S satisfies the 5 assumptions of [16, Proposition 12] (see also [84, Corollary 2.9]). Assumption 1 is a consequence of the fact that the MDP is ergodic under any policy that assigns positive probability to all actions. Hence, also Pt is ergodic. Assumption 2 is guaranteed by prop. C.1: since Nt(s, a) a.s. for all (s, a) Z, then the estimate of the MDP converges Mt M almost surely. By continuity, and the fact that ωt is a Cesàro mean, we also have ω t ω and ω t ω . Hence Pt Pπ almost surely, where π (a|s) = ω (s, a)/ P a ω (s, a ). Assumption 3 and 4 are satisfied as in the proof of [17, Thm. 7]. In fact, by employing lem. C.1 we have that Pt = (1 ϵt)Pπ t + ϵt Pπf,t (1 ϵt)Pπ t + µt Pu with µt = 1/Nt(s)α+β. Hence, one can follow the same reasoning of [17, Thm. 7], by also noting that µt 1/tα+β for every 1. Similarly, assumption 5 is satisfied by noting that ω t ω a.s., and for all (s, a) Z, ϵt, µt 0 almost surely. C.2.3 Sample complexity upper bound Next, using the forced exploration property, we prove that the stopping time is almost surely finite and an almost sure upper bound of the sample complexity of MR-Na S that holds asymptotically as δ 0. Proof strategy. The idea is to prove that Nt(s, a)/t ω (s, a) and that the estimate Mt M. If Nt(s, a) a.s. (prop. C.1 ), the latter follows automatically, while the former is given by prop. C.2 (given prop. C.1 ). At this point, one can conclude by continuity of the relaxed characteristic rate over its arguments in the stopping rule. Theorem C.1 (Almost sure sample complexity bound). MR-Na S with the stopping rule in props. 3.1 and C.2 guarantees that MR-Na S stops almost surely, and that P lim supδ 0 τ log(1/δ) U (M) = 1. Proof. Consider the event E = (s, a) Z, lim t Nt(s, a)/t = ω (s, a), lim t Mt = M . By prop. C.1 we have that E holds almost surely. Hence, one can prove straightforwardly prove that MR-Na S stops almost surely. Under E, by continuity, λ > 0, tλ N such that t tλ we have U(Nt/t; Mt) 1 (1 λ)U(ω ; M) 1 = (1 λ)U (M). Now, recall the definition of stopping threshold β(Nt, δ) = log(1/δ) + (S s,a log e h 1 + Nt(s,a) S 1 i . Note that there exists C > 0 such that β(Nt, δ) log(Ct/δ). Thus, we have that τ = inf{t tλ, t U(Nt/t; Mt) 1 β(Nt, δ)}, tλ {t tλ, t(1 λ)U (M) 1 β(Nt, δ)}, tλ {t tλ, t(1 λ)U (M) 1 log(Ct/δ)}. Applying [19, Lem. 8] we get 2 log CU (M) Hence lim supδ 0 τ ln(1/δ) U (M)/(1 λ). Letting λ 0 concludes the proof. C.3 Expected Sample Complexity Upper Bound of MR-Na S To derive a sample complexity upper bound in expectation of MR-Na S we define the following quantities: 1. Let m0 N be the (maximum) number of steps to move between any pair of states (s, s ). Then we can move between any pair (z, z ) in m0 + 1 steps at-most [17]. Hence, m0 can be interpreted as a sort of metric measuring the difficulty of navigating the MDP. We also define m := m0 + 1. 2. Let ηk := min {P n u (z |z)|z, z Z, n {1, . . . , k}, P n u (z |z) > 0} be the minimal probability of reaching z from z in n k transitions using a uniform random policy, with k N. 3. We define by Bρ(P) the ball of radius ρ around a transition function P, i.e., Bρ(P) := {P : maxs,a P( |s, a) P ( |s, a) 1 ρ}. Since in the setting of this work an MDP is identified by its transition function, for the sake of notation we also interchengeably write Bρ(M) to indicate the ball of radius ρ around M. 4. We denote by ωt the stationary distribution induced by the policy πt and by ωu the stationary distribution induced by the uniform policy. We also indicate by Wt a rank-one matrix whose rows are equal to ωt. 5. From the ergodicity of Pu there exists an integer r0 such that P r u(z |z) > 0 for all r r0 and all (z , z) Z2 (the existence of r0 is guaranteed by [85, Proposition 1.7]). 6. The oracle exploration policy π (a|s) := ω (a|s)/ P b ω (b|s) and the exploration policy at time step t: ˆπ t (a|s) := ω t (a|s)/ P b ω t (b|s). We begin by proving two independent results: (1) the geometric convergence of P n t to Wt in app. C.3.1, a result that is used to prove the concentration of the state-action visitation frequency; (2) the forced exploration property of MR-Na S in app. C.3.2, which is used to prove the concentration of the MDP estimate. This latter result is novel compared to previous works. We subsequently present the concentration of the MDP estimate in app. C.3.3 and of the state-action visitation frequency in app. C.3.4. We conclude with the expected sample complexity of MR-Na S in app. C.3.5. Finally, at the end of the section, we present some technical lemmas as well as an alternative result to the forced exploration property presented in app. C.3.2. C.3.1 Geometric ergodicity of the sampling rule In the following lemma we adapt a result from [17] to our forcing policy πf,t. We also define the following quantities σ(ϵ, π) := [(1 ϵ)A min s,a π(a|s)]r0 + ϵ θ(ϵ, π) := 1 σ(ϵ, π), C(ϵ, π) := 2/θ(ϵ, π), ρ(ϵ, π) := θ(ϵ, π)1/r0, L(ϵ, π) := (1 ρ(ϵ, π)) 1C(ϵ, π). where σu := minz,z P r0 u (z |z)/ωu(z ) and r0, α, β are as before. For the sake of brevity, we let σt = σ(ϵt, π t ), θt = θ(ϵt, π t ), and similarly for Ct, ρt, Lt. Lemma C.2 (Geometric ergodicity of the sampling rule). For MR-Na S it holds that n 1, P n t Wt Ctρn t . (24) Proof. For two state-action pairs z = (s, a), z = (s , a ) denote by P r t (z |z) the probability of reaching z from z in r steps. Then, for all (z, z ) we have P r t (1 ϵt)r0P r0 π t (z |z) + ϵr0 t P r0 πf ,t(z |z), (1 ϵt)r0P r0 π t (z |z) + (1/Nt(s)α+β)r0P r0 u (z |z), (1 ϵt)r0P r0 π t (z |z) + ϵ α t P r0 u (z |z). Let ct = mins,a π t (a|s) and note that Pπ t (z |z) Act Pu(z |z) for all (z, z ). Recall the definitions for the stationary distribution under the uniform policy by ωu, and σu = minz,z P r0 u (z |z)/ωu(z ). Then, we can rewrite the previous inequality as P r t [(1 ϵt)Act]r0 + ϵ P r0 u (z |z)ωu(z ) [(1 ϵt)Act]r0 + ϵ [(1 ϵt)Act]r0 + ϵ σuωu(z )ωt(z ). Then using [17, Lemma 21] we conclude that for all n 1 it holds that: P n t Wt 2θ Hence Pt also satisfies P n t Wt Ctρn t . C.3.2 Forced exploration with high probability In the following lemma, we provide a novel result for the forced exploration property of MR-Na S, which relies on lem. C.7. Lemma C.3. For all λ (0, 1), m := m0 + 1, define the event t 1, (s, a) S A, Nt(s, a) 1 ηm (km)α+β log SA Then, for MR-Na S with α + β 1 we have that P Proof. Fix a state-action pair z = (s, a) and let Yt = 1zt=z and Xk = 1Y(k 1)m+1+ +Ykm>0 for k = 1, . . . , t m . Define N t m (s, a) = P t m k=1 Xk, and note that Nt(s, a) N t m (s, a) for all t 1. Let Qk = P(Xk = 1|Fk 1), where Fk = σ(Y1, . . . , Ykm). Note that we can also write Qk as Qk = P( n {1, . . . , m} : Y(k 1)m+n = 1|Fk 1). Then Qk = E(πt)t P n {1, . . . , m} : Y(k 1)m+n = 1|Fk 1, (πt)t , max 1 n m P Y(k 1)m+n = 1|Fk 1, (πt)t , max 1 n m P z(k 1)m+n = z|z(k 1)m = z0, Fk 1, (πt)t , = E(πt)t,z0 max 1 n m E(zi)n 1 i=1 j=0 P(k 1)m+j(z(k 1)m+j+1 = zj+1|z(k 1)m+j = zj)|z1, . . . , zn 1, zn = z max 1 n m E(zi)n 1 i=1 1 (km)α+β Pu(z(k 1)m+j+1 = zj+1|z(k 1)m+j = zj)|z1, . . . , zn 1, zn = z = E(πt)t,z0 max 1 n m 1 (km)α+β P n u (z(k 1)m+n = z|z(k 1)m = z0)|z0 ηm (km)α+β . Hence, for λ 0 due to lem. C.7 we also have t : Nt(s, a) 1 ηm (km)α+β λ Consequently, choosing λ = log(SA/λ), we get that t : Nt(s, a) 1 k=1 Qk(s, a) log SA The following result, automatically follows from the previous lemma. Corollary C.1. For all t 1, under Ecnt(λ) we have ϵt min(1, κt(λ)) where κt(λ) := 2α m k=1 ηm (km)α+β 2 log(SA/λ) α . (25) C.3.3 Concentration of the empirical MDP In the following lemma we bound the probability that Mt is not close to M. In order to do so, recall that π (a|s) = ω (a|s)/ P b ω (b|s) and ˆπ t (a|s) = ω t (a|s)/ P b ω t (b|s). Then, observe that by continuity of Mt ω t (due to Berge s theorem) we that for every ξ > 0 there exists ρ > 0 such that for Mt Bρ(M) we have ˆπ t π ξ. Lemma C.4. For ξ > 0, T 1 define the event E1 p,T (ξ) = T t=hp(T )(Mt Bρ(ξ)(M)) where hp(T) = T p with p (0, 0.5). Then, there exists a constant C > 0 that only depend on ξ and M such that T 1, P E1 p,T (ξ) 1 T 2 + CT exp( 2ghp(T )(λ)ρ), (26) where ghp(T )(λ) = 1 m k=1 ηm (km)α+β log SA Proof. Consider the event Ecnt(λ) from lem. C.3 and let λ = 1/T 2. Note that E1 p,T (ξ) λ + P E1 p,T (ξ) Ecnt(λ) . For simplicity, we omit the dependence on λ and simply write Ecnt. Then, we have 5 P(E1 p,T (ξ) Ecnt) t=hp(T ) P(Mt / Bρ(M) Ecnt), t=hp(T ) P(max s,a Pt( |s, a) P( |s, a) 1 > ρ Ecnt), s,a,s P(|Pt(s |s, a) P(s |s, a)| > ρ/S Ecnt). Now, let gt(λ) = 1 2 Pt k=1 ηm (km)α+β log SA λ . Then, we have P(|Pt(s |s, a) P(s |s, a)| > ρ/S Ecnt) P(|Pt(s |s, a) P(s |s, a)| > ρ/S Nt(s, a) gt(λ)), k=gt(λ) P(|Pt(s |s, a) P(s |s, a)| > ρ/S Nt(s, a) = k), k=gt(λ) 2 exp( 2kρ2), k=0 2 exp( 2(k + gt(λ)ρ2), exp( 2gt(λ)ρ) k=0 2 exp( 2kρ2), 2 exp( 2gt(λ)ρ) 1 exp( 2ρ2) . Hence, we find 2 exp( 2gt(λ)ρ) 1 exp( 2ρ2) 2S2A exp( 2gt(λ)ρ) 1 exp( 2ρ2) , 2S2AT exp( 2ghp(T )(λ)ρ) 1 exp( 2ρ2) . C.3.4 Concentration of the state-action visitation frequency Recall now that ωt is the stationary distribution induced by the exploration policy πt at time t. Further, recall that πt = (1 εt)π t + εtπf,t, ω t = infω Ω(Mt) U(ω; Mt), ω t = (1/t) Pt n=1 ω n and that π t (a|s) = ω t (a|s)/ P a ω t (s, a ). We have the following result. Corollary C.2. Let T 1, p (0, 1), ξ > 0, λ (0, 1). For t T p under E1 p,T (ξ) we have that there exists ρ > 0 and κt(λ) such that As a consequence, under E1 p,T (ξ) we also have that Pt Pt 1 2 T p t 1 + ξ + ϵt 1 5For simplicity, and w.l.o.g., we consider hp(T) as an integer in the summation. Proof. Let π (a|s) = ω (a|s)/ P b ω (b|s) and ˆπ t (a|s) = ω t (a|s)/ P b ω t (b|s). Under E1 p,T (ξ) using [17, Lemma 23] there exists an MDP-specific constant κM such that ωt ω 1 κM Pt Pπ , κM πt π , κM [ π t π + ϵt] , κM [ π t π + ϵt] . Then, for t T p, we have k=1 (ˆπ t π ) which concludes the first part of the proof. For the second part, simply note that Pt Pt 1 Pt Pπ + Pπ Pt 1 , t 1 + ξ + ϵt 1 From the previous lemma we remark the following important fact: under E1 p,T (ξ) for k T q T p we have k + ξ 1 T q p + ξ. In the following proof we provide a a reformulation of [17, Proposition 19]. We require t and T to be larger than some constants. Specifically, we require t T 1 p (note that p (0, 0.5)) and T max Tq(ξ, λ), 1 ξ 1 q p , 1 , where q (0, 1) satisfies p < q, p + q < 1 and Tq(ξ, λ) := inf{T 1 : ξ κT q(λ) + 1 T (1 3p)/2 } with ξ > 0, λ (0, 1) and κt defined in corollary C.1. Proposition C.3. Under MR-Na S there exists T(ξ, λ) such that for all T max Tq(ξ, λ), 1 ξ 1 q p , 1 and all t T 1 p. and all functions f : Z [0, 1] we Pt k=1 f(zk) κξξ E1 p,T (ξ) Ecnt(λ) 2 exp( tξ2), (28) where ξ Kξ is a mapping with values in (1, ) such that lim supξ 0 Kξ < . Proof. Let ω (f) = Ez ω [f(z)] and ωt(f) = Ez ωt[f(z)]. Then D = Pt k=1 f(zk) k=1 f(zk) ω (f) Pt k=T q+1 f(zk) ω (f) k=1 f(zk) ω (f) t | {z } (a) Pt k=T q+1 f(zk) ωk 1(f) t | {z } (b) Pt k=T q+1 ωk 1(f) ω (f) t | {z } (c) First term (a). For the first term (a) for t T 1 p and ξ 1/T 1 p q we have t 1 T 1 p q ξ. Last term (c). The last term (c) under E1 p,T (ξ) for t T 1 p is bounded as 6 (c) Pt 1 k=T q ωk ω 1 Pt 1 k=T q κM[T p/k + ξ + ϵTq] ξ + ϵTq + Pt 1 k=T q T p/k ξ + ϵTq + T p log(t) ξ + ϵTq + T p ξ + ϵTq + 1 T (1 3p)/2 Hence, under E1 p,T (ξ) Ecnt(λ), since ϵk κk(λ) we get that (c) κM ξ + κTq(λ) + 1 T (1 3p)/2 . Let Tq(ξ, λ) = inf{T 1 : ξ κT q(λ) + 1 T (1 3p)/2 } where κt is defined in corollary C.1. Then for T Tq(ξ, λ) we have (c) 2κMξ. Second term (b). For the second term we use the function ˆfk solution to the Poisson equation f( ) ωk 1(f) = [ ˆfk Pk ˆfk]( ) [17, Lemma 23]. Hence (b) = (1/t) k=T q+1 ˆfk 1(zk) Pk 1 ˆfk 1(zk) = Mt + Ct + Rt, Pt k=T q+1 ˆfk 1(zk) Pk 1 ˆfk 1(zk 1) Pt k=T q+1 Pk ˆfk(zk) Pk 1 ˆfk 1(zk) Rt := PT q ˆf T q(z T q) Pt ˆft(zt) Before proceeding, always under E1 p,T (ξ) Ecnt(λ) for T Tq(ξ, λ) we have that |ϵk| κTq(λ) ξ, and for k T q, ξ 1/T q p T q p + ξ 2ξ. Therefore Lk Lξ, where Lξ = sup|ϵ| ξ, π π 2ξ L(ϵ, π). Bounding Mt. As in [17], t Mt is a martingale satisfying |k Mk (k 1)Mk 1| 2Lk 1, and hence for t T 1 p we bound using Azuma-Hoeffding inequality P(|Mt| 2Lξξ) 2 exp( tξ2). 6For simplicity, and w.l.o.g., we consider T q as an integer in the summation. Bounding Ct. Again, for t T 1 p we have k=T q+1 Lk[ ωk ωk 1 + Lk 1 Pk Pk 1 ], ωk ωk 1 + 2Lk 1 t 1 + ξ + ϵt 1 ωk ω + ωk 1 ω + 2Lk 1 t 1 + ξ + ϵt 1 k=T q+1 2Lk k 1 + ξ + ϵk 1 k 1 + ξ + ϵk 1 k=T q+1 Lξ(κM + Lξ) T p k 1 + ξ + ϵk 1 2Lξ(κM + Lξ) T p t + ξ + κTq(λ) , 2Lξ(κM + Lξ) 1 T (1 3p)/2 + ξ + κTq(λ) , 4Lξ(κM + Lξ)ξ. Bounding Rt. Finally, for the last term we get for T Tq(ξ, λ) we have |Rt| 2Lξ/t 2Lξ/T 1 p 2Lξ/T (1 3p)/2 2Lξξ. Last step. Summing up all terms yields that for t T 1 p, T max Tq(ξ, λ), 1 ξ 1 q p , 1 Pt k=1 f(zk) κξξ E1 p,T (ξ) Ecnt(λ) 2 exp( tξ2), (29) where κξ = 1 + 2κM + 4Lξ(1 + κM + Lξ). As a corollary of the previous proposition, we find the following concentration result on Nt(s, a)/t and the high probability event E2 T (ξ) = T t=T 1 p | Nt t ω | Kξξ . Corollary C.3. For MR-Na S we have P (s, a) Z, Nt(s, a) t ω (s, a) κξξ E1 p,T (ξ) Ecnt(λ) 2SA exp( tξ2), (30) E2 p,T (ξ) E1 p,T (ξ) Ecnt(λ) 2SA exp( T 1 pξ2) 1 exp( ξ2) . (31) C.3.5 Expected sample complexity of MR-Na S Finally, we have the expected sample complexity upper bound of MR-Na S, which closely follows the classical proofs in [17, 20]. Proof strategy. To summarize, the proof strategy relies on ensuring that Nt(s, a)/t concentrates around ω (s, a). To that aim, the concentration happens under some events E1 p,T (ξ) (defined in lem. C.4) given that Nt(s, a) is visited sufficiently often (see lem. C.3, or the alternative version in lem. C.5). As δ 0, by the properties of β(Nt, δ), one can show that that E[τ]/ ln(1/δ) does not exceed sup M Bρ(ξ)(M), ω ω Kξξ U(ω , M ) ln(1/δ), where Bρ(ξ)(M) is a ball of a certain radius ρ(ξ) around M (see def. of ρ(ξ) in app. C.3.3) and Kξ > 0 for all ξ > 0. Hence, letting ξ 0 concludes the proof. Theorem C.2 (Expected sample complexity). MR-Na S with the stopping rule in prop. 3.1 guarantees that E[τ] < and lim supδ 0 E[τ] log(1/δ) U (M). Proof. Define U (ξ) = sup M Bρ(ξ)(M), ω ω Kξξ U(ω , M ). Similarly to [17], by prop. C.3 under E1 p,T (ξ) (defined in lem. C.4) the event E2 p,T (ξ) (defined in corollary C.3) holds with high probability, and there exists T1(ξ, λ) such that for T T(ξ, λ) we have U(Nt/t, Mt) U (ξ) for all t T 1 p. Further, by the properties of the threshold function β(Nt, δ), there exists T3(ξ) such that for t T2(ξ) we have β(Nt, δ) log(1/δ) + ξU (ξ) 1t. Then, let T3(ξ, δ) be such that log(1/δ) + ξU (ξ) 1T3(ξ, δ) = T3(ξ, δ), that is T3(ξ, δ) = U (ξ) log(1/δ) Hence, under E1 p,T (ξ) E2 p,T (ξ) for T max(T1(ξ, λ), T2(ξ), T3(ξ, δ)) it holds that TU 1(NT /T, Mt) β(NT , δ), and thus (τ T) E1 p,T (ξ) E2 p,T (ξ). Therefore, using that E[τ] = P T =1 P(τ > T) one can show7 E[τ] max(T1(ξ, λ), T2(ξ), T3(ξ, δ)) + T =max(T1(ξ,λ),T2(ξ),T3(ξ,δ)) P E1 p,T (ξ) E2 p,T (ξ) , max(T1(ξ, λ), T2(ξ), T3(ξ, δ)) + T =max(T1(ξ,λ),T2(ξ),T3(ξ,δ)) 1 T 2 + CT exp( 2ghp(T )(λ)ρ) + 2SA exp( T 1 pξ2) 1 exp( ξ2) , and thus E[τ] is finite since ghp(T )(λ) grows as T p for a fixed λ. Hence lim sup δ 0 E[τ] log(1/δ) lim sup δ 0 max(T1(ξ, λ), T2(ξ), T3(ξ, δ)) log(1/δ) U (ξ) Letting ξ 0 concludes the proof. C.3.6 Alternative forced exploration with high probability In this section we introduce an alternative result for the forced exploration property with high probability that follows the one proposed in [17]. In addition to the quantities defined in app. C.3, we also define η := η1ηm and m := max m, l 1 2(α+β) m . The following lemma is an alternative to lem. C.3, and it is adaptation of [17, Lemma 11]. Using this lemma, one can obtain a sample complexity upper bound in expectation as in Alg. 2. For the sake of simplicity, we only state the main steps of the proof that differ from the one in [17]. Lemma C.5 (High probability forced exploration). For all λ (0, 1) we have z Z, t 1, Nt(z) t g(λ)2 where g(λ) = m η log 1 + SA λ and m := max m, l 1 2(α+β) m . Proof. Denote by τk(z) the k-th time an agent visit a state-action pair z = (s, a), and define the event E = ( z Z, k 1, τk(z) f(k)) for some increasing function f : N N, satisfying f(0) = 0. One can find that k=1 bk, bk := f(k) f(k 1) 1 1 (f(k 1) + mj + l)α+β 7For simplicity we consider max(T1(ξ, λ), T2(ξ), T3(ξ, δ)) as an integer. For f(k) = λk4 we have f(k) f(k 1) 1 f(k) f(k 1) 1 1 (f(k 1) + mj + l)α+β ! f(k) f(k 1) 1 1 η 1 f(k) m(α+β) f(k) f(k 1) 1 1 η 1 f(k) m(α+β) mf(k) m(α+β) Now, since 1 2(α+β) m, then we have that m(α + β) 1/2, thus bk exp η λk3 λ m . Hence k=1 bk 1 exp η Finally, the conclusion follows similarly as in [17]. The previous result relies on the following close adaptation of [17, Lemma 20]. Again, we only show the main steps of the proof. Lemma C.6. Fix z0 Z, and let Pt be the transition induced by the sampling policy πt of MR-Na S. Define the sub-stochastic matrix Qt obtained by removing from Pt the row and the column corresponding to z0: Qt [Pt(z0|z)]z =z0 [Pt(z|z0)] z =z0 Pt(z0|z0) Then, we have 1 kα+β . (33) Proof. Define rk(n1, n2) := PSA 1 j=1 Qn2 l=n1+1 Ql kj to be the sum of the k-th row of the product of matrices Ql for l {n1 + 1, . . . , n2}, and observe that Qn+m l=n+1 Ql = maxi {1,...,SA 1} ri(n, n + m). Since the MDP is communicating, there exists z Z s.t. Pu(z0|z) η1 and let k be the corresponding row of z in Qt. Then, using lem. C.1 and the fact that ϵt = 1/ max(1, Nt(s))α, for n1 1 we have Pt Pu Nt(s)α+β η1/tα+β, and thus rk (n1, n1 + 1) = j=1 (Qn1+1)k j = 1 Pn1+1(z0|z) 1 η1 1 (n1 + 1)α+β . For n1, n2 1, by recursion one can show that rk (n1, n1 + n2) = k j rk (n1, n1 + 1) 1 η1 1 (n1 + 1)α+β . Similarly, for any i {1, . . . , SA} \ {k }, for all n1 {1, . . . , m0} we have ri(n, n + m) = ij rj(n + n1, n + m), 1 η1 1 (n + n1 + 1)α+β To bound the last term, observe that for ni m n+ni Y ϵl Nl(s)β Pu ri(n, n + m) 1 η1ηm which concludes the proof. C.4 Stopping Rule Next, we provide the proof of the δ-PC property of MR-Na S. Note that a necessary condition for the algorithm to stop is that the model at the stopping time Mτ needs to have a unique optimal policy for every r R. Given this condition, we have the following proof of the stopping rule. Proof of prop. 3.1. Let C = { r R : ˆπτ,r / Π (Mr)} be the event that at the stopping time there exists a reward such that the estimated optimal policy is not optimal. Let Alt(M) = r Alt(Mr). Then, since Mτ M, admitting a single optimal policy for all r R, we can say that C { r R : M Alt(Mt,r)} r{M Alt(Mt,r)} {M Alt(Mt)} , where Mt,r = (Mt, r). Then P (τδ < , C) P t 1 : t U(Nt/t; Mt) 1 β(Nt, δ), M Alt(Mt) , P t 1 : t T(Nt/t; Mt) 1 β(Nt, δ), M Alt(Mt) , t 1 : inf r inf M r Alt(Mt,r) s,a Nt(s, a)KLMt|M r(s, a) β(Nt, δ), M Alt(Mt) t 1 : inf M r Alt(Mt,r) s,a Nt(s, a)KLMt|M (s, a) β(Nt, δ), M Alt(Mt) s,a Nt(s, a)KLMt|M(s, a) β(Nt, δ) where the conclusion follows from [86, Prop. 1]. C.5 Technical Lemmas We have the following adaptation of [87, Lemma F.4]. Lemma C.7. Let Y1, . . . , Yt be a sequence of Bernoulli random variables. Let Fi be a filtration, and for i = 1, . . . , t m and m N let Xi = 1Y(i 1)m+1+ +Yim>0 be a sequence of Bernoulli random variables such that Qi = P(Xi|Fi 1) is Fi 1-measurable and Xi is Fi-measurable. Then, for λ 0 we have that Finally, we report the following proposition from [86, 17] that bounds the self-normalized KL process. Proposition C.4 (Prop. 1, [86] and Lemma 15 [17]). Define the threshold β(Nt, δ) := log(1/δ) + s,a log e h 1 + Nt(s,a) S 1 i . Then, for all δ (0, 1) we have s,a Nt(s, a)KLMt|M(s, a) > β(Nt, δ) with the convention that Nt(s, a)KLMt|M(s, a) = 0 whenever Nt(s, a) = 0. D Numerical Results In this section we present more in detail the numerical results for the tabular case and for Deep Reinforcement Learning. We also introduce additional numerical results on radio network control problems. D.1 Numerical Results for the tabular case In this section we delve more into the detail of the numerical results for the tabular case. We focus on different hard-exploration tabular environments: Riverswim [54], Forked Riverswim [53], Double Chain [22] and NArms [54] (an adaptation of Six Arms to N arms). We compare MR-Na S against RF-UCRL [22] (a reward-free exploration method), ID3AL [33] (a maximum entropy exploration approach) and RF-PSRL, a reward-free adaptation of PSRL [35] (posterior sampling for RL). As the baselines are not specific to our setting, we consider different sets of experiments. We begin by describing the environments, and then present the results. D.1.1 Environments Details Here we provide a brief description of the environments. Riverswim. The River Swim environment is a classic reinforcement learning benchmark designed to test exploration [54]. It consists of a series of states arranged in a linear chain, where an agent can choose to swim right (downstream) or left (upstream). In the single-reward setting the agent can achieve a positive return by swimming right, but requires overcoming a strong current, making it a less probable event. Conversely, swimming left generally offers small to zero rewards, but is easier. This setup requires the agent to balance immediate, safer rewards with potentially higher but riskier returns. It is exponentially hard for a random uniform agent to reach the final state. In figure fig. 3 is shown a depiction of the environment. There are n states, and two main parameters, p, p (0, 1), and their sum psum = p + p < 1. In the figure, each tuple (a, p) represents the action a that triggers the transition and the probability p of that event. The agent starts in state s0, and in every state can only take two actions {a0, a1}. For small values of p it becomes difficult for the agent to swim right (i.e., reach sn 1), and larger values of p can also hinder the progress. On the other hand, swimming towards the left is easier, since the probability of P(si 1|si, a0) = 1. For the experiments, we used n = 10, p = 0.3, p = 0.6. s0 s2 ... sn 2 sn 1 (a0, 1), (a1, 1 p) (a0, 1), (a1, 1 psum) (a0, 1), (a1, 1 psum) (a0, 1), (a1, 1 psum) (a0, 1), (a1, 1 p) (a0, 1), (a1, p) Figure 3: Riverswim environment [54]. Each tuple (a, p) represents the action a that triggers the transition and the probability p of that event. Forked Riverswim. The Forked River Swim environment [53] is a variation of the traditional River Swim reinforcement learning benchmark, designed to test more complex exploration strategies. In this variant, the state space branches into multiple paths, resembling a river that forks. At intermediate states the agent can switch between the forks, while the end states are not connected. This variant requires the agent to make more sophisticated decisions to explore the environment. This setup increases the sample complexity and challenges the agent s ability to generalize across different paths within the environment. In fig. 4 is shown a depiction of the environment. There are a total of 2n+1 states, and two parameters p, p (0, 1), so that psum = p + p < 1. In the figure, each tuple (a, p) represents the action a that triggers the transition and the probability p of that event. The agent starts in state s0, and in every state can chose between three actions {a0, a1, a2}. For small values of p it becomes difficult for the agent to swim right in both forks, and larger values of p can also hinder the progress. As in Riverswim, swimming towards the left is easier, since the probability of P(si 1|si, a0) = 1. For the experiments, we used n = 4, p = 0.3, p = 0.6. s0 s1 s2 sn 1 sn s 1 s 2 s n 1 s n (a0, 1), (a1, 1 p), (a2, 1) (a0, 1), (a1, 1 psum) (a0, 1), (a1, 1 psum) (a1, p), (a2, 1) (a0, 1), (a1, 1 p) (a0, 1) (a1, 1 psum) (a0, 1) (a1, 1 psum) (a1, p), (a2, 1) (a0, 1), (a1, 1 p) Figure 4: Forked Riverswim environment [53]. Each tuple (a, p) represents the action a that triggers the transition and the probability p of that event. Double Chain. The Double Chain environment [22] consists of two chains, similarly to the Forked Riverswim. The main difference consists in the fact that it is not possible to switch between the two chains, and intermediate states are transient (there is no parameter p ). In fig. 5 is shown a depiction of the environment. There are a total of 2n+1 states, and one parameters p (0, 1). In the figure, each tuple (a, p) represents the action a that triggers the transition and the probability p of that event. The agent starts in state s0, and in every state can chose between two actions {a0, a1}. For small values of p it becomes difficult for the agent to move to the end of the chain in both chains. For the experiments, we used n = 6, p = 0.7. s 1 s n 1 s n (a0, 1) (a1, 1 p) (a0, 1), (a1, 1 p) (a0, 1), (a1, 1 p) (a0, 1), (a1, 1 p) (a1, p) (a0, 1) (a1, 1 p) (a0, 1), (a1, 1 p) (a0, 1), (a1, 1 p) (a0, 1), (a1, 1 p) Figure 5: Double Chain environment [22] . Each tuple (a, p) represents the action a that triggers the transition and the probability p of that event. NArms. This environment is an adaptation to N arms of the original 6Arms environment from [54]. Differently from the previous environments, this is a bandit-like environment, where the agent is presented with n different actions (or arms) to choose from. The agent starts in a state s0 and selects an arm ai. Upon selecting an arm, the agent may transition to corresponding state si. Certain arms are more difficult to observe, in the sense that the transition probability is lower. This property mimics the probability of collecting a reward in a bandit problem. In fig. 6 is shown a depiction of the environment. There are a total of n + 1 states, and one parameters p0 (0, 1). In the figure, each tuple (a, p) represents the action a that triggers the transition and the probability p of that event. The notation an0:n0+n indicates all the actions in {an0, . . . , an0+n}. The agent starts in state s0, and in every state she can select between n actions {a0, a1, . . . , an 1}. For small values of p0 it becomes difficult for the agent to move to different states. Similarly, it is harder to navigate to states si for large values of n. We used p0 = 1 and n = 4. (an 1, p0/n) (an 2, p0/(n 1)) (ai, 1 p0/(i + 1))n i=1 (a0:n 2, 1) (an 2:n, 1) (a0:n 3, 1) Figure 6: NArms environment. Each tuple (a, p) represents the action a that triggers the transition and the probability p of that event. In the figure the notation an0:n0+n indicates all the actions in {an0, . . . , an0+n}. In state s0 the probability to remain in s0 for any action ai is P(s0|s0, ai) = 1 p0/(i + 1), with the exception that P(s0|s0, a0) = 0. D.1.2 Numerical Results For the numerical results in the tabular case, we focus on 3 different aspects of the problem: (1) policy estimation error; (2) value estimation error; (3) exploration properties. For the first two aspects we also consider the capacity of MR-Na S to generalize over rewards that do not belong to R, specifically: MR-Na S uses the canonical basis of rewards Rcanonical to compute the allocation ω t and explore the environments. We also use α = 0.99, β = 0.01. We also test the generalization capacity of MR-Na S over a continuous set of rewards. At each time step we collect 30 additional rewards from Rrnd = {R1, . . . , R30}, which is a set of 30 reward vectors uniformly sampled from [0, 1]SA, different for each seed. Note that these vectors are not used by MR-Na S to drive its exploration 8. We consider a discount factor of γ = 0.9, confidence δ = 10 2 and run each algorithm for T = 5 105 steps. In all experiments we depict the average value and its 95% confidence interval over 100 seeds. MR-PSRL algorithm and RF-UCRL. MR-PSRL is a simple adaptation of PSRL (posterior sampling for RL [35]). PSRL is an episodic algorithm that draws an MDP estimate from a posterior distribution. We adapt PSRL by sampling at the beginning of an episode a random reward vector r in [0, 1]SA, 8Since we use the canonical basis the second assumption (CH) in thm. B.3 is satisfied. and exploring according to Mt,r for that episode. In our case, we simply draw r Dir(1) from a Dirichlet distribution, where 1 is a vector of ones of size SA. Since we do not consider a finite horizon setting, and both RF-UCRL and RF-PSRL are episodic algorithms, we set an horizon H = 1/(1 γ). Policy estimation error. We report in fig. 7 the estimation error of the optimal policies. In the plots we depict the average value of et,r = |Π (Mr) Π (Mt,r)| |Π (Mr) Π (Mt,r)| for a reward r9, where A B is the symmetric difference between two sets A, B. Intuitively, et accounts for both policies that are not accurately estimated and policies that are optimal in Mt,r but not in Mr. 0 20000 40000 Steps Estimation error E[et] 0 20000 40000 Steps Forked Riverswim 0 20000 40000 Steps Double Chain 0 20000 40000 Steps MR-Na S RF-UCRL ID3AL MR-PSRL 0 20000 40000 Steps 0 20000 40000 Steps Forked Riverswim 0 20000 40000 Steps Double Chain 0 20000 40000 Steps Estimation error over the canonical basis MR-Na S RF-UCRL ID3AL MR-PSRL 0 20000 40000 Steps 0 20000 40000 Steps Forked Riverswim 0 20000 40000 Steps Double Chain 0 20000 40000 Steps Estimation error over random rewards MR-Na S RF-UCRL ID3AL MR-PSRL Figure 7: Estimation error of the optimal policies. The plots depict the average value of et,r = |Π (Mr) Π (Mt,r)| |Π (Mr) Π (Mt,r)| (r is a random quantity) and its 95% confidence interval. In the first plot we combine together the results for the canonical basis Rcanonical and the random rewards Rrnd. In the second plot we only consider the canonical basis of rewards Rcanonical, whilst in the third plot we only consider the random rewards Rrnd. 9In the expectation the reward r is a random variable. In the first plot we combine together the results for the canonical basis Rcanonical and the random rewards Rrnd. In the second plot we only consider the canonical basis of rewards Rcanonical, whilst in the third plot we only consider the random rewards Rrnd. When consider only the canonical basis the advantage of MR-Na S is more noticeable, especially in the Riverswim and NArms environments (observe that for Forked Riverswim to bring the estimation error to 0 requires a far greater number of samples since the environment is more difficult and the canonical basis is larger). We also see from the last plot, the estimation error over random rewards, that MR-Na S is able to identify the correct optimal policies over rewards uniformly sampled form [0, 1]SA. Value estimation error. We report in fig. 8 the value estimation error over different rewards. In the plots we depict the average value of E h Vr Vt,r 1 S i , where the reward r is a random variable when accounting for the random rewards in Rrnd. From the results we observe the good performance of MR-Na S, despite the fact that the objective of MR-Na S is not to optimize the value estimation error. Exploration properties. In this paragraph we go more into the details of the exploration properties of the various algorithms. In particular, we focus on three different metrics. 1. The deviation in visitation time, defined as visit,t = maxs tvisit(s ) mins tvisit(s), where tvisit(s) is the time step the algorithm last visited state s. This metric visit,t quantifies how much an algorithms favours visitation of some states compared to others. In fig. 9 we show the results. We see that the initial transient of MR-Na S is relatively quick, and then converge to some steady-state value. While MR-Na S tends to focus on some states, it also manages to keep visiting all the other states, keeping the deviation to a small value. 2. The least visited state-action pair mins,a Nt(s, a). This metric is particularly important since it is a proxy value for the value estimation error and the policy estimation error. The top figure in fig. 10 show this metric. Also in this case we see a fast initial transient for MR-Na S. 3. The last metric we consider is the entropy of the number of state-action visits H(Nt), defined as s,a pt(s, a) log(pt(s, a)), where pt(s, a) = Nt(s, a) This metric intuitively quantifies how spread is the exploration across the state-action space. The bottom figure in fig. 10 shows this metric normalized in [0, 1]. Larger values indicate that the exploration is more uniform across the state-action space. Compute resources and additional information. For these simulations we used 1 G5.4XLARGE AWS instance with 16 v CPUs, 64 Gi B of memory and 1 A10G GPU with 24 Gi B of memory. To obtain all the results 2-3 days are needed. The entire research project needed roughly 15 days of computation time for this experiment. We set up our experiments using Python 3.11 [88] (for more information, please refer to the following link http://www.python.org), and made use of the following libraries: Num Py [89], Sci Py [90], CVXPY [91], Seaborn [92], Pandas [93], Matplotlib [94]. In CVXPY we used the CLARABEL optimizer [95] and/or the ECOS optimizer [96]. Changes, and new code, are published under the MIT license. To run the code, please, read the attached README file for instructions. D.2 Numerical Results: Cartpole swing-up and Deep Sea problems In this section we discuss the numerical results on the Cartpole swing-up problem and the Deep Sea problem. Libraries used in the experiments. We set up our experiments using Python 3.11 [88] (For more information, please refer to the following link http://www.python.org), and made use of the following libraries: Cython [97], Num Py [89], Sci Py [90], Py Torch [98], Seaborn [92], Pandas [93], Matplotlib [94]. Changes, and new code, are published under the MIT license. To run the code, 0 20000 40000 Steps Value error E [ Vr Vt,r 1/S] 0 20000 40000 Steps Forked Riverswim 0 20000 40000 Steps Double Chain 0 20000 40000 Steps MR-Na S RF-UCRL ID3AL MR-PSRL 0 20000 40000 Steps E [ Vr Vt,r 1/S] 0 20000 40000 Steps Forked Riverswim 0 20000 40000 Steps Double Chain 0 20000 40000 Steps Value error over the canonical basis MR-Na S RF-UCRL ID3AL MR-PSRL 0 20000 40000 Steps E [ Vr Vt,r 1/S] 0 20000 40000 Steps Forked Riverswim 0 20000 40000 Steps Double Chain 0 20000 40000 Steps Value error over random rewards MR-Na S RF-UCRL ID3AL MR-PSRL Figure 8: Value estimation error. The plots depict the average value of E h Vr Vt,r 1 S i ,. In the first plot we combine together the results for the canonical basis Rcanonical and the random rewards Rrnd. In the second plot we only consider the canonical basis of rewards Rcanonical, whilst in the third plot we only consider the random rewards Rrnd. please, read the attached README file for instructions. In the code, we make use of some code from the Behavior suite [36], which is licensed with the APACHE 2.0 license. D.2.1 Cartpole swing-up Environment details. The cartpole swing-up problem is a well-known challenge in control theory and reinforcement learning [99]. The objective is to balance a pole attached to a cart by an unactuated joint, where the cart moves along a frictionless track. Control is exerted by applying force to the cart. Initially, the pole hangs downward, and the goal is to swing it up into an upright position and maintain its balance there. Unlike the traditional cartpole balancing problem, this task requires both swinging the pole up and keeping it balanced once upright. 0 20000 40000 Steps E [ visit,t] 0 20000 40000 Steps Forked Riverswim 0 20000 40000 Steps Double Chain 0 20000 40000 Steps MR-Na S RF-UCRL ID3AL MR-PSRL Figure 9: Average value of visit,t = maxs tvisit(s ) mins tvisit(s), where tvisit(s) is the time step the algorithm last visited state s. 0 20000 40000 Steps E [mins,a Nt(s, a)] 0 20000 40000 Steps Forked Riverswim 0 20000 40000 Steps 103 Double Chain 0 20000 40000 Steps 0 20000 40000 Steps 0 20000 40000 Steps 0 20000 40000 Steps 0 20000 40000 Steps MR-Na S RF-UCRL ID3AL MR-PSRL Figure 10: Top: average value of the least visited state-action pair mins,a Nt(s, a). Bottom: normalized average value of the entropy of Nt over time (see text for a definition). Goal position Figure 11: Diagram of the Cartpole swing-up problem. The system s state at any given moment is defined by four variables: the cart s position x, the cart s velocity x, the pole s angle θ, and the pole s angular velocity θ. There are also four additional variables in the state, and for brevity, we direct the reader to [36]. This problem is challenging because the reward is sparse and includes a penalty for movement. Reward functions. In the multi-reward variant the objective is to stabilize the pole vertically around a specified position x0. The agent receives a positive reward only if the following conditions are met: (1) cos(θ) > k/20 for the pole s angle, (2) | θ| θ0 for its angular velocity, and (3) |x x0| 1 k/20 for the cart s position, where θ0 is a constant and k is an integer indicating the difficulty level. Additionally, the agent incurs a negative reward of 0.1 for moving, which hinders the learning process. Classical algorithms like DQN [100] tend to remain stationary due to this penalty. Then, the reward at time t for a given value of x0 is: rx0(st, at) = 0.1 + 1cos(θt)>k/20 1| θt| θ0 1|xt x0| 1 k/20 . We considered a set of rewards Rbase = (rx0)x0 X that consists of 5 rewards for different values for x0, where x0 is evenly spaced in [ 1.5, 1.5] (that is X = { 1.5, 0.75, 0, 0.75, 1}). To evaluate DBMR-BPI s ability to generalize to unseen rewards, we also evaluate the rewards collected in the set Rrnd, which consists of 5 reward functions where, for each reward, x0 is uniformly sampled in [ 1.5, 1.5]. The set Rrnd is different for each seed, and observe that the random rewards in Rrnd were not used during training of DBMR-BPI. Numerical results: performance. In figs. 12 and 13 we depict the performance of the various algorithms over Rbase and Rrnd for two difficulty levels k {3, 5}. In tabs. 4 and 5 we report the values at the last time step T of the various metrics considered. In particular, we focus on the distribution of the following random variable t=1 1ri(st,at)>0, where i is the i-th reward in the corresponding set of rewards. This metric measures the average amount of information collected by the agent useful to learn how to stabilize the pole upright. In the figures, and in the tables, the quantity Avg R({Xi}) (sim. the others metrics) indicates the arithmetic mean of {Xi}i over the rewards. We average results across 30 seeds. Base rewards R E[med R(Xi)] E[Avg R(Xi)] E[std R(Xi)] E[max R(Xi)] E[min R(Xi)] k = 3 DBMR-BPI 21.8(20.9, 22.7) 21.3(20.6, 21.9) 6.1(5.5, 6.6) 28.9(27.6, 30.2) 12.8(12.0, 13.6) T = 150000 RND 1.3(1.0, 1.7) 1.3(1.0, 1.6) 0.3(0.2, 0.3) 1.7(1.3, 2.1) 0.9(0.7, 1.2) APT 0.3(0.3, 0.4) 0.3(0.3, 0.4) 0.1(0.1, 0.2) 0.5(0.4, 0.6) 0.2(0.1, 0.2) Disagreement 0.9(0.7, 1.2) 0.9(0.7, 1.2) 0.3(0.2, 0.6) 1.3(0.9, 2.0) 0.5(0.4, 0.6) k = 5 DBMR-BPI 21.6(20.9, 22.3) 20.2(19.6, 20.6) 6.4(6.0, 6.9) 28.0(27.0, 29.0) 11.4(10.7, 12.0) T = 200000 RND 1.2(0.9, 1.4) 1.2(0.9, 1.5) 0.2(0.2, 0.3) 1.5(1.2, 1.9) 0.8(0.6, 1.1) APT 0.3(0.3, 0.4) 0.3(0.3, 0.4) 0.1(0.1, 0.1) 0.5(0.4, 0.6) 0.2(0.1, 0.2) Disagreement 0.7(0.5, 0.9) 0.7(0.5, 1.0) 0.3(0.2, 0.4) 1.2(0.7, 1.6) 0.4(0.3, 0.6) Table 4: Cartpole swing-up. Statistics for the random variable Xi = 1 T PT t=1 1ri(st,at)>0, with ri Rbase. Values are multiplied by 100 and rounded to the first digit. In bold statistically significant results (in parentheses we indicate the 95% confidence interval). In general, we see that DBMR-BPI outperforms all of the unsupervised learning baselines. Interestingly, even though DBMR-BPI uses the base set of rewards Rbase to explore, we see better performance when collecting rewards from Rrnd (which are not used by DBMR-BPI during exploration). The reason may be that some rewards in Rbase, (e.g., those with |x0| = 1.5), are harder to learn, while it is unlikely for the random rewards to have such values of x0. This result seems to suggest that the data collected by DBMR-BPI could also be used for rewards that are similar to those in R. 0 100000 Timestep t E[medi Xi,t] 0 100000 Timestep t E[Avgi Xi,t] 0 100000 Timestep t E[stdi Xi,t] 0 100000 Timestep t E[maxi Xi,t] 0 100000 Timestep t E[mini Xi,t] Statistics for the base rewards Rbase - k = 3 DBMR-BPI RND Disagreement APT 0 200000 Timestep t E[medi Xi,t] 0 200000 Timestep t E[Avgi Xi,t] 0 200000 Timestep t E[stdi Xi,t] 0 200000 Timestep t 30 E[maxi Xi,t] 0 200000 Timestep t E[mini Xi,t] Statistics for the base rewards Rbase - k = 5 DBMR-BPI RND Disagreement APT Figure 12: Cartpole swing-up. Statistics for the random variable Xi = 1 T PT t=1 1ri(st,at)>0, with ri Rbase. Values are multiplied by 100 and rounded to the first digit. 0 100000 Timestep t E[medi Xi,t] 0 100000 Timestep t E[Avgi Xi,t] 0 100000 Timestep t E[stdi Xi,t] 0 100000 Timestep t E[maxi Xi,t] 0 100000 Timestep t E[mini Xi,t] Statistics for the base rewards Rbase - k = 3 DBMR-BPI RND Disagreement APT 0 100000 200000 Timestep t E[medi Xi,t] 0 100000 200000 Timestep t E[Avgi Xi,t] 0 100000 200000 Timestep t E[stdi Xi,t] 0 100000 200000 Timestep t 30 E[maxi Xi,t] 0 100000 200000 Timestep t E[mini Xi,t] Statistics for the base rewards Rbase - k = 5 DBMR-BPI RND Disagreement APT Figure 13: Cartpole swing-up. Statistics for the random variable Xi = 1 T PT t=1 1ri(st,at)>0, with ri Rrnd. Values are multiplied by 100 and rounded to the first digit. Numerical results: rewards chosen by DBMR-BPI . In fig. 14 we plot the value of t,r estimated from DBMR-BPI . We see that according to the algorithm the third reward, which corresponds to x0 = 0, is initially the easiest to learn. This fact is also depicted in fig. 15, which shows the number of time a certain reward was picked by DBMR-BPI for exploration (the plot does not account the uniform exploration over rewards, or the tracking procedure that ensures each rewards is selected at a rate of O( n), where n is the number of episodes). Rewards corresponding to |x0| = 1.5 are harder to learn since the episode terminates if the agent exceeds the x boundary of the environment. Random rewards Rrnd E[med Rrnd(Xi)] E[Avg Rrnd(Xi)] E[std Rrnd(Xi)] E[max Rrnd(Xi)] E[min Rrnd(Xi)] k = 3 DBMR-BPI 23.7(22.5, 24.9) 23.1(22.2, 24.2) 4.1(3.6, 4.6) 28.1(26.9, 29.3) 17.1(16.0, 18.2) T = 150000 RND 1.3(1.1, 1.7) 1.4(1.1, 1.7) 0.2(0.1, 0.2) 1.6(1.2, 2.0) 1.1(0.9, 1.4) APT 0.3(0.3, 0.4) 0.3(0.3, 0.4) 0.1(0.1, 0.1) 0.5(0.4, 0.6) 0.2(0.2, 0.2) Disagreement 1.1(0.7, 1.8) 1.0(0.7, 1.6) 0.3(0.1, 0.5) 1.3(0.8, 2.0) 0.6(0.5, 0.7) k = 5 DBMR-BPI 23.0(21.8, 24.1) 22.3(21.5, 23.1) 4.5(4.0, 4.9) 27.6(26.7, 28.6) 15.5(14.4, 16.7) T = 200000 RND 1.1(0.9, 1.4) 1.1(0.9, 1.4) 0.2(0.1, 0.2) 1.4(1.1, 1.7) 0.9(0.7, 1.1) APT 0.3(0.3, 0.4) 0.3(0.3, 0.4) 0.1(0.1, 0.1) 0.4(0.4, 0.5) 0.2(0.2, 0.3) Disagreement 0.7(0.5, 0.9) 0.8(0.5, 1.0) 0.2(0.1, 0.3) 1.0(0.7, 1.4) 0.5(0.4, 0.7) Table 5: Cartpole swing-up. Statistics for the random variable Xi = 1 T PT t=1 1ri(st,at)>0, with ri Rrnd. Values are multiplied by 100 and rounded to the first digit. In bold statistically significant results (in parentheses we indicate the 95% confidence interval). 0 20000 40000 60000 80000 100000 120000 140000 Timestep t Estimation of t,r for DBMR-BPI - k = 3 0 25000 50000 75000 100000 125000 150000 175000 200000 Timestep t Estimation of t,r for DBMR-BPI - k = 5 Figure 14: Estimate t,r of DBMR-BPI over time steps. The i-th reward corresponds to the i-th reward in Rbase. Shaded area corresponds to the 95% confidence interval over 30 seeds. 1 2 3 4 5 Reward id Number of rewards chosen by DBMR-BPI up to timestep T(k = 3) 1 2 3 4 5 Reward id Number of rewards chosen by DBMR-BPI up to timestep T(k = 5) Figure 15: Distribution of the number of times a certain reward ri was picked by DBMR-BPI up to time step T. The i-th reward corresponds to the i-th reward in Rbase. The plot does not account for rewards chosen uniformly at random or due to forced tracking. Parameters and compute resources. For these simulations we used 1 G5.4XLARGE AWS instance with 16 v CPUs, 64 Gi B of memory and 1 A10G GPU with 24 Gi B of memory. To obtain all the results 3-4 days are needed. The entire research project needed roughly 21 days of computation time for this experiment. The parameters are listed in tab. 6. Refer to app. E for further details on the parameters and the algorithms. DBMR-BPI parameters Value APT parameters Value Disagreement parameters Value RND parameters Value Nensemble 20 Ndqn [128, 128] Ndqn [128, 128] Ndqn [128, 128] Ndbmrbpi [128, 128] C 2 105 C 2 105 C 2 105 τ 10 2 τ 10 2 τ 10 2 τ 10 2 αQ, αM (5 10 4, 5 10 5) α 5 10 4 α 5 10 4 α 5 10 4 puniform 0.5 Nrep 512 Nensemble 20 Nrep 512 pmask 0.7 Ntrunk [512] Ndisag [128] Nrnd [128] k 2 NF [128] αdisag 10 4 αrnd 10 4 sprior 6 NB [128] C 2 105 αapt 10 4 Kk 12 Table 6: Parameters and values of the algorithms. For all algorithms we used a batch size of 128. D.2.2 Deep Sea Environment details. The Deep Sea problem is a hard-exploration reinforcement learning challenge set in an N N grid. The agent starts in the top-left corner (state (0, 0)) and, in the classical version, aims to reach the bottom-right corner (state (N 1, N 1)) for a large reward. The state vector is an N 2-dimensional vector that one-hot encodes the agent s position in the grid. The agent can move diagonally, left or right, and down when near the wall. Reaching the bottom-right corner yields a reward of 1. We also incorporate slipping, as described in [53], so that when the agent selects an action a, there is a 5% probability that a different action a = a is executed instead. Lastly, for each seed, in each state the action mapping is randomized (so that in two different states an action a can correspond to go left in the first state, and go right in the second state). s(0,1) s(1,1) s(0,2) s(1,2) s(2,2) s(0,3) s(1,3) s(2,3) s(3,3) Figure 16: Depiction of the Deepsea problem as a tabular MDP with N = 4. Reward functions. We adapt the Deep Sea environment to a multi-reward setting by introducing N different rewards, Rbase = {r1 , r N}, where ri assigns a positive reward whenever the agent reaches column i in the last row of the grid, that is: ri(st, at) = 1st=(N 1,i). To evaluate the capability of DBMR-BPI in collecting unseen rewards, we sampled 20 reward functions Rrnd = (r i)20 i=1 for each seed. Each reward function is essentially a vector of dimension N(N +1)/2, corresponding to the possible positions in the grid, and is sampled from a Dirichlet distribution with parameter α = (αi)i, where αi = 1/(10N) for all i = 1, , N(N + 1)/2. When the agent is in state s, corresponding to row k, column j, it collects the corresponding reward (ri)j,k from the reward function r i. We then computed the rewards for these additional reward functions to assess performance. Note that these rewards are not used by DBMR-BPI during training. Numerical results: performance. In figs. 17 and 18 we depict the performance of the various algorithms over Rbase and Rrnd for two sizes N {20, 30}. In tabs. 7 and 8 we report the values at the last time step T of the various metrics considered. In particular, we focus on the distribution of the following random variable t=1 ri(st, at), where i is the i-th reward in the corresponding set of rewards. This metric measures the average reward collected for the i-th reward in a specific instance. In the figures, and in the tables, the quantity Avg R({Xi}) (sim. the others metrics) indicates the arithmetic mean of {Xi}i over the rewards, while GMR({Xi}) indicates the geometric mean over the rewards. We average results across 30 seeds. 0 2000 Episodes E[medi Xi,k] 0 2000 Episodes E[GMi Xi,k] 0 2000 Episodes E[stdi Xi,k] 0 2000 Episodes E[maxi Xi,k] 0 2000 Episodes E[mini Xi,k] 0 2000 Episodes E[P i 1Xi,k=0] DBMR-BPI RND Disagreement APT 0 2000 Episodes E[medi Xi,k] 0 2000 Episodes E[GMi Xi,k] 0 2000 Episodes E[stdi Xi,k] 0 2000 Episodes E[maxi Xi,k] 0 2000 Episodes E[mini Xi,k] 0 2000 Episodes E[P i 1Xi,k=0] DBMR-BPI RND Disagreement APT Figure 17: Deep Sea environment. Statistics for the random variable Xi = 1 T PT t=1 ri(st, at) (with ri Rbase). Values are multipled by 100 and rounded to the first digit (except for the last plot). Shaded areas depict 95% confidence intervals. Base rewards Rbase E[med R(Xi)] E[GMR(Xi)] E[std R(Xi)] E[max R(Xi)] E[min R(Xi)] E[P i 1Xi=0] N = 20 DBMR-BPI 5.0(4.9, 5.1) 4.6(4.6, 4.6) 1.8(1.7, 1.8) 7.9(7.8, 8.0) 1.7(1.6, 1.8) 0.0(0.0, 0.0) T = 50000 RND 4.8(4.7, 4.9) 4.6(4.2, 4.7) 1.7(1.6, 1.8) 8.9(8.5, 9.2) 2.4(2.1, 2.6) 0.0(0.0, 0.1) APT 3.3(3.0, 3.6) 1.4(0.8, 1.9) 5.3(4.8, 5.8) 19.7(17.7, 21.9) 0.2(0.1, 0.2) 0.9(0.6, 1.3) Disagreement 1.7(1.4, 2.1) 0.0(0.0, 0.0) 6.2(5.9, 6.5) 20.4(18.7, 22.2) 0.0(0.0, 0.0) 4.0(3.4, 4.5) N = 30 DBMR-BPI 3.5(3.4, 3.5) 3.1(2.9, 3.2) 0.7(0.7, 0.7) 4.4(4.3, 4.6) 1.1(1.0, 1.3) 0.0(0.0, 0.1) T = 100000 RND 3.1(3.0, 3.2) 1.8(1.3, 2.2) 1.9(1.9, 2.0) 8.1(7.7, 8.5) 0.3(0.2, 0.4) 0.3(0.2, 0.5) APT 2.0(1.8, 2.2) 0.0(0.0, 0.0) 3.9(3.6, 4.2) 16.0(14.5, 17.6) 0.0(0.0, 0.0) 3.3(2.8, 3.8) Disagreement 0.2(0.1, 0.2) 0.0(0.0, 0.0) 7.0(6.6, 7.3) 27.3(25.1, 29.6) 0.0(0.0, 0.0) 11.5(10.8, 12.2) Table 7: Deep Sea environment. Statistics for the random variable Xi = 1 T PT t=1 ri(st, at) (with ri Rbase). Values are multipled by 100 and rounded to the first digit. In parentheses we indicate the 95% confidence interval. For the base case, we report the geometric mean over the rewards, since the arithmetic mean would be a constant for this specific setting (the number of visits in the last row, normalized by t, can be seen as a frequency, and sums up to 1). hence, for this setting the geometric mean is more appropriate [58]. We also report the metric PN i=1 1Xi=0, which indicates the number of cells in the last row of the grid that have yet to be visited at a certain time step. 0 2000 Episodes E[medi Xi,k] 0 2000 Episodes 0 2000 Episodes E[stdi Xi,k] 0 2000 Episodes E[maxi Xi,k] 0 2000 Episodes E[mini Xi,k] DBMR-BPI RND Disagreement APT 0 2000 Episodes E[medi Xi,k] 0 2000 Episodes 0 2000 Episodes E[stdi Xi,k] 0 2000 Episodes E[maxi Xi,k] 0 2000 Episodes E[mini Xi,k] DBMR-BPI RND Disagreement APT Figure 18: Deep Sea environment. Statistics for the random variable Xi = 1 T PT t=1 ri(st, at) (with ri Rrnd). Values are multipled by 100 and rounded to the first digit (except for the last plot). Shaded areas depict 95% confidence intervals. Random rewards Rrnd E[med Rrnd(Xi)] E[Avg Rrnd(Xi)] E[std Rrnd(Xi)] E[max Rrnd(Xi)] E[min Rrnd(Xi)] N = 20 DBMR-BPI 8.0(7.7, 8.3) 8.9(8.5, 9.2) 3.6(3.2, 4.0) 18.7(17.0, 20.5) 4.3(4.0, 4.6) T = 50000 RND 7.9(7.6, 8.1) 8.7(8.4, 9.1) 3.5(3.1, 4.0) 18.6(16.7, 20.8) 4.5(4.1, 4.8) APT 6.6(6.0, 7.1) 8.7(8.2, 9.2) 6.6(5.9, 7.4) 27.2(24.6, 30.1) 1.6(1.3, 1.9) Disagreement 7.9(7.3, 8.5) 8.9(8.3, 9.5) 6.4(5.8, 7.1) 24.4(22.1, 26.8) 0.8(0.5, 1.0) N = 30 DBMR-BPI 5.0(4.9, 5.2) 6.0(5.8, 6.3) 3.0(2.5, 3.6) 16.0(13.4, 18.8) 3.4(3.2, 3.6) T = 100000 RND 5.4(5.3, 5.6) 6.1(5.9, 6.4) 3.0(2.4, 3.5) 15.1(12.7, 17.9) 2.7(2.5, 3.0) APT 4.6(4.3, 4.9) 5.9(5.6, 6.2) 4.3(3.8, 4.7) 18.6(16.7, 20.7) 1.3(1.0, 1.6) Disagreement 4.3(3.6, 5.1) 6.2(5.6, 6.7) 6.1(5.6, 6.6) 22.2(20.2, 24.4) 0.2(0.1, 0.3) Table 8: Deep Sea environment. Statistics for the random variable Xi = 1 T PT t=1 ri(st, at) (with ri Rrnd). Values are multipled by 100 and rounded to the first digit. In parentheses we indicate the 95% confidence interval. When considering the base set of rewards Rbase used by DBMR-BPI , we see that APT and Disagremeent do not match the performance of DBMR-BPI and RND, focusing most of the time on a few rewards. On the other hand, while DBMR-BPI and RND show similar performance for N = 20, the difference increases for N = 30 on the base set of rewards. We also note that DBMR-BPI tend to focus less on specific rewards (on average maxi Xi is smaller). For larger N DBMR-BPI seems also to explore more uniformly (lower standard deviation, and higher minimum). When considering the total reward collected from the set Rrnd, we see that DBMR-BPI achieves performance similar to RND, which confirms the capability of DBMR-BPI to collect reward from unseen reward functions. Numerical results: rewards chosen by DBMR-BPI. In fig. 19 we plot the value of t,r estimated from DBMR-BPI. According to DBMR-BPI the rewards in Rbase are initially the hardest one to learn ( t,r is small for those rewards), but then, as the algorithm eventually reaches those states, the estimated gap grows larger, and the hardest reward to learn is actually the first one, corresponding to the one that yields positive reward in state (N 1, 0). This is indeed expected, since there are different paths in the MDP that the agent can take to reach that state, while to reach state (N 1, N 1) there is only one possible sequence of actions (and one can verify that the sub-optimality gap is also larger). Fig. 20 shows the number of time a certain reward was picked by DBMR-BPI for exploration (the plot does not account the uniform exploration over rewards, or the tracking procedure that ensures each rewards is selected at a rate of O( n), where n is the number of episodes). The last rewards (those with high id value) have been chosen the most since their sub-optimality gap was smaller for a longer period of time. 0 10000 20000 30000 40000 50000 Timestep t Estimation of t,r for DBMR-BPI - N = 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 20000 40000 60000 80000 100000 Timestep t Estimation of t,r for DBMR-BPI - N = 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Figure 19: Estimate t,r of DBMR-BPI over time steps. The i-th reward corresponds to the i-th reward in Rbase. Shaded area corresponds to the 95% confidence interval over 30 seeds. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Reward id Number of rewards chosen by DBMR-BPI up to timestep T(N = 20) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Number of rewards chosen by DBMR-BPI up to timestep T(N = 30) Figure 20: Distribution of the number of times a certain reward ri was picked by DBMR-BPI up to time step T. The i-th reward corresponds to the i-th reward in Rbase. The plot does not account for rewards chosen uniformly at random or due to forced tracking. Parameters and compute resources. For these simulations we used 1 G5.4XLARGE AWS instance with 16 v CPUs, 64 Gi B of memory and 1 A10G GPU with 24 Gi B of memory. To obtain all the results 2-3 days are needed. The entire research project needed roughly 18 days of computation time for this experiment. The parameters are listed in tab. 9. Refer to app. E for further details on the parameters and the algorithms. DBMR-BPI parameters Value APT parameters Value Disagreement parameters Value RND parameters Value Nensemble 20 Ndqn [64, 64] Ndqn [64, 64] Ndqn [64, 64] Ndbmrbpi [64, 64] C 105 C 105 C 105 τ 10 2 τ 10 2 τ 10 2 τ 10 2 αQ, αM (5 10 4, 5 10 5) α 5 10 4 α 5 10 4 α 5 10 4 puniform 0.5 Nrep 512 Nensemble 20 Nrep 512 pmask 0.5 Ntrunk [512] Ndisag [64] Nrnd [64, 64] k 1 NF [64] αdisag 10 4 αrnd 10 4 sprior 8 NB [64] C 105 αapt 10 4 Kk 12 Table 9: Parameters and values of the algorithms. For all algorithms we used a batch size of 128. D.3 Radio network control problems D.3.1 Downlink link adaptation problem In this section we describe how the DBMR-BPI can be applied to the Downlink link adaptation problem [37] in radio networks. Overview. Link adaptation, or adaptive modulation and coding (AMC), is a technique used in wireless communication systems, such as the 3GPP HSDPA, LTE or NG-RAN, to dynamically adapt the transmission rate of a communication link to the timeand frequency-varying channel conditions [37]. Modulation and Coding Schemes (MCS) effectively adapt the transmission rate by matching the modulation and coding parameters used for communication (such as modulation order, code rate, etc.) to the conditions of the radio link, such as propagation loss, channel strength, interference from other signals concurrently transmitted in the same radio resources, etc. Link adaptation is a dynamic process that acts potentially as frequently as each transmission time interval (e.g., on a sub-millisecond time-scale in the 3GPP NG-RAN system) wherein a communication is scheduled between a base station (a radio node that facilitates wireless communication for user equipment within its coverage area) and a user equipment (UE, a mobile device such as a smartphone or tablet). Link adaptation algorithms attempt to optimally adapt the transmission data rate chosen for a link to the current channel and interference conditions of the link. These algorithms select the most appropriate modulation order and coding rate (also known as modulation and coding scheme, MCS) based on the latest information available about the state of the communication system as well as the state of the individual communication link (e.g., Channel Quality Indicator, CQI) for which the MCS value is selected. RL Agent CQI (CSI, Rank) Figure 21: Downlink link adaptation diagram. Problem Objective. In wireless communication systems, there is a fundamental trade-off between throughput and spectral efficiency, which can be understood in terms of resource elements. Resource elements are the smallest units of time-frequency resources allocated for transmission, encompassing a specific duration of time and a specific frequency bandwidth. Our objective is to explore the environment using DBMR-BPI to design policies that optimize this trade-off based on the desired intent. Higher throughput, the amount of data transmitted successfully per unit of time, often requires higher-order modulation schemes and denser coding rates, which can reduce spectral efficiency the effective use of available spectral resources. Conversely, enhancing spectral efficiency through lower-order modulation schemes and robust coding rates can ensure more reliable transmissions but may limit throughput. Since measuring throughput directly is challenging due to various influencing factors, we consider Transport Block Size (TBS) as a proxy. TBS represents the amount of data that can be transmitted in a given time interval based on the selected Modulation and Coding Scheme (MCS). However, increasing TBS does not necessarily result in higher throughput, as a larger TBS can lead to a higher Block Error Rate (BLER) the percentage of transport blocks that fail to be decoded correctly impacting the reliability of data transmission and reducing the actual amount of successfully received data. Environment, RL formulation and setup. We consider a Time-Division Duplexing (TDD) 5G system operating at a 3.5GHz carrier frequency, with PHY layer numerology µ = 0 of the 3GPP technical specifications 38.211 and single-user Multi-Input Multi-Output (SU-MIMO) transmission. Each base station is configured as massive MIMO (m MIMO) with an 8x4x2 antenna array. We consider 1 site, 3 sectors, and a fixed number of 10 UEs with full buffer traffic, thereby experiencing stable interference conditions. We generated 150 randomized scenarios during training that randomize across several variables (e.g., UEs movement and traffic). Each scenario lasted 5 seconds. The downlink link adaptation problem can be formulated as a multiple-rewards RL problem as follows: The base station represents the RL agent that, depending on the channel measurements transmitted by the UEs, chooses the modulation and coding scheme (MCS). The state includes current CQI values, number of the antennas of the UE, estimated speed of the UE, rank indicators, and other relevant metrics that affect link quality, including a windowed history of these values. Actions involve selecting different MCS values, which dictate the modulation order and coding rate for the packet transmission. The transmission of a packet from a UE is modeled as an episodic MDP of at most 5 steps. At each time step t, the agent chooses an action at (the MCS). If the transmission is successful, the agent receives an ACK and the episode terminates. Otherwise, the agent receives a NACK and observes the new channel measurements (next state) until the episode is done. The environment was simulated using a proprietary simulator10, the code for which cannot be disclosed. We evaluated the DBMR-BPI algorithm along with RND, APT and Disagreement. Reward design. We consider two rewards, that optimize for different intents. For each episode we let zt be a boolean value indicating if the packet was successfully transmitted (ACK) or not (NACK). We also let TBSn indicate the transport block size for the n-th transmission, and Nre,n the number of resource elements for the n-th transmission. 1. Transport block size: r1,n = TBSn 1zn=1, this reward encourages the agent to choose actions that maximize the transport block size (at the expense of other metrics, i.e., error rate, spectral efficiency), measured as the number of bits of successfully transmitted packets. 2. Spectral efficiency: r2,n = 0 if the transition is not terminal; otherwise, r2,n = P k n Nre,k, representing the total number of resource elements used up to the final time step n of the episode. These rewards, if weighted together, can optimize for different intents, between throughput, efficiency and transmission rate. Numerical results: performance. In fig. 22 we report the performance of the various algorithms in terms of (1) throughput, (2) BLER (block error rate) and (3) spectral efficiency (which measures the number of resource elements used). 10The environment design and MDP formulation come, or are inspired, from a forthcoming publication [101]. 0 20 40 60 80 100 Throughput (Mbps) 0 20 40 60 80 100 Throughput (Mbps) DBMR-BPI: 14.70 Mbps RND: 11.64 Mbps Disagreement: 9.09 Mbps APT: 12.67 Mbps 0.0 0.2 0.4 0.6 0.8 1.0 BLER 0.0 0.2 0.4 0.6 0.8 1.0 BLER DBMR-BPI: 0.40 bler RND: 0.71 bler Disagreement: 0.56 bler APT: 0.51 bler 0 2 4 6 8 Spectral efficiency (bits per re) 0 2 4 6 8 Spectral efficiency (bits per re) DBMR-BPI: 6.9 bits/re RND: 5.2 bits/re Disagreement: 4.4 bits/re APT: 5.6 bits/re Figure 22: From top to bottom: (1) throughput distribution achieved during training; (2) BLER distribution during training; (3) distribution of the number of bits transmitted per resource elements (spectral efficiency) during training. In the legend we also write the average value for each algorithm. In fig. 23 we present the distribution of the number of transmissions attempted per packet and the action selection (MCS index) for each algorithm. We observe that DBMR-BPI achieves better performance both in terms of throughput and spectral efficiency. The MCS index distribution reveals that DBMR-BPI tends to use all the available actions during the first transmission (note that only the last 5 index values are used during re-transmissions), which seems to indicate that it is effectively learning which action to take depending on the state features. On the other hand RND and Disagreement tend only to focus on the initial MCS index values during the first transmission. We also note a lower BLER target for DBMR-BPI, as evidenced also by the fact that the average number of transmissions is lower for DBMR-BPI compared to the other methods. In general, these performances appear to be primarily due to optimized usage of resource elements, as shown in fig. 24. This figure illustrates the distribution of rewards collected during training for each episode. While the TBS is similar across all algorithms, DBMR-BPI clearly focuses on optimizing the number of resource elements. This type of analysis is valuable for practitioners to understand the properties and impacts of each reward function. Numerical results: rewards chosen by DBMR-BPI. In fig. 25 we plot the estimate t,r for both rewards (left plot) and the number of times DBMR-BPI selected each reward (right plot; does not account for forced tracking or uniform exploration). From the left plot we see that optimizing 0 1 2 3 4 5 Number of transmissions per packet 0 1 2 3 4 5 Number of transmissions per packet DBMR-BPI: 1.6 nr Tx RND: 2.4 nr Tx Disagreement: 2.1 nr Tx APT: 1.8 nr Tx 0 5 10 15 20 25 30 MCS index 0 5 10 15 20 25 30 MCS index DBMR-BPI: 20.4 mcs RND: 23.5 mcs Disagreement: 19.1 mcs APT: 20.4 mcs Figure 23: From top to bottom: (1) distribution of the number of re-transmissions per packet during training; (2) MCS index selected by the algorithms during training (note that only the last 5 values are used during re-transmissions). In the legend we also write the average value for each algorithm. DBMR-BPI RND Disagreement APT 0 DBMR-BPI RND Disagreement APT Figure 24: Distribution of the normalized rewards collected during training by the algorithms. the number of resource elements seems to be a harder task due to the lower value of t,r, which explains why DBMR-BPI focuses more on optimizing the number of resource elements used per packet transmission. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Timestep t 1e6 t, r estimate for DBMR BPI RE TBS Reward 1e6 Number of rewards chosen by DBMR BPI Figure 25: Left plot: value of t,r estimated by DBMR-BPI during training for both rewards. Right plot: distribution of the rewards chosen by DBMR-BPI during training. Parameters, compute resources and libraries. For these simulations we used 4 private workstation each with 24 v CPUs, 128 Gi B of memory and 1 A100 NVIDIA GPU. To obtain all the results 6 days of computation time are needed. The entire research project needed roughly 24 days of computation time for this experiment. The parameters are listed in tab. 10. Refer to app. E for further details on the parameters and the algorithms. DBMR-BPI parameters Value APT parameters Value Disagreement parameters Value RND parameters Value Nensemble 20 Ndqn [128, ] 7 Ndqn [128, ] 7 Ndqn [128, ] 7 Ndbmrbpi [256, ] 6 C 7 106 C 7 106 C 7 106 τ 10 3 τ 10 3 τ 10 3 τ 10 3 αQ, αM (5 10 5, 5 10 5) α 5 10 5 α 5 10 5 α 5 10 4 puniform 0.5 Nrep 512 Nensemble 20 Nrep 512 pmask 0.5 Ntrunk [256, 256] Ndisag [256, 256, 128, 128, 128] Nrnd [256, 256, 256, 128, 128, 128] k 1 NF [256, 128, 128, 128] αdisag 5 10 5 αrnd 5 10 5 sprior 10 NB [256, 128, 128, 128] C 7 106 αapt 5 10 5 Kk 15 Table 10: Parameters and values of the algorithms. For all algorithms the batch size was 512. The notation [x, ] k indicates a stack of k layers of size x. We set up our experiments using Python 3.10 [88] (For more information, please refer to the following link http://www.python.org), and made use of the following libraries: Num Py [89], Sci Py [90], Py Torch [98], Seaborn [92], Pandas [93], Matplotlib [94]. E Deep-RL Algorithms Details E.1 DBMR-BPI algorithm Overview. As mentioned in sec. 4, extending MR-Na S to Deep-RL is not straightforward. Firstly MR-Na S is a model-based method, secondly there is no closed-form solution to arg infω ΩU (ω; M). To circumvent these issues, we adapt the DBMF-BPI algorithm from [53], where the authors propose DBMF-BPI, a deep-RL exploration approach for single-reward BPI. In [53], the authors suggest that a model-free solution can be achieved by (i) using the generative solution arg infω (S A) U (ω; M), which can be computed in closed form, and (ii) by estimating the parametric uncertainty of the MDP-specific quantities (sub-optimality gaps, etc.). In other words, the generative solution may be close enough to the model-based one. Nevertheless, this solution may be biased because we do not know what are the true sub-optimality gaps, and therefore we need to take into account the parametric uncertainty that we have in the estimates. Hence, we propose DBMR-BPI, an extension of DBMF-BPI, which learns an ensemble of Q-functions of the optimal policy and, similarly, the k-th moment of that Q-function (called M-function). More precisely, the extension of DBMF-BPI to a finite set of rewards, mainly consists of few modifications that we describe in the following paragraphs. 1. Value learning for different rewards. The Q-function in DBMR-BPI (sim. the M-function) learns the value for multiple rewards. In the finite action setting this implies that there are AR values to estimate in every state. Alternatively, we also tried to feed the networks with a one-hot encoding of the chosen reward, but this resulted to be less effective in terms of model capacity. 2. Reward exploration. We explore according to the difficulty of the reward function. For the sake of stability, at the beginning of each episode (or, for example, every 1/(1 γ) steps) we sample a reward r R to explore, and apply DBMF-BPI on this reward. To estimate the exploration difficulty, we simply use the estimate at time t of the minimum sub-optimality gap t,r (see paragraph below), and explore a reward r with probability 1/ 2 t,r. In addition to the mechanism described above, we also introduce a random uniform exploration of the rewards and forced tracking of the rewards: Random exploration: with probability puniform we choose a reward uniformly at random. Forced tracking: this mechanis ensures that each reward is explored sufficiently often. We let Ut = {r R : t Nr,t/2 < 0} be the set of under-sampled rewards at time t, where Nr,t is the number of times the algorithm has chosen reward r in the previous t steps, and choose r = arg minr Nr,t whenever Ut is non-empty (otherwise, we sample it as in Alg. 2). This procedure guarantees that each reward is sampled at a rate of O( Algorithm 3 DBMR-BPI - Reward selection 1: if X puniform then 2: Choose a reward uniformly at random. 3: else if Ut is not empty then 4: Choose reward arg minr Nr,t. 5: else 6: Choose reward r with probability 1 2 t,r 7: end if 3. Estimation of the sub-optimality gaps t,r. To estimate the minimum sub-optimality gap for a fixed reward r, we initially used the method described in DBMF-BPI. However, their estimates are often overly conservative. In DBMF-BPI, the minimum gap for each batch across all models in the ensemble is calculated and added to a rolling window. The smallest value in this window is then used as the target for stochastic approximation, employing a decreasing learning rate. We instead compute t,r as follows: 1. For each reward in the batch, we calculate the gaps and take the minimum across the models in the ensemble. We then average these minimum gaps over the transitions in the batch to obtain an estimate t,r. This approach is less conservative than taking the minimum across both transitions and models in the ensemble. 2. We treat t,r as a learnable parameter and use the estimated gap ˆ t,r as the target value. We compute the Huber loss between t,r and ˆ t,r, and perform a gradient step to update t,r. 3. To further stabilize the estimate of t,r, we can apply Polyak averaging over time steps. The main parameters for DBMR-BPI are as follows: Symbol Description Nensemble Ensemble size Ndbmrbpi Hidden layers sizes for each model in the ensemble τ Rate at which the target network updates αQ, αM Learning rates of Adam for the Q,M-networks puniform Uniform random reward probability pmask Mask probability used by DBMF-BPI k k-value used by DBMF-BPI sprior prior scale value used by DBMF-BPI C Maximum size of the replay buffer Table 11: DBMR-BPI parameters. E.2 Other algorithms For APT, Disagreement and RND we used the open-source implementation from [24]. However, the implementation in [24] only works for continuous action spaces, therefore we adapted the implementation to discrete action spaces as described below. For all the algorithms we used the same architecture as in the original papers, and only did a slight tuning of the hyper-parameters. DQN agent. Each algorithm uses a Double DQN agent to learn the Q-values using the intrinsic reward computed by the corresponding algorithm. The main parameters for this DQN agent are Symbol Description Ndqn Hidden layers sizes C Maximum size of the replay buffer τ Rate at which the target network updates α Learning rate of Adam Table 12: DQN parameters. APT Agent. This agent uses a forward network Fθ and a backward network Bθ to learn a representation that is used to compute the intrinsic reward. The forward network predicts a representation of the next state ˆst+1 given the current state st and action at, while the backward network predicts the action that was taken ˆat given two consecutive state representations st, st+1. The representation is Nrep-dimensional, and is learnt by a trunk network Mθ. The loss is computed by averaging over a batch B the forward and backward errors. The forward error is computed as in [24] eforward(B) = 1 (st,at,st+1) B Fθ(Mθ(st))at Mθ(st+1) 2, where Fθ( )a is the a-th predicted representation. We adapt the backward error to discrete action spaces by using the cross-entropy loss ebackward(B) = 1 (st,at,st+1) B a 1(at=a) log(Bθ(Mθ(st), Mθ(st+1))a), where Bθ( , )a is the predicted probability of action a by the network. The overall loss over a batch B is e(B) = eforward(B) + ebackward(B). For this agent, in addition to the parameters in tab. 12, we also have the following parameters Symbol Description Nrep Representation dimension Ntrunk Hidden layers sizes for the trunk network NF Hidden layers sizes for the forward network NB Hidden layers sizes for the backward network αapt Learning rate for APT Kk Number of neighbors used to compute the particle based entropy Table 13: APT parameters. Disagreement agent. This agent uses an ensemble of models to predict the next state. The variance across the models is then used to quantify the parametric uncertainty and compute the intrinsic reward. We kept the same implementation as in [24]. For this agent, in addition to the parameters in tab. 12, we also have the following parameters Symbol Description Nensemble Ensemble size Ndisag Hidden layers sizes for each model in the ensemble αdisag Learning rate for Disagremeent Table 14: Disagreement parameters. RND agent. This agent uses a predictor and target network to estimate the parametric uncertainty, which is then used as intrinsic reward. This is similar to the mechanism employed by prior networks in [102]. Simply, for a given state both the predictor and target networks output a representation of dimension Nrep. The predictor is then trained to minimize the ℓ2 norm distance between the predictor representation and the target representation. For this agent, in addition to the parameters in tab. 12, we also have the following parameters Symbol Description Nrep Representation dimension Nrnd Hidden layers sizes for the predictor/target networks αrnd Learning rate for RND Table 15: RND parameters. F Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We reported in the abstract and (mainly) in the introduction the contributions of the paper. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Limitations are discussed in app. A. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All theorems/results are numbered and cross-referenced and we make clear the assumptions used in the paper. For each result we indicate where to find the proof in the appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Our contributions, the MR-Na S and DBMR-BPI algorithms, are fully described in app. C and app. E.1. Numerical experiments, including details and parameters used, are fully described in app. D. We further release the code of the algorithms, and of the environments, except for the simulator used in app. D.3.1, which is proprietary and not disclosable (we provided all the details to be able to replicate the experiment). Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Yes, in the code we provide the instructions in the README file to run the simulations, collect data, and plot the results. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide all the details of the experiments, including hyper-parameters, optimizers used, etc. in the appendix. Also the code is provided. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: In all results we report the average value and confidence intervals. For some results we also show directly the data distribution. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: For each experiment we reported in the appendix the resources used and the total computation time. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Yes, the authors followed the Neur IPS Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: It is discussed in app. A. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: We believe that the paper poses no such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: For all the assets used we cited the original authors. In the README of the code we cite the license of the assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: For all the code published we provide documentation in the code or README file. Licensing is mentioned in the REAME file. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: the paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: the paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.