# sleeping_reinforcement_learning__6c4e9992.pdf

Sleeping Reinforcement Learning

Simone Drago * 1 Marco Mussi * 1 Alberto Maria Metelli 1

In the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent s control. To address these realistic scenarios, we introduce a novel paradigm called Sleeping Reinforcement Learning, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order r Op H ?

SATq, where H is the horizon, S and A are the cardinalities of the state and action spaces, respectively, and T is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order Ωp ?

T2A{2q when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order r Op H ?

SATq, showing that the problem shares the same complexity of standard RL.

1. Introduction

In recent years, Reinforcement Learning (RL, Sutton & Barto, 2018) has demonstrated remarkable success in solv-

*Equal contribution 1Politecnico di Milano, Milan, Italy. Correspondence to: Simone Drago <simone.drago@polimi.it>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

ing sequential decision-making problems mainly in simulated environments. Nowadays, we witness an increasing demand to transition its capabilities from simulation to real-world applications. However, to move RL to practical domains, we still experience notable challenges that need to be addressed from both the algorithmic and theoretical perspectives. One of these challenges concerns actions availability. Indeed, in the standard RL framework, the set of actions that can be played in a given state is assumed to be known and immutable throughout the interaction. However, in several real-world sequential decision-making problems, it may not be possible to play some of the actions under some circumstances.

Motivation. Consider the scenario depicted in Figure 1a, in which we want to control a physical system (e.g., a robot) characterized by a given action space A made of all the actions the agent can perform. At every stage h, the RL agent observes the state of the environment sh and decides the action ah P A to play. Then, such an action is usually validated by a low-level controller (LLC), which checks whether the action is feasible or not (depending, for instance, on some physical constraints or safety reasons). On the one hand, if feasible, the action is executed as is. On the other hand, if the action is not feasible, the low-level controller overrides it with another action, rah P A, such as one suggested by a baseline policy or the closest feasible action (according to some domain-specific metric). From the agent s perspective, the LLC can be considered as a part of the environment. However, when the agent is unaware that certain actions are infeasible (and may only realize it after the LLC intervenes), the performance of the learned policy can be significantly harmed (see Example 1). If the agent is instead aware of the available actions, we ideally want it to select an action among those deemed feasible by the LLC. This scenario can be addressed by adopting a solution in which a low-level filter makes the RL agent aware of which actions Ah Ď A are available before actually making a decision (Figure 1b).

This problem of learning in scenarios with varying action availability is widely studied and discussed for Multi-Armed Bandits (MABs, Lattimore & Szepesv ari, 2020) under the name of Sleeping MABs. This research line comprises stochastic and adversarial choices for both rewards and action availability (Kleinberg et al., 2008; Kanade et al.,

Sleeping Reinforcement Learning

Equivalent System

Controller sh sh RL Agent

(a) Classic Reinforcement Learning.

RL Agent sh

AhpshqĎA sh

(b) Sleeping Reinforcement Learning.

Figure 1. Example of a possible interaction protocol for Classic and Sleeping Reinforcement Learning.

2009; Saha et al., 2020; Nguyen & Mehta, 2024) and several notions of regret (Gaillard et al., 2023). In RL, instead, this problem remains unaddressed despite its practical relevance.1 This work aims to fill such a gap in the literature.

Original Contributions. In this paper, we propose the novel paradigm of Sleeping Reinforcement Learning (Sle RL) and study it from a theoretical perspective. The contributions of this work are summarized as follows:

In Section 2, we formally introduce the framework of the episodic finite-horizon Sleeping Markov Decision Processes (Sle MDPs) setting, a generalization of MDPs in which the available action set changes throughout the interaction. Then, we introduce the two types of action set disclosure, namely per-episode and per-stage disclosures. Finally, we present two stochastic models governing the action availability, namely Markovian and independent. In Section 3, we consider the simpler case of perepisode disclosure, where the agent is informed of the available action sets at the beginning of each episode. We introduce the definitions of value function, optimality, and regret. Furthermore, we present Action-Restricted UCBVI (AR-UCBVI), an algorithm based on UCBVI (Azar et al., 2017) and analyze its regret, showing that it matches the lower bound of standard RL up to logarithmic factors for sufficiently large T (Theorem 3.2). In Section 4, we address the more realistic and challenging setting of per-stage disclosure, where the available action set is revealed for the current stage only, imme-

1The scenario in which not all the actions are available for every state in a deterministic manner is discussed under the name of action masking (Huang & Onta n on, 2022), see Appendix B.

diately before the agent selects an action. We address the general scenario in which the action availabilities are governed by a Markovian process and illustrate how the problem can be framed as solving an augmented MDP in which the available action set is incorporated in the state. Based on this transformation, we define the value function, optimality, and regret. Then, through a novel construction (Figure 3), we demonstrate a statistical barrier of this setting, showing that an exponential dependence on the number of actions A is unavoidable in the regret, proving a lower bound of order Ωp H ?

SAT2A{2q (Theorem 4.2). In Section 5, we turn to the tractable case of independent per-stage disclosure, where the action availabilities are sampled independently at every stage. We propose a novel optimistic algorithm, Sleeping UCBVI (S-UCBVI), that extends the classical UCBVI with the estimate of the action availability probabilities and appropriately defined new bonuses. We show that S-UCBVI enjoys a regret bound of order r Op H ?

SATq for sufficiently large T (Theorem 5.2), matching the lower bound, up to logarithmic factors.

Related works are discussed in Appendix B. Omitted proofs are provided in Appendices C and D. A numerical validation of the work is provided in Appendix E.

2. The Sleeping MDPs Setting

In this section, we first introduce the notation and the standard episodic finite-horizon MDP setting, and then we present the novel Sleeping MDPs (Sle MDPs) framework.

Notation. Given a, b P N with a ă b, we define Ja, b K : ta, a 1, . . . , bu and Ja K : J1, a K. Given a finite set X, we denote as p Xq the probability simplex over X, with |X| its cardinality and with Pp Xq its power set. Let q P p Xq, we denote its support as supppqq tx P X : qpxq ą 0u.

Markov Decision Processes. We define a finite-horizon undiscounted MDP as a tuple M : p S, A, P, R, H, sq, where S and A are the state and action spaces, respectively, H is the horizon of the episode, P : S ˆ A ˆ JHK Ñ p Sq is the stage-dependent transition probability distribution, R : S ˆ A ˆ JHK Ñ r0, 1s is the deterministic reward function, assumed to be known,2 and s P S is the initial state. We assume the state space and the action space are finite sets, and we denote their cardinalities as |S| : S ă 8 and |A| : A ă 8. The agent s behavior is modeled with a Markovian policy π : S ˆ JHK Ñ p Aq. The agent interacts with the environment for K episodes of length H, and we denote with T KH the total number of decisions.

Sleeping Markov Decision Processes. A Sle MDPs is

2This is a mild assumption that can be removed with no additional complexity, as learning the transition probability P is more challenging than learning the reward function R.

Sleeping Reinforcement Learning

Algorithm 1: Interaction Protocol Per-episode.

1 for k P JKK do

2 Agent observes Ak,hpsq, @h PJHK, s P S

3 for h P JHK do

4 Agent observes state sk,h 5 Agent plays ak,h P Ak,hpsk,hq

6 Environment returns rk,h and sk,h 1

Algorithm 2: Interaction Protocol Per-stage.

1 for k P JKK do

2 for h P JHK do

3 Agent observes state sk,h 4 Agent observes Ak,h 5 Agent plays ak,h P Ak,h 6 Environment returns rk,h and sk,h 1

defined as a tuple M : p S, A, C, P, R, H, sq, where p S, A, P, R, H, sq is standard MDP as defined above, and C is the action availability model, that will be characterized later. Formally, for every episode k P JKK, at every stage h P JHK, for the current state sk,h P S, the available action set Ak,hpsq Ď A is an element of the power set Pp Aq, selected according to the model C. We assume that at least one action can always be played (i.e., Ak,hpsk,hq tu, @k P JKK, h P JHK, sk,h P S).

Action Disclosure. The available actions can be revealed to the agent either at the beginning of the current episode for every stage and state, i.e., per-episode disclosure, or when the agent is asked to choose an action in the current stage and state, i.e., per-stage disclosure. The two interaction protocols are presented in Algorithms 1 and 2, respectively.

Example 1. To illustrate the effect of the per-episode (PE) and per-stage (PS) disclosures of action availability and the effect of a low-level controller (LLC), we consider the MDP in Figure 2a, whose goal is to go from initial state A to the absorbing state C. We have two paths to do so. First, we can play action Fs (i.e., forward safe) in state A, which deterministically leads to C with a reward of 2. Second, we can play action F (i.e., forward) in state A and deterministically reach B without costs (r 0). Then, in state B, for all subsequent stages, action F is available with probability p P r0, 1s. If F is not available, we can voluntarily stay in B and wait (i.e., play action Sv), obtaining a reward of 1. In state B, there is also another action Sf (i.e., stay forced), that makes the agent remain in state B and receive a reward of 2. Action Sf is employed by the LLC to override forbidden actions (i.e., the attempt to play F when it is not available). We now compute and plot as a function of p (Figure 2b) the optimal value functions for

(a) Illustrative Sleeping MDP.

0 0.2 0.4 0.6 0.8 1

V EDp Aq V SDp Aq V LLCp Aq

(b) Optimal value functions as a function of p.

Figure 2. Illustrative example.

PE and PS disclosure and for LLC (see Appendix A for a complete discussion):

PE disclosure. Since we know whether action F is available in state B at the beginning of the episode, we play F in state A if action F is available in state B in stages h P t2, 3u getting either reward 0 or 1 (playing Sv in stage h 2); otherwise we play Fs getting 2 as reward. PS disclosure. Here, we do not know if action F will be available in state B. If we play F in state A, we may wait several stages in B playing action Sv until F becomes available and getting p1 pq{p as expected total reward. Instead, if we play Fs in state A, we get 2 as reward. LLC. Here, when we are in state B, we do not observe if action F is available. Since the LLC overrides F with Sv when F is not available, if we decide to go through state B, we will get 2p1 pq{p as expected total reward; otherwise, by playing Fs in state A, we get 2 as reward.

As expected, the value functions are sorted as: V PEp Aq ě V PSp Aq ě V LLCp Aq , where V PEp Aq, V PSp Aq, and V LLCp Aq represent the optimal state value function in state A in the PE, PS, and LLC cases, respectively (see also Section 6).

Action Availability Models. We admit the available action sets to be chosen in a stochastic way as follows:

Independent (referred as ind) action availabilities: we define C Cind : S ˆ JHK Ñ p Pp Aqq. For every action subset B Ď A, state s P S, and stage h P JHK, Cind h p B|sq Prp Ak,h B|sk,h sq represents the probability that B is the available action set in state s at stage h. Notably, the availability of an action does not depend on whether it was available in the past. Markovian (referred as Markov) action availabilities: we define C CMarkov : S2 ˆ A ˆ JHK ˆ Pp Aq Ñ p Pp Aqq. For every action subsets B, B1 Ď A, states s, s1 P S, action a P Ak,h, and stage h P JHK, CMarkov h p B1|s1, s, a, Bq Prp Ak,h B1|sk,h s1, sk,h 1 s, ak,h 1 a, Ak,h 1 Bq represents the probability that B1 is the available action set observed in state s1 at stage h, conditioned to the fact that B was the available action set, s the state, and a the played action at

Sleeping Reinforcement Learning

stage h 1, respectively.3

3. Per-episode Disclosure

In this section, we face the simpler case in which the available actions are revealed at the beginning of the episode for every state and stage. We start by discussing the notions of policy, value function, optimality, and regret. Then, we present an algorithm matching the regret of standard RL. We highlight that the per-episode disclosure can be seen as a generalization of action masking (see Appendix B).

Policies, Value Functions, and Optimality. The action availabilities Ak,hpsq are revealed at the beginning of each episode k P JKK, for every state s P S and stage h P JHK. Thus, we restrict the policy space to the policies that play available actions only, i.e., Πk : tπ : S ˆ JHK Ñ p Aq s.t. @ps, hq P S ˆ JHK : supppπhp |sqq Ď Ak,hpsqu. For episode k and policy π P Πk, we denote with V π h psq the value function in state s at stage h following policy π and with Qπ hps, aq the state-action value function, restricted to the available actions a P Ak,hpsq. For episode k, we denote as optimal policy any policy fulfilling π k P arg maxπPΠk V π 1 psq and the optimal value functions as

V k,hpsq V π k h psq and Q k,hps, aq Q π k h ps, aq. An optimal policy π k can be retrieved as the greedy policy w.r.t. Q h, restricting to the available actions using a variation of value iteration, namely Action-Restricted Value Iteration (AR-VI, Algorithm 3).4

Regret. We evaluate the performance of an algorithm A in terms of cumulative regret over the K P N episodes against the optimal policy π k of each episode.

Definition 3.1 (Per-episode Disclosure Regret). Let A be an algorithm playing sequence of policies pπkq K k 1PŚK k 1Πk, we define the per-episode (PE) disclosure regret as:

RPEp A, Tq : ÿ

V k,1psq V πk 1 psq . (1)

Since we know which actions are available at every stage and for every state in advance, we can afford to compete against the best policy π k in every episode. Indeed, regardless of the fact that the action availabilities Ak,hpsq are chosen in a stochastic or adversarial way, we are able to tackle the problem as if we were in a different MDP (defined in terms of the available actions) in each episode k.

Lower Bound. We now present a minimax lower bound

3With little abuse, we are using the same symbol even for stage h 1, where sh 1, ah 1, and Ak,h 1 are not defined. In such a case, we consider CMarkov 1 p B1|s1q Prp Ak,1 B1|sk,1 s1q as the initial-action availability distribution. 4This is totally equivalent to solving an MDP with action sets that depend on the state and stage.

Algorithm 3: Action-Restricted Value Iteration (AR-VI) for episode k.

Input :Sleeping MDP M p S, A, C, P, R, H, sq, Available actions Ak,hpsq, @h P JHK, s P S

1 V k,H 1psq 0

2 for h P t H, H 1, . . . , 1u do

3 Q k,hps, aq Rhps, aq Es1 Php |s,aq V k,h 1ps1q , @a P Ak,hpsq, s P S

4 V k,hpsq maxa PAk,hpsq Q k,hps, aq, @s P S

6 return π k,hpsq P arg max a PAk,hpsq Q k,hps, aq, @s P S, h P JHK

for Sle MDPs with per-episode disclosure.

Theorem 3.1 (Lower Bound Per-episode Disclosure). For any algorithm A, there exists an instance of Sleeping MDP such that, for T ě Ωp H2SAq, the per-episode disclosure regret satisfies:

E r RPEp A, Tqs ě Ω H ?

Proof. The proof directly follows from that of (Domingues et al., 2021, Theorem 9).5 We have to lower bound the regret on the worst instance of Sle MDPs with per-episode disclosure. Since the Sle MDPs per-episode disclosure are a generalization of MDPs (an MDP is a Sle MDP where Ak,hpsq A, @s P S, h P JHK, k P JKK), the regret lower bound for MDPs holds for Sle MDPs too.

Algorithm. To learn in the per-episode scenario, we modify UCBVI (Azar et al., 2017) to handle the action availabilities. We design Action-Restricted UCBVI (AR-UCBVI, Algorithm 4), the optimistic counterpart of the AR-VI (Algorithm 3). From a high-level perspective, besides the fact that the maximization in Bellman s equation is computed over the available actions Ah,kpsq only, AR-UCBVI has to carefully handle the optimism to guarantee a monotonicity property of the sequence of estimated state-action value functions. The algorithm starts by initializing the visitation counters (line 1). Then, for every episode k P JKK, the algorithm observes the action availabilities Ak,hpsq (line 3) and estimates the transition model p Pk,hps1|s, aq (line 4):

p Pk,hps1|s, aq Nk,hps, a, s1q

Nk,hps, aq , (2)

where Nk,hps, aq is the number of times action a was played in state s in stage h, and Nk,hps, a, s1q is the number of times the next state was s1. Then, AR-UCBVI runs optimistic value iteration to obtain optimistic estimates of the optimal action-value function Q k,hps, aq. As mentioned above, this step requires more attention w.r.t. that of (Azar et al.,

5This result differs from the one of (Domingues et al., 2021) due to the different notation adopted.

Sleeping Reinforcement Learning

Algorithm 4: Action-Restricted UCBVI (AR-UCBVI).

1 Initialize: N0,hps, a, s1q 0, N0,hps, aq 0, N0,hpsq 0, @ps, a, s1, hq P S ˆ A ˆ S ˆ JHK

2 for k P JKK do

3 Agent observes Ak,hpsq, @h P JHK, s P S

4 Estimate p Pk,hps1|s, aq as in Eq. (2)

5 //Compute p Vk, p q, p Qk, p , q for episode k

6 p Qk 0,hps, aq H h 1, @ps, a, hq P S ˆAˆJHK

7 for j P Jk K do

8 Initialize p V k j,H 1psq 0, @s P S

9 for h t H, H 1, . . . , 1u do

10 for s P S do

11 Compute p Qk j,hps, aq, @a P Ak,hpsq as in Eq. (3)

12 Compute p V k j,hpsq max a PAk,hpsq p Qk j,hps, aq

16 //Play optimistically for episode k

17 Agent observes state sk,1 18 for h P JHK do

19 Agent plays ak,h P arg max a PAk,hpsk,hq p Qk k,hpsk,h, aq

20 Environment returns rk,h and sk,h 1 21 Increment counters

2017). Indeed, the original UCBVI, to ensure a monotonically non-increasing sequence of the estimates p Qk,hps, aq, limits the current estimate p Qk,hps, aq (computed with all samples up to episode k 1) to the previous episode estimate p Qk 1,hps, aq (computed with all samples up to episode k 2). This operation makes no sense in a Sle MPD since the action availabilities Ak,hpsq and Ak 1,hpsq may change between consecutive episodes, making p Qk 1,hps, aq no longer an optimistic estimate of the true Q k,hps, aq. For this reason, we have to compute the sequence of optimistic value functions p Qk j,hps, aq for the action availabilities Ak,hpsq of episode k using the estimates p Pj,h of all episodes j with j P Jk K. This way, we make use of all the samples collected so far (even in episodes j with action availabilities different from the current ones Ak,hpsq). Furthermore, this ensures that the sequence of estimates p Qk j,hps, aq is monotonically non-increasing in j. As shown in Algorithm 4 (lines 6-15), AR-UCBVI starts from h H and goes backward computing the optimistic p Qk j,hps, aq for every a P Ak,hpsq:

p Qk j,hps,aq min ! p Qk j 1,hps,aq, (3)

s1PS p Pj,hps1|s,aqp V k j,h 1ps1q b Q,k j,h ps,aq ) ,

where b Q,k j,h ps, aq is the exploration bonus obtained from

a refined analysis of UCBVI (Drago et al., 2025) and is defined as:

b Q,k j,h ps, aq :

4Lp Vk j,hps, aq

Nj,hps, aq 7HL 3Nj,hps, aq

g f f e8Es1 p Pj,hp |s,aqrb Q,k j,h 1ps1qs

Nj,hps, aq ,

where p Vk j,h Vars1 p Pj,hp |s,aqrp V k j,h 1ps1qs is the empirical variance of the next-state value estimate, b Q,k j,h 1ps1q mint842H3S2AL{Nj,h 1ps1q, H2u is the additional bonus, and L lnp5HSAT{δq. Then, we compute the value estimate as p V k j,hpsq maxa PAk,hpsq p Qk j,hps, aq. Finally, the algorithm plays an action greedily w.r.t. the optimistic estimate p Qk k,hps, aq (lines 17-22).

The following result provides the regret of AR-UCBVI, showing that learning in a Sle MDP with per-episode disclosure does not increase the regret w.r.t. standard RL.

Theorem 3.2 (Upper Bound Per-episode Disclosure). For any δ P p0, 1q, with probability 1 δ, the per-episode disclosure regret of AR-UCBVI is bounded by:

RPEp AR-UCBVI, Tq ď 34HL ?

SAT 2500H4S2AL 2,

where L lnp5HSAT{δq. For T ě Ωp H6S3Aq, selecting δ 1{T, we have:

E r RPEp AR-UCBVI, Tqs ď r O H ?

where the expectation is taken w.r.t. the stochasticity of the environment.

Proof. The proof of this theorem follows the one of (Azar et al., 2017, Theorem 2). The key challenge is the computation of the bonus b Q,k j,h . Indeed, Lemma 17 of (Azar et al., 2017) heavily relies on the fact that the sequence p Vk,hpsq V h psq is monotonically non-increasing in k. This is not the case for our sequence p V k k,hpsq V k,hpsq since, as already explained, the action availabilities Ak,hpsq change across episodes. For this reason, we resort to the sequence p V k j,hpsq V k,hpsq for fixed k which is monotonically nonincreasing in j. This allows us to apply Lemma 16 of (Azar et al., 2017) pretending to have played in the MDP with action availabilities of episode k, i.e., Ak,hpsq, for all episodes j P Jk K and, ultimately, getting the bonus as in Lemma 17. Notice that we apply the pigeonhole principle considering Ak,hpsq A which represents the worst case.

4. Per-stage Disclosure: Markovian Case

We now analyze the realistic scenario where the set of available actions is revealed for the current stage only with no knowledge of future availabilities. In this sec-

Sleeping Reinforcement Learning

tion, we focus on Markovian availabilities i.e., Ak,h CMarkov h p |sk,h, sk,h 1, ak,h 1, Ak,h 1q.

Augmented MDP, Policies, Value Functions, and Optimality. To address this scenario, we can map the Sle MDP with an augmented MDP in which we encode action availability sets in the state representation.

Definition 4.1 (Augmented MDP). Let M : p S, A, C, P, R, H, sq be a Sle MDP, we define the augmented MDP Ă M : p r S, A, r P, r R, H, r P0q, with:

augmented state space r S : S ˆ Pp Aq; augmented transition probability, defined for every rs ps, Bq, rs1 ps1, B1q P r S, a P A, and h P JHK as: r Phprs1|rs, aq Phps1|s, aq CMarkov h p B1|s1, s, a, Bq; reward function, defined for every rs ps, Bq P r S and a P A as: r Rprs, aq Rps, aq; initial augmented-state distribution, defined for every rs ps, Bq P r S as r P0prsq CMarkov 1 p B|sq1ts su.

Note that the augmented MDP has a state space with cardinality | r S| S2A, exponential in A. The policy space in the augmented MDP is defined as rΠ : trπ : r S ˆ JHK Ñ p Aq s.t. @rs ps, Bq P r S : suppprπp |rsqq Ď Bu, to ensure that only available actions are played.6 For a policy rπ, augmented state rs ps, Bq, available action a P B, and stage h P JHK, we denote with r V rπ h prsq r V rπ h pps, Bqq and r Qrπ hprs, aq r Qrπ hpps, Bq, aq the value and state-action value functions, respectively, in the augmented MDP, the latter restricted to the available actions a P B. An optimal policy for the augmented MDP is any policy such that rπ P arg maxrπPrΠ r V rπ and the optimal value functions as r V h prsq r V rπ h prsq and r Q hprs, aq r Qrπ h prs, aq. An optimal policy rπ can be obtained as greedy w.r.t. r Q .7 The optimal value functions can be computed using value iteration on the augmented MDP (see Algorithm 5).

Considering the protocol of Algorithm 2, we define the value function for the original Sle MDP V rπ 1 psq in the initial state s P S as the expectation of that of the augmented MDP r V rπ 1 pps, Bqq over the available action sets B P Pp Aq:

V rπ 1 psq : E B CMarkovp |sqrr V rπ 1 pps, Bqqs. (4)

Similarly, for the optimal value function, we have: V 1 psq : EB CMarkovp |sqrr V 1 pps, Bqqs.

Regret. We evaluate the performance of an algorithm in terms of cumulative regret over the K P N episodes against

6Note the fundamental difference w.r.t. the per-episode case, where the policy space was different for every episode k, while here in the per-stage case, the policy space is the same for all episodes, but the policy is conditioned to the actual available actions set B. 7As usual, we are in an MDP and, therefore, there exists a policy rπ optimal from every state rs.

the optimal policy rπ constant throughout the episodes.

Definition 4.2 (Per-stage Disclosure Regret). Let A be an algorithm playing a sequence of policies prπkq K k 1 P rΠK, we define the per-stage disclosure (PS) policy regret as:

RPSp A, Tq KV 1 psq ÿ

k PJKK V rπk 1 psq.

Differently from the per-episode case (Definition 3.1) where comparator π k changes over episodes, in the per-stage case (Definition 4.2), we consider a constant comparator rπ .

Algorithm. Given the mapping to the augmented MDP, we can resort to the original UCBVI (Azar et al., 2017) to learn in the Sle MDP with the following regret guarantees.

Theorem 4.1 (Upper Bound Per-stage Disclosure: Markovian). For any δ P p0, 1q, with probability 1 δ, the perstage disclosure regret of UCBVI with Bernstein-Freedman bonuses (Azar et al., 2017, bonus 2) on the augmented MDP is bounded by:

RPSp UCBVI, Tq ď 34H r L ?

SAT2A 2500H4S2A22Ar L2,

where r L logp5H2S2AAT{δq. In particular, for T ě Ωp H6S3A323Aq and selecting δ 2A{T, we have:

E r RPSp UCBVI, Tqs ď r O H ?

Proof. The proof is an application of (Azar et al., 2017, Theorem 2), where we consider the state space cardinality S2A of the augmented MDP and the stage-dependent transition probabilities (equivalent to increasing the state space cardinality by a multiplicative factor of H).

The regret guarantees of Theorem 4.1 are clearly unsatisfactory due to the presence of the exponential dependence on the cardinality of the action space A. In the following, we show that such an exponential dependence is unavoidable when the action availability follows a Markovian process.

Lower Bound. The following theorem presents the lower bound on the Markovian per-stage disclosure regret.

Theorem 4.2 (Lower Bound Per-stage Disclosure: Markovian). For any algorithm A, there exists an instance of Sleeping MDP with per-stage disclosure and Markovian action availability such that, for T ě Ωp HSA2Aq and H ě Ωp Aq, the per-stage disclosure policy regret satisfies:

E r RPS p A, Tqs ě Ω H a

Proof Sketch. The instances are made of two parts (see Figure 3). First, every instance has an A-ary tree MDP as in (Domingues et al., 2021), where all actions are available. Then, we attach a partial sleeping lattice to every leaf, in which the state si does not change, and whenever an action

Sleeping Reinforcement Learning

A-ary Tree (Domingues et al., 2021)

Sleeping Lattice (ours) s s

si si si si si si

si si si si

a1 a2 a3 a4

a1 a2 a3 a4

a1 a2 a3 a4

a2 a3 a4 a1 a3

a4 a1 a2 a3

a1{2{3 a1{2{4 a1{3{4 a2{3{4

a1{2 a1{3 a1{4 a2{3 a2{4 a3{4

a1{2{3{4 a1{2{3{4

a1{2{3{4 a1{2{3{4 a1{2{3{4

1 2 ε 1 2 ε

Figure 3. Instances used in the proof of Theorem 4.2. Available actions are reported next to the nodes.

ah is played, it gets removed from the next set of available actions, i.e., Ah 1psiq Ahpsiqztahu. This happens for A{2 times, until we reach the largest layer of the lattice, i.e., the one with A A{2 2A{2 nodes. At this point, whatever action is played, the next action availability set will be a singleton either containing a or a with equal probability. a (resp. a ) leads to an absorbing state with reward 1 (resp. 1). Hard instances are constructed by electing a leaf state s , an action a , a stage h , and an availability set B

(with |B | A{2) that increase by ε ą 0 the probability of getting singleton ta u in the next stage.

This lower bound, using a novel construction (i.e., the sleeping lattice, Figure 3), shows that an exponential dependence on A is unavoidable when there is temporal correlation among the sets of available actions. Given this statistical barrier, we next discuss a tractable scenario in which the action availability is sampled independently at each stage.

5. Per-Stage Disclosure: Independent Case

We now discuss the scenario in which, at every stage, the available action set Ak,h is sampled independently from the past, i.e., Ak,h Cind h p |sk,hq. The lack of a temporal structure allows removing the dependence of the value functions on the available action sets, differently from the augmented MDP of Section 4. Indeed, looking at the Bellman s equation in the augmented MDP (line 3 of Algorithm 5), we realize that Cind h p |sq does not depend on B, removing such a dependence from the state-action value function r Q hpps, Bq, aq, that, from now on, we will denote as Q hps, aq. Moreover, similarly to what has been done in

Algorithm 5: Sleeping VI for C CMarkov.

Input :Sleeping MDP M S, A, CMarkov, P, R, H, s

1 r V H 1pps, Bqq 0, @s P S, B P Pp Aq

2 for h P t H, H 1, . . . , 1u do

3 r Q hpps, Bq, aq Rhps, aq

Es1 Php |s,aq EB1 CMarkov h p |s1s,a,Bq r V h 1pps1, B1qq ıı ,

@s P S, B P Pp Aq, a P B

4 r V h pps, Bqq maxa PB Q hpps, Bq, aq, @s P S, B P Pp Aq

6 return rπ hpps, Bqq P arg maxa PB r Q k,hpps, Bq, aq, @s P S, B P Pp Aq, h P JHK

Algorithm 6: Sleeping VI for C Cind.

Input :Sleeping MDP M S, A, Cind, P, R, H, s

1 V H 1psq 0, @s P S

2 for h P t H, H 1, . . . , 1u do

3 Q hps, aq Rhps, aq Es1 Php |s,aq V h 1ps1q , @a P A, s P S

4 V h psq EB Cind h p |sq maxa PB Q hps, aq , @s P S

6 return rπ hps, Bq P arg maxa PB Q k,hps, aq, @s P S, B P Pp Aq, h P JHK

Equation (4), it is convenient to introduce the value function V h psq : EB Cind h p |sqrr V h ps, Bqs. Given these quantities, to solve a Sle MDP with independent availabilities, we resort to a simpler value iteration approach (Algorithm 6).

Lower Bound. The following theorem provides a regret lower bound, by reducing the considered scenario to an MDP in which all actions are always available.

Theorem 5.1 (Lower Bound Per-stage Disclosure: Independent). For any algorithm A, there exists an instance of Sle MDPs with independent action availability with perstage disclosure such that, for T ě Ωp H2SAq, the perstage disclosure regret satisfies:

E r RPSp A, Tqs ě Ω H ?

Proof Sketch. The formal proof is provided in Appendix C. Similarly to Theorem 3.1, the proof directly follows from that of (Domingues et al., 2021, Theorem 9). We have to lower bound the regret on the worst instance of Sle MDPs with independent per-stage disclosure action availability. Since the Sle MDPs with independent per-stage action availability are a generalization of MDPs (an MDP is a Sle MDP where Cind h p A|sq 1, @s P S, h P JHK), the regret lower bound for MDPs holds for Sle MDPs too.

Algorithm. To efficiently learn in this setting, we propose an algorithm that extends UCBVI (Azar et al., 2017) with the

Sleeping Reinforcement Learning

Algorithm 7: Sleeping UCBVI (S-UCBVI).

1 Initialize: N0,hps, a, s1q 0, N0,hps, aq 0,

2 N0,hpsq 0, N0,hps, Bq 0,

3 @ps, a, s1, h, Bq P S ˆ A ˆ S ˆ JHK ˆ Pp Aq

4 p Q0,hps, aq H h 1, @ps, a, hq P S ˆ A ˆ JHK

5 for k P JKK do

6 //Compute p Vk, p q, p Qk, p , q for episode k

7 Estimate p Pk,hps1|s, aq as in Eq. (2)

8 Estimate p Cind k,hp B|sq as in Eq. (5)

9 Initialize p Vk,H 1psq 0, @s P S

10 for h t H, H 1, . . . , 1u do

11 for s P S do

12 Compute p Qk,hps, aq, @a P A, as in Eq. (6)

13 Compute p Vk,hpsq as in Eq. (7)

16 //Play optimistic for episode k

17 Agent observes state sk,1 18 for h P JHK do

19 Agents observes action set Ak,hpsk,hq

20 Agent plays ak,h P arg max a PAk,hpsk,hq p Qk,hpsk,h, aq

21 Environment returns rk,h and sk,h 1 22 Increment counters

estimation of the action availabilities Cind h p |sq. Sleeping UCBVI (S-UCBVI, Algorithm 7) is an optimistic algorithm where the key innovation is using two bonuses, one for the state-action value functions p Qk,hps, aq (as for UCBVI, to handle uncertainty on p P) and one for the state value functions p Vk,hpsq (to handle uncertainty on p C). We estimate the transition model as in Equation (2) (line 7) and we keep an estimate of the action availability as follows (line 8):

p Cind k,hp B|sq Nk,hps, Bq

Nk,hpsq , (5)

where Nk,hpsq is the number of times state s is visited at stage h and Nk,hps, Bq is the number of times we observe action availability B. Then, we perform an optimistic value iteration (lines 9-15) to obtain the optimistic estimate p Vk,hpsq and p Qk,hps, aq by means of two additive optimistic bonuses. Going backward from h H, we compute p Qk,hps, aq as:

p Qk,hps, aq min ! p Qk 1,hps, aq, (6)

Rhps, aq b Q k,hps, aq ÿ

s1PS p Pk,hps1|s, aqp Vk,h 1ps1q ) ,

where b Q k,hps, aq is the exploration bonus to account for the

uncertainty on the transition model estimate p P:

b Q k,hps, aq :

4Lp Vk,hps, aq

Nk,hps, aq 7HL 3p Nk,hps, aq 1q

g f f e4Es1 p Pk,hp |s,aqrb Q k,h 1ps1qs

Nk,hps, aq ,

where p Vk,h Vars1 p Pk,hp |s,aqrp Vk,h 1ps1qs is the empirical variance of the next-state estimated value function, b Q k,h 1ps1q mint29002H3S3A2AL3{Nk,h 1ps1q, H2u is the additional bonus term, and L logp80HS2A2AT{δq. Then, we compute the optimistic value function p Vk,hpsq as:

p Vk,hpsq ÿ

BPPp Aq p Cind k,hp B|sqmax a PB p Qk,hps,aq b V k,hpsq, (7)

where b V k,h is a bonus accounting for the uncertainty on the action set availability defined as:

b V k,hpsq :

4Lp Qk,hpsq

Nk,hpsq 7HL 3p Nk,hpsq 1q

g f f e4EB p Cind k,hp |sqrb V k,hps, πk,hps, Bk,hqqs

where p Qk,hpsq Var B p Cind k,hp |sqr p Qk,hps, πk,hps, Bqqs, and

b V k,hps, aq mint13502H3S3A2AL3{Nk,hps, aq, H2u is an additional bonus. Finally, the algorithm plays an action greedily w.r.t. p Qk,hps, aq (lines 17-23).

We provide the regret upper bound of S-UCBVI for perstage disclosure and independent availabilities.

Theorem 5.2 (Regret Upper Bound S-UCBVI with independent availability and per-stage disclosure). For any δ P p0, 1q, with probability 1 δ, the per-stage disclosure regret of S-UCBVI on any Sle MDP with per-stage disclosure independent action availabilities is bounded by:

RPSp S-UCBVI, Tq ď512H ?

4982H6S3A2AL2G,

where L logp80HS2A2AT{δq and G logp HSATq. In particular, for T ě Ωp H10S5A422Aq and selecting δ 2A{T, we have:

E r RPSp S-UCBVI, Tqs ď r O H ?

Proof Sketch. The proof is provided in Appendix D and extends (Azar et al., 2017, Theorem 2). The key difference is the use of two bonus terms to ensure the optimism of both p Vk,h and p Qk,h due to the estimation of the action availabilities p Cind k,h and of the transition model p Pk,h. Given r V k,h p Vk,h V πk h and r Q k,h p Qk,h Qπk h , we notice a re-

cursive dependence, that we unfold since r V can be derived from r Q and r Q is obtained by upper-bounding r V as:

r V k,hďb V k,h εQ k,h 4H22AL{Nk,h r Q k,h,

Sleeping Reinforcement Learning

Per-stage Disclosure Per-episode Disclosure

Lower Bound Upper Bound Lower Bound Upper Bound

Independent

T ě Ωp H2SAq

Theorem 5.1

T ě Ωp H10S5A422Aq

Theorem 5.2 Ω H ?

T ě Ωp H2SAq

Theorem 3.1

T ě Ωp H6S3Aq

Theorem 3.2 Markovian

T ě Ωp HSA2Aq

Theorem 4.2

T ě Ωp H6S3A323Aq

Theorem 4.1

Table 1. Summary of the results.

where εQ k,h is a martingale difference sequence that leads to a lower-order term. As such, we upper bound the two quantities as done in Lemma 3 of (Azar et al., 2017) at the cost of a multiplicative e constant, avoiding any exponential dependency on A in the higher-order terms. Then, extending the rationale of (Azar et al., 2017), via backwards induction, we show that the empirical variance is small enough that the additional bonuses b V k,h and b Q k,h guarantee that p Vk,h and p Qk,h are indeed optimistic. This is done by deriving an upper bound on p Qk,hps,aq Q hps,aq in the order of:

p Qk,hps,aq Q hps,aqďmint Op H3S3A2AL3{Nk,hps,aqq,Hu.

Then, we use the latter result to derive an upper bound to p Vk,hpsq V h psq in the order of:

p Vk,hpsq V h psqďmint Op H3S3A2AL3{Nk,hpsqq,Hu.

Subsequently, we use these inequalities to show that p Qk,hě Q h and p Vk,hěV h hold, thus, demonstrating the optimism. Finally, we derive the regret bound by combining the terms in the upper bound of r V k,h, observing that, when applying the pigeonhole principle, we consider all the actions in A as available, as this provides the worst-case allocation.

This result shows that for a large enough T, the regret suffered by S-UCBVI is of the same order as that of UCBVI-BF (Azar et al., 2017, Theorem 2), matching the lower bound for standard RL, up to logarithmic terms. Thus, the need for estimating the action availability Cind, that is a distribution over Pp Aq, results in a minimum value of T that scales exponentially with A. Nevertheless, no exponential dependence on A is present in the leading term.

6. Discussion and Conclusions

We summarize the results presented in this paper in Table 1. We motivated the introduction of approaches aware of the action availability in opposition to an LLC since they allow learning better-performing behaviors. We illustrated this in Example 1. Now, we formally prove that this is the case.

Per-episode ě Per-stage. We restrict to stochastic availabilities sampled independently at every stage, i.e., via Cind. Indeed, in this case, we can imagine availabilities Ak,hpsq to be pre-sampled for every k P JKK and ps, hq P S ˆ JHK. In the per-episode disclosure, Ak,hpsq is revealed at the beginning of episode k, while in the per-stage disclosure Ak,hpsq is revealed only when, at episode k, we reach state s at stage h. This observation allows concluding that the expected optimal performance in the per-episode disclosure is superior to that of the per-stage disclosure:

Er V k s E max πk PΠk V πk 1 psq ȷ

loooooooooooooooomoooooooooooooooon

Per-episode

ě max πPΠ V π 1 psq V 1 psq looooooooooomooooooooooon

where the expectation is taken w.r.t. the randomness of Ak,hpsq, and where Πk and Π are defined in Sections 3 and 4, respectively, and the inequality follows from Jensen s.

Per-stage ě LLC. The LLC can be regarded as a (possibility stochastic) function that, given an unavailable action ak,h R Ak,h sampled from policy πhp |sk,hq, overrides it with an available action a1 k,h P Ak,h sampled according to some strategy ρLLC h p |sk,h, ak,h, Ak,hq. Thus, the overall effect of the LLC is equivalent to playing a policy rπLLC h p |sk,h, Ak,hq ř

a PA ρLLC h p |sk,h, a, Ak,hqπhpa|sk,hq that belongs to the policy space Π on which we optimize in the per-stage disclosure case. Thus, the performance of the LLC cannot be larger than that of the optimal policy of the per-stage case (i.e., V rπLLC 1 psq ď maxπPΠ V π 1 psq V 1 psq).

Future Works. Interesting future research directions include investigating action availability structures that are more general than the independent case while preserving statistical tractability. Moreover, it is of interest to devise instance-dependent features to characterize the complexity of an instance in the regret bounds based on the characteristics of action availability.

Sleeping Reinforcement Learning

Acknowledgments

Funded by the European Union Next Generation EU within the project NRPP M4C2, Investment 1.3 DD. 341 15 March 2022 FAIR Future Artificial Intelligence Research Spoke 4 PE00000013 D53C22002380006.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 89 96, 2008.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pp. 263 272, 2017.

Bartlett, P. L. and Tewari, A. REGAL: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), pp. 35 42. AUAI Press, 2009.

Blitzstein, J. K. and Hwang, J. Introduction to Probability Second Edition. Chapman and Hall/CRC, 2019.

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.

Chatterjee, A., Ghalme, G., Jain, S., Vaish, R., and Narahari, Y. Analysis of thompson sampling for stochastic sleeping bandits. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). AUAI Press, 2017.

Cortes, C., De Salvo, G., Gentile, C., Mohri, M., and Yang, S. Online learning with sleeping experts and feedback graphs. In Proceedings of the International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pp. 1370 1378. PMLR, 2019.

Domingues, O. D., M enard, P., Kaufmann, E., and Valko, M. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pp. 578 598. PMLR, 2021.

Drago, S., Mussi, M., and Metelli, A. M. A refined analysis of ucbvi. ar Xiv preprint ar Xiv:2502.17370, 2025.

Gaillard, P., Saha, A., and Dan, S. One arrow, two kills: A unified framework for achieving optimal regret guarantees in sleeping bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 206 of Proceedings of Machine Learning Research, pp. 7755 7773, 2023.

Huang, S. and Onta n on, S. A closer look at invalid action masking in policy gradient algorithms. In Proceedings of the International Florida Artificial Intelligence Research Society Conference (FLAIRS), 2022.

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563 1600, 2010.

Kanade, V. and Steinke, T. Learning hurdles for sleeping experts. ACM Transactions on Computation Theory, 6 (3):11:1 11:16, 2014.

Kanade, V., Mc Mahan, H. B., and Bryan, B. Sleeping experts and bandits with stochastic action availability and adversarial rewards. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 5 of JMLR Proceedings, pp. 272 279. JMLR.org, 2009.

Kleinberg, R. D., Niculescu-Mizil, A., and Sharma, Y. Regret bounds for sleeping experts and bandits. In Proceedings of the Annual Conference on Learning Theory (COLT), pp. 425 436. Omnipress, 2008.

Kleinberg, R. D., Niculescu-Mizil, A., and Sharma, Y. Regret bounds for sleeping experts and bandits. Machine Learning, 80(2-3):245 272, 2010.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Maurer, A. and Pontil, M. Empirical bernstein bounds and sample-variance penalization. In Proceedings of the Annual Conference on Learning Theory (COLT), 2009.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. A. Playing atari with deep reinforcement learning. Co RR, abs/1312.5602, 2013.

Nguyen, Q. M. and Mehta, N. Near-optimal per-action regret bounds for sleeping bandits. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 238 of Proceedings of Machine Learning Research, pp. 2827 2835. PMLR, 2024.

Sleeping Reinforcement Learning

Osband, I. and Van Roy, B. On lower bounds for regret in reinforcement learning. Co RR, abs/1608.02732, 2016.

Saha, A., Gaillard, P., and Valko, M. Improved sleeping bandits with stochastic action sets and adversarial rewards. In Proceedings of the International Conference on Machine Learning (ICML), volume 119 of Proceedings of Machine Learning Research, pp. 8357 8366. PMLR, 2020.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S., Yeo, M., Makhzani, A., K uttler, H., Agapiou, J. P., Schrittwieser, J., Quan, J., Gaffney, S., Petersen, S., Simonyan, K., Schaul, T., van Hasselt, H., Silver, D., Lillicrap, T. P., Calderone, K., Keet, P., Brunasso, A., Lawrence, D., Ekermo, A., Repp, J., and Tsing, R. Starcraft II: a new challenge for reinforcement learning. Co RR, abs/1708.04782, 2017.

Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, pp. 125, 2003.

Zhang, Z., Chen, Y., Lee, J. D., and Du, S. S. Settling the sample complexity of online reinforcement learning. In Proceedings of the Annual Conference on Learning Theory (COLT), volume 247 of Proceedings of Machine Learning Research, pp. 5213 5219. PMLR, 2024.

Sleeping Reinforcement Learning

Figure 4. Illustrative Sleeping MDP.

In this appendix, we discuss an example of Sleeping MDP, analyze the two disclosure scenarios described in the paper (i.e., per-episode and per-stage disclosure, indicated with PE and PS, respectively), and the case in which we do not have information about the available actions, making use of a low-level controller (LLC) to correct taken actions that are not available, generating an equivalent MDP as depicted in Figure 1a.

Consider the Sleeping MDP depicted in Figure 4, where the goal is to go from initial state A to the absorbing state C. To reach this goal, we have two paths, the first (the one below) is safe but costly, the second one is unsafe and may be not always available, but has no cost. More formally, we consider an undiscounted finite-horizon Sleeping MDP with initial state A, and absorbing state is C (we assume infinite horizon here since there is probability 1 of reaching state C). To reach the absorbing state, we have two possible paths, which can be chosen using deterministic actions. The first one is the one following action Fs (i.e., forward safe), which deterministically leads to the final state with total reward 2. The second path, instead, leads to the final state through state B. We assume we can always take action F (i.e., forward) in state A and deterministically reach B without costs (r 0). Then, when we are in B, the next forward action may be available or not, and we assume it is available with probability p (formally Prp F P Ahp Bqq p, for every stage h). If F is not available, we can voluntarily stay in B and wait (i.e., play action Sv), obtaining a reward of 1, or try to go anyway. In state B, there is also another action Sf (i.e., stay forced), that makes the agent remain in state B and receive a reward of 2. This latter action will be employed by the LLC to override forbidden actions (i.e., the attempt to play F in state B when it is not available).

We now compute the value functions of this Sleeping MDP for the three scenarios under analysis.

Sleeping MDP with per-episode disclosure. We first analyze the optimal value function V PEp Aq in state A for the per-episode disclosure scenario. We recall that in this case, we knew the action was available at the beginning of the episode. It is easy to observe that, considering h 1 as the moment in which the first decision is made, the path through B will be convenient if at h 2 or h 3 the action F will be available in state B (if action F will be available for the first time in h 4 or later, we can choose the safe path).8 Knowing that such action will be available with probability p, the probability that it will be available at time h 2 is indeed p, while the probability that will be not available in h 2 but will be available in h 3 is p1 pqp. Given that, the value function for state A is:

V PE,1p Aq 0 p lomon

f available in h 2

1 ppp1 pqq loooomoooon

f available in h 3 and not in h 2

2 p1 pp pp1 pqqq looooooooooomooooooooooon

Sleeping MDP with per-stage disclosure. We now analyze the optimal value function V PS,1p Aq in state A for the per-stage disclosure scenario, where we do not know the actual action availability in advance. Given the simplicity of the problem, we

8For the sake of precision, it is not relevant if action F is available in state B when h 1, because we are not in such a state and we cannot play it.

Sleeping Reinforcement Learning

0 0.2 0.4 0.6 0.8 1

V EDp Aq V SDp Aq V LLCp Aq

Figure 5. Optimal value functions V : p Aq for : P t PE, PS, LLCu.

analyze value functions from the final state backward. We first observe how, for absorbing state C, we have V PS,hp Cq 0 for every h P N. Then, we move to state B, and we trivially observe that the optimal policy is to go to state C if possible (i.e., if action F is available), and use action Sv otherwise. The value function for state B is:

V PS,hp Bq pp0 V PS,h 1p Cqq p1 pqp 1 V PS,h 1p Bqq.

Since V PS,hp Bq V PS,h 1p Bq, by solving for V PS,hp Bq, we get:

V PSp Bq 1 p

Given that, in state A the optimal policy will choose the best path in expectation, its value function will be:

V PS,1p Aq max " 2, 1 p

Sleeping MDP as and MDP with LLC. We finally suppose that we want to handle this Sleeping MDP as if it were a standard MDP with stochastic actions and rewards depending also on the landing state. We start as before by observing that V LLCp Cq 0. Then, we can reason about state B. It is clearly visible that we have to try to go to state C; otherwise, we will continue to pay the cost of staying in B (for a sufficiently large horizon). Given that, the expected value of this state is 0 if we play action F and it is available, 2 if we play it when it is not available. Formally:

V LLC,hp Bq pp0 V LLC,h 1p Cqq p1 pqp 2 V LLC,h 1p Bqq.

Since, V LLC,h 1p Bq V LLC,hp Bq we get:

V LLC,hp Bq 21 p

As before, in state A, we can select the best action, leading to:

V LLC,1p Aq max " 2, 2p1 pq

Comparison of the Results. Figure 5 shows the value functions for the three cases for all the values of p P r0, 1s. We can observe how, as supported by the intuition, for every p P r0, 1s, it is always better to know the action availability in advance (V PEp Aq ě V PSp Aq), as we have more information and we can take better decisions. This implies that the two notions of regret must differ (given that the two optimal value functions will be different), in particular in terms of the comparator we use, which should be appropriate and reachable. Finally, we observe that the performance of the equivalent MDP integrating the low-level controller is the worst, as expected.

Sleeping Reinforcement Learning

B. Related Works

In this appendix, we summarize the relevant literature for this work. We start by presenting an overview of the fundamental works in the sleeping MAB literature, as this problem has never been faced in the RL scenario. Then, we introduce works on invalid action masking in RL. Finally, we summarize the main results on regret bounds for standard RL.

Sleeping MAB. (Kleinberg et al., 2008; 2010) are the seminal works for the Sleeping MAB setting. In their work, the authors study both the full and partial information settings, considering both the stochastic and adversarial reward models and adversarially chosen action sets. (Kanade et al., 2009) presents the first polynomial time algorithm for MAB with adversarial rewards and stochastic action sets. In the same scenario, (Saha et al., 2020) improves the performance, keeping the computational complexity polynomial. (Kanade & Steinke, 2014) presents the first polynomial time algorithm bandits with both adversarial rewards and action sets, and (Nguyen & Mehta, 2024) achieves near-optimal regret bounds in this scenario. (Chatterjee et al., 2017) studies the setting with adversarial action sets and Bernoulli rewards. (Cortes et al., 2019) extends the Sleeping framework to consider graph feedback. (Gaillard et al., 2023) studies different notions of regret in the sleeping MAB setting, and discusses their relation.

Invalid Action Masking for RL. Several works consider the possibility of having not all actions available in all the states (Vinyals et al., 2017). This deterministic masking operation can be done over several types of algorithms such as policy gradient solutions (Huang & Onta n on, 2022) and state-of-the-art deep RL algorithms such as DQN (Mnih et al., 2013). However, in all these works, given a state s, we have a deterministic mapping to the action availability (i.e., C : S Ñ Pp Aq). In this work, instead, we consider way more challenging scenarios, as we have, given a state, probability distribution over the available action sets (i.e., C : S Ñ p Pp Aqq), also in the per-episode disclosure scenario, that can be seen as a stochastic generalization of action masking.

Minimax Regret Bounds for RL. (Auer et al., 2008; Jaksch et al., 2010) present the first minimax lower bound in the order of Ωp ?

DSATq for average reward MDPs with stationary transition probabilities where D is the diameter of the MDP.910

(Domingues et al., 2021) generalize this result by providing a standard proof framework for episodic MDPs and demonstrate a lower bound in the order of Ωp H ?

SATq with stage-dependent transitions and Ωp ?

HSATq with stage-independent ones. From the algorithmic perspective, (Jaksch et al., 2010) propose UCRL2, which enjoys r Op DS ?

ATq regret with stage-independent transition. (Azar et al., 2017) propose UCBVI, which enjoys r Op ?

HSATq regret with stage-independent transitions and r Op H ?

SATq with stage-dependent ones.11 (Zhang et al., 2024) theoretically improves the result of (Azar et al., 2017), even if preserving the same (optimal, up to logarithmic factors) rate, by reducing the requirement for the minimum T needed in order to match the lower bound.

C. Omitted Proofs of the Lower Bounds

C.1. Proof of Theorem 5.1

Theorem 5.1 (Lower Bound Per-stage Disclosure: Independent). For any algorithm A, there exists an instance of Sle MDPs with independent action availability with per-stage disclosure such that, for T ě Ωp H2SAq, the per-stage disclosure regret satisfies:

E r RPSp A, Tqs ě Ω H ?

Proof. This proof closely follows the one of (Domingues et al., 2021, Theorem 9). In order to demonstrate the lower bound for Sleeping MDPs with per-stage disclosure in the case of independent action availability, we first modify the class of hard MDP instances provided in (Domingues et al., 2021, Section 3.1), and then we follow similar derivations to the original proof, thus reporting only the relevant modifications.

Definition of the Sleeping MDPs class. We start by modifying the class CH,ε1 provided in (Domingues et al., 2021) to transform it into a specific class of Sleeping MDPs. As in the original proof, the Sle MDPs all have three special states: a

9This result considers a different setting w.r.t. the one of finite-horizon MDPs considered in this work. However, we can generalize this result by observing that H Op Dq, see (Domingues et al., 2021). 10(Bartlett & Tewari, 2009) present a variant of the lower bound which does not hold in general, see (Jaksch et al., 2010; Osband & Van Roy, 2016) for a detailed discussion. 11The result of (Azar et al., 2017) is derived with stage-independent transition. However, we can derive the result by considering a fictitious MDP with augmented state space S ˆ JHK.

Sleeping Reinforcement Learning

waiting state sw, a good state sg, and a bad state sb. The remaining S 3 states are arranged in a A-ary tree of depth d 1. The agent starts each episode in sw, and can select to remain in sw by playing an action aw up to stage H, after which it is forced to transition to state sroot, which is the root of the A-ary tree. For every triplet:

ph , l , a q P t1 d, . . . , H du ˆ L ˆ Aztawu,

we define a Sleeping MDP MS,ph ,l ,a q as follows. The action availabilities in sw are defined as:

Cindp A|swq 1, @h P JHK

meaning that all actions are available in the waiting state, and the transition probabilities from sw are defined as:

Phpsw|s sw, ak,h awq 1th ď Hu,

Phpsroot|s sw, ak,h awq 1 Phpsw|s sw, ak,h awq,

Phpsroot|s sw, ak,h aq 1, @a aw, h P JHK.

Let I Szttswu Y L Y tsg, sbuu be the set of internal nodes of the A-ary tree, then the action availabilities inside the tree are defined as:

Cindp A|sq 1, @s P I.

The transition probabilities for any state in the tree are deterministic: playing the a-th action leads deterministically to the a-th child node of the current node. For every leaf node sl P L, we define the action availabilities such that:

Prpaw P Ak,hpsqq 1, @s P L, k P JKK, h P JHK,

Prpa P Ak,hpsqq ÿ

BPPp Aq s.t. a PB Cindp B|sq c P r0, 1s, @s P L, a P Aztawu, k P JKK, h P JHK.

As such, at least one action is guaranteed to be available at every stage of every episode, and every action (except aw) is available with probability c. The leaf nodes are the only nodes which can transition to sg and sb, and they do so according to the following transition probabilities:

Phpsg|si, aq 1

2 ph ,l ,a qph, siaq,

Phpsb|si, aq 1 Phpsg|si, aq,

where ph ,l ,a qph, si, aq ε11tph, i, aq ph , l , a qu. Finally, we also define a reference Seeping MDP MS,0 which has the same structure as the Sle MDPs defined above, but with 0ph, si, aq 0. The rewards are defined as:

Rps, a, hq 1ts sg, h ě H d 1u, @a P A.

Regret of an algorithm A in MS,ph ,l ,a q. Following the same reasoning as in (Domingues et al., 2021), it is clear to see that the learner is required to learn the optimal trajectory, which enables it to play action a in leaf node sl at stage h . However, this trajectory is only achievable with probability c, due to the availability of action a in node sl . Notice that, if the optimal trajectory is not available in an episode, then the learner does not incur in any regret, as neither the agent nor the optimal policy can achieve it. We now follow the proof of Th. 9 of (Domingues et al., 2021), adapting to the Sleeping MDP setting. We report only the relevant modifications. Let Sk,h and Ak,h be the random variables that represent, respectively, the state occupied and the action selected at stage h of episode k. We start by observing that the average reward gathered by an algorithm A is again defined as:

Eph ,l ,a q

h 1 Rp Sk,h, Ak,h, hq

k 1 Prp Sk,H d 1 sgq.

For any stage h P J1 d, H d K, we can rewrite Eq. (7) of (Domingues et al., 2021) as:

Pr ph ,l ,a qp Sk,h 1 sgq Pr ph ,l ,a qpsk,h sgq 1

2 Pr ph ,l ,a qp Sk,h P Lq

1th h u Pr ph ,l ,a qp Sk,h sl , Ak,h a |a P Ak,hq

Pr ph ,l ,a qpsk,h sgq 1

2 Pr ph ,l ,a qp Sk,h P Lq

Sleeping Reinforcement Learning

c 1th h u Pr ph ,l ,a qp Sk,h sl , Ak,h a q, (8)

where (8) is obtained by applying Bayes rule and observing that Prph ,l ,a qpa P Ak,h|Sk,h sl , Ak,h a q 1. Following the proof, we obtain that the optimal value in any of the Sle MDPs is ρ p H H dp1{2 εqq{c. We can then rewrite the suffered regret as:

RT p Aq Kp H H dqε ˆ 1 1

K Eph ,l ,a qr ZK,ph ,l ,a qs ,

where ZK,ph ,l ,a q řK k 1 1t Sk,h sl , Ak,h a |a P Ak,h u. Observe that:

1 K Eph ,l ,a qr ZK,ph ,l ,a qs 1

k 1 Eph ,l ,a qr1t Sk,h sl , Ak,h a |a P Ak,h us

k 1K Pr ph ,l ,a qp Sk,h sl , Ak,h a |a P Ak,h q

1 c Pr ph ,l ,a qp Sk,h sl , Ak,h a q

1 c K Eph ,l ,a qr NK,ph ,l ,a qs,

where NK,ph ,l ,a q is defined as in (Domingues et al., 2021). We can then bound the maximum regret over all possible instances as:

max ph ,l ,a q RT p Aq ě Kp H H dq ε

ph ,l ,a q Eph ,l ,a qr ZK,ph ,l ,a qs

ě Kp H H dq ε

ph ,l ,a q Eph ,l ,a qr NK,ph ,l ,a qs

Following the derivation, we finally obtain:

max ph ,l ,a q RT p Aq ě Kp H H dqε

Clearly, as c appears only at the denominator of negative terms, and c P r0, 1s by definition, the value of c that maximizes the regret is c 1. Intuitively, given a finite number of episodes, and given that the agent does not pay any regret if the optimal trajectory is not available, the case in which the agent can pay the maximum regret is that in which the optimal trajectory is available in every episode. Finally, we conclude the proof by plugging in the same optimal values for ε and H, obtaining a lower bound of Ωp H ?

C.2. Proof of Theorem 4.2

Theorem 4.2 (Lower Bound Per-stage Disclosure: Markovian). For any algorithm A, there exists an instance of Sleeping MDP with per-stage disclosure and Markovian action availability such that, for T ě Ωp HSA2Aq and H ě Ωp Aq, the per-stage disclosure policy regret satisfies:

E r RPS p A, Tqs ě Ω H a

Proof. We start the proof by describing the instance. The goal of this proof is to show that an exponential component in the regret is not avoidable if we consider Markovian action availability and per-stage disclosure.

Construction of the Instances

To build the hard instances of Sle MDPs we need in order to get the lower bound exponential in the number of actions, we consider two building blocks, as depicted in Figure 6. In these instances, we consider H ą A

2 2 log A S, we consider

Sleeping Reinforcement Learning

A-ary Tree (Domingues et al., 2021)

Sleeping Lattice

Instance used in Theorem 4.2

si si si si si si

si si si si

a1 a2 a3 a4

a1 a2 a3 a4

a1 a2 a3 a4

a2 a3 a4 a1 a3

a4 a1 a2 a3

a1{2{3 a1{2{4 a1{3{4 a2{3{4

a1{2 a1{3 a1{4 a2{3 a2{4 a3{4

a1{2{3{4 a1{2{3{4 a1{2{3{4 a1{2{3{4

a1{2{3{4 a1{2{3{4 a1{2{3{4 a1{2{3{4

1 2 ε 1 2 ε

Figure 6. Instance used in the lower bound construction for C CMarkov with per-stage disclosure.

Sleeping Reinforcement Learning

the state set to be composed of S states and 3 additional states: sw, s , and s which we will define later. Moreover, we consider the action set to be composed by A 2 action, with A even, where we have 2 special actions a and a we will define later.

A-ary Tree. The first building block is the tree structure instance proposed by (Domingues et al., 2021), depicted in the grey area in Figure 6. In this structure, we have a starting state sw from which we have to exit at the proper moment, which is designed to get the proper dependency on the horizon H. Then, we have a A-ary tree structure that we consider in the same way as (Domingues et al., 2021). For this part, we consider that all the actions are available in all the states. Moreover, we consider deterministic transitions. We refer the reader to (Domingues et al., 2021) for further details. This construction gives us a lower bound in the order of Ω H ?

SAT and, given that we consider full action availability, we are in the same scenario as for the original paper, so we avoid to report all the derivation, and we refer the reader to (Domingues et al., 2021, Theorem 9) for the analysis.

Sleeping MDP Lattice. The second building block is a lattice structure to generate the hard instance in terms of Markovian action availability sets. In every leaf of the A-ary tree, we add a lattice of depth A{2 designed as follows. We start, as depicted in Figure 6, in a generic leaf state si where we will have all the actions available to build the lattice. In the example, the available actions are reported next to the state. When we are in the lattice and we make an action ai, such an action will be no more available in the next stage, while we remain in the same si P S, so the only evolution we trigger is a deterministic evolution of the available action set B which will become Bztaiu. We stop the construction of the lattice when we are in the layer of the lattice with the maximum extension. It is simple to verify that in such a layer, we have n A A{2 different

availability combinations. Then, at step A

2 1 in the lattice, we have that the different instances become distinguishable, and we will have an action set i P Jn K which is the optimal one. We call mi the instance in which i is the optimum. Now, if we are in the optimal action availability set, we have probability 1

2 ε to remain in si and have the chance of playing action a , and we have probability 1

2 ε to remain in si and having available only action a . Then, if we play a in si in the proper stage h P JHK, we get reward 1; otherwise we get 0. If i P Jn K, i i , we have the same probability of going to good and bad action set, i.e., 1

Given that, an instance is characterized by a tuple ps , h , a , B q. While the first 3 terms are the same of (Domingues et al., 2021), the last term is the optimal available action set, i.e., the one allowing us, if properly triggered, to play a and get reward 1. B is the set corresponding to i P Jn K.

Given that we already know the result of the first building block (i.e., the A-ary tree), we dedicate our efforts to getting the exponential dependency on A, then combining the results together just requires some straightforward algebra.

Fix i P Jn K. We call Nip Kq the number of times we observed i over K episodes. We define i P Jn K the ones such that:

i P arg min i PJn Kzti u Emi r Nip Kqs.

Given that, since we are in a loop-free structure, we know that: ÿ

i PJn Kzti u Emi r Nip Kqs ď K,

for the so-called averaging hammer we have:

Emi r Ni p Kqs ď K n 1.

We now consider instance mi , which is defined similarly to i but this time the best available action set, i.e., the one with a probability of activating a equal to 1

2 2ε, is i . We highlight that in this instance we have a multiplicative term 2 before ε.

We can now compute the regret and optimize the value of ε. We use the notation RT pmq to indicate the regret after T interactions on instance m:

max t RT pmi q, RT pmi qu ě 1

2 p RT pmi q RT pmi qq

ˆ Ni p Kq ď K

ˆ Ni p Kq ě K

Sleeping Reinforcement Learning

8 exp ˆ E mi r Ni p Kqs DKL

8 exp ˆ Kε2

where is the value function we will compute later, line (9) is the Bretagnolle-Huber inequality (see Lattimore & Szepesv ari, 2020, Theorem 14.2), and line (10) holds for sufficiently small ε.

We can now compute :

so we have ε.

Now we have to choose ε and we choose ε b

K and we get:

max t RT pmi q, RT pmi qu ě εK

8 exp ˆ Kε2

where the last inequality is derived after having observed that:

Now, applying the same reasoning as (Domingues et al., 2021) we can retrieve the multiplicative factor Ω H3{2?

leading to a bound in the order of Ω H ?

SAT2A{2 . This concludes the proof.

D. Proof of Theorem 5.2

In this appendix, we provide the formal proof of Theorem 5.2. The proof follows the one of (Azar et al., 2017, Theorem 2) and as such, some of the original lemmas which are used as is are reported to increase the readability of the proof.

D.1. Notation

We now collect the notation necessary for the understanding of the proof of Theorem 5.2.

Symbol Meaning S State space A Action space P Transition distribution C Action set availability distribution R Reward function H Length of the episode K Total number of episodes T Total number of steps Tk Total number of steps up to episode k S Cardinality of the state space A Cardinality of the action space

Sleeping Reinforcement Learning

sk,h State occupied at stage h of episode k aπkp Bq k,h Action played at stage h of episode k under policy πk with action set B available Nkps, aq Number of visits to state-action pair ps, aq up to episode k Nkps, a, s1q Number of transitions to state s1 from state s after playing action a, up to episode k Nk,hpsq Number of visits to state s at stage h up to episode k Nk,hps, aq Number of visits to state-action pair ps, aq at stage h up to episode k p Pk Estimated transition distribution p C Estimated action set availability distribution b Q State-action value function exploration bonus b Q k,hpsq mint 29002H3S3A2AL3

Nk,hpsq , H2u b V State value function exploration bonus b V k,hps, Bq mint 13502H3S3A2AL3

N1 i,jpsi,j,a πip Bq i,j q , H2u

πk Policy played during episode k π Optimal policy Q h State-action value function of the optimal policy Qπ h State-action value function following policy π p Qk,h Optimistic state-action value function V h Value function of the optimal policy at stage h V π h Value function under policy π at stage h p Vk,h Optimistic estimator of the optimal value function at stage h of episode k V k,hpsq Regret in state s, at stage h of episode k, following policy πk r V k,hpsq Pseudo-regret in state s, at stage h of episode k, following policy πk Q k,hps, aq Q hps, aq Qπk h ps, aq r Q k,hps, aq p Qk,hps, aq Qπk h ps, aq E Concentration inequalities event Ωk,h Optimism event εV , εV , ξV , ξ V , εQ, εQ Martingale differences sequences rkstyp, rkstyp,s, rkstyp,s,a Sets of typical episodes Hk,h History of the interactions up to, and including, stage h of episode k, not including the observed available action set Bk,h Hk,h,B History of the interactions up to, and including, stage h of episode k, including the observed available action set Bk,h L Logarithmic term logp80HS2A2AT{δq G Logarithmic term logp HSATq Vπk h Next-state variance of V πk V h Next-state variance of V p Vk,h Empirical next-state variance of p Vk,h p V k,h Empirical next-state variance of V

Qπk h Variance of Qπk Q h Variance of Q p Qk,h Empirical variance of p Qk,h p Q k,h Empirical variance of Q

Table 2: Table of notation

We now restate the definitions of the Martingale Difference Sequences:

εV k,h : P r V k,h 1 psk,h, aπkp Bk,hq k,h q r V k,h 1psk,h 1q,

Sleeping Reinforcement Learning

s1PS Pps1|sk,h, aπkp Bk,hq k,h q

g f f e Its1 P rssk,hu

Nkpsk,h, aπkp Bk,hq k,h q Pps1|sk,h, aπkp Bk,hq k,h q r V k,h 1ps1q

g f f e Itsk,h 1 P rssk,hu

Nkpsk,h, aπkp Bk,hq k,h q Pps1|sk,h, aπkp Bk,hq k,h q r V k,h 1psk,h 1q,

BPPp Aq Cindp B|sk,hq

s1PS Pps1|s, aπkp Bq k,h qr V k,h 1ps1q

r V k,h 1psk,h 1q,

ξ V k,h : ÿ

BPPp Aq Cindp B|sk,hq

s1PS Pps1|s, aπkp Bq k,h q

Its1 P rssk,hu

Nkpsk,h, aπkp Bq k,h q Pps1|sk,h, aπkp Bq k,h q r V k,h 1ps1q

Itsk,h 1 P rssk,hu

Nkpsk,h, aπkp Bq k,h q Pps1|sk,h, aπkp Bq k,h q r V k,h 1psk,h 1q,

εQ k,h : Cind r Q k,h psk,hq r Q k,hpsk,h, aπkp Bk,hq k,h q

BPPp Aq Cindp B|sk,hq

It B P r Bsk,hu Nkpsk,hq Cindp B|sk,hq r Q k,hpsk,h, aπkp Bq k,h q

It Bk,h P r Bsk,hu Nkpsk,hq Cindp Bk,h|sk,hq r Q k,hpsk,h, aπkp Bk,hq k,h q,

and of the variance terms:

Vπk h ps, aq : Var s1 P p |s,aqr V πk h ps1qs,

V hps, aq : Var s1 P p |s,aqr V h ps1qs,

p Vk,hps, aq : Var s1 p Pkp |s,aq rp Vk,hps1qs,

p V k,hps, aq : Var s1 p Pkp |s,aq r V h ps1qs,

Qπk h psq : Var B Cindp |sq r Qπk h ps, πk,hps, Bqs ,

Q hpsq : Var B Cindp |sqr Q hps, π hps, Bqqs,

p Qk,hpsq : Var B p Cind k p |sq r p Qk,hps, πk,hps, Bqqs,

p Q k,hpsq : Var B p Cind k p |sq r Q hps, π hps, Bqqs.

For ease of notation, we will employ the following notation throughout the appendix. Let F : X Ñ p Y q be a probability distribution over a set Y conditioned to a set X. Let p F : X Ñ p Y q be an estimator of F, and let G : X ˆ Y Ñ R be a real-valued function. Then the define the following notations:

p FGq pxq : ÿ

y PY Fpy|xq Gpx, yq,

p p F Fq G pxq : ÿ

p Fpy|xq Fpy|xq Gpx, yq.

Similarly, let F : X ˆ Y Ñ p Zq, p F : X ˆ Y Ñ p Zq, and G : Z Ñ R be defined with the same meaning as above, then

Sleeping Reinforcement Learning

we define the following notations:

p FGq px, yq : ÿ

z PZ Fpz|x, yq Gpzq,

p p F Fq G px, yq : ÿ

p Fpz|x, yq Fpy|x, yq Gpzq.

D.2. High Probability Events

In this section, we state the high probability events Ωk,h and E. For ease of reference, we employ the same notation as (Azar et al., 2017).

Let Ωk,h be the set of events:

Ωk,h : ! p Vi,jpsq ě V j psq p Qi,jps, aq ě Q j ps, aq, @pi, jq P rk, hshist, s P S, a P A ) ,

for k P JKK and h P JHK, where:

rk, hshist : tpi, jq : i P JKK, j P JHK, pi ă kq _ pi k, j ě hqu,

under which optimism holds.

Event E is the event defined as:

E : E p P č E p Cind č č

k PJKK h PJHK s PS a PA BPPp Aq

Eazp F r V ,k,h, H, Lq č Eazp F r V ,k,h,s, H, Lq č Eazp F r V ,k,h,s,a, H, Lq

č Eazp F1 r V ,k,h, 1 ?

L , Lq č Eazp F1 r V ,k,h,s, 1 ?

L , Lq č Eazp F1 r V ,k,h,s,a, 1 ?

L , Lq č Eazp F r V ,k,h,B, H, Lq č Eazp F r V ,k,h,B,s, H, Lq č Eazp F r V ,k,h,B,s,a, H, Lq č Eazp F1 r V ,k,h,B, 1 ?

L , Lq č Eazp F1 r V ,k,h,B,s, 1 ?

L , Lq č Eazp F1 r V ,k,h,B,s,a, 1 ?

L , Lq č Eazp F r Q,k,h, H, Lq č Eazp F r Q,k,h,s, H, Lq č Eazp F r Q,k,h,s,a, H, Lq č Eazp F1 r Q,k,h, 1 ?

L , Lq č Eazp F1 r Q,k,h,s, 1 ?

L , Lq č Eazp F1 r Q,k,h,s,a, 1 ?

L , Lq č Efrp GV,k,h, H4Tk, H3, Lq č Efrp GV,k,h,s, H5Nk,h, H3, Lq č Efrp GV,k,h,s,a, H5Nk,h, H3, Lq č Efrp GQ,k,h, H4Tk, H3, Lq č Efrp GQ,k,h,s, H5Nk,h, H3, Lq č Efrp GQ,k,h,s,a, H5Nk,h, H3, Lq č Eaz Fb V ,k,h, H2, L č Eaz Fb V ,k,h,s, H2, L

č Eaz Fb V ,k,h,s,a, H2, L č Eaz Fb Q,k,h, H2, L

č Eaz Fb Q,k,h,s, H2, L č Eaz Fb Q,k,h,s,a, H2, L +

Sleeping Reinforcement Learning

The proof that event E holds with probability at least 1 δ directly follows from Lemma 1 of (Azar et al., 2017).

We now state the definitions of the events that compose E. Events E p P and E p Cind concern the estimation of the transition probability and of the action availability distributions:

E p P : t p Pkps1|s, aq P Ppk, h, Nkps, aq, s, a, s1q, @k P JKK, h P JHK, ps, a, s1q P S ˆ A ˆ Su,

E p Cind : t p Cind k p B|sq P Cpk, h, Nkpsq, s, Bq, @k P JKK, h P JHK, s P S, B P Pp Aqu.

Ppk, h, n, s, a, s1q is the subset of the set of all transition probability distributions P such that:

Ppk, h, n, s, a, s1q :

# r Pp |s, aq P P : } r Pp |s, aq Pp |s, aq}1 ď 2

s1PS p r Pps1|s, aq Pps1|s, aqq V h ps1q

2p V k,h 1ps, aq L

n 7HL 3pn 1q,

2V h 1ps, aq L

ˇˇˇ r Pps1|s, aq Pps1|s, aq ˇˇˇ ď

2Pps1|s, aqp1 Pps1|s, aqq L

and Cpk, h, Nkpsq, s, Bq is the subset of all the action availability distributions C such that:

Cpk, h, Nkpsq, s, Bq :

# r Cindp |sq P C : } r Cindp |sq Cindp |sq}1 ď 2

BPPp Aq p r Cindp B|sq Cindp B|sqq Q hps, aπ p Bq k,h q

2p Q k,hpsq L

n 7HL 3pn 1q,

ˇˇˇ r Cindp B|sq Cindp B|sq ˇˇˇ ď

2Cindp B|sqp1 Cindp B|sqq L

where Eq. (11) and Eq. (14) follows by applying Lemma 2.1 of (Weissman et al., 2003), Eq. (12) and Eq. (15) follows by applying both the Bernstein inequality (see, e.g., Cesa-Bianchi & Lugosi, 2006) and the Empirical Bernstein inequality (Maurer & Pontil, 2009), and Eq. (13) and Eq. (16) derive from the application of the Bernstein inequality for Bernoulli random variables.

The remaining events concern the summation of Martingale difference sequences. For ease of reading, let us introduce the following shorthand notation:

Is : Itsi,h su,

Is,a : Itsi,h s, aπip Bi,hq i,h au,

where I represents the indicator function.

Sleeping Reinforcement Learning

Eazp F r V ,k,h, H, Lq :

j h p P r V i,j 1qpsi,j, aπip Bi,jq i,j q

j h r V i,j 1psi,j 1q

kp H hq H2L

Eazp F r V ,k,h,s, H, Lq :

j h p P r V i,j 1qpsi,j, aπip Bi,jq i,j q

j h r V i,j 1psi,j 1q

Nk,hpsqp H hq H2L

Eazp F r V ,k,h,s,a, H, Lq :

j h p P r V i,j 1qpsi,j, aπip Bi,jq i,j q

j h r V i,j 1psi,j 1q

Nk,hps, aqp H hq H2L

Eazp F1 r V ,k,h, 1 ?

s1PS Pps1|si,j, aπip Bi,jq i,j q

Its1 P rssi,ju

Nipsi,j, aπip Bi,jq i,j q Pps1|si,j, aπip Bi,jq i,j q r V i,j 1ps1q

Itsi,j 1 P rssi,ju

Nipsi,j, aπip Bi,jq i,j q Ppsi,j 1|si,j, aπip Bi,jq i,j q r V i,j 1psi,j 1q

Eazp F1 r V ,k,h,s, 1 ?

s1PS Pps1|si,j, aπip Bi,jq i,j q

Its1 P rssi,ju

Nipsi,j, aπip Bi,jq i,j q Pps1|si,j, aπip Bi,jq i,j q r V i,j 1ps1q

Itsi,j 1 P rssi,ju

Nipsi,j, aπip Bi,jq i,j q Ppsi,j 1|si,j, aπip Bi,jq i,j q r V i,j 1psi,j 1q

Nk,hpsqp H hq

Eazp F1 r V ,k,h,s,a, 1 ?

s1PS Pps1|si,j, aπip Bi,jq i,j q

Its1 P rssi,ju

Nipsi,j, aπip Bi,jq i,j q Pps1|si,j, aπip Bi,jq i,j q r V i,j 1ps1q

Itsi,j 1 P rssi,ju

Nipsi,j, aπip Bi,jq i,j q Ppsi,j 1|si,j, aπip Bi,jq i,j q r V i,j 1psi,j 1q

Nk,hps, aqp H hq

Eazp F r V ,k,h,B, H, Lq :

BPPp Aq Cindp B|si,jq ÿ

s1PS Pps1|si,j, aπip Bq i,j qr V i,j 1ps1q

j h r V i,j 1psi,j 1q

kp H hq H2L

Sleeping Reinforcement Learning

Eazp F r V ,k,h,B,s, H, Lq :

BPPp Aq Cindp B|si,jq ÿ

s1PS Pps1|si,j, aπip Bq i,j qr V i,j 1ps1q

j h r V i,j 1psi,j 1q

Nk,hpsqp H hq H2L

Eazp F r V ,k,h,B,s,a, H, Lq :

BPPp Aq Cindp B|si,jq ÿ

s1PS Pps1|si,j, aπip Bq i,j qr V i,j 1ps1q

j h r V i,j 1psi,j 1q

Nk,hps, aqp H hq H2L

Eazp F1 r V ,k,h,B, 1 ?

j h E B Cindp |si,jq

s1 P p |si,j,a πip Bq i,j q

Its1 P rssi,ju

Nipsi,j, aπip Bq i,j q Pps1|si,j, aπip Bq i,j q r V i,j 1ps1q

Itsi,j 1 P rssi,ju

Nipsi,j, aπip Bq i,j q Pps1|si,j, aπip Bq i,j q r V i,j 1psi,j 1q

Eazp F1 r V ,k,h,B,s, 1 ?

j h E B Cindp |si,jq

s1 P p |si,j,a πip Bq i,j q

Its1 P rssi,ju

Nipsi,j, aπip Bq i,j q Pps1|si,j, aπip Bq i,j q r V i,j 1ps1q

Itsi,j 1 P rssi,ju

Nipsi,j, aπip Bq i,j q Pps1|si,j, aπip Bq i,j q r V i,j 1psi,j 1q

Nk,hpsqp H hq

Eazp F1 r V ,k,h,B,s,a, 1 ?

j h E B Cindp |si,jq

s1 P p |si,j,a πip Bq i,j q

Its1 P rssi,ju

Nipsi,j, aπip Bq i,j q Pps1|si,j, aπip Bq i,j q r V i,j 1ps1q

Itsi,j 1 P rssi,ju

Nipsi,j, aπip Bq i,j q Pps1|si,j, aπip Bq i,j q r V i,j 1psi,j 1q

Nk,hps, aqp H hq

Eazp F r Q,k,h, H, Lq :

j h p Cind r Q i,jqpsi,jq

j h r Q i,jpsi,j, aπip Bi,jq i,j q

kp H hq H2L

Eazp F r Q,k,h,s, H, Lq :

j h p Cind r Q i,jqpsi,jq

j h r Q i,jpsi,j, aπip Bi,jq i,j q

Sleeping Reinforcement Learning

Nk,hpsqp H hq H2L

Eazp F r Q,k,h,s,a, H, Lq :

j h p Cind r Q i,jqpsi,jq

j h r Q i,jpsi,j, aπip Bi,jq i,j q

Nk,hps, aqp H hq H2L

Eazp F1 r Q,k,h, 1 ?

BPPp Aq Cindp B|si,jq

It B P r Bsi,ju Nipsi,jq Cindp B|si,jq r Q i,jpsi,j, aπip Bq i,j q

It Bi,j P r Bsi,ju Nipsi,jq Cindp B|si,jq r Q i,jpsi,j, aπip Bi,jq i,j q

Eazp F1 r Q,k,h,s, 1 ?

BPPp Aq Cindp B|si,jq

It B P r Bsi,ju Nipsi,jq Cindp B|si,jq r Q i,jpsi,j, aπip Bq i,j q

It Bi,j P r Bsi,ju Nipsi,jq Cindp B|si,jq r Q i,jpsi,j, aπip Bi,jq i,j q

Nk,hpsqp H hq

Eazp F1 r Q,k,h,s,a, 1 ?

BPPp Aq Cindp B|si,jq

It B P r Bsi,ju Nipsi,jq Cindp B|si,jq r Q i,jpsi,j, aπip Bq i,j q

It Bi,j P r Bsi,ju Nipsi,jq Cindp B|si,jq r Q i,jpsi,j, aπip Bi,jq i,j q

Nk,hps, aqp H hq

Efrp GV,k,h, H4Tk, H3, Lq :

j h Vπi j 1psi,j, aπip Bi,jq i,j q|Hi,h

j h Vπi j 1psi,j, aπip Bi,jq i,j q

Efrp GV,k,h,s, H5Nk,h, H3, Lq :

j h Vπi j 1psi,j, aπip Bi,jq i,j q|Hi,h

j h Vπi j 1psi,j, aπip Bi,jq i,j q

H5Nk,hpsq L 4

Efrp GV,k,h,s,a, H5Nk,h, H3, Lq :

j h Vπi j 1psi,j, aπip Bi,jq i,j q|Hi,h

j h Vπi j 1psi,j, aπip Bi,jq i,j q

H5Nk,hps, aq L 4

Sleeping Reinforcement Learning

Efrp GQ,k,h, H4Tk, H3, Lq :

j h Qπi j psi,jq|Hi,h

j h Qπi j psi,jq

Efrp GQ,k,h,s, H5Nk,h, H3, Lq :

j h Qπi j psi,jq|Hi,h

j h Qπi j psi,jq

H5Nk,hpsq L 4

Efrp GQ,k,h,s,a, H5Nk,h, H3, Lq :

j h Qπi j psi,jq|Hi,h

j h Qπi j psi,jq

H5Nk,hps, aq L 4

Eaz Fb Q,k,h, H2, L :

j h p Pb Qqpsi,j, aπip Bi,jq i,j q

b Qpsi,j 1q

kp H hq H4L

Eaz Fb Q,k,h,s, H2, L :

j h p Pb Qqpsi,j, aπip Bi,jq i,j q

b Qpsi,j 1q

Nk,hpsqp H hq H4L

Eaz Fb Q,k,h,s,a, H2, L :

j h p Pb Qqpsi,j, aπip Bi,jq i,j q

b Qpsi,j 1q

Nk,hps, aqp H hq H4L

Eaz Fb V ,k,h, H2, L :

j h p Cb V qpsi,jq

b V psi,j, aπip Bi,jq i,j q

kp H hq H4L

Eaz Fb V ,k,h,s, H2, L :

j h p Cb V qpsi,jq

b V psi,j, aπip Bi,jq i,j q

Nk,hpsqp H hq H4L

Eaz Fb V ,k,h,s,a, H2, L :

j h p Cb V qpsi,jq

b V psi,j, aπip Bi,jq i,j q

Nk,hps, aqp H hq H4L

Sleeping Reinforcement Learning

The choice of L derives from a union bound over the 40 events, for every episode k P JKK, stage h P JHK, state s P S, action a P A, and action subset B P Pp Aq, noting that each event holds with at least probability 1 δ.

D.3. Technical Lemmas

Lemma D.1 (Regret decomposition upper bound V k,h). Let k P JKK and h P JHK. Assume events E and Ωk,h hold. Then, the regret from stage h onward of all episodes up to k, in terms of state value function, can be upper bounded as follows:

i 1 V i,hpsi,hq ď

i 1 r V i,hpsi,hq

b V i,jpsi,jq EB Cindp |si,jqrb Q i,jpsi,j, aπip Bq i,j qs 1

H b Q i,jpsi,j, aπip Bi,jq i,j q

2L H εV i,j ?

EB Cindp |si,jq p p Pi Pq V j 1 psi,j, aπip Bq i,j q ı 1

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

EB Cindp |si,jq

3Nipsi,j, aπip Bq i,j q

3Nipsi,j, aπip Bi,jq i,j q

Nipsi,jq 2H2AL

2Qπi j psi,jq L Nipsi,jq 2HL 3Nipsi,jq

Proof. Considering a single value of k P JKK, we first observe that, under Ωk,h:

V k,hpsk,hq V h psk,hq V πk h psk,hq

ď p Vk,hpsk,hq V πk h psk,hq

r V k,hpsk,hq.

As such, we can then bound the pseudo-regret r V k,hpsk,hq:

r V k,hpsk,hq p Vk,hpsk,hq V πk h psk,hq

b V k,hpsk,hq p Cind k p Qk,h psk,hq Cind Qπk h psk,hq

b V k,hpsk,hq p p Cind k Cindq p Qk,h psk,hq loooooooooooooooomoooooooooooooooon

EB Cindp |sk,hqrb Q k,hpsk,h, aπkp Bq k,h qs

EB Cindp |sk,hq p Pk p Vk,h 1 psk,h, aπkp Bq k,h q PV πk h 1 psk,h, aπkp Bq k,h q ı

paq b V k,hpsk,hq EB Cindp |sk,hqrb Q k,hpsk,h, aπkp Bq k,h qs

EB Cindp |sk,hq p p Pk Pqp Vk,h 1 psk,h, aπkp Bq k,h q ı E B Cindp |sk,hq

s1 P p |sk,h,a πkp Bq k,h q

r V k,h 1ps1q ı

paq b V k,hpsk,hq EB Cindp |sk,hqrb Q k,hpsk,h, aπkp Bq k,h qs E B Cindp |sk,hq

s1 P p |sk,h,a πkp Bq k,h q

r V k,h 1ps1q ı

EB Cindp |sk,hq p p Pk Pq V h 1 psk,h, aπkp Bq k,h q ı

Sleeping Reinforcement Learning

EB Cindp |sk,hq p p Pk Pqpp Vk,h 1 V h 1q psk,h, aπkp Bq k,h q ı

paq b V k,hpsk,hq EB Cindp |sk,hqrb Q k,hpsk,h, aπkp Bq k,h qs r V k,h 1psk,h 1q

ξV k,h EB Cindp |sk,hq p p Pk Pq V h 1 psk,h, aπkp Bq k,h q ı

EB Cindp |sk,hqr p p Pk Pqpp Vk,h 1 V h 1q psk,h, aπkp Bq k,h q looooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooon

We now bound term pbq following the procedure of (Azar et al., 2017, Lemma 3), obtaining the following bound:

2Lξ V k,h 1

H r V k,h 1psk,h 1q EB Cindp |sk,hq

3Nkpsk,h, aπkp Bq k,h q

To bound term paq we first derive that:

paq p p Cind k Cindq p Qk,h psk,hq

p p Cind k Cindqp p Qk,h Qπk h q psk,hq loooooooooooooooooooooomoooooooooooooooooooooon

p p Cind k Cindq Qπk h psk,hq

2Qπk h psk,hq L Nkpsk,hq 2HL 3Nkpsk,hq (18)

where Eq. (18) is obtained by applying Bernstein s inequality. We now bound term pcq:

p Cind k p B|sk,hq Cindp B|sk,hq r Q k,hpsk,h, aπkp Bq k,h q

2Cindp B|sk,hqp1 Cindp B|sk,hqq L

Nkpsk,hq 2L 3Nkpsk,hq

r Q k,hpsk,h, aπkp Bq k,h q (19)

2Cindp B|sk,hq L

Nkpsk,hq 2L 3Nkpsk,hq

r Q k,hpsk,h, aπkp Bq k,h q

Cindp B|sk,hq

Nkpsk,hq r Q k,hpsk,h, aπkp Bq k,h q

looooooooooooooooooooooooomooooooooooooooooooooooooon

3Nkpsk,hq (20)

where Eq. (19) is obtained by applying Bernstein s inequality for Bernoulli Random Variables, and Eq. (20) follows by upper bounding r Q k,hpsk,h, aπkp Bq k,h q with H. Let r Bsk,h t B P Pp Aq : Nkpsk,hq Cindp B|sk,hq ě 2H2Lu. To bound term pdq, we first rewrite it as:

Cindp B|sk,hq

Nkpsk,hq r Q k,hpsk,h, aπkp Bq k,h q

loooooooooooooooooooooooooomoooooooooooooooooooooooooon

Cindp B|sk,hq

Nkpsk,hq r Q k,hpsk,h, aπkp Bq k,h q

loooooooooooooooooooooooooomoooooooooooooooooooooooooon

Sleeping Reinforcement Learning

we can then bound term pfq as:

Nkpsk,hq Cindp B|sk,hq

Nkpsk,hq r Q k,hpsk,h, aπkp Bq k,h q

2L Nkpsk,hq ,

by applying the condition of r Bsk,h, and term peq as:

Cindp B|sk,hq

Nkpsk,hq r Q k,hpsk,h, aπkp Bq k,h q

It Bk,h P r Bsk,hu Nkpsk,hq Cindp Bk,h|sk,hq r Q k,hpsk,h, aπkp Bk,hq k,h q (22)

ď εQ k,h 1 ?

2H2L r Q k,hpsk,h, aπkp Bk,hq k,h q, (23)

where Eq. (22) is obtained by applying the definition of εQ k,h, Eq. (23) follows from the definition of r Bsk,h. Plugging the bounds of terms peq and pfq into Eq. (21), we get:

pdq ď εQ k,h 1 ?

2H2L r Q k,hpsk,h, aπkp Bk,hq k,h q H22A?

2L Nkpsk,hq .

We can then bound r Q k,hpsk,h, aπkp Bk,hq k,h q as in Lemma 3 of (Azar et al., 2017) and plug the bound of term pdq into Eq. (20), thus obtaining:

H b Q k,hpsk,h, aπkp Bk,hq k,h q ˆ 1

r V k,h 1psk,h 1q

p p Pk Pq V h 1 psk,h, aπkp Bk,hq k,h q 1

2L H εV k,h

3Nkpsk,h, aπkp Bk,hq k,h q 2H22AL

Nkpsk,hq 2H2AL 3Nkpsk,hq.

Finally, putting together, the bounds of terms paq, pbq, and pcq, and plugging them into Eq. (17), we obtain:

r V k,hpsk,hq ď b V k,hpsk,hq EB Cindp |sk,hqrb Q k,hpsk,h, aπkp Bq k,h qs 1

H b Q k,hpsk,h, aπkp Bk,hq k,h q

r V k,h 1psk,h 1q ξV k,h 1

2L H εV k,h ?

EB Cindp |sk,hq p p Pk Pq V h 1 psk,h, aπkp Bq k,h q ı 1

p p Pk Pq V h 1 psk,h, aπkp Bk,hq k,h q

EB Cindp |sk,hq

3Nkpsk,h, aπkp Bq k,h q

3Nkpsk,h, aπkp Bk,hq k,h q

Sleeping Reinforcement Learning

Nkpsk,hq 2H2AL 3Nkpsk,hq

2Qπk h psk,hq L Nkpsk,hq 2HL 3Nkpsk,hq.

By applying the same inductive argument of (Azar et al., 2017, Lemma 3), we can isolate the term r V k,h and rewrite:

r V k,hpsk,hq ď e2 H 1 ÿ

b V k,jpsk,jq EB Cindp |sk,jqrb Q k,jpsk,j, aπkp Bq k,j qs 1

H b Q k,jpsk,j, aπkp Bk,jq k,j q

2L H εV k,j ?

EB Cindp |sk,jq p p Pk Pq V j 1 psk,j, aπkp Bq k,j q ı 1

p p Pk Pq V j 1 psk,j, aπkp Bk,jq k,j q

EB Cindp |sk,jq

3Nkpsk,j, aπkp Bq k,j q

3Nkpsk,j, aπkp Bk,jq k,j q

Nkpsk,jq 2H2AL

2Qπk j psk,jq L Nkpsk,jq 2HL 3Nkpsk,jq

observing that our multiplicative constant is e2 instead of e due to the p1 2{H 1{H2q coefficient of the recursive term. Finally, recalling the definition of Ωk,h:

Ωk,h : ! p Vi,jpsq ě V j psq p Qi,jps, aq ě Q j ps, aq, @pi, jq P rk, hshist, s P S, a P A ) ,

where rk, hshist : tpi, jq : i P JKK, j P JHK, pi ă kq _ pi k, j ě hqu, we observe that if Ωk,h holds then also the events Ωi,j for pi, jq P rk, hshist hold. As such, we can sum up the bound of Eq. (24) over all the episodes i P Jk K, thus concluding the proof.

Lemma D.2 (Regret decomposition upper bound Q k,h). Let k P JKK and h P JHK. Assume events E and Ωk,h hold. Then, the regret from stage h onward of all episodes up to k, in terms of state-action value function, can be upper bounded as follows:

i 1 Q k,hpsi,h, aπip Bi,hq i,h q ď

i 1 r Q k,hpsi,h, aπip Bi,hq i,h q

b Q i,jpsi,j, aπip Bi,jq i,j q ˆ 1 1

b V i,j 1psi,j 1q

2LεV i,j εV i,j ˆ 1 1

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

Nipsi,j, aπip Bi,jq i,j q ˆ 1 1

2AL Nipsi,j 1q

Proof. Considering a single value of k P JKK, we first observe that, under Ωk,h:

Q k,hpsk,h, aπkp Bk,hq k,h q Q hpsk,h, aπkp Bk,hq k,h q Qπk h psk,h, aπkp Bk,hq k,h q

ď p Qk,hpsk,h, aπkp Bk,hq k,h q Qπk h psk,h, aπkp Bk,hq k,h q

Sleeping Reinforcement Learning

r Q k,hpsk,h, aπkp Bk,hq k,h q.

As such, we can bound the pseudo-regret r Q k,hpsk,h, aπkp Bk,hq k,h q:

r Q k,hpsk,h, aπkp Bk,hq k,h q p Qk,hpsk,h, aπkp Bk,hq k,h q Qπk h psk,h, aπkp Bk,hq k,h q

b Q k,hpsk,h, aπkp Bk,hq k,h q p Pk p Vk,h 1 psk,h, aπkp Bk,hq k,h q PV πk h 1 psk,h, aπkp Bk,hq k,h q

b Q k,hpsk,h, aπkp Bk,hq k,h q p p Pk Pqp Vk,h 1 psk,h, aπkp Bk,hq k,h q

Ppp Vk,h 1 V πk h 1q psk,h, aπkp Bk,hq k,h q

b Q k,hpsk,h, aπkp Bk,hq k,h q p p Pk Pqpp Vk,h 1 V h 1q psk,h, aπkp Bk,hq k,h q

Ppp Vk,h 1 V πk h 1q psk,h, aπkp Bk,hq k,h q p p Pk Pq V h 1 psk,h, aπkp Bk,hq k,h q

ď b Q k,hpsk,h, aπkp Bk,hq k,h q ?

2LεV k,h εV k,h ˆ 1 1

r V k,h 1psk,h 1q

Nkpsk,h, aπkp Bk,hq k,h q p p Pk Pq V h 1 psk,h, aπkp Bk,hq k,h q, (25)

where Eq. (25) is obtained by bounding p p Pk Pqpp Vk,h 1 V h 1q according to the procedure of (Azar et al., 2017, Lemma 3). Observing that:

r V k,h 1psk,h 1q b V k,h 1psk,h 1q p Cind k p Qk,h 1 psk,h 1q Cind Qπk h 1 psk,h 1q

b V k,h 1psk,h 1q p p Cind k Cindq p Qk,h 1 psk,h 1q Cindp p Qk,h 1 Qπk h 1q psk,h 1q

ď b V k,h 1psk,h 1q εQ k,h 1 2H

2AL Nkpsk,h 1q r Q k,h 1psk,h 1, aπkp Bk,h 1q k,h 1 q, (26)

where Eq. (26) is obtained by applying the definition of εQ k,h and by bounding p p Cind k Cindq p Qk,h 1 using Thr. 2.1 of (Weissman et al., 2003), then we can rewrite:

r Q k,hpsk,h, aπkp Bk,hq k,h q ď b Q k,hpsk,h, aπkp Bk,hq k,h q ?

2LεV k,h εV k,h 8H2SL

Nkpsk,h, aπkp Bk,hq k,h q

p p Pk Pq V h 1 psk,h, aπkp Bk,hq k,h q ˆ 1 1

b V k,h 1psk,h 1q ˆ 1 1

2AL Nkpsk,h 1q ˆ 1 1

r Q k,h 1psk,h 1, aπkp Bk,h 1q k,h 1 q.

By applying the same inductive argument of (Azar et al., 2017, Lemma 3), we can isolate the term r Q k,h and rewrite:

r Q k,hpsk,h, aπkp Bk,hq k,h q ď e

b Q k,jpsk,j, aπkp Bk,jq k,j q ˆ 1 1

b V k,j 1psk,j 1q

2LεV k,j εV k,j ˆ 1 1

Sleeping Reinforcement Learning

p p Pk Pq V j 1 psk,j, aπkp Bk,jq k,j q

Nkpsk,j, aπkp Bk,jq k,j q ˆ 1 1

2AL Nkpsk,j 1q

Finally, recalling the definition of Ωk,h:

Ωk,h : ! p Vi,jpsq ě V j psq p Qi,jps, aq ě Q j ps, aq, @pi, jq P rk, hshist, s P S, a P A ) ,

where rk, hshist : tpi, jq : i P JKK, j P JHK, pi ă kq _ pi k, j ě hqu, we observe that if Ωk,h holds then also the events Ωi,j for pi, jq P rk, hshist hold. As such, we can sum up the bound of Eq. (27) over all the episodes i P Jk K, thus concluding the proof.

Lemma D.3. Let k P JKK and h P JHK. Let events E and Ωk,h hold. Then the following bounds hold:

j h εV i,j ď 2 a

j h ξV i,j ď 2 a

j h εQ i,j ď 2 a

ξ V i,j ď a

Moreover, for every s P S and a P A, the following bounds also hold:

i 1 Itsi,h su

j h εV i,j ď 2 b

H3Nk,hpsq L,

i 1 Itsi,h su

j h ξV i,j ď 2 b

H3Nk,hpsq L,

i 1 Itsi,h su

j h εQ i,j ď 2 b

H3Nk,hpsq L,

i 1 Itsi,h su

i 1 Itsi,h su

ξ V i,j ď b

Sleeping Reinforcement Learning

i 1 Itsi,h su

i 1 Itsi,h s, aπip Bi,hq i,h au

j h εV i,j ď 2 b

H3Nk,hps, aq L,

i 1 Itsi,h s, aπip Bi,hq i,h au

j h ξV i,j ď 2 b

H3Nk,hps, aq L,

i 1 Itsi,h s, aπip Bi,hq i,h au

j h εQ i,j ď 2 b

H3Nk,hps, aq L,

i 1 Itsi,h s, aπip Bi,hq i,h au

HNk,hps, aq,

i 1 Itsi,h s, aπip Bi,hq i,h au

ξ V i,j ď b

HNk,hps, aq,

i 1 Itsi,h s, aπip Bi,hq i,h au

HNk,hps, aq.

Proof. The proof follows a similar reasoning as that of Lemma 5 of (Azar et al., 2017), observing that under E and Ωk,h the following events hold:

Eazp F r V ,k,h, H, Lq, Eazp F r V ,k,h,s, H, Lq, Eazp F r V ,k,h,s,a, H, Lq,

Eazp F r V ,k,h,B, H, Lq, Eazp F r V ,k,h,B,s, H, Lq, Eazp F r V ,k,h,B,s,a, H, Lq,

Eazp F r Q,k,h, H, Lq, Eazp F r Q,k,h,s, H, Lq, Eazp F r Q,k,h,s,a, H, Lq,

Eazp F1 r V ,k,h, 1 ?

L , Lq, Eazp F1 r V ,k,h,s, 1 ?

L , Lq, Eazp F1 r V ,k,h,s,a, 1 ?

Eazp F1 r V ,k,h,B, 1 ?

L , Lq, Eazp F1 r V ,k,h,B,s, 1 ?

L , Lq, Eazp F1 r V ,k,h,B,s,a, 1 ?

Eazp F1 r Q,k,h, 1 ?

L , Lq, Eazp F1 r Q,k,h,s, 1 ?

L , Lq, Eazp F1 r Q,k,h,s,a, 1 ?

Remark D.1. With a slight abuse of notation, but to benefit the ease of reading, we will refer to the following lemmas also for the summations in which the action set is not fixed but is taken in expectation over Cind. As an example, we will refer to Lemma D.4 when upper bounding the summation:

j h EB Cindp |si,jq Vπi j 1psi,j, aπip Bq i,j q ı . (29)

Considering the summations of the random variables that define the events (see Lemma 1 of Azar et al., 2017) and changing the conditioning from mathcal Hk,h,B to Hk,h (i.e., removing Bk,h from the history) we see that the quantities in play retain the properties that make them Martingale difference sequences. As such, we can apply the same derivations, obtaining the same bounds.

Lemma D.4 (Lemma 8 of Azar et al. (2017)). Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Under the events E and Ωk,h, the following holds for every s P S:

Sleeping Reinforcement Learning

j h Vπi j 1psi,j, aπip Bi,jq i,j q ď HTk 2 a

i 1 Itsi,h su

j h Vπi j 1psi,j, aπip Bi,jq i,j q ď H2Nk,hpsq 2 b

H5Nk,hpsq L 4

i 1 Itsi,h s, aπip Bi,hq i,h au

j h Vπi j 1psi,j, aπip Bi,jq i,j q ď H2Nk,hps, aq 2 b

H5Nk,hps, aq L 4

Proof. The proof of this Lemma directly derives from the one of Lemma 8 of (Azar et al., 2017), observing that under event E, the following events hold:

Efrp GV,k,h, H4Tk, H3, Lq,

Efrp GV,k,h,s, H5N 1 k,h, H3, Lq,

Efrp GV,k,h,s,a, H5N 1 k,h, H3, Lq,

and by taking the expectation over both the states and the action availabilities. Eq. (43) and Eq. (44) of (Azar et al., 2017) hold under this modified expectation since the cumulative reward is bounded with H.

Lemma D.5 (Lemma 9 of Azar et al. (2017)). Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Under the events E and Ωk,h, the following holds for every ps, aq P S ˆ A:

V j 1psi,j, aπip Bi,jq i,j q Vπi j 1psi,j, aπip Bi,jq i,j q

j h r V i,jpsi,jq,

i 1 Itsi,h su

V j 1psi,j, aπip Bi,jq i,j q Vπi j 1psi,j, aπip Bi,jq i,j q

H5Nk,hpsq L 2H

i 1 Itsi,h su

j h r V i,jpsi,jq,

i 1 Itsi,h s, aπip Bi,hq i,h au

V j 1psi,j, aπip Bi,jq i,j q Vπi j 1psi,j, aπip Bi,jq i,j q

H5Nk,hps, aq L 2H

i 1 Itsi,h s, aπip Bi,hq i,h au

j h r V i,jpsi,jq.

Proof. The proof directly follows from the proof of Lemma 9 of (Azar et al., 2017), observing that events Eazp F r V ,k,h, H, Lq, Eazp F r V ,k,h,s, H, Lq, and Eazp F r V ,k,h,s, H, Lq hold under E, and Ωk,h holds.

Lemma D.6 (Lemma 10 of Azar et al. (2017)). Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Under the events E and Ωk,h, the following holds for every s P S:

p Vi,j 1psi,j, aπip Bi,jq i,j q Vπi j 1psi,j, aπip Bi,jq i,j q

Sleeping Reinforcement Learning

j h r V i,j 1psi,j 1q,

i 1 Itsi,h su

p Vi,j 1psi,j, aπip Bi,jq i,j q Vπi j 1psi,j, aπip Bi,jq i,j q

HANk,hpsq L 2H

i 1 Itsi,h su

j h r V i,j 1psi,j 1q,

i 1 Itsi,h s, aπip Bi,hq i,h au

p Vi,j 1psi,j, aπip Bi,jq i,j q Vπi j 1psi,j, aπip Bi,jq i,j q

HANk,hps, aq L 2H

i 1 Itsi,h s, aπip Bi,hq i,h au

j h r V i,j 1psi,j 1q.

Proof. The proof directly follows that of Lemma 10 of (Azar et al., 2017), observing that under Ωk,h and E, the following events hold: Eazp F r V ,k,h, H, Lq, Eazp F r V ,k,h,s, H, Lq, Eazp F r V ,k,h,s,a, H, Lq.

Lemma D.7. Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Under the events E and Ωk,h, the following holds for every s P S:

j h Qπi j psi,jq ď HTk 2 a

i 1 Itsi,h su

j h Qπi j psi,jq ď H2Nk,hpsq 2 b

H5Nk,hpsq L 4

i 1 Itsi,h s, aπip Bi,hq i,h au

j h Qπi j psi,jq ď H2Nk,hps, aq 2 b

H5Nk,hps, aq L 4

Proof. This proof follows a derivation similar to that of Lemma 8 of (Azar et al., 2017). Let us first recall the definition of Qπi j psi,jq:

Qπi j psi,jq Var B Cindp |si,jq

Qπi j psi,j, aπip Bq i,j q ı .

Under event E, the following events hold:

Efrp GQ,k,h, H4Tk, H3, Lq, Efrp GQ,k,h,s, H5Nk,h, H3Lq, and Efrp GQ,k,h,s,a, H5Nk,h, H3Lq.

Event Efrp GQ,k,h, H4Tk, H3, Lq implies that:

j h Qπi j psi,jq ď

j h Qπi j psi,jq|Hk,h

Observing that, conditioned to Hk,h, Qπi j psi,jq Vπi j 1psi,j, aπip Bi,jq i,j q since there is no variance on the reward at stage j, we can then recursively apply the Law of Total Variance (see e.g., Thr. 9.5.5 of Blitzstein & Hwang, 2019) as done in (Azar et al., 2017, Eq. 26) to obtain the following bound:

Sleeping Reinforcement Learning

j h Qπi j psi,jq|Hk,h

Var psi,j,Bi,jqj PJh 1,H 1K

j h 1 Rπipsi,j, aπip Bi,jq i,j q

Putting everything together, we obtain the following bound:

j h Qπi j psi,jq ď HTk 2 a

In a similar manner, from events Eazp F r Q,k,h,s, H, Lq and Eazp F r Q,k,h,s,a, H, Lq we can derive the bounds for the two remaining summations, thus concluding the proof.

Lemma D.8. Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Under the events E and Ωk,h, the following holds for every s P S:

p Qi,jpsi,jq Qπi j psi,jq ď H2a

S2ATk L 7H2S a

ATk L 4H3SLG

j h b Q i,jpsi,j, aπip Bi,jq i,j q 2H

j h r V i,j 1psi,j 1q,

i 1 Itsi,h su

p Qi,jpsi,jq Qπi j psi,jq

H5S2ANk,hpsq L 7 b

H5S2ANk,hpsq L 4H3SLG

i 1 Itsi,h su

j h b Q i,jpsi,j, aπip Bi,jq i,j q

i 1 Itsi,h su

j h r V i,j 1psi,j 1q,

i 1 Itsi,h s, aπip Bi,hq i,h au

p Qi,jpsi,jq Qπi j psi,jq

H5S2ANk,hps, aq L 7 b

H5S2ANk,hps, aq L 4H3SLG

i 1 Itsi,h s, aπip Bi,hq i,h au

j h b Q i,jpsi,j, aπip Bi,jq i,j q

i 1 Itsi,h s, aπip Bi,hq i,h au

j h r V i,j 1psi,j 1q.

Proof. We first provide an upper bound to p Qi,jpsi,jq Qπi j psi,jq, and we then bound its summation over episodes and stages.

p Qi,jpsi,jq Qπi j psi,jq EB p Cind i p |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı EB p Cind i p |si,jq p Qi,jpsi,j, aπip Bq i,j q ı2

EB Cindp |si,jq Qπi j psi,j, aπip Bq i,j q2ı EB Cindp |si,jq Qπi j psi,j, aπip Bq i,j q ı2

Sleeping Reinforcement Learning

ď EB p Cind i p |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı EB Cindp |si,jq Qπi j psi,j, aπip Bq i,j q2ı

EB p Cind i p |si,jq Q j psi,j, aπ p Bq i,j q ı2 EB Cindp |si,jq Q j psi,j, aπ p Bq i,j q ı2 (30)

ď EB p Cind i p |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı EB Cindp |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı

EB Cindp |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı EB Cindp |si,jq Qπi j psi,j, aπip Bq i,j q2ı

2H p p Cind i Cindq Q j psi,jq

ď EB p Cind i p |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı EB Cindp |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı

EB Cindp |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı EB Cindp |si,jq Qπi j psi,j, aπip Bq i,j q2ı

4H H2L Nipsi,jq, (31)

where Eq. (30) follows from observing that, under Ωk,h, p Qi,j ě Q j ě Qπi j , @ps, aq P S ˆ A, and Eq. (31) is obtained via Hoeffding s inequality. Putting this bound in the double summation, we get:

p Qi,jpsi,jq Qπi j psi,jq ď

EB p Cind i p |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı EB Cindp |si,jq p Qi,jpsi,j, aπip Bq i,j q2ı

looooooooooooooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooooooooooooon

EB Cindp |si,jq p Qi,jpsi,j, aπip Bq i,j q2 Qπi j psi,j, aπip Bq i,j q2ı

loooooooooooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooooooooooon

ˆ 4H H2L Nipsi,jq

looooooooooooomooooooooooooon

We can bound term paq with H2a

S2ATk L by bounding p Qi,j with H and applying Thr. 2.1 of (Weissman et al., 2003), and term pcq with 4H3SLG by applying a pigeonhole argument. We now bound term pbq as follows:

BPPp Aq Cindp B|si,jq p p Qi,jpsi,j, aπip Bq i,j q Qπi j psi,j, aπip Bq i,j qqp p Qi,jpsi,j, aπip Bq i,j q Qπi j psi,j, aπip Bq i,j qq

BPPp Aq Cindp B|si,jqp p Qi,jpsi,j, aπip Bq i,j q Qπi j psi,j, aπip Bq i,j qq

j h εQ i,j 2H

j h b Q i,jpsi,j, aπip Bi,jq i,j q

ˆ Es1P p Pip |si,j,a πip Bi,j q i,j q p Vi,j 1ps1q Es1 P p |si,j,a πip Bi,j q i,j q V πi j 1ps1q

j h εQ i,j 2H

j h b Q i,jpsi,j, aπip Bi,jq i,j q 2H

j h εV i,j 2H

j h r V i,j 1psi,j 1q

Sleeping Reinforcement Learning

j h p Pip |si,j, aπip Bi,jq i,j q Ppp |si,j, aπip Bi,jq i,j q 1 (33)

j h b Q i,jpsi,j, aπip Bi,jq i,j q H2S a

j h r V i,j 1psi,j 1q, (34)

where Eq. (33) is obtained by adding and subtracting 2H řk i 1 řH 1 j h p P p Vi,j 1qpsi,j, aπip Bi,jq i,j q and

2H řk i 1 řH 1 j h r V i,j 1psi,j 1q, and Eq. (34) is obtained since event Eazp F r Q,k,h, H, Lq holds under E. Plugging the bounds of terms paq, pbq, and pcq into Eq. (32), we finally obtain:

p Qi,jpsi,jq Qπi j psi,jq ď H2a

S2ATk L 2H2a

j h b Q i,jpsi,j, aπip Bi,jq i,j q

j h r V i,j 1psi,j 1q 4H3SLG

S2ATk L 7H2S a

j h b Q i,jpsi,j, aπip Bi,jq i,j q

j h r V i,j 1psi,j 1q 4H3SLG.

In a similar manner, and observing that events Eazp F r Q,k,h,s, H, Lq and Eazp F r Q,k,h,s,a, H, Lq hold under E, we can derive the bounds to the remaining summations, thus concluding the proof.

Lemma D.9 (Summation over typical episodes of state-action wise model errors, see Lemma 11 of Azar et al. (2017)). Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Under events E and Ωk,h the following inequalities hold for every ps, aq P S ˆ A:

i 1 Iti P rkstypu

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

6HSATk LG 2

g f f e HSALG

j h r V i,jpsi,jq, (35)

i 1 Iti P rkstyp,s, si,h su

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

6H2SANk,hpsq LG 2

g f f e HSALG

i 1 Itsi,h su

j h r V i,jpsi,jq,

i 1 Iti P rkstyp,s,a, si,h s, aπip Bi,hq i,h au

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

6H2SANk,hps, aq LG 2

Sleeping Reinforcement Learning

g f f e HSALG

i 1 Itsi,h s, aπip Bi,hq i,h u

j h r V i,jpsi,jq,

rkstyp : ! i P Jk K : psi,h, aπip Bi,hq i,h q P rps, aqsk, i ě 11600H3S3A2AL2G, @h P JHK ) ,

rkstyp,s : ! i P Jk K : psi,h, aπip Bi,hq i,h q P rps, aqsk, Nk,hpsq ě 11600H3S3A2AL2G, @h P JHK ) ,

rkstyp,s,a : ! i P Jk K : psi,h, aπip Bi,hq i,h q P rps, aqsk, Nk,hps, aq ě 11600H3S3A2AL2G, @h P JHK ) ,

rps, aqsk : tps, aq P S ˆ A : Nkps, aq ě H, Nk,hpsq ě H, @h P JHKu .

Proof. We adapt the proof of (Azar et al., 2017, Lemma 11). We begin by demonstrating the bound of Eq. (35):

i 1 Iti P rkstypu

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

i 1 Iti P rkstypu

g f f e2V j 1psi,j, aπip Bi,jq i,j q L

Nipsi,j, aπip Bi,jq i,j q 2HL

3Nipsi,j, aπip Bi,jq i,j q

g f f f f f e

j h V j 1psi,j, aπip Bi,jq i,j q loooooooooooooooomoooooooooooooooon

g f f f f f e

i 1 Iti P rkstypu

Nipsi,j, aπip Bi,jq i,j q loooooooooooooooooooooooomoooooooooooooooooooooooon

3HSALG, (37)

where Eq. (36) follows from the application of Bernstein s inequality and Eq. (37) follows from the application of Cauchy Schwarz s inequality together with a pigeonhole argument.

Using another pigeonhole argument, we can bound term pbq with SAG. By adding and subtracting Vπi j 1psi,j, aπip Bi,jq i,j q to term paq, we can rewrite it as:

j h Vπi j 1psi,j, aπip Bi,jq i,j q loooooooooooooooomoooooooooooooooon

j h p V j 1psi,j, aπip Bi,jq i,j q Vπi j 1psi,j, aπip Bi,jq i,j qq looooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooon

As events E and Ωk,h hold, we can apply Lemma D.4 and Lemma D.5 to bound terms pcq and pdq, respectively, thus obtaining:

paq ď HTk 6 a

j h r V i,jpsi,jq

j h r V i,jpsi,jq, (38)

where Eq. (38) holds under the condition of rkstyp. Plugging the bounds of terms paq and pbq into Eq. (37), rearranging the terms, and applying the subadditivity of the square root, we finally get:

Sleeping Reinforcement Learning

i 1 Iti P rkstypu

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q ď a

6HSATk LG 2

g f f e HSALG

j h r V i,jpsi,jq.

In a similar manner, we can demonstrate the bound on the remaining summations, thus concluding the proof.

Lemma D.10 (Summation over typical episodes of state-action value function bonus term). Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Let the UCB bonus for the state-action value function be defined as:

b Q k,hps, aq

g f f e4L Vars1 p Pkp |s,aqrp Vk,h 1ps1qs

Nkps, aq 7HL 3p Nkps, aq 1q

g f f e4Es1 p Pkp |s,aqrmint 29002H3S3A2AL3

N1 k,h 1ps1q , H2us

Under the events E and Ωk,h, the following inequalities hold for every ps, aq P S ˆ A:

i 1 Iti P rkstypu

j h b Q i,jpsi,j, aπip Bi,jq i,j q

28HSATk LG 7

3HSALG 5800 ?

H3S5A22AL3G2

g f f e8HSALG

j h r V i,j 1psi,j 1q,

i 1 Iti P rkstyp,s, si,h su

j h b Q i,jpsi,j, aπip Bi,jq i,j q

28H2SANk,hpsq LG 7

3HSALG 5800 ?

H3S5A22AL3G2

g f f e8HSALG

i 1 Itsi,h su

j h r V i,j 1psi,j 1q,

i 1 Iti P rkstyp,s,a, si,h s, aπip Bi,hq i,h au

j h b Q i,jpsi,j, aπip Bi,jq i,j q

28H2SANk,hps, aq LG 7

3HSALG 5800 ?

H3S5A22AL3G2

g f f e8HSALG

i 1 Itsi,h s, aπip Bi,hq i,h au

j h r V i,j 1psi,j 1q.

rkstyp : ! i P Jk K : psi,h, aπip Bi,hq i,h q P rps, aqsk, i ě 11600H3S3A2AL2G, @h P JHK ) ,

Sleeping Reinforcement Learning

rkstyp,s : ! i P Jk K : psi,h, aπip Bi,hq i,h q P rps, aqsk, Nk,hpsq ě 11600H3S3A2AL2G, @h P JHK ) ,

rkstyp,s,a : ! i P Jk K : psi,h, aπip Bi,hq i,h q P rps, aqsk, Nk,hps, aq ě 11600H3S3A2AL2G, @h P JHK ) ,

rps, aqsk : tps, aq P S ˆ A : Nkps, aq ě H, Nk,hpsq ě H, @h P JHKu .

Proof. The proof of this lemma closely follows that of (Azar et al., 2017, Lem. 12). We can rewrite the summation as:

i 1 Iti P rkstypu

j h b Q i,jpsi,j, aπip Bi,jq i,j q ď

i 1 Iti P rkstypu

4L Vars1 p Pip |si,j,a πip Bi,j q i,j qrp Vi,j 1ps1qs

Nipsi,j, aπip Bi,jq i,j q looooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooon

i 1 Iti P rkstypu

3p Nipsi,j, aπip Bi,jq i,j q 1q looooooooooooooooooooooooooooomooooooooooooooooooooooooooooon

i 1 Iti P rkstypu

4Es1 p Pip |si,j,a πip Bi,j q i,j qb Q i,j 1ps1q

Nipsi,j, aπip Bi,jq i,j q looooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooon

where b Q i,j 1ps1q mint 29002H3S3A2AL3

N1 i,j 1ps1q , H2u. First of all, we bound term pbq with 7

3HSALG by applying a pigeonhole

argument. To bound term paq, we apply Cauchy-Schwarz s inequality to obtain:

g f f f f f e

j h p Vi,j 1psi,j, aπip Bi,jq i,j q looooooooooooooooomooooooooooooooooon

g f f f f f e

i 1 Iti P rkstypu

Nipsi,j, aπip Bi,jq i,j q loooooooooooooooooooooooomoooooooooooooooooooooooon

By applying a pigeonhole argument, we bound term peq with SAG, and we rewrite term pdq as follows:

j h Vπi j 1psi,j, aπip Bi,jq i,j q loooooooooooooooomoooooooooooooooon

p Vi,j 1psi,j, aπip Bi,jq i,j q Vπi j 1psi,j, aπip Bi,jq i,j q

looooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooon

By applying Lemma D.4 and Lemma D.6 to bound terms pfq and pgq, respectively, we obtain the following bound:

pdq ď HTk 2 a

3H3L 7H2S a

j h r V i,j 1psi,j 1q

j h r V i,j 1psi,j 1q, (41)

where Eq. (41) holds under the condition of rkstyp. Combining the bounds of terms pdq and peq into Eq. (40) and applying the subadditivity of the square root, we obtain:

Sleeping Reinforcement Learning

g f f e8HSALG

j h r V i,j 1psi,j 1q.

To bound term pcq, we first apply Cauchy-Schwarz s inequality, obtaining:

g f f f f f e

j h Es1 p Pip |si,j,a πip Bi,j q i,j qb Q i,j 1ps1q looooooooooooooooooooooomooooooooooooooooooooooon

g f f f f f e

i 1 Iti P rkstypu

Nipsi,j, aπip Bi,jq i,j q loooooooooooooooooooooooomoooooooooooooooooooooooon

Similar to term peq, we bound term piq with SAG. To bound term phq, we first rewrite it as:

p Pib Q i,j 1 psi,j, aπip Bi,jq i,j q

p p Pi Pqb Q i,j 1 psi,j, aπip Bi,jq i,j q looooooooooooooooooooooooomooooooooooooooooooooooooon

Pb Q i,j 1 psi,j, aπip Bi,jq i,j q

Pb Q i,j 1 psi,j, aπip Bi,jq i,j q b Q i,j 1psi,j 1q

loooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooon

b Q i,j 1psi,j 1q loooooooooooomoooooooooooon

By bounding b Q i,j 1 with H2 and combining the application of (Weissman et al., 2003, Theorem 2.1) and of a pigeonhole argument, we can upper bound term pjq as:

To bound term pkq, we first observe that it is a Martingale difference sequence, and as such we can bound it via the event Eaz Fb Q,k,h, H2, L , which holds under E, thus obtaining:

By applying the definition of b Q i,j 1 together with a pigeonhole argument, we can bound term plq as:

plq ď 29002H3S4A2AL3G.

By applying the bounds of terms pjq, pkq, and plq into Eq. (43), we obtain:

phq ď H2S a

Tk L 29002H3S4A2AL3G.

By applying the bounds of terms phq and piq to Eq. (42), applying the definition of rkstyp, and applying the subadditivity of the square root, we get:

Sleeping Reinforcement Learning

3HSATk L 5800 ?

H3S5A22AL3G2.

Finally, we can combine the bounds of terms paq, pbq, and pcq into Eq. (39), obtaining the following bound:

i 1 Iti P rkstypu

j h b Q i,jpsi,j, aπip Bi,jq i,j q ď a

g f f e8HSAL3G

j h r V i,j 1psi,j 1q

3HSATk L 5800 ?

H3S5A22AL3G2

g f f e8HSALG

j h r V i,j 1psi,j 1q

3HSALG 5800 ?

H3S5A22AL3G2.

In a similar manner, observing that events Eaz Fb Q,k,h,s, H2, L , and Eaz Fb Q,k,h,s,a, H2, L also hold under E, we can derive the bounds for the remaining summations, thus concluding the proof.

Lemma D.11 (Summation over typical episodes of state value function bonus term). Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Let the UCB bonus for the state value function be defined as:

g f f e4L Var B p Cind k p |sqr p Qk,hps, πk,hps, Bqqs

Nkpsq 7HL 3p Nkpsq 1q

g f f e4EB p Cind k p |sqrmint 13502H3S3A2AL3

Nk,hps,πk,hps,Bqq , H2us

Under the events E and Ωk,h, the following inequalities hold for every s P S:

i 1 Iti P rkstypu

j h b V i,jpsi,jq ď a

45HSATk LG 7

3HSLG 2700 ?

H3S5A22AL3G2

g f f e31H2S2ALG2 kÿ

j h r V i,j 1psi,j 1q,

i 1 Iti P rkstyp,s, si,h su

j h b V i,jpsi,jq ď b

45H2SANk,hpsq LG 7

3HSLG 2700 ?

H3S5A22AL3G2

g f f e31H2S2ALG2 kÿ

i 1 Itsi,h su

j h r V i,j 1psi,j 1q,

i 1 Iti P rkstyp,s,a, si,h s, aπip Bi,hq i,h au

j h b V i,jpsi,jq

45H2SANk,hps, aq LG 7

3HSLG 2700 ?

H3S5A22AL3G2

Sleeping Reinforcement Learning

g f f e31H2S2ALG2 kÿ

i 1 Itsi,h s, aπip Bi,hq i,h au

j h r V i,j 1psi,j 1q.

rkstyp : ! i P Jk K : psi,h, aπip Bi,hq i,h q P rps, aqsk, i ě 11600H3S3A2AL2G, @h P JHK ) ,

rkstyp,s : ! i P Jk K : psi,h, aπip Bi,hq i,h q P rps, aqsk, Nk,hpsq ě 11600H3S3A2AL2G, @h P JHK ) ,

rkstyp,s,a : ! i P Jk K : psi,h, aπip Bi,hq i,h q P rps, aqsk, Nk,hps, aq ě 11600H3S3A2AL2G, @h P JHK ) ,

rps, aqsk : tps, aq P S ˆ A : Nkps, aq ě H, Nk,hpsq ě H, @h P JHKu .

Proof. We can rewrite the summation as:

i 1 Iti P rkstypu

j h b V i,jpsi,jq ď

i 1 Iti P rkstypu

g f f e4L Var B p Cind i p |si,jqr p Qi,jpsi,j, aπip Bq i,j qs

Nipsi,jq looooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooon

i 1 Iti P rkstypu

7HL 3p Nipsi,jq 1q loooooooooooooooooooooomoooooooooooooooooooooon

i 1 Iti P rkstypu

g f f e4EB p Cind i p |si,jqb V i,jpsi,j, Bq

Nipsi,jq loooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooon

where b V i,jpsi,j, Bq mint 13502H3S3A2AL3

N1 i,jpsi,j,a πip Bq i,j q , H2u. First of all, we can bound term pbq by applying a pigeonhole argument as:

We now focus on bounding term paq. By applying Cauchy-Schwarz s inequality, we obtain:

g f f f f f e

j h p Qi,jpsi,jq looooooooomooooooooon

g f f f f f e

i 1 Iti P rkstypu

1 Nipsi,jq looooooooooooooooomooooooooooooooooon

Using the pigeonhole argument, we bound term peq with SG. We can rewrite term pdq as follows:

j h Qπi j psi,jq looooooooomooooooooon

p Qi,jpsi,jq Qπi j psi,jq

loooooooooooooooooooomoooooooooooooooooooon

Sleeping Reinforcement Learning

We can bound term pfq by using Lemma D.7 as:

j h Qπi j psi,jq

and term pgq by using Lemma D.8 as:

p Qi,jpsi,jq Qπi j psi,jq

S2ATk L 7H2S a

ATk L 4H3SLG

j h b Q i,jpsi,j, aπip Bi,jq i,j q 2H

j h r V i,j 1psi,j 1q

By plugging the bounds of terms pfq and pgq into Eq. (46) we get:

pdq ď HTk 2 a

S2ATk L 7H2S a

ATk L 4H3SLG

j h b Q i,jpsi,j, aπip Bi,jq i,j q 2H

j h r V i,j 1psi,j 1q

S2ATk L 7H2S a

ATk L 4H3SLG

28HSATk LG 2H

g f f e8HSALG

j h r V i,j 1psi,j 1q

3 H2SALG 11600 ?

H5S5A22AL3G2 2H

j h r V i,j 1psi,j 1q (47)

j h r V i,j 1psi,j 1q

g f f e32H3SALG

j h r V i,j 1psi,j 1q, (48)

where Eq. (47) follows by applying Lemma D.10 and Eq. (48) holds under rkstyp. By plugging the bounds of terms pdq and peq into Eq. (45), rearranging the terms, and applying the subadditivity of the square root, we obtain:

g f f e31H2S2ALG2 kÿ

j h r V i,j 1psi,j 1q. (49)

The derivation of the bound of term pcq is similar to the derivation in Lemma D.10 from Eq. (42) onward. First we apply Cauchy-Schwarz s inequality, rewriting the term as:

Sleeping Reinforcement Learning

g f f f f f e

j h EB p Cind i p |si,jqb V i,jpsi,j, Bq loooooooooooooooooooomoooooooooooooooooooon

g f f f f f e

i 1 Iti P rkstypu

1 Nipsi,jq looooooooooooooooomooooooooooooooooon

With a pigeonhole argument, we bound term piq with SG. We now rewrite term phq as:

p Cind i b V i,j psi,jq

p p Cind i Cindqb V i,j psi,jq loooooooooooooooooooomoooooooooooooooooooon

Cindb V i,j psi,jq

Cindb V i,j psi,jq b V i,jpsi,jq

loooooooooooooooooooooooomoooooooooooooooooooooooon

b V i,jpsi,jq looooooooomooooooooon

By bounding b V i,j 1 with H2 and combining the application of (Weissman et al., 2003, Theorem 2.1) and of a pigeonhole argument, we can upper bound term pjq as:

To bound term pkq, we first observe that it is a Martingale difference sequence, and as such we can bound it via the event Eazp Fb V ,k,h, H2, Lq, which holds under E, thus obtaining:

By applying the definition of b V i,j together with a pigeonhole argument, we can bound term plq as:

plq ď 13502H3S4A22AL3G.

By putting together the bounds of terms pjq, pkq, and plq into Eq. (51), we get:

S2ATk L 2H2a

Tk L 13502H3S4A22AL3G. (52)

By applying the bounds of terms phq and piq to Eq. (50), applying the definition of rkstyp, and applying the subadditivity of the square root, we get:

HSATk LG 2700 ?

H3S5A22AL3G2. (53)

Finally, we can combine the bounds of terms paq, pbq, and pcq into Eq. (44), obtaining the following bound:

Sleeping Reinforcement Learning

i 1 Iti P rkstypu

j h b V i,jpsi,jq ď a

g f f e31H2S2ALG2 kÿ

j h r V i,j 1psi,j 1q

HSATk LG 2700 ?

H3S5A22AL3G2

g f f e31H2S2ALG2 kÿ

j h r V i,j 1psi,j 1q

3HSLG 2700 ?

H3S5A22AL3G2.

In a similar manner, observing that events Eazp Fb V ,k,h,s, H2, Lq and Eazp Fb V ,k,h,s,a, H2, Lq also hold under E, we can derive the bounds for the remaining summations, thus concluding the proof.

Lemma D.12 (Upper bound of state-action value function estimation error). Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Under E and Ωk,h, the following holds for every ps, aq P S ˆ A:

p Qk,hps, aq Q hps, aq ď min

Nk,hps, aq , H

Proof. We begin the proof by observing that, under Ωk,h, p Qk,hps, aq ě Q hps, aq for every s P S and a P A. We can rewrite:

p Qk,hps, aq Q hps, aq 1 Nk,hps, aq

i 1 Itsi,h s, aπip Bi,hq i,h au p Qk,hpsi,h, aπip Bi,hq i,h q Q hpsi,h, aπip Bi,hq i,h q

ď 1 Nk,hps, aq

i 1 Itsi,h s, aπip Bi,hq i,h au p Qi,hpsi,h, aπip Bi,hq i,h q Qπi h psi,h, aπip Bi,hq i,h q (54)

1 Nk,hps, aq

i 1 Itsi,h s, aπip Bi,hq i,h aur Q i,hpsi,h, aπip Bi,hq i,h q, (55)

where Eq. (54) follows from the fact that p Qk,h is monotonically decreasing in k by definition and by observing that Q h ě Qπi h .

Recalling the upper bound of řk i 1 Itsi,h s, aπip Bi,hq i,h aur Q i,hpsi,h, aπip Bi,hq i,h q:

i 1 Itsi,h s, aπip Bi,hq i,h aur Q k,hpsi,h, aπip Bi,hq i,h q

i 1 Itsi,h s, aπip Bi,hq i,h au

b Q i,jpsi,j, aπip Bi,jq i,j q 2b V i,j 1psi,j 1q

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q 8H2SL

Nipsi,j, aπip Bi,jq i,j q 4H

2AL Nipsi,j 1q

H3Nk,hps, aq L 4e b

HNk,hps, aq L 4e b

H3Nk,hps, aq L,

we can apply two pigeonhole arguments, obtaining the following upper bound:

Sleeping Reinforcement Learning

i 1 Itsi,h s, aπip Bi,hq i,h aur Q k,hpsi,h, aπip Bi,hq i,h q

i 1 Itsi,h s, aπip Bi,hq i,h au

b Q i,jpsi,j, aπip Bi,jq i,j q 2b V i,j 1psi,j 1q

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

8H2S2ALG 4 b

H3S2ANk,hps, aq L

H3Nk,hps, aq L 4e b

HNk,hps, aq L 4e b

H3Nk,hps, aq L (56)

: Uk,h,s,a.

We now bound the summations over episodes and stages of the different terms. By applying Lemma D.11, we can bound the summation over typical episodes of the state value function bonus term as:

i 1 Iti P rkstyp, si,h s, aπip Bi,hq i,h au

j h b V i,jpsi,jq

45H2SANk,hps, aq LG 7

3HSLG 2700 ?

H3S5A22AL3G2

g f f f f f e

31H2S2ALG2 kÿ

i 1 Itsi,h s, aπip Bi,hq i,h au

j h r V i,j 1psi,j 1q loooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooon

By applying the same procedure as in Eq. (26), we can bound term paq as:

i 1 Iti P rkstyp, si,h s, aπip Bi,hq i,h au

j h b V i,j 1psi,j 1q

i 1 Iti P rkstyp, si,h s, aπip Bi,hq i,h au

j h εQ i,j 1

i 1 Iti P rkstyp, si,h s, aπip Bi,hq i,h au

2AL Nipsi,i 1q

i 1 Iti P rkstyp, si,h s, aπip Bi,hq i,h au

j h r Q i,j 1psi,j 1, aπip Bi,j 1q i,j 1 q

45H2SANk,hps, aq LG 7

3HSLG 2700 ?

H3S5A22AL3G2 ?

H3Nk,hps, aq L 2 b

H3S2ANk,hps, aq L HUk,h,s,a, (58)

where Eq. (58) follows from the application of Lemma D.3 and Lemma D.11, the application of a pigeonhole argument, and by bounding řk i 1 Itsi,h s, aπip Bi,hq i,h au řH 1 j h r Q i,j 1psi,j 1, aπip Bi,j 1q i,j 1 q with HUk,h,s,a.

By applying the bound of term paq into Eq. (57), applying the condition of rkstyp, and rearranging terms, we obtain:

Sleeping Reinforcement Learning

i 1 Iti P rkstyp, si,h s, aπip Bi,hq i,h au

j h b V i,jpsi,jq

45H2SANk,hps, aq LG b

31H3S2ALG2Uk,h,s,a

H4S5A22AL3G2.

By applying Lemma D.10, we can bound the summation over typical episodes of the state-action value function bonus term as:

i 1 Iti P rkstyp, si,h s, aπip Bi,hq i,h au

j h b Q i,jpsi,j, aπip Bi,jq i,j q

28H2SANk,hps, aq LG 7

3HSAL2 5800 ?

H3S5A22AL3G2

g f f f f f e

i 1 It, si,h s, aπip Bi,hq i,h aur V i,j 1psi,j 1qq loooooooooooooooooooooooooooomoooooooooooooooooooooooooooon

By applying the bound of term paq, as computed in Eq. (58), and plugging it into Eq. (59), we get:

i 1 Iti P rkstyp, si,h s, aπip Bi,hq i,h au

j h b Q i,jpsi,j, aπip Bi,jq i,j q

28H2SANk,hps, aq LG b

8H2SALGUk,h,s,a

H3S5A22AL3G2.

By applying Lemma D.9 we can bound the summation over typical episodes of the state-action wise model errors as:

i 1 Iti P rkstyp, si,h s, aπip Bi,hq i,h au

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

6H2SANk,hps, aq LG 2

H2SALGUk,h,s,a.

By applying two pigeonhole arguments, combining the bounds, and accounting for the regret of non-typical episodes, we can then upper bound Eq. (56) as:

Uk,h,s,a ď 92e b

H3SA2ANk,hps, aq LG 16e b

H3S2ALG2Uk,h,s,a

H4S5A22AL3G2 11600H3S3A2AL3.

H3SA2ANk,hps, aq LG 11996 ?

H4S5A22AL3G2 ȷ 11600H3S3A2AL2G,

Sleeping Reinforcement Learning

we can solve for Uk,h,s,a as Uk,h,s,a ď 2α β2, and obtain the following bound:

Uk,h,s,a ď 184e b

H3SA2ANk,hps, aq LG 23992e ?

H4S5A22AL3G2 23200H3S3A2AL2G 256e2H3S2ALG2

H3S3A2ANk,hps, aq L2G, (60)

where Eq. (60) follows from the application of the rkstyp condition. Plugging this result into Eq. (55), and observing that the error cannot be greater than H, we get the following bound to the estimation error of the state-action value function due to the optimistic approach:

p Qk,hps, aq Q hps, aq ď min

Nk,hps, aq , H

thus concluding the proof.

Lemma D.13 (Upper bound of state value function estimation error). Let k P JKK and h P JHK. Let πk be the policy followed during episode k. Under E and Ωk,h, the following holds for every ps, aq P S ˆ A:

p Vk,hpsq V h psq ď min

Nk,hpsq , H

Proof. We begin the proof by observing that, under Ωk,h, p Qk,hps, aq ě Q hps, aq for every s P S and a P A. We can rewrite:

p Vk,hpsq V h psq 1 Nk,hpsq

i 1 Itsi,h su p Vk,hpsi,hq V h psi,hq

ď 1 Nk,hpsq

i 1 Itsi,h su p Vi,hpsi,hq V πi h psi,hq (61)

i 1 Itsi,h sur V i,hpsi,hq, (62)

where Eq. (61) follows from the fact that p Vk,h is monotonically decreasing in k by definition and by observing that V h ě V πi h .

Recalling the upper bound of řk i 1 Itsi,h sur V i,hpsi,hq:

i 1 Itsi,h sur V i,jpsi,jq ď e2 kÿ

b V i,jpsi,jq 2b Q i,jpsi,j, aπip Bi,jq i,j q 2 p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

EB Cindp |si,jq

3Nipsi,j, aπip Bq i,j q

3Nipsi,j, aπip Bi,jq i,j q 2H22AL

Nipsi,jq 2H2AL

2H2L Nipsi,jq

H3Nk,hpsq L 6e2b

HNk,hpsq L 4e2b

HNk,hpsq L,

Sleeping Reinforcement Learning

by applying several pigeonhole arguments, we can derive the following upper bound:

i 1 Itsi,h sur V i,jpsi,jq ď e2 kÿ

b V i,jpsi,jq 2b Q i,jpsi,j, aπip Bi,jq i,j q 2 p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q ı

8e2H2S22ALG e2b

2H3SNk,hpsq L 4e2b

H3Nk,hpsq L 10e2b

where Eq. (63) is obtained observing that, when applying the pigeonhole argument over a summation in which the argument depends on the expectation over B, the worst-case allocation of summation terms is the one in which all actions are available. We now bound the summations over the typical episodes of the different terms.

By applying Lemma D.10, we can bound the summation over typical episodes of the state-action value function bonus term as:

i 1 Iti P rkstyp, si,h su

j h b Q i,jpsi,j, aπip Bi,jq i,j q ď b

28H2SANk,hpsq LG b

8H2SALGUk,h,s

3HSALG 5800 ?

H3S5A22AL3G2.

By applying Lemma D.11, we can bound the summation over typical episodes of the state value function bonus term as:

i 1 Iti P rkstyp, si,h su

j h b V i,jpsi,jq ď b

40H2SANk,hpsq LG b

31H3S2ALG2Uk,h,s

3HSLG 2700 ?

H3S5A22AL3G2.

By applying Lemma D.9 we can bound the summation over typical episodes of the state-action wise model errors as:

i 1 Iti P rkstyp, si,h su

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

6H2SANk,hpsq LG 2

H2SALGUk,h,s.

By combining the bounds into Eq. (63), we get:

Uk,h,s ď 38e2b

H3SA2ANk,hpsq LG 16e2b

H3S2ALG2Uk,h,s

H3S5A22AL3G2 11600H3S3A2AL3 (64)

H3SA2ANk,hpsq LG 14317e2?

H3S5A22AL3G2 11600H3S3A2AL2G,

H3S2ALG2, (65)

Sleeping Reinforcement Learning

we can solve for Uk,h,s as Uk,h,s ď 2α β2, and obtain the following bound:

Uk,h,s ď 76e2b

H3SA2ANk,hpsq LG 28634e2?

H3S5A22AL3G2 23200H3S3A2AL2G 256e4H3S2ALG2

H3S3A2ANk,hpsq L2G, (66)

where Eq. (66) holds if Nk,hpsq ě ?

11600H3S3A2AL2G. Plugging this result into Eq. (62), and observing that the error cannot be greater than H, we get the following bound to the estimation error of the state-action value function due to the optimistic approach:

p Vk,hpsq V h psq ď min

Nk,hpsq , H

thus concluding the proof.

Lemma D.14 (Optimism). Let the optimistic bonuses be defined as:

b Q k,hps, aq

g f f e4L Vars1 p Pkp |s,aqrp Vk,h 1ps1qs

Nkps, aq 7HL 3p Nkps, aq 1q

g f f e4Es1 p Pkp |s,aqrmint 29002H3S3A2AL3

N1 k,h 1ps1q , H2us

g f f e4L Var B p Cind k p |sqr p Qk,hps, πk,hps, Bqqs

Nkpsq 7HL 3p Nkpsq 1q

g f f e4EB p Cind k p |sqrmint 13502H3S3A2AL3

Nk,hps,πk,hps,Bqq , H2us

Then, under event E, the following set of events hold:

Ωk,h : ! p Vi,jpsq ě V j psq p Qi,jps, aq ě Q j ps, aq, @pi, jq P rk, hshist, s P S, a P A ) ,

for k P JKK and h P JHK, where:

rk, hshist : tpi, jq : i P JKK, j P JHK, pi ă kq _ pi k, j ě hqu.

Proof. We demonstrate this result by induction. We begin by observing that p Vk,H 1psq V H 1psq 0 and p Qk,H 1ps, aq Q H 1ps, aq 0 for every k P JKK, s P S, and a P A. To prove the induction, we need to prove that, if Ωk,h 1 holds, then also Ωk,h holds. We prove this result for a generic k P JKK, and we observe that we can then apply this procedure recursively for all values of k, starting from k 1.

We begin by demonstrating that p Qk,hps, aq ě Q hps, aq. Let us recall the definition of p Qk,hps, aq:

Sleeping Reinforcement Learning

p Qk,hps, aq Rps, aq b Q k,hps, aq p Pk p Vk,h 1 ps, aq,

observing that if p Qk,hps, aq ě H h, the optimism holds trivially. We can write:

p Qk,hps, aq Q hps, aq b Q k,hps, aq p Pk p Vk,h 1 ps, aq PV h 1 ps, aq

b Q k,hps, aq p Pkpp Vk,h 1 V h 1q ps, aq p p Pk Pq V h 1 ps, aq

ě b Q k,hps, aq p p Pk Pq V h 1 ps, aq (67)

where Eq. (67) follows from the induction assumption. Under event E, we can apply the empirical Bernstein inequality (Maurer & Pontil, 2009):

ˇˇˇ p p Pk Pq V h 1 ps, aq ˇˇˇ ď

2p V k,h 1ps, aq L

Nkps, aq 7HL 3p Nkps, aq 1q,

where p V k,h 1ps, aq : Vars1 p Pkp |s,aqr V h 1ps1qs. As such, we obtain:

p Qk,hps, aq Q hps, aq ě b Q k,hps, aq

2p V k,h 1ps, aq L

Nkps, aq 7HL 3p Nkps, aq 1q

4p Vk,h 1ps, aq L

2p V k,h 1ps, aq L

Nkps, aq loooooooooooooooooooooooomoooooooooooooooooooooooon

s1PS p Pkps1|s, aq min ! 29002H3S3A22AL3

N1 k,h 1ps1q , H2 )

Nkps, aq . (68)

Observing that:

2p V k,h 1ps,aq 4p Vk,h 1ps,aq

Nkps,aq if p Vk,h 1ps, aq ď p V k,h 1ps, aq,

0 otherwise,

we now bound p V k,h 1 in terms of p Vk,h 1 from above. Observing that:

Varr Xs E r X Er Xss2

E r X Y Er Xs Er Y ss2

E rp X Y q Er X Y s Y Er Y ss2

ď 2E rp X Y q Er X Y ss2 2E r Y Er Y ss2

2 Varr X Y s 2 Varr Y s, (69)

we can then write:

Sleeping Reinforcement Learning

p V k,h 1ps, aq ď 2p Vk,h 1ps, aq 2 Var y p Pkp |s,aq r V h 1ps1q p Vk,h 1ps1qs

ď 2p Vk,h 1ps, aq 2 ÿ

s1PS p Pkps1|s, aq p Vk,h 1ps1q V h 1ps1q 2 .

By combining this bound with the result of Lemma D.13, we obtain the following bound on term paq:

s1PS p Pkps1|s,aq min " 29002H3S3A22AL3

N1 k,h 1ps1q ,H2 *

Nkps,aq if p Vk,h 1ps, aq ď p V k,h 1ps, aq,

0 otherwise.

By plugging this result into Eq. (68), we obtain that p Qk,hps, aq Q hps, aq ě 0.

We now demonstrate that p Vk,hpsq ě V h psq. Let us recall the definition of p Vk,hpsq:

p Vk,hpsq mintp Vk 1,hpsq, H, b V k,hpsq EB p Cind k p |sqr p Qk,hps, aπkp Bq k,h qsu.

Again, observe that, if p Vk,hpsq H, the optimism holds trivially. Moreover, if p Vk,hpsq p V k 1, hpsq, the optimism holds trivially under Ωk,h. As such, we only need to demonstrate the case in which p Vk,hpsq b V k,hpsq

EB p Cind k p |sqr p Qk,hps, aπkp Bq k,h qs. We can write:

p Vk,hpsq V h psq b V k,hpsq p Cind k p Qk,h psq Cind Q h psq

b V k,hpsq p Cind k p p Qk,h Q hq psq p p Cind k Cindq Q h psq

ě b V k,hpsq ÿ

BPPp Aq p Cind k p B|sq p Qk,hps, π hps, Bqq Q hps, π hps, Bqq

p p Cind k Cindq Q h psq (70)

ě b V k,hpsq p p Cind k Cindq Q h psq, (71)

where Eq. (70) derives by observing that πk is the greedy policy w.r.t. p Vk,h, and Eq. (71) follows from the optimism over the state-action value function we just demonstrated.

Under event E, we can apply the empirical Bernstein inequality:

ˇˇˇ p p Cind k Cindq Q h psq ˇˇˇ ď

2p Q k,hpsq L

Nkpsq 7HL 3p Nkpsq 1q, (72)

where p Q k,hpsq : Var B p Cind k p |sqr Q hps, aπ p Bq k,h qs. As such, we obtain:

p Vk,hpsq V h psq ě b V k,hpsq

2p Q k,hpsq L

Nkpsq 7HL 3p Nkpsq 1q

Sleeping Reinforcement Learning

4p Qk,hpsq L

2p Q k,hpsq L

Nkpsq loooooooooooooooooomoooooooooooooooooon

g f f f e4 ř

BPPp Aq p Cind k p B|sq min " 13502H3S3A22AL3

Nk,hps,a πkp Bq k,h q , H2 *

Nkpsq . (73)

Observing that:

2p Q k,hpsq 4p Qk,hpsq

Nkpsq if p Qk,hpsq ď p Q k,hpsq,

0 otherwise,

we now bound p Q k,h in terms of p Qk,h from above. By applying the result of Eq. (69), we get that:

p Q k,hpsq ď 2p Qk,hpsq 2 Var B p Cind k p |sq r Q hps, aπkp Bq k,h q p Qk,hps, aπkp Bq k,h qs

ď 2p Qk,hpsq 2 ÿ

BPPp Aq p Cind k p B|sq Q hps, aπkp Bq k,h q p Qk,hps, aπkp Bq k,h q 2 . (74)

By combining this bound with the result of Lemma D.12, we obtain the following bound on term pbq:

g f f e 4 ř

BPPp Aq p Cind k p B|sq min

13502H3S3A22AL3

Nk,hps,aπkp Bq k,h q ,H2 +

Nkpsq if p Qk,hpsq ď p Q k,hpsq,

0 otherwise.

By plugging this result into Eq. (73), we finally obtain that p Vk,hpsq ě V h psq, thus demonstrating optimism.

Theorem 5.2 (Regret Upper Bound S-UCBVI with independent availability and per-stage disclosure). For any δ P p0, 1q, with probability 1 δ, the per-stage disclosure regret of S-UCBVI on any Sle MDP with per-stage disclosure independent action availabilities is bounded by:

RPSp S-UCBVI, Tq ď512H ?

4982H6S3A2AL2G,

where L logp80HS2A2AT{δq and G logp HSATq. In particular, for T ě Ωp H10S5A422Aq and selecting δ 2A{T, we have: E r RPSp S-UCBVI, Tqs ď r O H ?

Proof. This proof adapts the result of (Azar et al., 2017, Theorem 2) to the Sleeping MDP setting in the case of i.i.d. action set availability and per-stage disclosure. We start by considering a Sleeping MDP in which the transition probabilities are stage-independent, as the generalization is straightforward and we consider it at the end of the proof.

Let events E and Ωk,h hold. Under these events, Lemma D.14 holds.

As such, we define the regret suffered by an algorithm A after T time steps as:

i 1 p V 1 psi,1q V πi 1 psi,1qq : V i,1psi,1q.

Sleeping Reinforcement Learning

We can define a pseudo-regret suffered by an algorithm A as:

p Vi,1psi,1q ˆV πi 1 psi,1q : r V i,1psi,1q.

By applying Lemma D.1, we first observe that řk i 1 V i,1psi,1q ď řk i 1 r V i,1psi,1q, and we decompose the pseudo-regret as:

i 1 r V i,hpsi,hq ď e2 K ÿ

b V i,jpsi,jq EB Cindp |si,jqrb Q i,jpsi,j, aπip Bq i,j qs 1

H b Q i,jpsi,j, aπip Bi,jq i,j q

EB Cindp |si,jq εV i,j,B 1

H εV i,j EB Cindp |si,jq ?

2LεV i,j,B ı

2L H εV i,j ?

EB Cindp |si,jq p p Pi Pq V j 1 psi,j, aπip Bq i,j q ı 1

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

EB Cindp |si,jq

3Nipsi,j, aπip Bq i,j q

3Nipsi,j, aπip Bi,jq i,j q

Nipsi,jq 2H2AL

2Qπi j psi,jq L Nipsi,jq 2HL 3Nipsi,jq

By applying several pigeonhole arguments, we obtain:

i 1 r V i,hpsi,hq ď e2 K ÿ

b V i,jpsi,jq EB Cindp |si,jqrb Q i,jpsi,j, aπip Bq i,j qs 1

H b Q i,jpsi,j, aπip Bi,jq i,j q

EB Cindp |si,jq εV i,j,B 1

H εV i,j EB Cindp |si,jq ?

2LεV i,j,B ı

2L H εV i,j ?

EB Cindp |si,jq p p Pi Pq V j 1 psi,j, aπip Bq i,j q ı 1

p p Pi Pq V j 1 psi,j, aπip Bi,jq i,j q

2Qπi j psi,jq L Nipsi,jq looooooomooooooon

9e2H2S2A2ALG (75)

We can bound the summation of term paq over typical episodes as follows:

i 1 Iti P rkstypu

i 1 Iti P rkstypu

2Qπi j psi,jq L Nipsi,jq

j 1 Qπi j psi,jq

i 1 Iti P rkstypu

1 Nipsi,jq (76)

4HSTLG, (78)

where Eq. (76) is obtained by applying Cauchy-Schwarz s inequality, Eq. (77) follows from the application of Lemma D.7 and of a pigeonhole argument, and Eq. (78) holds under rkstyp.

Sleeping Reinforcement Learning

By applying this bound, together with Lemmas D.3, D.9, D.10, D.11, the condition of rkstyp, accounting for the regret of non-typical episodes, and rearranging terms, we get that:

UK,1 ď 256 a

HSATk LG 117392H3S3A2AL2G 113 b

H2S2ALG2UK,1. (79)

By letting:

HSATLG 117392H3S3A2AL2G,

H2S2ALG2, (80)

we can solve for UK,1 as UK,1 ď 2α β2, and obtain the following bound:

UK,1 ď 512 ?

HSATLG 247533H3S3A2AL2G,

which we can plug into Eq. (75) to get the following upper bound on the regret:

RPSp S-UCBVI, Tq ď 512 ?

HSATLG 4982H3S3A2AL2G.

Considering now stage-dependent state transitions, we can address this generalization by considering a new Sleeping MDP with state space X such that |X| SH. As such, the regret in this case is upper bounded by:

RPSp S-UCBVI, Tq ď O ?

H2SATLG H6S3A2AL2G : UBPSpδq,

w.p. at least 1 δ.

To obtain an upper bound for the expected regret in the case of stage-independent transitions, we select δ 2A{T, and obtain that, if T ě 2A, it holds that:

E r RPSp S-UCBVI, Tqs E r RPSp S-UCBVI, Tq I t RPSp S-UCBVI, Tq ď UBPSpδqus

E r RPSp S-UCBVI, Tq I t RPSp S-UCBVI, Tq ą UBPSpδqus

ď UBPSpδq Tδ (81)

HSAT log p80HS2AT 2q2 logp HSATq

4982H3S3A2A logp80HS2AT 2q2 logp HSATq 2A, (82)

where Equation (81) follows by applying the high probability regret upper bound and by observing that RPSp S-UCBVI, Tq ď T. Moving to stage-dependent state transitions, we can generalize as above and rewrite Equation (82) as:

E r RPSp S-UCBVI, Tqs ď 512H b

SAT log p80H3S2AT 2q2 logp H2SATq

4982H6S3A2A logp80H3S2AT 2q2 logp H2SATq 2A

SAT H6S3A2A . (83)

Finally, we observe that whenever T ě Ωp H10S5A22Aq, the upper bound of the expected regret is of order r Op H ?

SATq, thus concluding the proof.

Sleeping Reinforcement Learning

E. Numerical Validation

In this appendix, we propose the Stochastic Frozen Lake setting and numerically validate our S-UCBVI against UCBVI, showing the efficacy of exploiting the knowledge of action availability. The code to reproduce the experiments is available at https://github.com/marcomussi/Sleeping RL.

Setting. The Stochastic Frozen Lake environment is a modification of the well-known Frozen Lake to allow holes in the lake to open and close stochastically, effectively limiting the action availability of the agent stochastically during the episode. The probability of a cell of the grid being a hole at any given stage is denoted via parameter p, except for the goal cell and the cell in which the agent is located at the beginning of the stage, which cannot be holes. We vary the probability of holes in the lake as p P t0, 0.5, 0.75u and the grid size of the lake as G P t2, 3, 4u. We consider a horizon H 10 to ensure that the agent can reach the goal. We consider K 2 105 episodes, and we compare S-UCBVI and UCBVI in terms of instantaneous reward averaged over 5 runs, with a 95% confidence interval. We also report the optimum computed apriori for reference.

Results. The results of the experiment are reported in Figure 7. We observe that, when p 0, i.e., there are no holes in the lake, both S-UCBVI and UCBVI manage to achieve the optimum instantaneous reward. As p and G increase, we observe that S-UCBVI manages to achieve the optimum, whereas UCBVI settles to a suboptimal value, with the gap between the two algorithms increasing in a directly proportional manner w.r.t. the two parameters.

Sleeping Reinforcement Learning

0 0.5 1 1.5 2

Instantaneous Reward

Optimum S-UCBVI UCBVI

(a) p 0, G 2.

0 0.5 1 1.5 2

Instantaneous Reward

Optimum S-UCBVI UCBVI

(b) p 0.5, G 2.

0 0.5 1 1.5 2

Instantaneous Reward

Optimum S-UCBVI UCBVI

(c) p 0.75, G 2.

0 0.5 1 1.5 2

Instantaneous Reward

Optimum S-UCBVI UCBVI

(d) p 0, G 3.

0 0.5 1 1.5 2

Instantaneous Reward

Optimum S-UCBVI UCBVI

(e) p 0.5, G 3.

0 0.5 1 1.5 2

Instantaneous Reward

Optimum S-UCBVI UCBVI

(f) p 0.75, G 3.

0 0.5 1 1.5 2

Instantaneous Reward

Optimum S-UCBVI UCBVI

(g) p 0, G 4.

0 0.5 1 1.5 2

Instantaneous Reward

Optimum S-UCBVI UCBVI

(h) p 0.5, G 4.

0 0.5 1 1.5 2

Instantaneous Reward

Optimum S-UCBVI UCBVI

(i) p 0.75, G 4.

Figure 7. Performances in terms of instantaneous reward in the Stochastic Frozen Lake environment with horizon H 10, number of episodes K 2 105, hole probability p P t0, 0.5, 0.75u and grid size G P t2, 3, 4u (5 runs, mean 95% C.I.).