# policy_certificates_towards_accountable_reinforcement_learning__04acfb75.pdf Policy Certificates: Towards Accountable Reinforcement Learning Christoph Dann 1 Lihong Li 2 Wei Wei 2 Emma Brunskill 3 The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about the quality of their current policy before executing it, and thus have limited use in high-stakes applications like healthcare. We address this lack of accountability by proposing that algorithms output policy certificates. These certificates bound the sub-optimality and return of the policy in the next episode, allowing humans to intervene when the certified quality is not satisfactory. We further introduce two new algorithms with certificates and present a new framework for theoretical analysis that guarantees the quality of their policies and certificates. For tabular MDPs, we show that computing certificates can even improve the sample-efficiency of optimism-based exploration. As a result, one of our algorithms is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds. 1. Introduction There is increasing excitement around applications of machine learning, but also growing awareness and concerns about fairness, accountability and transparency. Recent research aims to address these concerns but most work focuses on supervised learning and only few results (Jabbari et al., 2016; Joseph et al., 2016; Kannan et al., 2017; Raghavan et al., 2018) exist on reinforcement learning (RL). One challenge when applying RL in practice is that, unlike in supervised learning, the performance of an RL algorithm is typically not monotonically increasing with more data due to the trial-and-error nature of RL that necessitates ex- 1Carnegie Mellon University 2Google Research 3Stanford University. Correspondence to: Christoph Dann . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). ploration. Even sharp drops in policy performance during learning are common, e.g., when the agent starts to explore a new part of the state space. Such unpredictable performance fluctuation has limited the use of RL in high-stakes applications like healthcare, and calls for more accountable algorithms that can quantify and reveal their performance online during learning. To address this lack of accountability, we propose that RL algorithms output policy certificates in episodic RL. Policy certificates consist of (1) a confidence interval of the algorithm s expected sum of rewards (return) in the next episode (policy return certificates) and (2) a bound on how far from the optimal return the performance can be (policy optimality certificates). Certificates make the policy s performance more transparent and accountable, and allow designers to intervene if necessary. For example, in medical applications, one would need to intervene unless the policy achieves a certain minimum treatment outcome; in financial applications, policy optimality certificates can be used to assess the potential loss when learning a trading strategy. In addition to accountability, we also want RL algorithms to be sample-efficient and quickly achieve good performance. To formally quantify accountability and sample-efficiency of an algorithm, we introduce a new framework for theoretical analysis called IPOC. IPOC bounds guarantee that certificates indeed bound the algorithm s expected performance in an episode, and prescribe the rate at which the algorithm s policy and certificates improve with more data. IPOC is stronger than other frameworks like regret (Jaksch et al., 2010), PAC (Kakade, 2003) and Uniform-PAC (Dann et al., 2017), that only guarantee the cumulative performance of the algorithm, but do not provide bounds for individual episodes during learning. IPOC also provides stronger bounds and more nuanced guarantees on per episode performance than KWIK (Li et al., 2008). A natural way to create accountable and sample-efficient RL algorithms is to combine existing sample-efficient algorithms with off-policy policy evaluation approaches to estimate the return (expected sum of rewards) of the algorithm s policy before each episode. Existing policy evaluation approaches estimate the return of a fixed policy from a batch of data (e.g., Thomas et al., 2015b; Jiang & Li, 2016; Thomas & Brunskill, 2016). They provide little to no guarantees when the policy is not fixed but computed from Policy Certificates: Towards Accountable Reinforcement Learning that same batch of data, as is here the case. They also do not reason about the return of the unknown optimal policy which is necessary for providing policy optimality certificates. We found that by focusing on optimism-in-the-faceof-uncertainty (OFU) based RL algorithms for updating the policy and model-based policy evaluation techniques for estimating the policy returns, we can create sample-efficient algorithms that compute policy certificates on both the current policy s return and its difference to the optimal return. The main insight is that OFU algorithms compute an upper confidence bound on the optimal return from an empirical model when updating the policy. Model-based policy evaluation can leverage the same empirical model to compute a confidence interval on the policy return, even when the policy depends on the data. We illustrate this approach with new algorithms for two different episodic settings. Perhaps surprisingly, we show that in tabular Markov decision processes (MDPs) it can be beneficial to explicitly leverage the combination of OFU-based policy optimization and model-based policy evaluation to improve either component. Specifically, computing the certificates can directly improve the underlying OFU approach and knowing that the policy converges to the optimal policy at a certain rate improves the accuracy of policy return certificates. As a result, the guarantees for our new algorithm improve stateof-the-art regret and PAC bounds in problems with large horizons and are minimax-optimal up to lower-order terms. The second setting we consider are finite MDPs with linear side information (context) (Abbasi-Yadkori & Neu, 2014; Hallak et al., 2015; Modi et al., 2018), which is of particular interest in practice. For example, in a drug treatment optimization task where each patient is one episode, context is the background information of the patient which influences the treatment outcome. While one expects the algorithm to learn a good policy quickly for frequent contexts, the performance for unusual patients may be significantly more variable due to the limited prior experience of the algorithm. Policy certificates allow humans to detect when the current policy is good for the current patient and intervene if a certified performance is deemed inadequate. For example, for this health monitoring application, a human expert could intervene to either directly specify the policy for that episode, or in the context of automated customer service, the service could be provided at reduced cost to the customer. To summarize, We make the following main contributions: 1. We introduce policy certificates and the IPOC framework for evaluating RL algorithms with certificates. Similar to existing frameworks like PAC, it provides formal requirements to be satisfied by the algorithm, here requiring the algorithm to be an efficient learner and to quantify its performance online through policy certificates. 2. We provide a new RL algorithm for finite, episodic MDPs that satisfies this definition, and show that it has stronger, minimax regret and PAC guarantees than prior work. Formally, our sample complexity bound is O(SAH2/ϵ2 + S2AH3/ϵ) vs. prior O(SAH4/ϵ2 + S2AH3/ϵ) (Dann et al., 2017), and our regret bound O( SAH2T + S2AH3) improves prior work (Azar et al., 2017) since it has minimax rate up to log-terms in the dominant term even for long horizons H > SA. 3. We introduce a new RL algorithm for finite, episodic MDPs with linear side information that has a cumulative IPOC bound, which is tighter than past results (Abbasi Yadkori & Neu, 2014) by a factor of 2. Setting and Notation We consider episodic RL problems where the agent interacts with the environment in episodes of a certain length. While the framework for policy certificates applies more breadly, we focus on finite MDPs with linear side information (Modi et al., 2018; Hallak et al., 2015; Abbasi-Yadkori & Neu, 2014) for concreteness. This setting includes tabular MDPs as a special case but is more general and can model variations in the environment across episodes, e.g., because different episodes correspond to treating different patients in a healthcare application. Unlike the tabular special case, function approximation is necessary for efficient learning. Tabular MDPs The agent interacts with the MDP in episodes indexed by k. Each episode is a sequence (sk,1, ak,1, rk,1, . . . , sk,H, ak,H, rk,H) of H states sk,h S, actions ak,h A and scalar rewards rk,h [0, 1]. For notational simplicity, we assume that the initial state sk,1 is deterministic. The actions are taken as prescribed by the agent s policy πk and we here focus on deterministic timedependent policies, i.e., ak,h = πk(sk,h, h) for all time steps h [H] := {1, 2, . . . H}. The successor states and rewards are sampled from the MDP as sk,h+1 P(sk,h, ak,h) and rk,h PR(sk,h, ak,h). In tabular MDPs the size of the state space S = |S| and action space A = |A| are finite. Finite MDPs with linear side information. We assume that stateand action-space are finite as in tabular MDPs, but here the agent essentially interacts with a family of infinitely many tabular MDPs that is parameterized by linear contexts. At the beginning of episode k, two contexts, x(r) k Rd(r) and x(p) k Rd(p), are observed and the agent interacts in this episode with a tabular MDP, whose dynamics and reward function depend on the contexts in a linear fashion. Specifically, it is assumed that the rewards are sampled from PR(s, a) with means rk(s, a) = (x(r) k ) θ(r) s,a and transition probabilities are Pk(s |s, a) = (x(p) k ) θ(p) s ,s,a where θ(r) s,a Rd(r) and θ(p) s ,s,a Rd(p) are unknown parameter vectors for each s, s S, a A. As a regularity condition, we assume bounded parameters, i.e., Policy Certificates: Towards Accountable Reinforcement Learning θ(r) s,a 2 ξθ(r) and θ(p) s ,s,a 2 ξθ(p) as well as bounded contexts x(r) k 2 ξx(r) and x(p) k 2 ξx(p). We allow x(r) k and x(p) k to be different, and use xk to denote (x(r) k , x(p) k ) in the following. Note that our framework and algorithms can handle adversarially chosen contexts. Return and optimality gap. The quality of a policy π in any episode k is evaluated by the total expected reward or return: ρk(π) := E h PH h=1 rk,h ak,1:H π i , where this notation means that all actions in the episode are taken as prescribed by a policy π. Optimal policy and return ρ k = maxπ ρk(π) may depend on the episode s contexts. The difference of achieved and optimal return is called optimality gap k = ρ k ρk(πk) for each episode k where πk is the algorithm s policy in that episode. Additional notation. We denote the largest possible optimality gap by max = H, and the value functions of π in episode k by Qπk h (s, a) = E[PH t=h rk,t|ak,h = a, ak,h+1:H π] and V πk h (s) = Qπk h (s, π(s, h)). Optimal versions are marked by superscript and subscripts are omitted when unambiguous. We treat P(s, a) as a linear operator, that is, P(s, a)f = P s S P(s |s, a)f(s ) for any f : S R. We also use σq(f) = p q(f qf)2 for the standard deviation of f with respect to a state distribution q and V max h = (H h + 1) for all h [H]. We also use the common short hand notation a b = max{a, b} and a b = min{a, b} as well as O(f) = O(f poly(log(f))). 3. The IPOC Framework During execution, the optimality gaps k are hidden and the algorithm only observes the sum of rewards which is a sample of ρk(πk). This causes risk as one does not know whether the algorithm is playing a good or potentially bad policy. We introduce a new learning framework that mitigates this limitation. This framework forces the algorithm to output its current policy πk as well as certificates ϵk R+ and Ik R before each episode k. The return certificate Ik is a confidence interval on the return of the policy, while the optimality certificate ϵk informs the user how sub-optimal the policy can be for the current context, i.e., ϵk k. Certificates allow one to intervene if needed. For example, in automated customer services, one might reduce the service price in episode k if certificate ϵk is above a certain threshold, since the quality of the provided service cannot be guaranteed. When there is no context, an optimality certificate upper bounds the sub-optimality of the current policy in any episode which makes algorithms anytime interruptable (Zilberstein & Russell, 1996): one is guaranteed to always know a policy with improving performance. Our learning framework is formalized as follows: Definition 1 (Individual Policy Certificates (IPOC) Bounds). An algorithm satisfies an individual policy certificate (IPOC) bound F if for a given δ (0, 1) it outputs the current policy πk, a return certificate Ik R and an optimality certificate ϵk with ϵk |Ik| before each episode k (after observing the contexts) so that with probability at least 1 δ: 1. all return certificates contain the return policy πk played in episode k and all optimality certificates are upper bounds on the sub-optimality of πk, i.e., k N : ϵk k and ρk(πk) Ik ; and either 2a. for all number of episodes T the cumulative sum of certificates is bounded PT k=1 ϵk F(W, T, δ) (Cumulative Version), or 2b. for any threshold ϵ, the number of times certificates can exceed the threshold is bounded as P k=1 1{ϵk > ϵ} F(W, ϵ, δ) (Mistake Version). Here, W can be (known or unknown) properties of the environment. If conditions 1 and 2a hold, we say the algorithm has a cumulative IPOC bound and if conditions 1 and 2b hold, we say the algorithm has a mistake IPOC bound. Condition 1 alone would be trivial to satisfy with ϵk = max and Ik = [0, max], but condition 2 prohibits this by controlling the size of ϵk (and therefore the size of |Ik| ϵk). Condition 2a bounds the cumulative sum of optimality certificates (similar to regret bounds), and condition 2b bounds the size of the superlevel sets of ϵk (similar to PAC bounds). We allow both alternatives as condition 2b is stronger but one sometimes can only prove condition 2a (see Appendix D1). An IPOC bound controls simultaneously the quality of certificates (how big ϵk k and |Ik| are) as well as the optimality gaps k themselves and, hence, an IPOC bound not only guarantees that the algorithm improves its policy but also becomes better at telling us how well the policy performs. Note that the condition ϵk |Ik| in Definition 1 is natural as any upper bound on ρ k is also an upper bound on ρk(πk) and is made for notational convenience. We would like to emphasize that we provide certificates on the return, the expected sum of rewards, in the next episode. Due to the stochasticity in the environment, one in general cannot hope to accurately predict the sum of rewards directly. Since return is the default optimization criteria in RL, certificates for it are a natural starting point and relevant in many scenarios. Nonetheless, certificates for other properties of the sum-of-reward distribution of a policy are an interesting direction for future work. For example, one might want certificates on properties that take into account the variability of the sum of rewards (e.g., conditional value at risk) in high-stakes applications which are often the objective in risk-sensitive RL. 1See full version of this paper at https://arxiv.org/ abs/1811.03056. Policy Certificates: Towards Accountable Reinforcement Learning 3.1. Relation to Existing Frameworks Unlike IPOC, existing frameworks for RL only guarantee sample-efficiency of the algorithm over multiple episodes and do not provide performance bounds for single episodes during learning. The common existing frameworks are: Mistake-style PAC bounds (Strehl et al., 2006; 2009; Szita & Szepesv ari, 2010; Lattimore & Hutter, 2012; Dann & Brunskill, 2015) bound the number of ϵ-mistakes, that is, the size of the set {k N : k > ϵ} with high probability, but do not tell us when mistakes happen. The same is true for the stronger Uniform-PAC bounds (Dann et al., 2017) which hold for all ϵ jointly. Supervised-learning style PAC bounds (Kearns & Singh, 2002; Jiang et al., 2017; Dann et al., 2018) ensure that the algorithm outputs an ϵ-optimal policy for a given ϵ, i.e., they ensure k ϵ for k greater than the bound. Yet, they need to know ϵ ahead of time and tell us nothing about k during learning (for k smaller than the bound). Regret bounds (Osband et al., 2013; 2016; Azar et al., 2017; Jin et al., 2018) control the cumulative sum of optimality gaps PT k=1 k (regret) which does not yield any nontrivial guarantee for individual k because it does not reveal which optimality gaps are small. We show that mistake IPOC bounds are stronger than any of the above guarantees, i.e., they imply Uniform PAC, PAC, and regret bounds. Cumulative IPOC bounds are slightly weaker but still imply regret bounds. Both versions of IPOC also ensure that the algorithm is anytime interruptable, i.e., it can be used to find better and better policies that have small k with high probability 1 δ. That means IPOC bounds imply supervised-learning style PAC bounds for all ϵ jointly. These claims are formalized as follows: Proposition 2. Assume an algorithm has a cumulative IPOC bound F(W, T, δ). 1. Then it has a regret bound of same order, i.e., with probability at least 1 δ, for all T the regret R(T) := PT k=1 k is bounded by F(W, T, δ). 2. If F has the form PN p=0(Cp(W, δ)T) p p+1 for appropriate functions Cp, then with probability at least 1 δ for any ϵ, it outputs a certificate ϵk ϵ within Cp(W, δ)p(N + 1)p+1 episodes. Hence, for settings without context, the algorithm outputs an ϵ-optimal policy within that number of episodes (supervised learning-style PAC bound). Proposition 3. If an algorithm has a mistake IPOC bound F(W, ϵ, δ), then 1. it has a uniform PAC bound F(W, ϵ, δ), i.e., with probability at least 1 δ, the number of episodes with k ϵ is at most F(W, ϵ, δ) for all ϵ > 0; 2. with probability 1 δ for all ϵ, it outputs a certificate ϵk ϵ within F(W, ϵ, δ) + 1 episodes. For settings without context, that means the algorithm outputs an ϵ-optimal policy within that many episodes (supervised learning-style PAC). 3. if F has the form PN p=1 Cp(W,δ) ϵp ln C(W,δ) Cp(W, δ) 1 and constants N, n N, it also has a cumulative IPOC bound of order p=1 Cp(W, δ)1/p T p polylog( max, C(W, δ), T) The functional form in part 2 of Proposition 2 includes common polynomial bounds like O( T) or O(T 2/3) with appropriate factors and similarly for part 3 of Proposition 3 which covers for example O(1/ϵ2). Our IPOC framework is similar to KWIK (Li et al., 2008), in that the algorithm is required to declare how well it will perform. Hower, KWIK only requires an algorithm to declare whether the output will perform better than a single pre-specified input threshold. Existing KWIK for RL methods only provide such a binary classification, and have less strong learning guarantees. In a sense IPOC is a generalization of KWIK. 4. Algorithms with Policy Certificates A natural path to obtain RL algorithms with IPOC bounds is to combine existing provably efficient online RL algorithms with an off-policy policy evaluation method to compute a confidence interval on the online RL algorithm s policy for the current episode. This yields policy return certificates, but not necessarily policy optimality certificates bounds on the difference of the optimal and current policy s return. Estimating the optimal return using off-policy evaluation algorithms in order to compute optimality certificates would require a significant computational burden, e.g. evaluating all (exponentially many) policies. However optimism in the face of uncertainty (OFU) algorithms can be modified to provide both policy return certificates and optimality certificates without the need for a separate off-policy policy optimization step. Specifically, we here consider OFU algorithms that maintain an upper confidence bound (for a potentially changing confidence level) on the optimal value function Q k,h and therefore optimal return ρ k. This bound is also an upper bound on the return of the current policy which is chosen to maximize this bound. Many OFU methods explicitly maintain a confidence set of the MDP model to compute the upper confidence bound on Q k,h. These same confidence sets of the model can be used to compute a lower bound on the value function of the current policy. In doing so, OFU algo- Policy Certificates: Towards Accountable Reinforcement Learning rithms can be modified with little computational overhead to provide policy return and optimality certificates. For these reasons, we focus on OFU methods, introducing two new algorithms with policy certificates, one for tabular MDPs and and one for the more general MDPs with linear side information setting. Both approaches have a similar structure, but leverage different confidence sets and model estimators. In the first case, we show that maintaining lower bounds on the current policy s value has significant benefits beyond enabling policy certificates: lower bounds help us to derive a tighter bound on our uncertainty over the range of future values. Thus we are able to provide the strongest, to our knowledge, PAC and regret bounds for tabular MDPs. It remains an intriguing but non-trivial question if we can create confidence sets that leverage explicit upper and lower bounds for the linear side information setting. 4.1. Tabular MDPs We present the ORLC (optimistic RL with certificates) Algorithm shown in Algorithm 1 (see the appendix for a version with empirically tighter confidence bounds but same theoretical guarantees). It shares similar structure with recent OFU algorithms like UBEV (Dann et al., 2017) and UCBVI-BF (Azar et al., 2017) but has some significant differences highlighted in red. Before each episode k, Algorithm 1 computes an optimistic estimate Qk,h of Q h in Line 10 by dynamic programming on the empirical model ( ˆPk, ˆrk) with confidence intervals ψk,h. Importantly, it also computes Qk,h, a pessimistic estimate of Qπk h in similar fashion in Line 11. The optimistic and pessimistic estimates Qk,h, Qk,h (resp. Vk,h, Vk,h) allow us to compute the certificates ϵk and Ik and enables more sample-efficient learning. Specifically, Algorithm 1 uses a novel form of confidence intervals ψ that explicitly depends on this difference. These confidence intervals are key for proving the following IPOC bound: Theorem 4 (Mistake IPOC Bound of Alg. 1). For any given δ (0, 1), Alg. 1 satisfies in any tabular MDP with S states, A actions and horizon H, the following Mistake IPOC bound: For all ϵ > 0, the number of episodes where Alg. 1 outputs a certificate |Ik| = ϵk > ϵ is By Proposition 3, this implies a Uniform-PAC bound of same order as well as the regret and PAC bounds listed in Table 1. This table also contains previous state of the art bounds of each type2 as well as lower bounds. The IPOC lower bound follows from the PAC lower bound by 2These model-free and model-based methods have the best known bounds in our problem class. Q-learning with UCB and UBEV allow time-dependent dynamics. One might be able to improve their regret bound by H when adapting them to our setting. Dann & Brunskill (2015) and Proposition 3. For ϵ small enough ( O(1/(SH)) specifically), our IPOC bound is minimax, i.e., the best achievable, up to log-factors. This is also true for the Uniform-PAC and PAC bounds implied by Theorem 4 as well as the implied regret bound when the number of episodes T = Ω(S3AH4) is large enough. ORLC is the first algorithm to achieve this minimax rate for PAC and Uniform-PAC. While UCBVI-BF achieves minimax regret for problems with small horizon, their bound is suboptimal when H > SA. The lower-order term in our regret bound O(S2AH3) has a slightly worse dependency on H than Azar et al. (2017) but we can trade-off a factor of H for A (see appendix) and believe that this term can be further reduced by a more involved analysis. We defer details of our IPOC analysis to the appendix availble at https://arxiv.org/abs/1811.03056 but the main advances leverage that [ Qk,h(s, a), Qk,h(s, a)] is an observable confidence interval for both Q h(s, a) and Qπk h (s, a). Specifically, our main novel insights are: While prior works (e.g. Lattimore & Hutter, 2012; Dann & Brunskill, 2015) control the suboptimality Q h Qπk h of the policy by recursively bounding Qk,h Qπk h , we instead recursively bound Qk,h Qk,h 2ψk,h + ˆPk( Vk,h+1 Vk,h+1) which is not only simpler but also controls both the suboptimality of the policy and the size of the certificates simultaneously. As existing work (e.g. Azar et al., 2017; Jin et al., 2018), we use empirical Bernstein-type concentration inequalities to construct Qk,h(s, a) as an upper bound to Q h(s, a) = r(s, a) + P(s, a)V h+1. This results in a dependency of the upper bound on the variance of the optimal next state value σ ˆ Pk(s,a)(V h+1)2 under the empirical model. Since V h+1 is unknown this has to be upper-bounded by σ ˆ Pk(s,a)( Vk,h+1)2 + B with an additional bonus B to account for the difference between the values, Vk,h+1 V h+1, which is again unobservable. Azar et al. (2017) now constructs an observable bound on B through an intricate regret analysis that involves additional high-probability bounds on error terms (see their Efr/Eaz events) which causes the suboptimal H3T term in their regret bound. Instead, we use the fact that Vk,h+1 Vk,h+1 is an observable upper bound on Vk,h+1 V h+1 which we can directly use in our confidence widths ψk,h (see the last term in Line 9 of Alg. 1). Hence, availability of lower bounds through certificates improves also our upper confidence bounds on Q h and yields more sample-efficient exploration with improved performance bounds as we avoid additional high-probability bounds of error terms. Note that by augmenting our state space with a time index, our algorithm also achieves minimax optimality with O( SAH3T) regret up to lower order terms in their setting. Policy Certificates: Towards Accountable Reinforcement Learning Algorithm 1: ORLC (Optimistic Reinforcement Learning with Certificates) Input :failure tolerance δ (0, 1] 1 φ(n) = 1 r n 1.4 ln ln(e n) + ln 26SA(H+1+S) δ ; Vk,H+1(s) = 0; Vk,H+1(s) = 0 s S, k N; 2 for k = 1, 2, 3, . . . do 3 for s , s S, a A do // update empirical model and number of observations 4 nk(s, a) = Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a} ; // number of times (s,a) was observed 5 ˆrk(s, a) = 1 nk(s,a) Pk 1 i=1 PH h=1 ri,h1{si,h = s, ai,h = a} ; // avg. reward observed for (s,a) 6 ˆPk(s |s, a) = 1 nk(s,a) Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a, si,h+1 = s } 7 for h = H to 1 and s S do // optimistic planning with upper and lower confidence bounds 8 for a A do 9 ψk,h(s, a) = (1+ 12σ ˆ Pk(s,a)( Vk,h+1))φ(nk(s, a)) + 45SH2φ(nk(s, a))2+ 1 H ˆP(s, a)( Vk,h+1 Vk,h+1); 10 Qk,h(s, a) = 0 (ˆrk(s, a) + ˆPk(s, a) Vk,h+1 + ψk,h(s, a)) V max h ; // UCB of Q h+1 Qk,h(s, a) = 0 (ˆrk(s, a) + ˆPk(s, a) Vk,h+1 ψk,h(s, a)) V max h ; // LCB of Qπk h+1 12 πk(s, h) = argmaxa Qk,h(s, a); Vk,h(s) = Qk,h(s, πk(s, h)); Vk,h(s) = Qk,h(s, πk(s, h)); 13 output policy πk with certificates Ik = [ Vk,1(s1,1), Vk,1(s1,1)] and ϵk = |Ik|; 14 sample episode k with policy πk; // Observe sk,1, ak,1, rk,1, sk,2, . . . , sk,H, ak,H, rk,H Algorithm Regret PAC Mistake IPOC UCBVI-BF (Azar et al., 2017) O( H3T + S2AH2) - - Q-l. w/ UCB2 (Jin et al., 2018) O( SAH4T + S1.5A1.5H4.5) - - UCFH (Dann & Brunskill, 2015) - O S2AH2 UBEV 2 (Dann et al., 2017) O( SAH4T + S2AH3) O SAH4 ORLC (this work) O( SAH2T + S2AH3) O SAH2 Lower bounds Ω Table 1. Comparison of the state of the art and our bounds for episodic RL in tabular MDPs. A dash means that the algorithm does not satisfy a non-trivial bound without modifications. T is the number of episodes and ln(1/δ) factors are omitted for readability. For an empirical comparison of the sample-complexity of these approaches, see Appendix E.2 available at https://arxiv.org/abs/1811.03056. As opposed to the upper bounds, we cannot simply apply concentration inequalities to construct Qk,h(s, a) as a lower bound to Qπk because the estimation target Qπk(s, a) = r(s, a)+P(s, a)V πk h+1 is itself random. The policy πk depends in highly non-trivial ways on all samples from which we also estimate the empirical model ˆPk, ˆrk. A prevalent approach in model-based policy evaluation (Strehl & Littman, 2008; Ghavamzadeh et al., 2016, e.g.) to deal with this challenge is to instead apply a concentration argument on the ℓ1 distance of the transition estimates P(s, a) ˆPk(s, a) 1 Sφ(nk(s, a)). This yields confidence intervals that shrink at a rate of H Sφ(nk(s, a)). Instead, we can exploit that πk is generated by a sample-efficient algorithm and construct Qk,h as a lower bound to the non-random quantity r(s, a) + P(s, a)V h+1. We account for the difference P(s, a)(V h+1 V πk h+1) P(s, a)( Vk,h+1 Vk,h+1) explicitly, again through a recursive bound. This allows us to achieve confidence intervals that shrink at a faster rate of ψk,h Hφ(nk(s, a)) + SH2φ(nk(s, a))2 without the S dependency in the dominating φ(nk(s, a)) term (recall φ(nk(s, a)) 1 and goes to 0). Hence, by leveraging that πk is computed by a sample-efficient approach, we improve the tightness of the certificates. 4.2. MDPs With Linear Side Information We now present an algorithm for the more general setting with side information, which, for example, allows us to take background information about a customer into account and generalize across different customers. Algorithm 2 gives an extension, called ORLC-SI, of the OFU algorithm by Abbasi-Yadkori & Neu (2014). Its overall structure is the same as the tabular Algorithm 1 but here the empirical model are least-squares estimates of the model parameters evaluated at the current contexts. Specifically, the Policy Certificates: Towards Accountable Reinforcement Learning Algorithm 2: ORLC-SI (Optimistic Reinforcement Learning with Certificates and Side Information) Input :failure prob. δ (0, 1], regularizer λ > 0 d; Vk,H+1(s) = 0; Vk,H+1(s) = 0 s S, k N; 2 φ(N, x, ξ) := h 1 2 ln S(SA+A+H) 4 ln det N det(λI) i x N 1 ; 3 for k = 1, 2, 3, . . . do 4 Observe current contexts x(r) k and x(p) k ; 5 for s, s S, a A do // estimate model with least-squares 6 N (q) k,s,a = λI + Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a}x(q) k (x(q) k ) for q {r, p}; 7 ˆθ(r) k,s,a = (N (r) k,s,a) 1 Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a}x(r) k ri,h; ˆrk(s, a) = 0 (x(r) k ) ˆθ(r) k,s,a 1; 8 ˆθ(p) s ,s,a = (N (p) k,s,a) 1 Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a, si,h+1 = s }x(p) k ; 9 ˆPk(s |s, a) = 0 (x(p) k ) ˆθ(p) k,s ,s,a 1; 10 for h = H to 1 and s S do // optimistic planning with ellipsoid confidence bounds 11 for a A do 12 ψk,h(s, a) = Vk,h+1 1φ(N (p) k,s,a, x(p) k , ξθ(p)) + φ(N (r) k,s,a, x(r) k , ξθ(r)); 13 Qk,h(s, a) = 0 (ˆrk(s, a) + ˆPk(s, a) Vk,h+1 + ψk,h(s, a)) V max h ; // UCB of Q h+1 14 Qk,h(s, a) = 0 (ˆrk(s, a) + ˆPk(s, a) Vk,h+1 ψk,h(s, a)) V max h ; // LCB of Qπk h+1 15 πk(s, h) = argmaxa Qk,h(s, a); Vk,h(s) = Qk,h(s, πk(s, t)); Vk,h(s) = Qk,h(s, πk(s, t)); 16 output policy πk with certificates Ik = [ Vk,1(s1,1), Vk,1(s1,1)] and ϵk = |Ik|; 17 sample episode k with policy πk ; // Observe sk,1, ak,1, rk,1, sk,2, . . . , sk,H, ak,H, rk,H empirical transition probability ˆPk(s |s, a) is (x(p) k ) ˆθs ,s,a where ˆθs ,s,a is the least squares estimate of model parameter θs ,s,a. Since transition probabilities are normalized, this estimate is then clipped to [0, 1]. This model is estimated separately for each (s , s, a)-triple, but generalizes across different contexts. The confidence widths ψk,h are derived using ellipsoid confidence sets on model parameters. We show the following IPOC bound: Theorem 5 (Cumulative IPOC Bound for Alg. 2 ). For any δ (0, 1) and regularizer λ > 0, Alg. 2 satisfies the following cumulative IPOC bound in any MDP with contexts of dimensions d(r) and d(p) and bounded parameters ξθ(r) d(p), ξθ(p) d(p). With prob. at least 1 δ all return certificates contain the return of πk and optimality certificates are upper bounds on the optimality gaps and their total sum after T episodes is bounded for all T by S3AH4Tλ(d(p) + d(r)) log ξ2 x(p) + ξ2 x(r) λδ By Proposition 2, this IPOC bound implies a regret bound of the same order which improves on the O( p d2S4AH5T log 1/δ) regret bound of Abbasi-Yadkori & Neu (2014) with d = d(p) + d(r) by a factor of SAH. While they make a different modelling assumption (generalized linear instead of linear), we believe at least our better S dependency is due to using improved least-squares estimators for the transition dynamics 3 and can likely be 3They estimate θs ,s,a only from samples where the transition transferred to their setting. The mistake-type PAC bound by Modi et al. (2018) is not comparable because our cumulative IPOC bound does not imply a mistake-type PAC bound.4 Nonetheless, loosely translating our result to a PAC-like bound yields O d2S3AH5 ϵ2 which is much smaller than their O d2SAH4 ϵ5 max{d2, S2} bound for small ϵ. The confidence bounds in Alg. 2 are more general but looser than those for the tabular case of Alg. 1. Instantiating the IPOC bound for Alg. 2 from Theorem 5 for tabular MDPs (x(r) k = x(p) k = 1) yields O( S3AH4T) which is worse than the cumulative IPOC bound O( SAH2T + S2AH3) of Alg. 1 implied by Thm. 4 and Prop. 3. By Prop. 3, a mistake IPOC bound is stronger than the cumulative version we proved for Algorithm 2. One might wonder if Alg. 2 also satisfies a mistake bound, but in Appendix D (at https://arxiv.org/abs/1811. 03056) we show that this is not the case because of its nondecreasing ellipsoid confidence sets. There could be other algorithms with mistake IPOC bounds for this setting, but they they would likely require entirely different confidence sets. s, a s was observed instead of all occurrences of s, a (no matter whether s was the next state). 4An algorithm with a sub-linear cumulative IPOC bound can output a certificate larger than a threshold ϵk ϵ infinitely often as long as it does so sufficiently less frequently (see Section D). Policy Certificates: Towards Accountable Reinforcement Learning 0.0M 0.5M 1.0M 1.5M 2.0M 2.5M 3.0M 3.5M 4.0M Episodes Correlation 0.94 Certificates Optimality Gap Figure 1. Certificates and (unobserved) optimality gaps of Algorithm 2 for 4M episodes on an MDP with context distribution shift after 2M (episodes sub-sampled for better visualization) 5. Simulation Experiment One important use case for certificates is to detect sudden performance drops when the distribution of contexts changes. For example, in a call center dialogue system, there can be a sudden increase of customers calling due to a certain regional outage. We demonstrate that certificates can identify such performance drops caused by context shifts. We consider a simulated MDP with 10 states, 40 actions and horizon 5 where rewards depend on a 10-dimensional context and let the distribution of contexts change after 2M episodes. As seen in Figure 1, this causes a spike in optimality gap as well as in the optimality certificates. While our certificates need to upper bound the optimality gap / contain the return in each episode up to a small failure probability, even for the worst case, our algorithm reliably can detect this sudden decrease of performance. In fact, the optimality certificates have a very high correlation of 0.94 with the unobserved optimality gaps. One also may wonder if our algorithms leads to improvements over prior approaches in practice or only in the theoretical bounds. To help answer this, we present results in Appendix E at https://arxiv.org/abs/1811. 03056 both on analyzing the policy certificates provided, and examining ORLC s performance in tabular MDPs versus other recent papers with similar regret (Azar et al., 2017) or PAC (Dann et al., 2017) bounds. Encouragingly in the small simulation MDPs considered, we find that our algorithms lead to faster learning and better performance. Therefore while our primary contribution is theoretical results, these simulations suggest the potential benefits of the ideas underlying our proposed framework and algorithms. 6. Related Work The connection of IPOC to other frameworks is formally discussed in Section 3. Our algorithms essentially compute confidence bounds as in OFU methods, and then use those in model-based policy evaluation to obtain policy certificates. There are many works on off-policy policy evaluation (e.g., Jiang & Li, 2016; Thomas & Brunskill, 2016; Mahmood et al., 2017), some of which provide non-asymptotic con- fidence intervals (e.g., Thomas et al., 2015b;a; Sajed et al., 2018). However, these methods focus on the batch setting where a set of episodes sampled by fixed policies is given. Many approaches rely on importance weights that require stochastic data-collecting policies but most sample-efficient algorithms for which we would like to provide certificates deploy deterministic policies. One could treat previous episodes to be collected by one stochastic data-dependent policy but that introduces bias in the importance-weighting estimators that is not accounted for in the analyses. Interestingly, there is very recent work (Zanette & Brunskill, 2019) that also observed the benefits of using lower bounds in optimism-based exploration in tabular episodic RL. Though both their and our work obtain improved theoretical results, the specific forms of the optimistic bonuses are distinct and the analyses differ in many parts (e.g., we provide (Uniform-)PAC and regret bounds instead of only regret bounds). Most importantly, our work provides policy certificate guarantees as a main contribution whereas that work focuses on problem-dependent regret bounds. Approaches on safe exploration (Kakade & Langford, 2002; Pirotta et al., 2013; Thomas et al., 2015a; Ghavamzadeh et al., 2016) guarantee monotonically increasing performance by operating in a batch loop. Our work is orthogonal, as we are not restricting exploration but rather exposing its impact to the users and give them the choice to intervene. 7. Conclusion and Future Work We introduced policy certificates to improve accountability in RL by enabling users to intervene if the guaranteed performance is deemed inadequate. Bounds in our new theoretical framework IPOC ensure that certificates indeed bound the return and suboptimality in each episode and prescribe the rate at which certificates and policy improve. By combining optimism-based exploration with model-based policy evaluation, we have created two algorithms for RL with policy certificates, including for tabular MDPs with side information. For tabular MDPs, we demonstrated that policy certificates help optimism-based policy learning and vice versa. As a result, our new algorithm is the first to achieve minimax-optimal PAC bounds up to lower-order terms for tabular episodic MDPs, and, also the first to have both, minimax PAC and regret bounds, for this setting. Future areas of interest include scaling up these ideas to continuous state spaces, extending them to model-free RL, and to provide per-episode risk-sensitive guarantees on the reward obtained. Policy Certificates: Towards Accountable Reinforcement Learning Acknowledgements Part of this work were completed while Christoph Dann was an intern at Google. We appreciate support from a Microsoft Faculty Fellowship and a NSF Career award. Abbasi-Yadkori, Y. and Neu, G. Online learning in MDPs with side information. ar Xiv preprint ar Xiv:1406.6812, 2014. Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272, 2017. Dann, C. and Brunskill, E. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2818 2826, 2015. Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5713 5723, 2017. Dann, C., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. On oracle-efficient PAC RL with rich observations. In Advances in Neural Information Processing Systems, pp. 1422 1432, 2018. Ghavamzadeh, M., Petrik, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pp. 2298 2306, 2016. Hallak, A., Di Castro, D., and Mannor, S. Contextual Markov decision processes. ar Xiv:1502.02259, 2015. Howard, S. R., Ramdas, A., Mc Auliffe, J., and Sekhon, J. Uniform, nonparametric, non-asymptotic confidence sequences. ar Xiv preprint ar Xiv:1810.08240, 2018. Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fair learning in Markovian environments. ar Xiv preprint ar Xiv:1611.03071, 2016. Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010. Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pp. 652 661. JMLR. org, 2016. Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pp. 1704 1713, 2017. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efficient? ar Xiv preprint ar Xiv:1807.03765, 2018. Joseph, M., Kearns, M., Morgenstern, J. H., and Roth, A. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pp. 325 333, 2016. Kakade, S. On the sample complexity of reinforcement learning. Ph D thesis, University College London, 2003. Kakade, S. M. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002. Kannan, S., Kearns, M., Morgenstern, J., Pai, M., Roth, A., Vohra, R., and Wu, Z. S. Fairness incentives for myopic agents. In Proceedings of the 2017 ACM Conference on Economics and Computation, pp. 369 386. ACM, 2017. Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002. Lattimore, T. and Czepesvari, C. Bandit Algorithms. Cambridge University Press, 2018. Lattimore, T. and Hutter, M. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pp. 320 334. Springer, 2012. Li, L., Littman, M. L., and Walsh, T. J. Knows what it knows: a framework for self-aware learning. In Proceedings of the 25th international conference on Machine learning, pp. 568 575. ACM, 2008. Mahmood, A. R., Yu, H., and Sutton, R. S. Multi-step off-policy learning without importance sampling ratios. ar Xiv preprint ar Xiv:1702.03006, 2017. Modi, A., Jiang, N., Singh, S., and Tewari, A. Markov decision processes with continuous side information. In Algorithmic Learning Theory, pp. 597 618, 2018. Osband, I., Russo, D., and Van Roy, B. (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003 3011, 2013. Osband, I., Van Roy, B., and Wen, Z. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pp. 2377 2386, 2016. Policy Certificates: Towards Accountable Reinforcement Learning Pirotta, M., Restelli, M., Pecorino, A., and Calandriello, D. Safe policy iteration. In International Conference on Machine learning, pp. 307 315, 2013. Raghavan, M., Slivkins, A., Vaughan, J. W., and Wu, Z. S. The externalities of exploration and how data diversity helps exploitation. ar Xiv preprint ar Xiv:1806.00543, 2018. Sajed, T., Chung, W., and White, M. High-confidence error estimates for learned value functions. ar Xiv preprint ar Xiv:1808.09127, 2018. Strehl, A. L. and Littman, M. L. An analysis of modelbased interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331, 2008. Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. PAC model-free reinforcement learning. In International Conference on Machine Learning, 2006. Strehl, A. L., Li, L., and Littman, M. L. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10:2413 2444, 2009. Szita, I. and Szepesv ari, C. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1031 1038, 2010. Thomas, P. and Brunskill, E. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139 2148, 2016. Thomas, P., Theocharous, G., and Ghavamzadeh, M. High confidence policy improvement. In International Conference on Machine Learning, pp. 2380 2388, 2015a. Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-confidence off-policy evaluation. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 3000 3006, 2015b. Zanette, A. and Brunskill, E. Tighter problemdependent regret bounds in reinforcement learning without domain knowledge using value function bounds. https://arxiv.org/abs/1901.00210, 2019. Zilberstein, S. and Russell, S. Optimal composition of realtime systems. Artificial Intelligence, 82(1-2):181 213, 1996.