# policy_certificates_towards_accountable_reinforcement_learning__04acfb75.pdf

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

Christoph Dann 1 Lihong Li 2 Wei Wei 2 Emma Brunskill 3

The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about the quality of their current policy before executing it, and thus have limited use in high-stakes applications like healthcare. We address this lack of accountability by proposing that algorithms output policy certiﬁcates. These certiﬁcates bound the sub-optimality and return of the policy in the next episode, allowing humans to intervene when the certiﬁed quality is not satisfactory. We further introduce two new algorithms with certiﬁcates and present a new framework for theoretical analysis that guarantees the quality of their policies and certiﬁcates. For tabular MDPs, we show that computing certiﬁcates can even improve the sample-efﬁciency of optimism-based exploration. As a result, one of our algorithms is the ﬁrst to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds.

1. Introduction

There is increasing excitement around applications of machine learning, but also growing awareness and concerns about fairness, accountability and transparency. Recent research aims to address these concerns but most work focuses on supervised learning and only few results (Jabbari et al., 2016; Joseph et al., 2016; Kannan et al., 2017; Raghavan et al., 2018) exist on reinforcement learning (RL).

One challenge when applying RL in practice is that, unlike in supervised learning, the performance of an RL algorithm is typically not monotonically increasing with more data due to the trial-and-error nature of RL that necessitates ex-

1Carnegie Mellon University 2Google Research 3Stanford University. Correspondence to: Christoph Dann <cdann@cdann.net>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

ploration. Even sharp drops in policy performance during learning are common, e.g., when the agent starts to explore a new part of the state space. Such unpredictable performance ﬂuctuation has limited the use of RL in high-stakes applications like healthcare, and calls for more accountable algorithms that can quantify and reveal their performance online during learning.

To address this lack of accountability, we propose that RL algorithms output policy certiﬁcates in episodic RL. Policy certiﬁcates consist of (1) a conﬁdence interval of the algorithm s expected sum of rewards (return) in the next episode (policy return certiﬁcates) and (2) a bound on how far from the optimal return the performance can be (policy optimality certiﬁcates). Certiﬁcates make the policy s performance more transparent and accountable, and allow designers to intervene if necessary. For example, in medical applications, one would need to intervene unless the policy achieves a certain minimum treatment outcome; in ﬁnancial applications, policy optimality certiﬁcates can be used to assess the potential loss when learning a trading strategy. In addition to accountability, we also want RL algorithms to be sample-efﬁcient and quickly achieve good performance. To formally quantify accountability and sample-efﬁciency of an algorithm, we introduce a new framework for theoretical analysis called IPOC. IPOC bounds guarantee that certiﬁcates indeed bound the algorithm s expected performance in an episode, and prescribe the rate at which the algorithm s policy and certiﬁcates improve with more data. IPOC is stronger than other frameworks like regret (Jaksch et al., 2010), PAC (Kakade, 2003) and Uniform-PAC (Dann et al., 2017), that only guarantee the cumulative performance of the algorithm, but do not provide bounds for individual episodes during learning. IPOC also provides stronger bounds and more nuanced guarantees on per episode performance than KWIK (Li et al., 2008).

A natural way to create accountable and sample-efﬁcient RL algorithms is to combine existing sample-efﬁcient algorithms with off-policy policy evaluation approaches to estimate the return (expected sum of rewards) of the algorithm s policy before each episode. Existing policy evaluation approaches estimate the return of a ﬁxed policy from a batch of data (e.g., Thomas et al., 2015b; Jiang & Li, 2016; Thomas & Brunskill, 2016). They provide little to no guarantees when the policy is not ﬁxed but computed from

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

that same batch of data, as is here the case. They also do not reason about the return of the unknown optimal policy which is necessary for providing policy optimality certiﬁcates. We found that by focusing on optimism-in-the-faceof-uncertainty (OFU) based RL algorithms for updating the policy and model-based policy evaluation techniques for estimating the policy returns, we can create sample-efﬁcient algorithms that compute policy certiﬁcates on both the current policy s return and its difference to the optimal return. The main insight is that OFU algorithms compute an upper conﬁdence bound on the optimal return from an empirical model when updating the policy. Model-based policy evaluation can leverage the same empirical model to compute a conﬁdence interval on the policy return, even when the policy depends on the data. We illustrate this approach with new algorithms for two different episodic settings.

Perhaps surprisingly, we show that in tabular Markov decision processes (MDPs) it can be beneﬁcial to explicitly leverage the combination of OFU-based policy optimization and model-based policy evaluation to improve either component. Speciﬁcally, computing the certiﬁcates can directly improve the underlying OFU approach and knowing that the policy converges to the optimal policy at a certain rate improves the accuracy of policy return certiﬁcates. As a result, the guarantees for our new algorithm improve stateof-the-art regret and PAC bounds in problems with large horizons and are minimax-optimal up to lower-order terms.

The second setting we consider are ﬁnite MDPs with linear side information (context) (Abbasi-Yadkori & Neu, 2014; Hallak et al., 2015; Modi et al., 2018), which is of particular interest in practice. For example, in a drug treatment optimization task where each patient is one episode, context is the background information of the patient which inﬂuences the treatment outcome. While one expects the algorithm to learn a good policy quickly for frequent contexts, the performance for unusual patients may be signiﬁcantly more variable due to the limited prior experience of the algorithm. Policy certiﬁcates allow humans to detect when the current policy is good for the current patient and intervene if a certiﬁed performance is deemed inadequate. For example, for this health monitoring application, a human expert could intervene to either directly specify the policy for that episode, or in the context of automated customer service, the service could be provided at reduced cost to the customer.

To summarize, We make the following main contributions:

1. We introduce policy certiﬁcates and the IPOC framework for evaluating RL algorithms with certiﬁcates. Similar to existing frameworks like PAC, it provides formal requirements to be satisﬁed by the algorithm, here requiring the algorithm to be an efﬁcient learner and to quantify its performance online through policy certiﬁcates. 2. We provide a new RL algorithm for ﬁnite, episodic

MDPs that satisﬁes this deﬁnition, and show that it has stronger, minimax regret and PAC guarantees than prior work. Formally, our sample complexity bound is O(SAH2/ϵ2 + S2AH3/ϵ) vs. prior O(SAH4/ϵ2 + S2AH3/ϵ) (Dann et al., 2017), and our regret bound O(

SAH2T + S2AH3) improves prior work (Azar et al., 2017) since it has minimax rate up to log-terms in the dominant term even for long horizons H > SA. 3. We introduce a new RL algorithm for ﬁnite, episodic MDPs with linear side information that has a cumulative IPOC bound, which is tighter than past results (Abbasi Yadkori & Neu, 2014) by a factor of

2. Setting and Notation

We consider episodic RL problems where the agent interacts with the environment in episodes of a certain length. While the framework for policy certiﬁcates applies more breadly, we focus on ﬁnite MDPs with linear side information (Modi et al., 2018; Hallak et al., 2015; Abbasi-Yadkori & Neu, 2014) for concreteness. This setting includes tabular MDPs as a special case but is more general and can model variations in the environment across episodes, e.g., because different episodes correspond to treating different patients in a healthcare application. Unlike the tabular special case, function approximation is necessary for efﬁcient learning.

Tabular MDPs The agent interacts with the MDP in episodes indexed by k. Each episode is a sequence (sk,1, ak,1, rk,1, . . . , sk,H, ak,H, rk,H) of H states sk,h S, actions ak,h A and scalar rewards rk,h [0, 1]. For notational simplicity, we assume that the initial state sk,1 is deterministic. The actions are taken as prescribed by the agent s policy πk and we here focus on deterministic timedependent policies, i.e., ak,h = πk(sk,h, h) for all time steps h [H] := {1, 2, . . . H}. The successor states and rewards are sampled from the MDP as sk,h+1 P(sk,h, ak,h) and rk,h PR(sk,h, ak,h). In tabular MDPs the size of the state space S = |S| and action space A = |A| are ﬁnite.

Finite MDPs with linear side information. We assume that stateand action-space are ﬁnite as in tabular MDPs, but here the agent essentially interacts with a family of inﬁnitely many tabular MDPs that is parameterized by linear contexts. At the beginning of episode k, two contexts, x(r) k Rd(r) and x(p) k Rd(p), are observed and the agent interacts in this episode with a tabular MDP, whose dynamics and reward function depend on the contexts in a linear fashion. Speciﬁcally, it is assumed that the rewards are sampled from PR(s, a) with means rk(s, a) = (x(r) k ) θ(r) s,a and transition probabilities are Pk(s |s, a) = (x(p) k ) θ(p) s ,s,a where θ(r) s,a Rd(r) and θ(p) s ,s,a Rd(p) are unknown parameter vectors for each s, s S, a A. As a regularity condition, we assume bounded parameters, i.e.,

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

θ(r) s,a 2 ξθ(r) and θ(p) s ,s,a 2 ξθ(p) as well as bounded

contexts x(r) k 2 ξx(r) and x(p) k 2 ξx(p). We allow x(r) k and x(p) k to be different, and use xk to denote (x(r) k , x(p) k ) in the following. Note that our framework and algorithms can handle adversarially chosen contexts.

Return and optimality gap. The quality of a policy π in any episode k is evaluated by the total expected reward or return: ρk(π) := E h PH h=1 rk,h ak,1:H π i , where this notation means that all actions in the episode are taken as prescribed by a policy π. Optimal policy and return ρ k = maxπ ρk(π) may depend on the episode s contexts. The difference of achieved and optimal return is called optimality gap k = ρ k ρk(πk) for each episode k where πk is the algorithm s policy in that episode.

Additional notation. We denote the largest possible optimality gap by max = H, and the value functions of π in episode k by Qπk h (s, a) = E[PH t=h rk,t|ak,h = a, ak,h+1:H π] and V πk h (s) = Qπk h (s, π(s, h)). Optimal versions are marked by superscript and subscripts are omitted when unambiguous. We treat P(s, a) as a linear operator, that is, P(s, a)f = P

s S P(s |s, a)f(s ) for any f : S R. We also use σq(f) = p

q(f qf)2 for the standard deviation of f with respect to a state distribution q and V max h = (H h + 1) for all h [H]. We also use the common short hand notation a b = max{a, b} and a b = min{a, b} as well as O(f) = O(f poly(log(f))).

3. The IPOC Framework

During execution, the optimality gaps k are hidden and the algorithm only observes the sum of rewards which is a sample of ρk(πk). This causes risk as one does not know whether the algorithm is playing a good or potentially bad policy. We introduce a new learning framework that mitigates this limitation. This framework forces the algorithm to output its current policy πk as well as certiﬁcates ϵk R+ and Ik R before each episode k. The return certiﬁcate Ik is a conﬁdence interval on the return of the policy, while the optimality certiﬁcate ϵk informs the user how sub-optimal the policy can be for the current context, i.e., ϵk k. Certiﬁcates allow one to intervene if needed. For example, in automated customer services, one might reduce the service price in episode k if certiﬁcate ϵk is above a certain threshold, since the quality of the provided service cannot be guaranteed. When there is no context, an optimality certiﬁcate upper bounds the sub-optimality of the current policy in any episode which makes algorithms anytime interruptable (Zilberstein & Russell, 1996): one is guaranteed to always know a policy with improving performance. Our learning framework is formalized as follows:

Deﬁnition 1 (Individual Policy Certiﬁcates (IPOC) Bounds). An algorithm satisﬁes an individual policy certiﬁcate (IPOC)

bound F if for a given δ (0, 1) it outputs the current policy πk, a return certiﬁcate Ik R and an optimality certiﬁcate ϵk with ϵk |Ik| before each episode k (after observing the contexts) so that with probability at least 1 δ:

1. all return certiﬁcates contain the return policy πk played in episode k and all optimality certiﬁcates are upper bounds on the sub-optimality of πk, i.e., k N : ϵk k and ρk(πk) Ik ; and either 2a. for all number of episodes T the cumulative sum of certiﬁcates is bounded PT k=1 ϵk F(W, T, δ) (Cumulative Version), or 2b. for any threshold ϵ, the number of times certiﬁcates can exceed the threshold is bounded as P k=1 1{ϵk > ϵ} F(W, ϵ, δ) (Mistake Version).

Here, W can be (known or unknown) properties of the environment. If conditions 1 and 2a hold, we say the algorithm has a cumulative IPOC bound and if conditions 1 and 2b hold, we say the algorithm has a mistake IPOC bound.

Condition 1 alone would be trivial to satisfy with ϵk = max and Ik = [0, max], but condition 2 prohibits this by controlling the size of ϵk (and therefore the size of |Ik| ϵk). Condition 2a bounds the cumulative sum of optimality certiﬁcates (similar to regret bounds), and condition 2b bounds the size of the superlevel sets of ϵk (similar to PAC bounds). We allow both alternatives as condition 2b is stronger but one sometimes can only prove condition 2a (see Appendix D1). An IPOC bound controls simultaneously the quality of certiﬁcates (how big ϵk k and |Ik| are) as well as the optimality gaps k themselves and, hence, an IPOC bound not only guarantees that the algorithm improves its policy but also becomes better at telling us how well the policy performs. Note that the condition ϵk |Ik| in Deﬁnition 1 is natural as any upper bound on ρ k is also an upper bound on ρk(πk) and is made for notational convenience.

We would like to emphasize that we provide certiﬁcates on the return, the expected sum of rewards, in the next episode. Due to the stochasticity in the environment, one in general cannot hope to accurately predict the sum of rewards directly. Since return is the default optimization criteria in RL, certiﬁcates for it are a natural starting point and relevant in many scenarios. Nonetheless, certiﬁcates for other properties of the sum-of-reward distribution of a policy are an interesting direction for future work. For example, one might want certiﬁcates on properties that take into account the variability of the sum of rewards (e.g., conditional value at risk) in high-stakes applications which are often the objective in risk-sensitive RL.

1See full version of this paper at https://arxiv.org/ abs/1811.03056.

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

3.1. Relation to Existing Frameworks

Unlike IPOC, existing frameworks for RL only guarantee sample-efﬁciency of the algorithm over multiple episodes and do not provide performance bounds for single episodes during learning. The common existing frameworks are:

Mistake-style PAC bounds (Strehl et al., 2006; 2009; Szita & Szepesv ari, 2010; Lattimore & Hutter, 2012; Dann & Brunskill, 2015) bound the number of ϵ-mistakes, that is, the size of the set {k N : k > ϵ} with high probability, but do not tell us when mistakes happen. The same is true for the stronger Uniform-PAC bounds (Dann et al., 2017) which hold for all ϵ jointly. Supervised-learning style PAC bounds (Kearns & Singh,

2002; Jiang et al., 2017; Dann et al., 2018) ensure that the algorithm outputs an ϵ-optimal policy for a given ϵ, i.e., they ensure k ϵ for k greater than the bound. Yet, they need to know ϵ ahead of time and tell us nothing about k during learning (for k smaller than the bound). Regret bounds (Osband et al., 2013; 2016; Azar et al.,

2017; Jin et al., 2018) control the cumulative sum of optimality gaps PT k=1 k (regret) which does not yield any nontrivial guarantee for individual k because it does not reveal which optimality gaps are small.

We show that mistake IPOC bounds are stronger than any of the above guarantees, i.e., they imply Uniform PAC, PAC, and regret bounds. Cumulative IPOC bounds are slightly weaker but still imply regret bounds. Both versions of IPOC also ensure that the algorithm is anytime interruptable, i.e., it can be used to ﬁnd better and better policies that have small k with high probability 1 δ. That means IPOC bounds imply supervised-learning style PAC bounds for all ϵ jointly. These claims are formalized as follows:

Proposition 2. Assume an algorithm has a cumulative IPOC bound F(W, T, δ).

1. Then it has a regret bound of same order, i.e., with probability at least 1 δ, for all T the regret R(T) := PT k=1 k is bounded by F(W, T, δ).

2. If F has the form PN p=0(Cp(W, δ)T)

p p+1 for appropriate functions Cp, then with probability at least 1 δ for any ϵ, it outputs a certiﬁcate ϵk ϵ within

Cp(W, δ)p(N + 1)p+1

episodes. Hence, for settings without context, the algorithm outputs an ϵ-optimal policy within that number of episodes (supervised learning-style PAC bound).

Proposition 3. If an algorithm has a mistake IPOC bound F(W, ϵ, δ), then

1. it has a uniform PAC bound F(W, ϵ, δ), i.e., with probability at least 1 δ, the number of episodes with k ϵ

is at most F(W, ϵ, δ) for all ϵ > 0; 2. with probability 1 δ for all ϵ, it outputs a certiﬁcate ϵk ϵ within F(W, ϵ, δ) + 1 episodes. For settings without context, that means the algorithm outputs an ϵ-optimal policy within that many episodes (supervised learning-style PAC).

3. if F has the form PN p=1 Cp(W,δ)

ϵp ln C(W,δ)

Cp(W, δ) 1 and constants N, n N, it also has a cumulative IPOC bound of order

p=1 Cp(W, δ)1/p T

p polylog( max, C(W, δ), T)

The functional form in part 2 of Proposition 2 includes common polynomial bounds like O(

T) or O(T 2/3) with appropriate factors and similarly for part 3 of Proposition 3 which covers for example O(1/ϵ2).

Our IPOC framework is similar to KWIK (Li et al., 2008), in that the algorithm is required to declare how well it will perform. Hower, KWIK only requires an algorithm to declare whether the output will perform better than a single pre-speciﬁed input threshold. Existing KWIK for RL methods only provide such a binary classiﬁcation, and have less strong learning guarantees. In a sense IPOC is a generalization of KWIK.

4. Algorithms with Policy Certiﬁcates

A natural path to obtain RL algorithms with IPOC bounds is to combine existing provably efﬁcient online RL algorithms with an off-policy policy evaluation method to compute a conﬁdence interval on the online RL algorithm s policy for the current episode. This yields policy return certiﬁcates, but not necessarily policy optimality certiﬁcates bounds on the difference of the optimal and current policy s return. Estimating the optimal return using off-policy evaluation algorithms in order to compute optimality certiﬁcates would require a signiﬁcant computational burden, e.g. evaluating all (exponentially many) policies.

However optimism in the face of uncertainty (OFU) algorithms can be modiﬁed to provide both policy return certiﬁcates and optimality certiﬁcates without the need for a separate off-policy policy optimization step. Speciﬁcally, we here consider OFU algorithms that maintain an upper conﬁdence bound (for a potentially changing conﬁdence level) on the optimal value function Q k,h and therefore optimal return ρ k. This bound is also an upper bound on the return of the current policy which is chosen to maximize this bound. Many OFU methods explicitly maintain a conﬁdence set of the MDP model to compute the upper conﬁdence bound on Q k,h. These same conﬁdence sets of the model can be used to compute a lower bound on the value function of the current policy. In doing so, OFU algo-

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

rithms can be modiﬁed with little computational overhead to provide policy return and optimality certiﬁcates.

For these reasons, we focus on OFU methods, introducing two new algorithms with policy certiﬁcates, one for tabular MDPs and and one for the more general MDPs with linear side information setting. Both approaches have a similar structure, but leverage different conﬁdence sets and model estimators. In the ﬁrst case, we show that maintaining lower bounds on the current policy s value has signiﬁcant beneﬁts beyond enabling policy certiﬁcates: lower bounds help us to derive a tighter bound on our uncertainty over the range of future values. Thus we are able to provide the strongest, to our knowledge, PAC and regret bounds for tabular MDPs. It remains an intriguing but non-trivial question if we can create conﬁdence sets that leverage explicit upper and lower bounds for the linear side information setting.

4.1. Tabular MDPs

We present the ORLC (optimistic RL with certiﬁcates) Algorithm shown in Algorithm 1 (see the appendix for a version with empirically tighter conﬁdence bounds but same theoretical guarantees). It shares similar structure with recent OFU algorithms like UBEV (Dann et al., 2017) and UCBVI-BF (Azar et al., 2017) but has some signiﬁcant differences highlighted in red. Before each episode k, Algorithm 1 computes an optimistic estimate Qk,h of Q h in Line 10 by dynamic programming on the empirical model ( ˆPk, ˆrk) with conﬁdence intervals ψk,h. Importantly, it also computes Qk,h, a pessimistic estimate of Qπk h in similar fashion in Line 11. The optimistic and pessimistic estimates Qk,h, Qk,h (resp.

Vk,h, Vk,h) allow us to compute the certiﬁcates ϵk and Ik and enables more sample-efﬁcient learning. Speciﬁcally, Algorithm 1 uses a novel form of conﬁdence intervals ψ that explicitly depends on this difference. These conﬁdence intervals are key for proving the following IPOC bound:

Theorem 4 (Mistake IPOC Bound of Alg. 1). For any given δ (0, 1), Alg. 1 satisﬁes in any tabular MDP with S states, A actions and horizon H, the following Mistake IPOC bound: For all ϵ > 0, the number of episodes where Alg. 1 outputs a certiﬁcate |Ik| = ϵk > ϵ is

By Proposition 3, this implies a Uniform-PAC bound of same order as well as the regret and PAC bounds listed in Table 1. This table also contains previous state of the art bounds of each type2 as well as lower bounds. The IPOC lower bound follows from the PAC lower bound by

2These model-free and model-based methods have the best known bounds in our problem class. Q-learning with UCB and UBEV allow time-dependent dynamics. One might be able to improve their regret bound by

H when adapting them to our setting.

Dann & Brunskill (2015) and Proposition 3. For ϵ small enough ( O(1/(SH)) speciﬁcally), our IPOC bound is minimax, i.e., the best achievable, up to log-factors. This is also true for the Uniform-PAC and PAC bounds implied by Theorem 4 as well as the implied regret bound when the number of episodes T = Ω(S3AH4) is large enough. ORLC is the ﬁrst algorithm to achieve this minimax rate for PAC and Uniform-PAC. While UCBVI-BF achieves minimax regret for problems with small horizon, their bound is suboptimal when H > SA. The lower-order term in our regret bound O(S2AH3) has a slightly worse dependency on H than Azar et al. (2017) but we can trade-off a factor of H for A (see appendix) and believe that this term can be further reduced by a more involved analysis.

We defer details of our IPOC analysis to the appendix availble at https://arxiv.org/abs/1811.03056 but the main advances leverage that [ Qk,h(s, a), Qk,h(s, a)] is an observable conﬁdence interval for both Q h(s, a) and Qπk h (s, a). Speciﬁcally, our main novel insights are:

While prior works (e.g. Lattimore & Hutter, 2012; Dann & Brunskill, 2015) control the suboptimality Q h Qπk h of the policy by recursively bounding Qk,h Qπk h , we instead recursively bound Qk,h Qk,h 2ψk,h + ˆPk( Vk,h+1 Vk,h+1) which is not only simpler but also controls both the suboptimality of the policy and the size of the certiﬁcates simultaneously. As existing work (e.g. Azar et al., 2017; Jin et al.,

2018), we use empirical Bernstein-type concentration inequalities to construct Qk,h(s, a) as an upper bound to Q h(s, a) = r(s, a) + P(s, a)V h+1. This results in a dependency of the upper bound on the variance of the optimal next state value σ ˆ Pk(s,a)(V h+1)2 under the empirical model. Since V h+1 is unknown this has to be upper-bounded by σ ˆ Pk(s,a)( Vk,h+1)2 + B with an additional bonus B to account for the difference between the values, Vk,h+1 V h+1, which is again unobservable. Azar et al. (2017) now constructs an observable bound on B through an intricate regret analysis that involves additional high-probability bounds on error terms (see their Efr/Eaz events) which causes the suboptimal

H3T term in their regret bound. Instead, we use the fact that Vk,h+1 Vk,h+1 is an observable upper bound on Vk,h+1 V h+1 which we can directly use in our conﬁdence widths ψk,h (see the last term in Line 9 of Alg. 1). Hence, availability of lower bounds through certiﬁcates improves also our upper conﬁdence bounds on Q h and yields more sample-efﬁcient exploration with improved performance bounds as we avoid additional high-probability bounds of error terms.

Note that by augmenting our state space with a time index, our algorithm also achieves minimax optimality with O(

SAH3T) regret up to lower order terms in their setting.

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

Algorithm 1: ORLC (Optimistic Reinforcement Learning with Certiﬁcates)

Input :failure tolerance δ (0, 1]

1 φ(n) = 1 r

n 1.4 ln ln(e n) + ln 26SA(H+1+S)

δ ; Vk,H+1(s) = 0; Vk,H+1(s) = 0 s S, k N;

2 for k = 1, 2, 3, . . . do

3 for s , s S, a A do // update empirical model and number of observations

4 nk(s, a) = Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a} ; // number of times (s,a) was observed

5 ˆrk(s, a) = 1 nk(s,a) Pk 1 i=1 PH h=1 ri,h1{si,h = s, ai,h = a} ; // avg. reward observed for (s,a)

6 ˆPk(s |s, a) = 1 nk(s,a) Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a, si,h+1 = s }

7 for h = H to 1 and s S do // optimistic planning with upper and lower confidence bounds

8 for a A do

9 ψk,h(s, a) = (1+

12σ ˆ Pk(s,a)( Vk,h+1))φ(nk(s, a)) + 45SH2φ(nk(s, a))2+ 1

H ˆP(s, a)( Vk,h+1 Vk,h+1);

10 Qk,h(s, a) = 0 (ˆrk(s, a) + ˆPk(s, a) Vk,h+1 + ψk,h(s, a)) V max h ; // UCB of Q h+1

Qk,h(s, a) = 0 (ˆrk(s, a) + ˆPk(s, a) Vk,h+1 ψk,h(s, a)) V max h ; // LCB of Qπk h+1

12 πk(s, h) = argmaxa Qk,h(s, a); Vk,h(s) = Qk,h(s, πk(s, h)); Vk,h(s) = Qk,h(s, πk(s, h));

13 output policy πk with certiﬁcates Ik = [ Vk,1(s1,1), Vk,1(s1,1)] and ϵk = |Ik|;

14 sample episode k with policy πk; // Observe sk,1, ak,1, rk,1, sk,2, . . . , sk,H, ak,H, rk,H

Algorithm Regret PAC Mistake IPOC

UCBVI-BF (Azar et al., 2017) O(

H3T + S2AH2) - - Q-l. w/ UCB2 (Jin et al., 2018) O(

SAH4T + S1.5A1.5H4.5) - -

UCFH (Dann & Brunskill, 2015) - O S2AH2

UBEV 2 (Dann et al., 2017) O(

SAH4T + S2AH3) O SAH4

ORLC (this work) O(

SAH2T + S2AH3) O SAH2

Lower bounds Ω

Table 1. Comparison of the state of the art and our bounds for episodic RL in tabular MDPs. A dash means that the algorithm does not satisfy a non-trivial bound without modiﬁcations. T is the number of episodes and ln(1/δ) factors are omitted for readability. For an empirical comparison of the sample-complexity of these approaches, see Appendix E.2 available at https://arxiv.org/abs/1811.03056.

As opposed to the upper bounds, we cannot simply apply concentration inequalities to construct Qk,h(s, a) as a lower bound to Qπk because the estimation target Qπk(s, a) = r(s, a)+P(s, a)V πk h+1 is itself random. The policy πk depends in highly non-trivial ways on all samples from which we also estimate the empirical model ˆPk, ˆrk. A prevalent approach in model-based policy evaluation (Strehl & Littman, 2008; Ghavamzadeh et al., 2016, e.g.) to deal with this challenge is to instead apply a concentration argument on the ℓ1 distance of the transition estimates P(s, a) ˆPk(s, a) 1

Sφ(nk(s, a)). This yields conﬁdence intervals that shrink at a rate of H

Sφ(nk(s, a)). Instead, we can exploit that πk is generated by a sample-efﬁcient algorithm and construct Qk,h as a lower bound to the non-random quantity r(s, a) + P(s, a)V h+1. We account for the difference P(s, a)(V h+1 V πk h+1) P(s, a)( Vk,h+1 Vk,h+1) explicitly, again through a recursive bound. This allows us

to achieve conﬁdence intervals that shrink at a faster rate of ψk,h Hφ(nk(s, a)) + SH2φ(nk(s, a))2 without the

S dependency in the dominating φ(nk(s, a)) term (recall φ(nk(s, a)) 1 and goes to 0). Hence, by leveraging that πk is computed by a sample-efﬁcient approach, we improve the tightness of the certiﬁcates.

4.2. MDPs With Linear Side Information

We now present an algorithm for the more general setting with side information, which, for example, allows us to take background information about a customer into account and generalize across different customers. Algorithm 2 gives an extension, called ORLC-SI, of the OFU algorithm by Abbasi-Yadkori & Neu (2014). Its overall structure is the same as the tabular Algorithm 1 but here the empirical model are least-squares estimates of the model parameters evaluated at the current contexts. Speciﬁcally, the

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

Algorithm 2: ORLC-SI (Optimistic Reinforcement Learning with Certiﬁcates and Side Information)

Input :failure prob. δ (0, 1], regularizer λ > 0

d; Vk,H+1(s) = 0; Vk,H+1(s) = 0 s S, k N;

2 φ(N, x, ξ) := h

1 2 ln S(SA+A+H)

4 ln det N det(λI) i x N 1 ;

3 for k = 1, 2, 3, . . . do

4 Observe current contexts x(r) k and x(p) k ;

5 for s, s S, a A do // estimate model with least-squares

6 N (q) k,s,a = λI + Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a}x(q) k (x(q) k ) for q {r, p};

7 ˆθ(r) k,s,a = (N (r) k,s,a) 1 Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a}x(r) k ri,h; ˆrk(s, a) = 0 (x(r) k ) ˆθ(r) k,s,a 1;

8 ˆθ(p) s ,s,a = (N (p) k,s,a) 1 Pk 1 i=1 PH h=1 1{si,h = s, ai,h = a, si,h+1 = s }x(p) k ;

9 ˆPk(s |s, a) = 0 (x(p) k ) ˆθ(p) k,s ,s,a 1;

10 for h = H to 1 and s S do // optimistic planning with ellipsoid confidence bounds

11 for a A do

12 ψk,h(s, a) = Vk,h+1 1φ(N (p) k,s,a, x(p) k , ξθ(p)) + φ(N (r) k,s,a, x(r) k , ξθ(r));

13 Qk,h(s, a) = 0 (ˆrk(s, a) + ˆPk(s, a) Vk,h+1 + ψk,h(s, a)) V max h ; // UCB of Q h+1

14 Qk,h(s, a) = 0 (ˆrk(s, a) + ˆPk(s, a) Vk,h+1 ψk,h(s, a)) V max h ; // LCB of Qπk h+1

15 πk(s, h) = argmaxa Qk,h(s, a); Vk,h(s) = Qk,h(s, πk(s, t)); Vk,h(s) = Qk,h(s, πk(s, t));

16 output policy πk with certiﬁcates Ik = [ Vk,1(s1,1), Vk,1(s1,1)] and ϵk = |Ik|;

17 sample episode k with policy πk ; // Observe sk,1, ak,1, rk,1, sk,2, . . . , sk,H, ak,H, rk,H

empirical transition probability ˆPk(s |s, a) is (x(p) k ) ˆθs ,s,a where ˆθs ,s,a is the least squares estimate of model parameter θs ,s,a. Since transition probabilities are normalized, this estimate is then clipped to [0, 1]. This model is estimated separately for each (s , s, a)-triple, but generalizes across different contexts. The conﬁdence widths ψk,h are derived using ellipsoid conﬁdence sets on model parameters. We show the following IPOC bound: Theorem 5 (Cumulative IPOC Bound for Alg. 2 ). For any δ (0, 1) and regularizer λ > 0, Alg. 2 satisﬁes the following cumulative IPOC bound in any MDP with contexts of dimensions d(r) and d(p) and bounded parameters ξθ(r)

d(p), ξθ(p)

d(p). With prob. at least 1 δ all return certiﬁcates contain the return of πk and optimality certiﬁcates are upper bounds on the optimality gaps and their total sum after T episodes is bounded for all T by

S3AH4Tλ(d(p) + d(r)) log ξ2 x(p) + ξ2 x(r) λδ

By Proposition 2, this IPOC bound implies a regret bound of the same order which improves on the O( p

d2S4AH5T log 1/δ) regret bound of Abbasi-Yadkori & Neu (2014) with d = d(p) + d(r) by a factor of

SAH. While they make a different modelling assumption (generalized linear instead of linear), we believe at least our better S dependency is due to using improved least-squares estimators for the transition dynamics 3 and can likely be

3They estimate θs ,s,a only from samples where the transition

transferred to their setting. The mistake-type PAC bound by Modi et al. (2018) is not comparable because our cumulative IPOC bound does not imply a mistake-type PAC bound.4

Nonetheless, loosely translating our result to a PAC-like bound yields O d2S3AH5

ϵ2 which is much smaller than

their O d2SAH4

ϵ5 max{d2, S2} bound for small ϵ.

The conﬁdence bounds in Alg. 2 are more general but looser than those for the tabular case of Alg. 1. Instantiating the IPOC bound for Alg. 2 from Theorem 5 for tabular MDPs (x(r) k = x(p) k = 1) yields O(

S3AH4T) which is worse than the cumulative IPOC bound O(

SAH2T + S2AH3) of Alg. 1 implied by Thm. 4 and Prop. 3.

By Prop. 3, a mistake IPOC bound is stronger than the cumulative version we proved for Algorithm 2. One might wonder if Alg. 2 also satisﬁes a mistake bound, but in Appendix D (at https://arxiv.org/abs/1811. 03056) we show that this is not the case because of its nondecreasing ellipsoid conﬁdence sets. There could be other algorithms with mistake IPOC bounds for this setting, but they they would likely require entirely different conﬁdence sets.

s, a s was observed instead of all occurrences of s, a (no matter whether s was the next state). 4An algorithm with a sub-linear cumulative IPOC bound can output a certiﬁcate larger than a threshold ϵk ϵ inﬁnitely often as long as it does so sufﬁciently less frequently (see Section D).

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

0.0M 0.5M 1.0M 1.5M 2.0M 2.5M 3.0M 3.5M 4.0M Episodes

Correlation 0.94

Certiﬁcates Optimality Gap

Figure 1. Certiﬁcates and (unobserved) optimality gaps of Algorithm 2 for 4M episodes on an MDP with context distribution shift after 2M (episodes sub-sampled for better visualization)

5. Simulation Experiment

One important use case for certiﬁcates is to detect sudden performance drops when the distribution of contexts changes. For example, in a call center dialogue system, there can be a sudden increase of customers calling due to a certain regional outage. We demonstrate that certiﬁcates can identify such performance drops caused by context shifts. We consider a simulated MDP with 10 states, 40 actions and horizon 5 where rewards depend on a 10-dimensional context and let the distribution of contexts change after 2M episodes. As seen in Figure 1, this causes a spike in optimality gap as well as in the optimality certiﬁcates. While our certiﬁcates need to upper bound the optimality gap / contain the return in each episode up to a small failure probability, even for the worst case, our algorithm reliably can detect this sudden decrease of performance. In fact, the optimality certiﬁcates have a very high correlation of 0.94 with the unobserved optimality gaps.

One also may wonder if our algorithms leads to improvements over prior approaches in practice or only in the theoretical bounds. To help answer this, we present results in Appendix E at https://arxiv.org/abs/1811. 03056 both on analyzing the policy certiﬁcates provided, and examining ORLC s performance in tabular MDPs versus other recent papers with similar regret (Azar et al., 2017) or PAC (Dann et al., 2017) bounds. Encouragingly in the small simulation MDPs considered, we ﬁnd that our algorithms lead to faster learning and better performance. Therefore while our primary contribution is theoretical results, these simulations suggest the potential beneﬁts of the ideas underlying our proposed framework and algorithms.

6. Related Work

The connection of IPOC to other frameworks is formally discussed in Section 3. Our algorithms essentially compute conﬁdence bounds as in OFU methods, and then use those in model-based policy evaluation to obtain policy certiﬁcates. There are many works on off-policy policy evaluation (e.g., Jiang & Li, 2016; Thomas & Brunskill, 2016; Mahmood et al., 2017), some of which provide non-asymptotic con-

ﬁdence intervals (e.g., Thomas et al., 2015b;a; Sajed et al., 2018). However, these methods focus on the batch setting where a set of episodes sampled by ﬁxed policies is given. Many approaches rely on importance weights that require stochastic data-collecting policies but most sample-efﬁcient algorithms for which we would like to provide certiﬁcates deploy deterministic policies. One could treat previous episodes to be collected by one stochastic data-dependent policy but that introduces bias in the importance-weighting estimators that is not accounted for in the analyses.

Interestingly, there is very recent work (Zanette & Brunskill, 2019) that also observed the beneﬁts of using lower bounds in optimism-based exploration in tabular episodic RL. Though both their and our work obtain improved theoretical results, the speciﬁc forms of the optimistic bonuses are distinct and the analyses differ in many parts (e.g., we provide (Uniform-)PAC and regret bounds instead of only regret bounds). Most importantly, our work provides policy certiﬁcate guarantees as a main contribution whereas that work focuses on problem-dependent regret bounds.

Approaches on safe exploration (Kakade & Langford, 2002; Pirotta et al., 2013; Thomas et al., 2015a; Ghavamzadeh et al., 2016) guarantee monotonically increasing performance by operating in a batch loop. Our work is orthogonal, as we are not restricting exploration but rather exposing its impact to the users and give them the choice to intervene.

7. Conclusion and Future Work

We introduced policy certiﬁcates to improve accountability in RL by enabling users to intervene if the guaranteed performance is deemed inadequate. Bounds in our new theoretical framework IPOC ensure that certiﬁcates indeed bound the return and suboptimality in each episode and prescribe the rate at which certiﬁcates and policy improve. By combining optimism-based exploration with model-based policy evaluation, we have created two algorithms for RL with policy certiﬁcates, including for tabular MDPs with side information. For tabular MDPs, we demonstrated that policy certiﬁcates help optimism-based policy learning and vice versa. As a result, our new algorithm is the ﬁrst to achieve minimax-optimal PAC bounds up to lower-order terms for tabular episodic MDPs, and, also the ﬁrst to have both, minimax PAC and regret bounds, for this setting.

Future areas of interest include scaling up these ideas to continuous state spaces, extending them to model-free RL, and to provide per-episode risk-sensitive guarantees on the reward obtained.

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

Acknowledgements

Part of this work were completed while Christoph Dann was an intern at Google. We appreciate support from a Microsoft Faculty Fellowship and a NSF Career award.

Abbasi-Yadkori, Y. and Neu, G. Online learning in MDPs with side information. ar Xiv preprint ar Xiv:1406.6812, 2014.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272, 2017.

Dann, C. and Brunskill, E. Sample complexity of episodic ﬁxed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2818 2826, 2015.

Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5713 5723, 2017.

Dann, C., Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. On oracle-efﬁcient PAC RL with rich observations. In Advances in Neural Information Processing Systems, pp. 1422 1432, 2018.

Ghavamzadeh, M., Petrik, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pp. 2298 2306, 2016.

Hallak, A., Di Castro, D., and Mannor, S. Contextual Markov decision processes. ar Xiv:1502.02259, 2015.

Howard, S. R., Ramdas, A., Mc Auliffe, J., and Sekhon, J. Uniform, nonparametric, non-asymptotic conﬁdence sequences. ar Xiv preprint ar Xiv:1810.08240, 2018.

Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fair learning in Markovian environments. ar Xiv preprint ar Xiv:1611.03071, 2016.

Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010.

Jiang, N. and Li, L. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pp. 652 661. JMLR. org, 2016.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, pp. 1704 1713, 2017.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efﬁcient? ar Xiv preprint ar Xiv:1807.03765, 2018.

Joseph, M., Kearns, M., Morgenstern, J. H., and Roth, A. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pp. 325 333, 2016.

Kakade, S. On the sample complexity of reinforcement learning. Ph D thesis, University College London, 2003.

Kakade, S. M. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002.

Kannan, S., Kearns, M., Morgenstern, J., Pai, M., Roth, A., Vohra, R., and Wu, Z. S. Fairness incentives for myopic agents. In Proceedings of the 2017 ACM Conference on Economics and Computation, pp. 369 386. ACM, 2017.

Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002.

Lattimore, T. and Czepesvari, C. Bandit Algorithms. Cambridge University Press, 2018.

Lattimore, T. and Hutter, M. PAC bounds for discounted MDPs. In International Conference on Algorithmic Learning Theory, pp. 320 334. Springer, 2012.

Li, L., Littman, M. L., and Walsh, T. J. Knows what it knows: a framework for self-aware learning. In Proceedings of the 25th international conference on Machine learning, pp. 568 575. ACM, 2008.

Mahmood, A. R., Yu, H., and Sutton, R. S. Multi-step off-policy learning without importance sampling ratios. ar Xiv preprint ar Xiv:1702.03006, 2017.

Modi, A., Jiang, N., Singh, S., and Tewari, A. Markov decision processes with continuous side information. In Algorithmic Learning Theory, pp. 597 618, 2018.

Osband, I., Russo, D., and Van Roy, B. (More) efﬁcient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pp. 3003 3011, 2013.

Osband, I., Van Roy, B., and Wen, Z. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pp. 2377 2386, 2016.

Policy Certiﬁcates: Towards Accountable Reinforcement Learning

Pirotta, M., Restelli, M., Pecorino, A., and Calandriello, D. Safe policy iteration. In International Conference on Machine learning, pp. 307 315, 2013.

Raghavan, M., Slivkins, A., Vaughan, J. W., and Wu, Z. S. The externalities of exploration and how data diversity helps exploitation. ar Xiv preprint ar Xiv:1806.00543, 2018.

Sajed, T., Chung, W., and White, M. High-conﬁdence error estimates for learned value functions. ar Xiv preprint ar Xiv:1808.09127, 2018.

Strehl, A. L. and Littman, M. L. An analysis of modelbased interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331, 2008.

Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. PAC model-free reinforcement learning. In International Conference on Machine Learning, 2006.

Strehl, A. L., Li, L., and Littman, M. L. Reinforcement learning in ﬁnite MDPs: PAC analysis. Journal of Machine Learning Research, 10:2413 2444, 2009.

Szita, I. and Szepesv ari, C. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1031 1038, 2010.

Thomas, P. and Brunskill, E. Data-efﬁcient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139 2148, 2016.

Thomas, P., Theocharous, G., and Ghavamzadeh, M. High conﬁdence policy improvement. In International Conference on Machine Learning, pp. 2380 2388, 2015a.

Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-conﬁdence off-policy evaluation. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, pp. 3000 3006, 2015b.

Zanette, A. and Brunskill, E. Tighter problemdependent regret bounds in reinforcement learning without domain knowledge using value function bounds. https://arxiv.org/abs/1901.00210, 2019.

Zilberstein, S. and Russell, S. Optimal composition of realtime systems. Artiﬁcial Intelligence, 82(1-2):181 213, 1996.